A pan-sharpening network using multi-resolution transformer and two-stage feature fusion

Pan-sharpening is a fundamental and crucial task in the remote sensing image processing field, which generates a high-resolution multi-spectral image by fusing a low-resolution multi-spectral image and a high-resolution panchromatic image. Recently, deep learning techniques have shown competitive results in pan-sharpening. However, diverse features in the multi-spectral and panchromatic images are not fully extracted and exploited in existing deep learning methods, which leads to information loss in the pan-sharpening process. To solve this problem, a novel pan-sharpening method based on multi-resolution transformer and two-stage feature fusion is proposed in this article. Specifically, a transformer-based multi-resolution feature extractor is designed to extract diverse image features. Then, to fully exploit features with different content and characteristics, a two-stage feature fusion strategy is adopted. In the first stage, a multi-resolution fusion module is proposed to fuse multi-spectral and panchromatic features at each scale. In the second stage, a shallow-deep fusion module is proposed to fuse shallow and deep features for detail generation. Experiments over QuickBird and WorldView-3 datasets demonstrate that the proposed method outperforms current state-of-the-art approaches visually and quantitatively with fewer parameters. Moreover, the ablation study and feature map analysis also prove the effectiveness of the transformer-based multi-resolution feature extractor and the two-stage fusion scheme.


INTRODUCTION
images are widely used in remote sensing applications such as land cover classification (Ghamisi et al., 2019), environmental change detection (Bovolo et al., 2010) and agriculture monitoring (Gilbertson, Kemp & Van Niekerk, 2017). Due to physical constraints, there is a trade-off between spatial and spectral resolutions during satellite imaging. The satellite can only provide low-spatial-resolution (LR) MS images and corresponding high-spatial-resolution (HR) PANchromatic (PAN) images (Zhang, 2004;Zhou, Liu & Wang, 2021). However, many applications mentioned above require satellite imagery with high spatial and spectral resolutions. Pan-sharpening meets the demand by fusing the LRMS and PAN images to obtain an HRMS image.
Various methods are proposed for pan-sharpening. They can be separated into four main classes: component substitution (CS), multi-resolution analysis (MRA), variational optimization (VO) and deep learning (DL) (Vivone et al., 2015;Vivone et al., 2021b;Vivone et al., 2021a). The first three classes belong to traditional algorithms that appeared several decades ago. The DL-based methods have emerged recently and achieved promising results.
The MRA category consists of algorithms that adopt a multi-scale decomposition to extract spatial details from the PAN image and inject them into up-sampled MS bands (Xiong et al., 2021). Wavelet transform (WT) (Kim et al., 2011), discrete wavelet transform (DWT) (Pradhan et al., 2006) and generalized Laplacian pyramids with modulation transfer function (MTF-GLP) (Aiazzi et al., 2003) are well-known MRA methods.
The VO methods rely on defining and solving optimization problems. For instance, P+XS (Ballester et al., 2006) obtains a high-fidelity HRMS image via a variational optimization process with several reasonable hypotheses. SR-D (Vicinanza et al., 2015) uses sparse dictionary elements to represent the desired spatial details. The representation coefficients are obtained by solving a variational optimization problem. In the last few years, VO methods have also been combined with DL techniques to use the advantages of both classes fully (Shen et al., 2019;Deng et al., 2021).
The DL class typically leverages data-driven learning to get an optimized solution for pan-sharpening. Huang et al. (2015) launched the first attempt at DL-based pan-sharpening by utilizing a modified sparse de-noising auto-encoder scheme. Inspired by convolutional neural networks (CNN) based image super-resolution (Dong et al., 2016), Masi et al. (2016) proposed an efficient three-layer CNN called PNN. To fully extract spectral and spatial features, RSIFNN (Shao & Cai, 2018) uses a two-stream architecture to extract features from the LRMS and PAN images separately. PanNet (Yang et al., 2017) generates spatial details with a deep residual network in the high-pass filtering domain for better spatial preservation. Spectral information is also preserved by directly adding the up-sampled LRMS image to the details. Observing that the scale of features varies among different ground objects, Yuan et al. (2018) proposed a multi-scale and multi-depth CNN called MSDCNN. Many follow-up works exploit multi-scale feature extraction in pan-sharpening. For example, PSMD-Net (Peng et al., 2021) embeds multi-scale convolutional layers into dense blocks for pan-sharpening. Zhang et al. (2019) proposed a bidirectional pyramid network that reconstructs the HRMS image from coarse resolution to fine resolution. Recently, several transformer-based pan-sharpening methods emerged (Meng et al., 2022;Zhou et al., 2022). They utilize transformers to extract long-range image features but have not considered multi-scale information. DR-NET (Su, Li & Hua, 2022) introduces Swin Transformer (Liu et al., 2021) blocks into a UNet-like architecture for spatial information preservation. CPT-noRef (Li, Guo & Li, 2022) uses a pyramid transformer encoder to supply global and multi-scale features. With long-range feature extraction capacity, these transformer-based methods have achieved promising results. However, the utilization of transformers also induces higher model complexity. To avoid this, in this article, we aim to design a lightweight network that can exploit diverse features. Transformers in the proposed network only need to extract a few distinct features.
In existing DL-based methods, diverse features with multi-scale, multi-depth and contextual information are not fully extracted. And in image reconstruction, the indiscriminate use of these diverse features also limits the fusion quality. To solve these problems, we propose a pan-sharpening approach based on multi-resolution transformer and two-stage feature fusion in this article. Two transformer-based multi-resolution feature extractors (MRFE) are applied separately to the LRMS and PAN images to extract diverse features fully. After feature extraction, a multi-resolution fusion module (MRFM) and a shallow-deep fusion module (SDFM) are proposed to exploit multi-scale and multi-depth features for spatial detail generation. Finally, the generated details are injected into the up-sampled LRMS image to obtain the pan-sharpened image. Extensive experiments over QuickBrid (QB) and WorldView-3 (WV3) datasets demonstrate that the proposed method outperforms state-of-the-art algorithms visually and quantitatively. The main contributions can be summarized as follows: 1. A two-branch transformer-based feature extractor is designed to facilitate information interaction between different resolutions and finally learn effective multi-scale feature representation of the LRMS and PAN images. 2. An MRFM is proposed to fuse LRMS and PAN features at each resolution, which is simple yet effective for the fusion of multi-resolution modality-specific features. 3. An SDFM is proposed to fuse shallow local and deep multi-scale features, which is essential for fully utilizing diverse features.

MATERIALS & METHODS Datasets
Pan-sharpening is a technique that uses the PAN image to sharpen the LRMS image. It has a wide application in the remote sensing field. However, there are various imaging satellites. The PAN and LRMS data they captured have different characteristics, which challenges the generality of pan-sharpening methods. Thus, two datasets captured by QB and WV3 satellites are used in the experiments to evaluate the performance of the proposed method. Another challenge is the absence of real HRMS images. Since the ground-truth (GT) HRMS images are unavailable for the pan-sharpening task, we follow Wald's protocol (Wald, Ranchin & Mangolini, 1997) to spatially degrade the LRMS and PAN images by a factor of 4 (the spatial-resolution gap between the PAN and LRMS images). Then, the original full-resolution LRMS images can be regarded as references, i.e., GT HRMS images. All the images are cropped into PAN patches with size of 128 × 128 and LRMS patches with size of 32 × 32 to generate datasets. As a result, the QB dataset has 11,216 patch pairs, and the WV3 dataset has 11160 patch pairs. The 11216 QB patch pairs are randomly divided into 8974/1121/1121 (80%/10%/10%) pairs for the training, validation, and testing

Overall network architecture
The overall architecture of the proposed method is depicted in Fig. 1, where X P ∈ R H ×W ×1 , ×B andŶ M ∈ R H ×W ×B represent the PAN image, the LRMS image and the fused image. W and H are the width and height of the PAN image. B is the number of MS bands. First, shallow local features of up-sampled X M and X P are extracted by two convolution layers with kernel size of 3 × 3, respectively. These convolution layers are also referred to as pre-conv layers. Then, the shallow local features are fed into an MRFE to further extract deep multi-resolution feature maps of the two images. The feature maps at the same resolution are concatenated and fused by an MRFM. Finally, the deep and shallow features are merged by an SDFM and added to the up-sampled X M to obtain the pan-sharpened imageŶ M . The entire pan-sharpening process can be described as follows: (1) where ↑ X M represents the up-sampled LRMS image. M S and P S are shallow features of the LRMS and PAN images. M HR and P HR are HR feature maps. M MR and P MR are Middle-Resolution (MR) feature maps. M LR and P LR are LR feature maps. F D denotes the deep features. Y D is the generated spatial details injected into the up-sampled LRMS image.
[·] represents concatenation at the channel dimension. The MRFE, MRFM and SDFM are key components of the proposed method, which will be elaborated in the following.

Multi-resolution feature extractor
Extracting effective and diverse features is of great importance to spatial information preservation. To this end, the MRFE keeps an HR stream without down-sampling to prevent spatial information loss and gradually adds MR and LR streams to extract multi-scale features. Furthermore, skip connections with up-sampling or down-sampling operations are used between multi-resolution streams to facilitate information interaction, which can help improve the features' effectiveness and reduce redundancy among streams. For long-range feature extraction, HRFormer blocks (Yuan et al., 2021) are used as basic blocks to build the MRFE. The structure of an HRFormer block is shown in Fig. 2. The Local-Window Self-Attention (LWSA) mechanism models long-range dependencies between pixels. Then, the Feed-Forward Network (FFN) with a 3 × 3 Depth-Wise (DW) convolution exchanges information across windows to acquire contextual information. As shown in Fig. 1, the MRFE consists of HRFormer blocks and skip connections between streams. Therefore, the HRFormer blocks can encode contextual information into the features. And the multi-resolution streams exchange information with each other to generate effective and diverse deep feature representations M HR , M MR , M LR , P HR , P MR , and P LR . In the following, two fusion stages are designed to fuse these features progressively.

Multi-resolution fusion module
In the first fusion stage, we propose an MRFM to fuse LRMS and PAN feature maps at each resolution. The structure of the MRFM is shown in Fig. 3. Every two LRMS and PAN feature representations with the same resolution are concatenated and fed into a 3 × 3 convolution layer to fuse the modality-specific features. Then, a residual block (ResBlock)  is used to refine the fused features. To restore the spatial details of the MR and LR feature maps, 3 × 3 convolution and pixel shuffle layers are used as learnable up-sampling procedures. Finally, feature maps with multi-scale information are concatenated to constitute the deep features F D . Thanks to the different depths and resolutions of these streams, the feature maps in F D are diverse and complementary to the shallow features.

Shallow-deep fusion module
To generate spatial details by fully using the shallow features M S , P S , and the deep features F D , an SDFM is proposed, which helps the network focus on more informative features and avoids the degradation problem . Figure 4 displays the structure of the SDFM. M S , P S and F D are concatenated by the channel dimension and fed into a 1 × 1 convolution to primarily fuse the features. Note that the shallow features M S and P S skip the MRFM. They are directly fed to this stage to preserve the original image information. Then, a standard squeeze and excitation block (Hu, Shen & Sun, 2018) is adopted to excite informative features. Specifically, a global average pooling (GAP) is used to aggregate the channel-wise global information into a channel descriptor. Subsequently, two 1 × 1 convolution layers reduce and restore the dimension of the descriptor to capture the correlations among channels. The restored descriptor is mapped to a set of channel weights via a sigmoid function. Thus, informative channels can be excited by scaling the primarily fused features with the channel weights. We exploit the excited features to learn a residual component to refine the fused features. A 3 × 3 convolution layer completes this step. Finally, a 1 × 1 convolution generates spatial details Y D from the refined features. The details are injected into the up-sampled LRMS image to obtain the pan-sharpening result Y M .

Metrics
Five widely used indicators are adopted to evaluate different methods quantitatively. The indices can be grouped into four full-reference indicators and one no-reference indicator according to whether they require a GT image in calculations. For the evaluation on reduced-resolution datasets, we measure the four full-reference indices, including Spectral Angle Mapper (SAM) (Yuhas, Goetz & Boardman, 1992), relative dimensionless global error in synthesis (ERGAS) (Wald, 2002), spatial Correlation Coefficient (sCC) (Zhou, Civco & Silander, 1998), and the Q2 n (Alparone et al., 2004;Garzelli & Nencini, 2009) index (i.e., Q4 for 4-band data and Q8 for 8-band data). ERGAS and Q2 n evaluate the global quality of pan-sharpened results. SAM estimates spectral distortions. sCC measures the quality of spatial details. In the evaluation on full-resolution datasets, we adopt the no-reference index Hybrid Quality with No Reference (HQNR) (Aiazzi et al., 2014) with its spectral distortion component D λ and spatial distortion component D S to measure the quality of pan-sharpened results.

Experimental setting
The proposed method is implemented with the PyTorch framework and runs on an NVIDIA GEFORCE RTX 3090 GPU. Our model is trained for 500 epochs by an AdamW (Loshchilov & Hutter, 2019) optimizer with an initial learning rate of 0.0005, a momentum of 0.9, β 1 = 0.9, β 2 = 0.999, and a weight decay coefficient of 0.05. The mini-batch size is set to 16. The hyper-parameter setting of the proposed method is listed in Table 2. Since the shallow pre-conv layers mainly focus on local regions and extract fine-grained features with rich spatial details, the channel number of output feature maps at these shallow layers is set to C S = 16. The deep multi-scale feature maps in the MRFE and MRFM have contextual information. Contextual information is essential to pan-sharpening because of the similarity among ground objects. However, the pan-sharpening task focuses more on fine-grained spatial details than contextual semantic information. Therefore, the number of output feature maps at each layer in the MRFE and MRFM is set to C D = 8. In the SDFM, except for the last layer, the number of feature maps equals the total feature map amount of M S , P S and F D (i.e., 56). K is the window size of the LWSA in an HRFormer block, which is set to 8 by default. H denotes the head number of the MHSA in the LWSA. H = 1 is enough as the channel number C = 8 is quite low.
Experiments are conducted to verify the proposed model configuration. First, under the condition of a roughly unchanged total number of feature maps, the channel number of shallow features C S and the channel number of deep features C D are adjusted to find an appropriate setting of feature map numbers. The mean and standard deviation (STD) of the experimental results are reported in Table 3, which demonstrates that the fine-grained shallow feature maps at fine should be slightly more than the coarse-grained deep feature maps. We tested different values to determine the MHSA's head number H. The results are listed in Table 4, which proves that H = 1 is enough.

Experimental results on reduced-resolution datasets
To verify the effectiveness of the proposed method, we conducted comparative experiments on the reduced-resolution QB and WV3 datasets. In this case, the original LRMS images are GT images for visual assessment. Figure 5 displays visual results on the reduced-resolution QB data. To highlight the differences, we visualize residual maps between the pan-sharpening results and the reference (GT) in Fig. 6. A pixel with a small mean absolute error (MAE) is shown in blue. In contrast, a pixel with a big MAE is displayed in yellow. From the enlarged box Tables 5 and 6 list the average performance and STD of different methods across all testing reduced-resolution image pairs. It can be found that transformer-based Zhou et al. (2022) and DR-NET have the second-best and third-best results on both datasets. Over the reduced-resolution QB testing set, our method yields the best quantitative results on SAM,

Experimental results on full-resolution datasets
All the methods are tested on original PAN and LRMS images to further verify the effectiveness of our method on the full-resolution real data. However, there are no GT HRMS images as references in this circumstance. Figure 9 shows visual results on the full-resolution QB data. As shown in the enlarged view, the fusion result of GSA suffers from severe spectral distortions. It can also be observed that the results of BDSD, MTF-GLP-FS, PNN and DR-NET have slight color distortions. On the other hand, despite well-maintained spectral information, the results of TV, MSDCNN and Zhou et al. (2022) suffer from blurring effects. By comparison, our method presents the best pan-sharpening result regarding spectral and spatial fidelity.  The quantitative results of different methods on the two datasets are listed in Tables 7 and 8. Over the QB dataset, our method yields the best quantitative results on all three indicators. As for the WV3 testing set, TV has the best D λ , and Zhou et al. (2022) has the best D S . Our method has the second-best results on both indicators and the best HQNR value, which indicates that the proposed method has the best overall performance.

Parameter numbers and time performance
To further evaluate the complexity of the proposed method, the parameter number and average runtime of each method on the 1,121 QB reduced-resolution testing patches are given in Table 9. The traditional algorithms are tested on a 2.6-GHz Intel Core i7-10750H CPU, and the DL-based approaches are tested on an NVIDIA GeForce RTX 2060 GPU. By comparison, the VO-based TV consumes more runtime than other methods. PNN  Ideal value 0 0 1 exhibits the lowest time consumption because it only has three convolution layers. Thus, its computational complexity is extremely low. Other methods show relatively close time consumption. In terms of parameter numbers, the parameters of DR-NET are considerably more than other methods. It is due to the large number of feature maps throughout the DR-NET. Exploiting these feature maps leads to redundancy and high complexity. The proposed method has the fewest parameters, which means low spatial complexity and will facilitate model deployment on devices with limited memory resources.

Ablation study
Since  Table 10. Figure 12 shows the visual results of the variants over the QB dataset. Corresponding residual maps  are displayed in Fig. 13. Figure 14 shows the convergence performance of the variants during the training process. Importance of the multi-resolution feature extractor  Fig. 13, it can also be observed that the variant w/o MRFE shows larger residuals than the variant w/o MRFM. The comparison between these two variants demonstrates that the MRFE plays a significant role in the proposed method.

Importance of the multi-resolution fusion module
The MRFM is removed as the variant w/o MRFM in Fig. 11 to verify the necessity of twostage feature fusion. In the variant w/o MRFM, both multi-resolution and shallow-deep features are fused by the SDFM. The results of the variant w/o MRFM in Table 10 are apparently inferior to our full model, which demonstrates that the MRFM is effective. These results also prove the two-stage feature fusion is better than the traditional one-shot fusion. Furthermore, the residual map in Fig. 13 also shows that removing the MRFM is harmful.

Importance of the shallow-deep fusion module
To verify the effectiveness of the SDFM, we replace the module with a simple 3 × 3 convolution layer. The metrics of the variant w/o SDFM in Table 10 show that replacing the SDFM with a simple convolution layer is detrimental to fusion results. In Fig. 14, it is evident that the variant w/o SDFM converges slower than other structures, especially in the first 50 epochs. The slow convergence of the variant w/o SDFM demonstrates that a slightly more complicated shallow-deep fusion module can boost the convergence speed of our method.

Importance of the skip connections between resolutions
Information exchange across resolutions relies on up-sampling and down-sampling skip connections. All the connections except those adding the streams are deleted in the variant w/o SC to investigate their influence on feature extraction. The results of the variant w/o SC in Table 10 show that all metrics are slightly down, which demonstrates that the skip connections are helpful.

Visualization and analysis of feature maps
To verify the proposed network's diverse feature extraction ability, we visualize the feature maps extracted from a QB testing image pair in Fig. 15. These feature maps are finally fused by the SDFM and relate directly to the quality of pan-sharpening results. The M S and P S feature maps are shallow features of the MS and PAN images. It can be observed that the M S feature maps are blurry and lack spatial details. The P S feature maps are clear and some are high-frequency detail features. For instance, the P S feature maps at the second and fourth rows in Fig. 15 have rich high-frequency details.
The F 1 D , F 2 D and F 3 D feature maps are multi-scale deep features obtained from the MRFM. The feature maps from different resolutions have distinct content. The F 1 D are HR features and thus have rich spatial details. The MR features F 2 D contributes both detailed and contextual information. In LR features F 3 D , some unique high-level features apparently different from M S , P S , F 1 D , and F 2 D are presented. All the various feature maps demonstrate that our method fully exploits the MS and PAN images' information.

DISCUSSION
The proposed method has been compared with different pan-sharpening methods on reduced-resolution and full-resolution datasets. The parameter number and inference time of these methods have also been measured. It can be found that our method can yield superior results with minimum parameters and achieve a good trade-off between calculation and performance. The proposed method controls the number of parameters by keeping fewer feature maps throughout the network and turns to multi-resolution feature extraction for more effective feature maps. Visualizing these feature maps has verified that they have distinct content, which brings our method high efficiency.
By comparing the full model with the variants w/o MRFE and w/o SC, it can also be found that the HRFormer-like feature extractor is useful. The skip connections between every two resolutions have slightly improved the pan-sharpening performance.
As for the two-stage feature fusion, it can be observed that all indicators declined significantly, no matter removing the MRFM or the SDFM. Besides, the SDFM apparently boosts the convergence speed of our method. The variants w/o MRFE, w/o MRFM, w/o SC and the full model show the same convergence speed due to the SDFM. As the final fusion stage of diverse features, the SDFM inspired by residual learning significantly impacts the results and eases the gradient back-propagation. Among the variants, the variant w/o MRFE is simplest and can be regarded as a pure CNN-based baseline, which is easier to train than the transformer-based MRFE. Thus, the SDFM ensures the strength of the baseline, and the baseline guarantees the convergence speed of those variants with the MRFE.

CONCLUSIONS
In this article, we proposed a pan-sharpening approach based on multi-resolution transformer and two-stage feature fusion. In the proposed network, two HRFormer-like structures formed a two-branch multi-resolution feature extractor to learn multi-scale and contextual feature maps from the MS and PAN images. A two-stage fusion scheme was proposed to fuse these diverse features. The MRFM fuses modality-specific multi-resolution features, and the SDFM finally fused shallow and deep features to generate the details to be injected into the MS image. Experiments on two kinds of datasets demonstrated the superiority of our method over state-of-the-art methods. The extracted feature maps were visualized to verify the diversity of the extracted features, and the ablation study also proved the effectiveness of the MRFE. Besides, the two-stage feature fusion is also proved to be necessary via ablation experiments. The SDFM can even boost the convergence speed of the network. In future works, efforts will be made to enhance the time efficiency of transformer-based pan-sharpening methods.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the National Natural Science Foundation of China (No. 61703299). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.