GCRDN: Global Context-Driven Residual Dense Network for Remote Sensing Image Superresolution

Superresolution (SR) of remote sensing images aims to restore high-quality information from low-resolution images. Recently, it has witnessed great strides with the rapid development of deep learning (DL) techniques. Despite their good performance, these DL-based models are often ineffective in balancing global and local feature extraction. Moreover, they are usually hindered by the poor image reconstruction capability of the decoder inside their SR models. To cope with this problem, this work proposes a novel global context-driven residual dense network (GCRDN) for satellite image SR based on the encoder and decoder architecture. In particular, the proposed encoder is endowed with nonlocal sparse attention modules incorporated into the residual dense network to learn robust representations from global features. Furthermore, a decoder equipped with back-sampling blocks is devised to fully exploit the feature maps extracted from the encoder. Extensive experimental comparisons based on two multisensor satellite remote sensing datasets confirm that the proposed GCRDN achieves impressive performance in terms of perceptual quality and fidelity.

the landscape, and they have been applied in numerous remote sensing vision tasks such as semantic segmentation [7], [8], change detection [9], and object detection [10]. However, it is still challenging to reconstruct fine-grained remote sensing images due to their ultralong-range imaging nature, atmospheric disruptions, and equipment noise [2], [11].
A large number of SR methods have been developed in the literature. Broadly speaking, these existing SR methods can be categorized into three approaches, namely interpolation-based [12], [13], reconstructions-based [14], [15], and learning-based [16], [17], [18], [19] approaches. The interpolation-based approach is an elementary kind of the SR method that generates SR images by increasing the pixel intensities on an up-sampled grid. In contrast, the reconstruction-based SR approach usually relies on explicit prior information to limit the range of potential solutions, assuming that LR images result from HR images after multiple degradations. As a result, the reconstruction-based SR approach is more capable of producing flexible and precise details. Finally, taking advantage of deep learning (DL) techniques, the learning-based approach has demonstrated superior performance by exploiting the statistical correlations between the LR and its matching HR counterpart based on large training sample sets [20]. More specifically, the DL-based SR models are commonly developed upon the encoder-decoder structure in which the encoder extracts representative feature maps from the LR images, whereas the decoder reconstructs the HR images from the feature maps. These DL-based models can be further divided into three categories based on their baseline models, namely convolutional neural network (CNN)-based, generative adversarial network (GAN)-based, and self-attention-based models.
The CNN was first introduced into SR as the baseline model to map an LR image into the HR one in an end-to-end manner [21]. Along the same direction, Kim et al. [22] proposed very deep superresolution (VDSR) by increasing the depth of the network while utilizing the residual learning and gradient clipping to improve the convergence performance of its deep networks. Recently, Lim et al. [23] devised a more comprehensive network named enhanced deep superresolution networks (EDSR) by further enhancing the residual and dense blocks, whereas the authors in [24], [25], [26], and [27] proposed to learn complicated characteristics of ground objects before restoring them into high-quality images in remote sensing. However, the performance of these CNN-based models is handicapped by the limited receptive field of the CNN, incurring insufficient global texture information extraction. On the other hand, GAN-based models have been applied in SR to derive synthetic images of distribution similar to that of the authentic images [28], which results in comparable visual representation [29], [30], [31], [32], [33]. However, these GAN-based models usually suffer from artifacts in the resulting images and loss of high-resolution details of ground objects. Finally, another line of work focuses on the self-attention mechanism for SR, especially the newly proposed transformer structure [34]. The vision transformer [35] has become a popular choice for exploiting long-range contextual information, benefiting from the multihead self-attention mechanism. Furthermore, the Fusformer [36], especially designed for remote sensing images, applied a pure transformer architecture to address the global relationship modeling in feature maps. However, the self-attention-based methods may lead to disturbance from high-frequency noise and loss of detailed information.
In summary, due to the limitation of the receptive field, the CNN-based encoder is ineffective in extracting global features of remote sensing images. In contrast, as the self-attention-based methods focus on calculating the correlation of all elements in the feature maps, they are susceptible to unrelated and noisy contents, thus leading to high computational costs and inaccurate representation. Furthermore, most state-of-the-art methods overlooked the information restoration capability of the decoder, i.e., reconstructing high-quality HR images from abstract features extracted by the encoder.
To remedy these issues, we propose a global context-driven residual dense network (GCRDN) for SR. Specifically, nonlocal sparse attention (NLSA) is first introduced into the residual dense encoder block, which helps exploit the features generated from the residual dense encoder block and effectively aggregate enriched global information. Furthermore, the back-sampling blocks are employed to construct the decoder in which the image reconstruction error is fed back through successive up-sampling and down-sampling blocks. The rich semantic features from the encoder are repeatedly used for the final high-resolution image generation, improving the utilization of semantic information and HR image quality. The main contributions of this article are summarized as follows.
1) A novel nonlocal sparse residual dense (NLRD) encoder is proposed by incorporating NLSA into residual dense networks to capture similar contextual information from a global perspective. The NLRD Encoder is effective and robust as it takes into account the global relationship such as repeated textures in the remote sensing images. 2) The enriched features learned from the encoder are upsampled and down-sampled successively by the deep back-sampling (DBP) decoder. The circulation of samplings guides the finer reconstruction of the target images while retaining detailed information. 3) Capitalizing on the proposed NLRD encoder and DBP decoder, GCRDN is constructed by leveraging global context extraction and continuous decoding strategy for SR. The rest of this article is organized as follows: Section II provides a brief overview of different existing DL-based SR models, whereas the architecture of GCRDN is described in Section III. After that, Section IV provides in-depth experimental analyses on GCRDN and other state-of-the-art SR methods. Finally, Section V concludes this article.

A. CNN-Based SR Framework
The conventional CNN-based encoder-decoder networks have been widely applied in computer vision tasks for decades. Motivated by the great success of classical models such as VGG [37], ResNet [38], DenseNet [39], and MemNet [40] in various practical applications, a number of pioneering SR models have been developed including VDSR [22], RCAN [41], RDN [42], PQA-CNN [43], EDSR [23], MSAN [44], RSI-Net [45], and CTN [46]. These CNN-based SR models are characterized by their convolutional layers for encoding and decoding feature maps. Specifically, RCAN [41] provided a solution for equal treatment of low-frequency and high-frequency features across channels while channel attention was introduced into the residual structure with long skip connections among several residual groups. As a result, the high-frequency information was assigned a higher attention weight than the low-frequency information. This module is also applied in MSAN [44] to perform multilevel feature extraction focusing on the complex structure of remote sensing images. Moreover, PQA-CNN [43] was proposed for SR of remote sensing images by adopting a perceptual quality-assured framework with an uncertainty-driven quantification model to meet the human perceptual requirement. RDN [42] utilized a combination of dense and residual blocks to fully extract the hierarchical features from the LR images. Despite their many advantages, these CNN-based models mainly focus on local information and cannot fully exploit long-range contextual information due to their limited global feature extraction capability. Moreover, another drawback of these CNNbased models is that contextual and spatial information may be lost in the decoding stage, limiting the recovery of the high-resolution information [47].
More recently, GAN has been introduced in image restoration by driving the synthesized results closer to the natural images manifold and discriminating whether it is "real" enough for human perception with the adversarial learning strategy [48]. SRGAN [28], ESRGAN [49], SWCGAN [30], EEGAN [29], MAGAN [31], and CDGAN [32], have been developed for SR. However, their performance is usually hindered by various problems such as model collapse, unstable training, and gradient vanishing during adversarial learning.

B. Self-Attention-Based Enhancement
The transformer was first proposed to model the long-range dependencies in natural language processing before it was introduced to the computer vision field with comparable performance. As the core of the transformer, the self-attention module has been proven to be more effective in arranging longdistance information and attracted much attention. Methods such as SwinIR [50], TranSMS [51], ESRT [52], NLSN [53], Fusformer [36], Interactformer [54], and TransENet [55] are designed based on the self-attention mechanism. In particular, SwinIR [50] adopted a hierarchical transformer with a shifted-window attention mechanism [56]. It consisted of several residual swin transformer blocks to make the interactions between image contents and attention weights. By restricting self-attention computation in non-overlapping local windows, the shifted windowing scheme obtained greater computational efficiency on high-level vision tasks. Moreover, it has been reported that the synergy of CNN and transformer outperforms the pure transformer network. For instance, TranSMS [51] developed a dual-branch encoder built with a vision transformer module and a dense convolutional module. These two branches are used to capture contextual relationships in low-resolution input features and localize high-resolution features, respectively. Furthermore, it has been proposed to embed self-attention modules into the CNN to enhance global information recognition. For instance, NLSN [53] improved the nonlocal sparse attention by identifying the most informative locations that need attentions while ignoring those unrelated regions. Despite that these methods can capture long-range context relationships, they suffer from loss of detailed information in the representation learning and image reconstruction processes. Compared with [53], we combine NSLA with dense residual learning to improve feature expression and introduce the back-sampling strategy to generate finer reconstructed images.

III. METHODOLOGY
This section will provide an overview of the proposed method before elaborating on the two essential components of the proposed GCRDN.

A. Network Framework
As shown in Fig. 1, the proposed GCRDN consists of an NLRD encoder and a DBP decoder. In the NLRD encoder, the shallow convolutional features generated by the encoder head are first fed into a series of nonlocal attentive residual dense blocks (NARDBs). These NARDBs are utilized to preserve the feed-forward nature of networks based on a contiguous memory mechanism while extracting local-global contextual features. After that, the multilevel features obtained by the NLRD encoder are concatenated along the channel axis and integrated by convolution operations. Furthermore, the encoded feature maps are fed into the DBP decoder to reconstruct the HR images through the continuous up-and down-sampling processes, in which the encoded features are mapped to the higher resolution feature maps and converted back to the lower resolution repeatedly. Finally, all HR features generated in the up-sampling processes are concatenated and converted into the expected output size before being fed into the decoder tail for SR image reconstruction with the convolution operations.

B. NLRD Encoder
The NLRD encoder aims to learn enriched local-global features of remote sensing images. It consists of an encoder head, several NARDBs and a feature fusion module. The encoder head containing two convolutional layers is used to generate the shallow features of input images. These shallow features are then fed into a series of NARDBs for multilevel representation learning. After that, the feature fusion module composed of convolutional layers is utilized to integrate the multilevel features to enhance representative capabilities.
The architecture of the NARDB is depicted in Fig. 2. As the core component of the NLRD encoder, the NARDB is developed based on hierarchical dense residual learning and the NLSA mechanism.
1) Dense Residual Learning: The dense residual learning in the NARDB can be formulated as follows:  where O and I are the output and input of the dense residual learning, respectively. Furthermore, [·] stands for the concatenation process, whereas Conv denotes a convolutional operation. In addition, R j represents the feature maps produced by the jth convolutional layer followed by a ReLU activation function, where j ∈ {1, . . . , J}. Notably, in dense residual learning, the results generated by all convolutional blocks in the previous stage will be connected with the next-stage blocks. After that, an NLSA module is employed to capture global contextual information based on the integrated local features.
2) Nonlocal Sparse Attention (NLSA): Motivated by the observation that remote sensing images typically exhibit repeated patterns, such as vegetation, roads, wasteland, and mountains, NLSA is utilized to capture the global contextual information for better feature extraction. More specifically, NLSA is an improved nonlocal attention operation that partitions the input features into hash buckets to reduce the attention computation.
As shown in Fig. 3, the upper branch presents the first step of NLSA. The feature maps are fed into the spherical locality-sensitive hashing (LSH) algorithm [57] to obtain the attention bucket, while in the bottom branch, for each query p, the attention operation executes in its attention bucket as the calculation range to generate the attention-weighted feature values.
To capture global contextual information from all positions, the input feature map of the NLSAX ∈ R C×H×W , where C, H, and W denote the feature dimension, the height and width of the feature map, respectively, is first reshaped into a feature sequence X ∈ R C×P where P = H × W . Then, the LSH partitions the features X into M hash buckets based on the similarity of the angular distances between elements. Input elements with high similarity are mapped into the same bucket. More specifically, LSH first randomly rotates a cross-polytope inscribed into a hypersphere and projects the tensor onto the hypersphere. After that, LSH chooses the closest polytope vertex as a tensor's hash code such that vectors of similar angular distances fall into the same hash bucket. The application of the attention bucket achieves high efficiency and robustness by ignoring other noisy or less-correlated partitions. Denote by A ∈ R M ×C a rotation matrix, the resulting tensor after sampling and rotation is given by and subsequently, the hash code at the location p, for p = 1, 2, . . . , P , is defined as where x p is the pth column of X and [·] m stands for the mth entry of the enclosed vector. As a result, the locations of the same hash code are put into the same bucket. For the feature x p at the location p, its bucket index set can be obtained by Note that G p indexes the locations highly related to the location p. Using G p , we can further compute the NLSA output r p as follows: where trans(·) is a feature transformation function, while α p,q is a weighting coefficient defined as with s(·, ·) being the feature similarity.
Since the input feature map X ∈ R M ×P contains P locations, the set G p indexes |G p | nonzero elements in [s(x p , x 1 ), . . . , s(x p , x P )] ∈ R P , where | · | stands for the cardinality of the enclosed set. Since G p contains the pixel locations  the query should attend to, the sparsity constraint can be conducted on the NLSA by reducing the number of nonzero entries to the designated chunk size K, i.e., the size of the attention bucket. Finally, the outputs of all NARDBs are integrated before being fed into the decoder: where E output , N t , and N 0 are the outputs of the NLRD encoder, the tth NARDB, and the encoder head, respectively, for t = 1, 2, . . . , T . Furthermore, Conv stands for the convolutional operation.

C. DBP Decoder
In sharp contrast to most existing decoders in the encoderdecoder frameworks that directly reconstruct the HR images through progressive convolution and up-sampling in a feedforward manner, we exploit a DBP decoder to preserve HR components in image reconstructions following an approach similar to [58]. Specifically, the DBP Decoder concentrates on boosting feature sampling at various depths while propagating the reconstruction errors across various stages. As a result, the DBP decoder can learn from different up-and down-sampling operators while retaining the details of HR components. The circulation and interlayer dense connections of up-and downsampling alleviate the vanishing gradient problem and improve the feature reuse for obtaining better results. Moreover, the up-sampling module takes all down-sampled features as input, while the down-sampling module processes those feature maps produced in each up-sampling unit. In this error feedback strategy, the sampling features in the early stages can guide and constrain the feature expression in the later stages. Without loss of generality, the following discussions will focus on the (n + 1)th up-sampling and down-sampling operations.
1) Up-Sampling: We denote by {L n , . . . , L 0 } the outputs of the first n down-sampling blocks. The (n + 1)th up-sampling process is shown in Fig. 4(a) in which the LR feature maps {L n , . . . , L 0 } are first concatenated before being fed into convolution layers. The resulting feature maps denoted as L n can be expressed as L n is then first upscaled before being downscaled. After that, the difference between L n and its downscaled counterpart is upscaled. Finally, the outputs from both upscale operations are added together to generate the up-sampling output H n+1 as follows: where DC(·) stands for the deconvolution-based upscale operation.
2) Down-Sampling: As presented in Fig. 4(b), the (n + 1)th down-sampling process is a reverse operation of the up-sampling process. The HR features {H n+1 , . . . , H 1 } are first concatenated before being fed into convolution layers. The resulting feature maps denoted as H n+1 can be expressed as  Similarly, H n+1 is then first downscaled before being upscaled. After that, the difference between H n+1 and its upscaled counterpart is downscaled. Finally, the outputs from both downscale operations are added together to generate the n + 1 downsampled output L n+1 as follows: Finally, the decoder tail D tail with two convolutional layers gathers all the up-sampled results to compute the final HR images as where U is the total number of the up-sampling blocks.

A. Datasets and Metrics
In this section, two remote sensing datasets, i.e., OLI2MSI [59] and Alsat [60] are employed to evaluate the proposed model. The OLI2MSI dataset comprises Landsat8-OLI and Sentinel2-MSI images with 5225 and 100 pairs of images for training and testing, respectively. Furthermore, Landsat8-OLI images with a spatial resolution of 30 m serve as the LR input and Sentinel2-MSI images with a spatial resolution of 10 m are regarded as the HR ground truth. The Alsat dataset contains 2182 training samples and three subdatasets for testing, namely scenes of "agriculture," "urban," and "special" structures, with 56, 282, and 239 image pairs, respectively. Two widely used metrics, i.e., peak signal to noise ratio (PSNR) and structural similarity (SSIM), are used to quantitatively evaluate the SR performance.

B. Implementation Details
In the proposed network, 16 NARDBs are adopted in the encoder with 3 × 3 convolution for feature extraction and 1 × 1 for feature fusion. We use padding operators for all convolutional layers. The chunk size K in the NLSA is set to 144 in OLI2MSI and 25 in Alsat. The total number of up-sampling processes is set to U = 6, i.e., six up-sampling operations are utilized in the decoder. The network is trained using the L1 loss with a batch size of 16 and a learning rate of 10 −4 . All images are cropped into patches of 32 × 32 LR inputs and 96 × 96 HR outputs in the OLI2MSI dataset. Similarly, 32 × 32 and 128 × 128 patches are generated for the training and testing on the Alsat dataset. All experiments are implemented with PyTorch on a single NVIDIA GeForce RTX 3090 GPU with 24 G RAM.

C. Comparisons With Advanced SR Models
To demonstrate the effectiveness of the proposed GCRDN, we compare it against 11 different state-of-the-art models, including RCAN [41], RDN [42], EDSR [23], EFDN [61], DBPN [58], EDRN [47], TranSMS [51], NLSN [53], ESRT [52], SRGAN [28], and ESRGAN [49]. 1) OLI2MSI SR: As shown in Table I, the proposed GCRDN achieved the best performance among all the methods under evaluation. In particular, GCRDN equipped with the NLRD encoder and the DBP decoder demonstrated noticeable improvements as compared to our baseline RDN, achieving 0.12 dB and 0.0024 improvement in terms of PSNR and SSIM, respectively. This suggests that the proposed GCRDN is more effective in extracting texture information such as the mountains and roads and generating HR images. Furthermore, inspection of Table I suggests that the proposed GCRDN considerably outperformed the self-attention-based and CNN-based models. This is because that the CNN-based models lacked global features while the self-attention-based models were incapable of effectively utilizing suitable local features. Furthermore, these models also suffered from the weak image reconstruction capability of the decoder. In contrast, the proposed GCRDN benefits from effective feature extraction by leveraging the synergy of the nonlocal attention modules and the back-sampling strategy. Finally, the GAN-based models, e.g., SRGAN and ESRGAN showed worse performances as compared to the CNN-based and self-attentionbased models, which indicated that these GAN-based models were less effective for this remote sensing dataset. Fig. 5 shows visual comparisons of images obtained with all methods under evaluation. As presented in Fig. 5, the outline of the roads restored by the proposed GCRDN is more precise and coherent than the others, demonstrating better texture information extraction and reconstruction of the proposed GCRDN.
In summary, the experimental results and visual comparisons discussed previously confirmed that the proposed GCRDN outperformed CNN-based and self-attention models by effectively exploiting global-local information in the encoder and making better feature representation in the decoder. 2) Alsat SR: For the Alsat dataset, inspection on Table I revealed that the proposed GCRDN achieved the best results among all methods on the collection of three subdatasets. In particular, GCRDN demonstrated 0.06 dB and 0.0079 improvement in terms of PSNR and SSIM, respectively, as compared to our baseline, RDN. Table I further illustrates the performance of all methods in various scenes. It is observed that the proposed GCRDN showed impressive performance in "agriculture," "special," and "urban" scenes. Indeed, the proposed GCRDN achieved the best performance in "agriculture" and "special" test sets because the images of "agriculture" and "special" scenes possess simple and repetitive patterns that can be easily captured and exploited by GCRDN. In "urban" scenes, despite that the performance of GCRDN is worse than that of RCAN in PSNR and that of ESRGAN in SSIM, GCRDN also generated SR images with a high perceptual quality in such a complicated situation. Visual comparisons on the results generated by all methods under consideration in "urban," "agriculture," and "special" scenes are shown in Figs. 6-8, respectively. Clearly, the proposed GRRDN generated clearer SR images with more detailed information and texture. The results from CNN-based methods, especially DBPN, are generally very smooth but blurry. For those GAN-based methods, they have many artifacts, which is highly undesirable in the remote sensing field. In general, SR images provided by the proposed model were visually closest to the HR images while ensuring high fidelity. Table II presents the ablation investigation on the effect of NLSA and back-sampling. To demonstrate the effect of NLSA and back-sampling, we trained and tested the models by removing the NLSA and replacing the back-sampling decoder with a simple up-sampling decoder on the OLI2MSI and Alsat dataset. The result demonstrates that the NLSA benefited the feature extraction of the residual dense blocks with 0.033 dB and 0.0004 in OLI2MSI and 0.022 dB and 0.0063 in Alsat improvement in terms of PSNR and SSIM, respectively. The enhancement of the decoder resulted in an improvement of 0.062 dB and 0.0013 in OLI2MSI and 0.040 dB and 0.0081 in Alsat in terms of PSNR and SSIM, proving its superior performance. The experiment results confirmed the benefits of the two proposed NLSA and back-sampling in the basic residual dense network.

D. Ablation Study
Furthermore, as shown in Table III, we conduct additional experiments to evaluate the influence of the dense connections in the NARD encoder and DBP Decoder. The dense connections improve the feature extraction and expression performance with 0.029 dB and 0.0050 in the encoder and 0.234 dB and 0.0011 in the decoder in OLI2MSI and Alsat, respectively. Moreover, the last column demonstrates that combining the dense connections in the encoder and decoder could make further improvements.

E. Parameter Analysis
In this section, we investigate the impact of parameter settings on the performance of the proposed GCRDN.
1) Size of Attention Bucket: The sparsity of NLSA is controlled by the size of the attention bucket (chunk size) K. Theoretically, a smaller chunk size can lead to SR images of higher quality under the condition that most correlated elements are identified. We evaluated chunk sizes of {4, 25, 100, 144, 225}. From Fig. 9(a), it is observed that the PSNR and SSIM performance was insensitive to the chuck size with the largest deviation of 0.007 dB for PSNR and 0.0001 in SSIM across different chunk sizes evaluated, demonstrating the stability of our proposed GCRDN.
2) Stages of Up/Down-Sampling: In this test, we investigated the influence of the stage number of up-and down-sampling operations, i.e., U − 1 in the DBP Decoder on feature expression performance. We trained and tested the proposed GCRDN with {2, 4, 5, 6, 8} stages of up-and down-samplings. As shown in Fig. 9(b), the SSIM and PSNR performances were not very sensitive to the stage number of up/down-samplings with the largest deviation of 0.061 dB for PSNR and 0.0014 in SSIM across different stage numbers evaluated. Furthermore, four and five stages of up-and down-samplings produced the best SSIM and PSNR, respectively.

F. Visual Activation Maps
Fig. 10 compares the activation maps of the RDN encoder, the RDN decoder, the NLRD encoder, and the DBP decoder using two images. For both images, the activation maps of the NLRD Encoder clearly showed more apparent details than those from RDN, indicating that NLRD achieved effective feature extraction with edge enhancement. Using the same encoder, the DBP decoder demonstrated more effective high-resolution information reconstruction with more evident textures than RDN, especially on those blur boundary lines. The activation maps of the NLRD encoder and the DBP decoder confirmed that the proposed GCRDN could restore high-frequency details while alleviating the blurring artifacts, resulting in sharp and natural edges.

G. Computational Complexity Analysis
Finally, we compare the computational complexity of all methods under evaluation in terms of model complexity, memory, parameters, and inference speed. In particular, we divided the models into three categories, namely the CNN-based, the self-attention-based, and the GAN-based methods. As shown in   Table IV, the proposed GCRDN has an inference speed comparable to that of other models while requiring higher complexity and memory.

V. CONCLUSION
In this article, a novel SR method named GCRDN has been proposed by exploiting an NLRD encoder and a DBP decoder to perform effective feature extraction and expression for remote sensing image SR. The NLRD encoder is designed to extract distinct global contextual features whereas the DBP decoder bridges the gap between the enriched features and the final high-quality reconstructed images by the circulation samplings structures. As a result, the proposed GCRDN can effectively characterize the complex content of remote sensing images and restore accurate high-resolution images. Extensive comparative experiments on OLI2MSI and Alsat datasets have confirmed the superior performance of the proposed GCRDN. In the future, we will introduce the diffusion model to further improve the texture reconstruction capability of our model. Moreover, we will extend our model to other image restoration tasks, such as image denoising and cloud removal.