Remote Sensing Imagery Super Resolution Based on Adaptive Multi-Scale Feature Fusion Network

Due to increasingly complex factors of image degradation, inferring high-frequency details of remote sensing imagery is more difficult compared to ordinary digital photos. This paper proposes an adaptive multi-scale feature fusion network (AMFFN) for remote sensing image super-resolution. Firstly, the features are extracted from the original low-resolution image. Then several adaptive multi-scale feature extraction (AMFE) modules, the squeeze-and-excited and adaptive gating mechanisms are adopted for feature extraction and fusion. Finally, the sub-pixel convolution method is used to reconstruct the high-resolution image. Experiments are performed on three datasets, the key characteristics, such as the number of AMFEs and the gating connection way are studied, and super-resolution of remote sensing imagery of different scale factors are qualitatively and quantitatively analyzed. The results show that our method outperforms the classic methods, such as Super-Resolution Convolutional Neural Network(SRCNN), Efficient Sub-Pixel Convolutional Network (ESPCN), and multi-scale residual CNN(MSRN).


Introduction
Image super-resolution (SR), is a classical yet challenging problem in the field of computer vision. The goal of image super-resolution is to reconstruct a visually pleasing high-resolution (HR) image from one or more low-resolution (LR) images [1]. Remote sensing imagery, captured from the satellite optical imaging sensors, provides abundant information to monitor the Earth's surface, having broad application in the fields of object matching and detection, land cover classification, assessment of urban economic levels, resource exploration, etc. It has proved that high-resolution remote sensing images play an important role. However, due to factors such as long-distance imaging, atmospheric turbulence, transmission noise, and motion blurring, the quality and the spatial resolution of remote sensing imagery are relatively poorer and lower as compared with natural images. Moreover, the ground objects of remote sensing imagery usually have different scales, causing the objects and surrounding environment to mutually couple in the joint distribution of their image patterns [2]. Therefore, super-resolution for remote sensing imagery has attracted huge interest and become a hot research topic.
For remote sensing image super-resolution, this paper proposes an Adaptive Multi-scale Feature Fusion Network (AMFFN). AMFFN can extract dense features directly from the original low-resolution image without any image interpolation preprocessing. Several adaptive multi-scale feature filtering blocks are cascaded to adaptively extract high-frequency detailed feature information of remote sensing imagery.
In summary, this paper contributes the following: (1) An adaptive multi-scale feature fusion network for the remote sensing image super-resolution, which can adaptively extract multi-scale feature information; (2) The mechanisms of squeeze-and-excited and adaptive gating are integrated for feature extraction and fusion, which can learn the channel correlation of feature maps, adaptively decide how much of the previous feature information should be reserved, reduce the redundant feature information among the intermediate multi-scale feature and enhance the use of useful feature information.
The remainder of this article is organized as follows. In Section 2, the network structure and the implementation details are discussed in detail. Section 3 demonstrates the experimental results on remote sensing image super-resolution, and the comparisons with other classical methods are discussed. The conclusions are given in Section 4.

Network Architecture
The network structure of AMFFN consists of four parts: Original feature extraction, adaptive multi-scale feature extraction, feature fusion and image reconstruction, as shown in Figure 1 and the part of adaptive multi-scale feature extraction is the core of our algorithm.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 15 For remote sensing image super-resolution, this paper proposes an Adaptive Multi-scale Feature Fusion Network (AMFFN). AMFFN can extract dense features directly from the original low-resolution image without any image interpolation preprocessing. Several adaptive multi-scale feature filtering blocks are cascaded to adaptively extract high-frequency detailed feature information of remote sensing imagery.
In summary, this paper contributes the following: (1) An adaptive multi-scale feature fusion network for the remote sensing image super-resolution, which can adaptively extract multi-scale feature information; (2) The mechanisms of squeeze-and-excited and adaptive gating are integrated for feature extraction and fusion, which can learn the channel correlation of feature maps, adaptively decide how much of the previous feature information should be reserved, reduce the redundant feature information among the intermediate multi-scale feature and enhance the use of useful feature information.
The remainder of this article is organized as follows. In Section 2, the network structure and the implementation details are discussed in detail. Section 3 demonstrates the experimental results on remote sensing image super-resolution, and the comparisons with other classical methods are discussed. The conclusions are given in Section 4.

Network Architecture
The network structure of AMFFN consists of four parts: Original feature extraction, adaptive multi-scale feature extraction, feature fusion and image reconstruction, as shown in Figure 1 and the part of adaptive multi-scale feature extraction is the core of our algorithm. The input of our network is the original low-resolution image for sup-resolution, denoted as LR I . A convolutional layer conv with 0 n filters are firstly applied to the input image to produce a set of feature maps, The input of our network is the original low-resolution image for sup-resolution, denoted as I LR . A convolutional layer conv with n 0 filters are firstly applied to the input image to produce a set of feature maps, where A 0 is the original feature maps extracted from the low-resolution remote sensing imagery, w 0 corresponds filters in the convolutional layer, which is 128 filters with the spatial size of 3 × 3 in this paper, b 0 denotes the biases of the convolutional layer, and ' * ' represents the convolution operation. In the part of adaptive multi-scale feature extraction, supposing there are n adaptive multi-scale feature extraction (AMFE), and the output of i-th AMFE A i can be represented as, where f MFE (·) denotes the operation of multi-scale feature extraction, and g(·) represents adaptive feature gating operation, the details will be elaborated in the following sub-section. AMFE is the basic module for the adaptive feature extraction, which consists of a unit of multi-scale feature extraction (MFE) and a feature gating for adaptively retaining the feature information from the output of previous AMFE. Through feature extraction, a series of feature maps, such as A 0 , · · · , A n , can be obtained. These feature maps contain a large amount of redundant information, which increase the computational burden significantly if they are directly used for image reconstruction. Therefore, before delivering these feature for super-resolution, a feature fusion layer is stacked after AMFE for feature fusion and reduction. The output of feature fusion layer A f usion is formulated as, where w f corresponds to the weights of the feature fusion layer, which is 64 filters of a size of 1 × 1, b f is the corresponding biases, and [A 0 , A 1 , · · · , A n ] denotes the concatenation of all feature maps extracted by the first feature extraction layer conv and AMFE. As many CNN-based SISR methods, the sub-pixel convolution method is adopted to reconstruct the high-resolution image. The reconstruction function can be defined as follows, where w s1 denotes the weights of a 3 × 3 convolution layer. If the scale factor is r(e.g., ×2), the number of filters in the convolution layer would be C · r 2 , and C refers to the channel number of the input feature maps.shu f f le(·) represents the shuffling operation that rearranges the elements of a H LR × W LR × C · r 2 tensor acquired in the top layer into a rH LR × rW LR × C tensor, more details can consult to [10]. A 3 × 3 convolution layer w s2 with C 1 filters used to reconstruct the remote sensing images, and C 1 represents the number of channels of the original input image (e.g., if it is an RGB image, C 1 = 3). And the tensor of rH LR × rW LR × C 1 is our desired reconstructed high-resolution image I SR . In our paper, L1 function is chosen to avoid introducing unnecessary training tricks and reduce computations.

Adaptive Multi-Scale Feature Extraction
As previously mentioned, the module of adaptive multi-scale feature extraction (AMFE) is the core module in our method. The structure of AMFE is illustrated as Figure 2, and it mainly consists of two units: Multi-scale feature extraction and filtering, and feature gating. For multi-scale unit feature extraction and filtering, it contains two parts: Multi-scale feature extraction and feature filtering.

Multi-Scale Feature Extraction Unit
Firstly, the feature maps outputted from the previous AMFE are processed with a convolutional layer. For the i -th AMFE, this can be defined as, features,the numbers of these filters are all 64, which can be expressed by Equation (6), where j denotes the type index of the filters. Suppose that each filter bank contains , and the convolutional output is concatenated and divided into 3 groups, that is With filters of different spatial size and the cascaded structure of AMFE, we can build a hierarchical system that can extract multi-scale image feature information. The filter of spatial size 1 × 1 is mainly used to perform dimension reduction, but it also can learn the channel correlation between the feature maps, that is "extract" feature information along the channel direction.

Multi-Scale Feature Extraction Unit
Firstly, the feature maps outputted from the previous AMFE are processed with a convolutional layer. For the i-th AMFE, this can be defined as, where A i−1 is the feature map from the previous AMFE, the number of feature maps outputted from each AMFE is 128 in the paper, w 0 i corresponds to 128 filters of a size of 128 × 3 × 3, b 0 i is the corresponding biases, and φ(·) represents activation function Relu.
Then, three types of filters f i 1 = 1 × 1, f i 2 = 3 × 3 and f i 3 = 5 × 5 is used to extract multi-scale features, the numbers of these filters are all 64, which can be expressed by Equation (6), where j denotes the type index of the filters. Suppose that each filter bank contains n i 1 = n i 2 = n i 3 = 64, and the convolutional output is concatenated and divided into 3 groups, that is M 0 With filters of different spatial size and the cascaded structure of AMFE, we can build a hierarchical system that can extract multi-scale image feature information. The filter of spatial size 1 × 1 is mainly used to perform dimension reduction, but it also can learn the channel correlation between the feature maps, that is "extract" feature information along the channel direction.

Feature Filtering Unit
To enhance the sensitivity of informative features, feature filtering follows the multi-scale feature extraction. We borrowed the idea of squeeze-and-excitation (SE), proposed by the Hu et al. [27], to promote useful features and suppress less useful ones. The SE method firstly used global average pooling to generate channel-wise statistics, which was used as a channel descriptor. Then, two fully-connected (FC) layers around the non-linearity are used to form a bottleneck to derive the scalar corresponding each feature map. For high computation efficiency, the FC layers are replaced by the 1 × 1 convolution layer, and the diagram of feature filtering is illustrated in Figure 3. The operation of feature filtering unit can be defined as follows, where A impor (·) represents the operation of determining the importance score of each feature map, w 4 i corresponds to 128 filters with spatial size of 1 × 1.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 15 To enhance the sensitivity of informative features, feature filtering follows the multi-scale feature extraction. We borrowed the idea of squeeze-and-excitation (SE), proposed by the Hu et al. [27], to promote useful features and suppress less useful ones. The SE method firstly used global average pooling to generate channel-wise statistics, which was used as a channel descriptor. Then, two fully-connected (FC) layers around the non-linearity are used to form a bottleneck to derive the scalar corresponding each feature map. For high computation efficiency, the FC layers are replaced by the 1 × 1 convolution layer, and the diagram of feature filtering is illustrated in Figure 3. The operation of feature filtering unit can be defined as follows, where ( ) impor A ⋅ represents the operation of determining the importance score of each feature map, 4 i w corresponds to 128 filters with spatial size of 1 × 1.

Feature Gating Unit
When the structure of the network is fixed, it would be non-adaptation and not flexible enough to copy with the complex situation, especially for the remote sensing imagery. Therefore, in the module of adaptive multi-scale feature extraction, a simple feature gating mechanism is adopted in this paper, as illustrated by Figure 1. A shortcut connection enables the features outputted from the previous AMFE module to feed to current AMFE module directly. This is beneficial for reducing the loss of feature information during the transmission. In this paper, a feature gating mechanism is used to adaptively decide how much of the previous feature information should be reserved, and the implementation details are shown in the Figure 4. The key for feature gating is how to adaptively obtain the value of gating score When the value of gating score is determined, which is a scalar, then the reserved feature information

Feature Gating Unit
When the structure of the network is fixed, it would be non-adaptation and not flexible enough to copy with the complex situation, especially for the remote sensing imagery. Therefore, in the module of adaptive multi-scale feature extraction, a simple feature gating mechanism is adopted in this paper, as illustrated by Figure 1. A shortcut connection enables the features outputted from the previous AMFE module to feed to current AMFE module directly. This is beneficial for reducing the loss of feature information during the transmission. In this paper, a feature gating mechanism is used to adaptively decide how much of the previous feature information should be reserved, and the implementation details are shown in the Figure 4. To enhance the sensitivity of informative features, feature filtering follows the multi-scale feature extraction. We borrowed the idea of squeeze-and-excitation (SE), proposed by the Hu et al. [27], to promote useful features and suppress less useful ones. The SE method firstly used global average pooling to generate channel-wise statistics, which was used as a channel descriptor. Then, two fully-connected (FC) layers around the non-linearity are used to form a bottleneck to derive the scalar corresponding each feature map. For high computation efficiency, the FC layers are replaced by the 1 × 1 convolution layer, and the diagram of feature filtering is illustrated in Figure 3. The operation of feature filtering unit can be defined as follows, where ( )

Feature Gating Unit
When the structure of the network is fixed, it would be non-adaptation and not flexible enough to copy with the complex situation, especially for the remote sensing imagery. Therefore, in the module of adaptive multi-scale feature extraction, a simple feature gating mechanism is adopted in this paper, as illustrated by Figure 1. A shortcut connection enables the features outputted from the previous AMFE module to feed to current AMFE module directly. This is beneficial for reducing the loss of feature information during the transmission. In this paper, a feature gating mechanism is used to adaptively decide how much of the previous feature information should be reserved, and the implementation details are shown in the Figure 4. The key for feature gating is how to adaptively obtain the value of gating score   The key for feature gating is how to adaptively obtain the value of gating score score(A i−1 ) for the input feature A i−1 . When the value of gating score is determined, which is a scalar, then the reserved feature information A i−1 is just as follows, where g(·) represents the gating operation. To calculate the gating score and alleviate the calculation burden, the average pooling is used to reduce the dimension of the feature map, and use the global information to learn the gating score. Then, to capture the dependencies between channels, we add a simple non-linear function of two fully-connected layers connected with BatchNorm [28] and a ReLU activation function, and the output is a vector V of two elements. After softmax operation, vector V would be a normalized vector with V[0] + V[1]= 1. We define the second element V[1] is our desired value of gating score, which represents the how much proportion of feature information need to be reserved.
To enhance the robustness, the noise with Gumbel distribution is added when deriving the vector V, that is the Gumbel-Softmax strategy [29] is used to replace the softmax. Then, the new vector V is calculated as follows, where G is the Gumbel noise vector, each element G i follows Gumbel(0,1) distribution, which can be sampled using inverse transform sampling by drawing u i ∼ Uniform(0,1) and computing G i = − log(− log(u i )), τ is the softmax temperature, which is set to 1 in our paper.

Datasets and Performance Metrics
To verify the effectiveness of the method in this paper, three datasets of remote sensing imagery are used. The first is the UC Merced land-use dataset (referred to as the UC later) [30]. The dataset includes 21 types of scenes, and each scene has 100 images with size of 256 × 256 pixels and spatial resolution of 0.3 m. For each type of scene, 80 images are randomly selected into the training image set, and the remaining 20 images are selected into the test image set. The second is NWPU-RESISC45 (referred to as NW later) [28]. The dataset contains 45 types of scenes, each of which has 700 images with the same size of 256 × 256 pixels and the spatial resolution varying from 30 meters to 0.2 meters. For each type of scene, 100 images are randomly selected into the training image set, and 10 images are randomly selected from the remaining into the test image set. The third is the images captured by the satellite TianGong-2 (referred to as the TG later) [31]. The dataset consists of 6 types of scenes, and the total number of images is 2000, which are all selected into the test image set. Through these, we build our training image set and testing image set. The overall information and the example images of the experimental datasets are illustrated in Table 1 and Figure 5 respectively.
The algorithm is based on the PyTorch framework, which enables NVIDIA TitanXp Graphics Processing Unit (GPU)and Intel (R) Xeon (R) Silver 4116 Central Processing Unit (CPU) to train the model. The original high-resolution images are downsized by bicubic interpolation to generate corresponding low-resolution images for training, and the training images are augmented by horizontal or vertical flipping and 90 • rotating transformation. For all training images, low-resolution patches with a size of 64 × 64 are extracted, and the total number of LR image patches is 11,124. In each training batch, we randomly extract 16 LR patches with the size of 64 × 64 and an epoch having 696 iteration of back-propagation. The maximum epoch number is 100, the learning rate is 0.0001, and the Adam optimizer is used. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are selected as the metrics for the evaluation of each experiment.

Number of AMFE Modules
AMFE module is the core part of our network, and its number dramatically affects the depth of our network. To find suitable value for it, the number of AMFE is set to 2,4,8,16,24, and 32, and the values of Loss and PSNR are depicted in Figure 6 and Table 2.
From Table 2, it can be found that the PSNR of reconstructed images increases with the number of AMFE modules. This can be explained by that with the increase of the number of AMFEs, more feature information could be extracted, which is beneficial for the super-resolution of remote sensing imagery. The increase rate is gradually slow down with the number of AMFEs. And when the number of AMFEs reaches to 16, our network reaches the best performance, as shown in Figure 6, the loss value decreases faster and is more stable than others, and the PSNR is the highest. When the number of AMFEs is 24 or 32, the network costs more time and has a relatively slow convergence, but a worse result. The overfitting and the insufficient training data may be the reason for this. For our experiment, we set the number of AMFEs to be 16 considering the tradeoff between the performance and computing efficiency.

Number of AMFE Modules
AMFE module is the core part of our network, and its number dramatically affects the depth of our network. To find suitable value for it, the number of AMFE is set to 2, 4, 8, 16, 24, and 32, and the values of Loss and PSNR are depicted in Figure 6 and Table 2.
From Table 2, it can be found that the PSNR of reconstructed images increases with the number of AMFE modules. This can be explained by that with the increase of the number of AMFEs, more feature information could be extracted, which is beneficial for the super-resolution of remote sensing imagery. The increase rate is gradually slow down with the number of AMFEs. And when the number of AMFEs reaches to 16, our network reaches the best performance, as shown in Figure 6, the loss value decreases faster and is more stable than others, and the PSNR is the highest. When the number of AMFEs is 24 or 32, the network costs more time and has a relatively slow convergence, but a worse result. The overfitting and the insufficient training data may be the reason for this. For our experiment, we set the number of AMFEs to be 16 considering the tradeoff between the performance and computing efficiency.  To verify the effectiveness of our feature gating mechanism, three connection methods with the MFE are discussed, which are: 1) The output of MFE is directly used as the input of the next MFE; 2) Add a shortcut to connect the output of previous MFE with the input of the next MFE; 3) Replace the shortcut of the way 2) with our gating mechanism.
The comparison results are given in Figure 7 and Table 3. We can see that by adding the skip connection, it enables to directly learn the difference between the features and reaches a faster convergence speed, and with our gating mechanism, it shows better performance on both the convergence rate and the PSNR value. The convergence speed is faster than the short cut connection way after the 40 epochs and the PSNR is higher than other two methods by about 0.3 dB. The skip connection provides a shortcut to connect the output of previous MFE directly with the input of the next MFE, which is beneficial to the propagation of feature information, but it may result in information redundancy. In addition, excessive parameters might lead to overfitting. This maybe the reason that the skip connection method achieves worse result. Our feature gating strategy can learn from the practical images and adaptively determine the gating score, which decides how much proportion of feature information from previous MFE will be reserved and integrated. From the experimental results, we can find that the feature gating unit can reduce redundant information effectively and improve the performance of image super-resolution.

Adaptive Feature Gating
To verify the effectiveness of our feature gating mechanism, three connection methods with the MFE are discussed, which are:

1)
The output of MFE is directly used as the input of the next MFE; 2) Add a shortcut to connect the output of previous MFE with the input of the next MFE; 3) Replace the shortcut of the way 2) with our gating mechanism.
The comparison results are given in Figure 7 and Table 3. We can see that by adding the skip connection, it enables to directly learn the difference between the features and reaches a faster convergence speed, and with our gating mechanism, it shows better performance on both the convergence rate and the PSNR value. The convergence speed is faster than the short cut connection way after the 40 epochs and the PSNR is higher than other two methods by about 0.3 dB. The skip connection provides a shortcut to connect the output of previous MFE directly with the input of the next MFE, which is beneficial to the propagation of feature information, but it may result in information redundancy. In addition, excessive parameters might lead to overfitting. This maybe the reason that the skip connection method achieves worse result. Our feature gating strategy can learn from the practical images and adaptively determine the gating score, which decides how much proportion of feature information from previous MFE will be reserved and integrated. From the experimental results, we can find that the feature gating unit can reduce redundant information effectively and improve the performance of image super-resolution.

Comparision Results with Other Classical Methods
Our proposed method AMFFN has been compared with classical methods, such as Bicubic interpolation, SRCNN [8], ESPCN [10], and MSRN [19]. The quantitative results of these methods for scale factor ×2, ×3, and ×4 are in Table 4. To ensure fairness, SRCNN, ESPCN, MSRN and our network AMFFN are trained and tested by the same remote sensing image set.
Compared with SRCNN and ESPCN, the PSNR obtained by our method is higher by 3 dB to 5 dB, a significant improvement has been achieved. The reason for this is that our method can extract multi-scale feature and realize adaptive feature fusion, which contributes to the enhancement of results of image super-resolution. However, SRCNN and ESPCN are essentially shallow networks, with limiting ability of feature extraction and fusion. When contrasting to the MSRN method, which is also a deep network and achieves better results than SRCNN and ESPCN, our method outperforms it in terms of PSNR and SSIM.

Comparision Results with Other Classical Methods
Our proposed method AMFFN has been compared with classical methods, such as Bicubic interpolation, SRCNN [8], ESPCN [10], and MSRN [19]. The quantitative results of these methods for scale factor ×2, ×3, and ×4 are in Table 4. To ensure fairness, SRCNN, ESPCN, MSRN and our network AMFFN are trained and tested by the same remote sensing image set.
Compared with SRCNN and ESPCN, the PSNR obtained by our method is higher by 3 dB to 5 dB, a significant improvement has been achieved. The reason for this is that our method can extract multi-scale feature and realize adaptive feature fusion, which contributes to the enhancement of results of image super-resolution. However, SRCNN and ESPCN are essentially shallow networks, with limiting ability of feature extraction and fusion. When contrasting to the MSRN method, which is also a deep network and achieves better results than SRCNN and ESPCN, our method outperforms it in terms of PSNR and SSIM. Visual comparisons on scale factor ×2 are shown in Figures 8-12. From the results, it is found that AMFFN can clearly reconstruct the green plants and the straight strips in the farm field in UC dataset. However, the methods of Bicubic interpolation, SRCNN and ESPCN cannot accurately reconstruct the green plants. MSRN can reconstruct some green plants, but the ringing effects arises when reconstructing the straight strips in the farmland. For the images of urban scene in the NW dataset and mountain and farmland scenes in the TG datasets, the linear features and spatial structure of reconstructed high-resolution images are clearer using our method.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 15 Visual comparisons on scale factor ×2 are shown in Figures 8-12. From the results, it is found that AMFFN can clearly reconstruct the green plants and the straight strips in the farm field in UC dataset. However, the methods of Bicubic interpolation, SRCNN and ESPCN cannot accurately reconstruct the green plants. MSRN can reconstruct some green plants, but the ringing effects arises when reconstructing the straight strips in the farmland. For the images of urban scene in the NW dataset and mountain and farmland scenes in the TG datasets, the linear features and spatial structure of reconstructed high-resolution images are clearer using our method.  Visual comparisons on scale factor ×2 are shown in Figures 8-12. From the results, it is found that AMFFN can clearly reconstruct the green plants and the straight strips in the farm field in UC dataset. However, the methods of Bicubic interpolation, SRCNN and ESPCN cannot accurately reconstruct the green plants. MSRN can reconstruct some green plants, but the ringing effects arises when reconstructing the straight strips in the farmland. For the images of urban scene in the NW dataset and mountain and farmland scenes in the TG datasets, the linear features and spatial structure of reconstructed high-resolution images are clearer using our method.

Conclusions
This paper proposes an adaptive multi-scale feature fusion network for remote sensing imagery. Several adaptive multi-scale feature extraction (AMFFN) are used to extract multi-scale feature information, and the squeeze-and-excited and feature gating unit mechanism are adopted to enhance the adaptation of feature information, to adaptively select and make full use of intermediate feature information. Quantitative and visual benchmarking results on different test data sets show that our AMFFN outperform the classical image super-resolution methods.

Conclusions
This paper proposes an adaptive multi-scale feature fusion network for remote sensing imagery. Several adaptive multi-scale feature extraction (AMFFN) are used to extract multi-scale feature information, and the squeeze-and-excited and feature gating unit mechanism are adopted to enhance the adaptation of feature information, to adaptively select and make full use of intermediate feature information. Quantitative and visual benchmarking results on different test data sets show that our AMFFN outperform the classical image super-resolution methods.