Adversarial Spectral Super-Resolution for Multispectral Imagery Using Spatial Spectral Feature Attention Module

Acquiring high-quality hyperspectral imagery with high spatial and spectral resolution plays an important role in remote sensing. Due to the limited capacity of sensors, providing high spatial and spectral resolution is still a challenging issue. Spectral super-resolution (SSR) increases the spectral dimensionality of multispectral images to achieve resolution enhancement. In this article, we propose a spectral resolution enhancement method based on the generative adversarial network framework without introducing additional spectral responses prior. In order to adaptively rescale informative features for capturing interdependencies in the spectral and spatial dimensions, a spatial spectral feature attention module is introduced. The proposed method jointly exploits spatio-spectral distribution in the hyperspectral manifold to increase spectral resolution while maintaining spatial content consistency. Experiments are conducted on both synthetic Landsat 8 and Sentinel-2 radiance data and real coregistered advanced land image and Hyperion (MS and HS) images, which indicates the superiority of the proposed method compared to other state-of-the-art methods.


I. INTRODUCTION
H YPERSPECTRAL sensors capture hyperspectral images (HSIs) with a continuous wavelength coverage of a wide spectral range at a nanoscale high spectral resolution. HSIs record fine spectral signatures of ground materials and provide abundant spectral information for research in the field of remote sensing, such as spectral unmixing [1], mineral detection [2], and environment monitoring [3]. Although high spectral redundancy raises the computation complexity compared to multispectral data, hyperspectral data still allows classification and other tasks with improved accuracy by adopting feature selection for spectral signatures of interest and data reduction [4]. However, directly acquiring HSI with high spatial resolution from remote sensing sensors is challenging due to the limitation of hardware. A representative example of space-borne instruments is the Hyperion hyperspectral imager carried on NASA Earth Observing-1 (EO-1), which covers the spectral range from 0.43 to 2.4 μm, with a spatial resolution of only 30 m. Another example of airborne instruments is AVIRIS, which provides 224 contiguous spectral channels with approximate 20 m spatial resolution [5]. To obtain a reasonable signal-to-noise ratio, the tradeoff between spatial and spectral resolution comes at the cost of increasing the instantaneous field of view. Therefore, the acquisition of HSIs often with the sacrifice of spatial resolution leads to the problem of mixed pixels [6]. In addition, the degradation mechanisms in the acquisition process also inevitably affect the imaging quality [7]. In contrast, multispectral images (MSIs) provide much fewer spectral bands with higher spatial resolution and lower acquisition costs. High-quality images with the high spatial and spectral resolution are quite useful for various applications, thus combining the advantage of different sources of imagery is conducive to the application of various tasks. Acquiring high spatial and spectral resolution images is quite challenging in improving the imaging quality of hardware sensors. Therefore, postacquisition enhancement techniques are utilized to compensate for the inability of existing remote sensing sensors to provide high spatial and spectral resolution data simultaneously. In this article, we follow the concept of spectral super-resolution, where high spatial and spectral resolution HSI are reconstructed from MSI, aiming to span the spectral dimension of MSI while maintaining the same spatial resolution. Hence, the tradeoff between spatial and spectral resolution is overcome by combining the characteristics of HSI and MSI.
The spatio-spectral covariance matrix of hyperspectral data cubes is nondiagonal, and the autocorrelation functions are wide, which indicates the second-order redundancy of the data cube in both spatial and spectral domain [8]. Hyperspectral data cubes are highly correlated and lie in a low-dimensional manifold. For hyperspectral processing, hyperspectral signals are projected to a low-dimensional subspace to greatly reduce the computational complexity [9]. Due to the high correlation in the joint spatiospectral domain, learning to reconstruct hyperspectral data from a suitable training dataset is feasible. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The spectral super-resolution problem is heavily underconstrained, especially for satellite remote sensing imagery, where hundreds of bands are reconstructed from no more than 20 bands. Spectral reconstructed images expand the number of bands, while the spatial texture details are consistent with the MSI. The spectral signature exploited from the adjacent spatial region is beneficial to the spectral enhancement and spatial content consistency. Therefore, exploiting inherent hyperspectral correlation from the similarity of neighboring pixels in both spatial and spectral dimensions is beneficial for reconstruction. Previous sparse coding-based methods are shallow learning models trained on a single scene with limited training samples. Due to the poor expression ability and limited generalization ability, these methods can only achieve desired performance in specific scenes. Deep learning-based methods are mostly designed as larger and deeper networks for three-channel RGB spectral reconstruction, which have shown much better performance. Some of the approaches introduced the constraint of reconstructed MSI and original MSI based on the known spectral response function for reducing spectral distortion [10], [11], [12]. However, fewer methods are designed to explore spatio-spectral underlying interdependencies for wider remote sensing scenarios [13].
In this case, we improve the reconstruction accuracy using adversarial loss as a constraint without knowing the specific spectral response function. Although the adversarial framework was first utilized in the RGB spectral reconstruction task [14], recovering HSI of more than 100 bands from MSI is still quite challenging using adversarial learning.
In this article, we propose a spectral super-resolution technique based on the framework of generative adversarial networks. The proposed approach jointly exploits spatiospectral underlying priors using the adversarial learning technique without introducing spectral response prior. The main contributions of this article can be summarized as follows.
1) Based on the adversarial learning framework, the proposed method learns an end-to-end mapping between multispectral and hyperspectral imagery acquired by a specific sensor without additional prior knowledge. 2) To further explore the correlation in the joint spatiospectral domain, a spatial spectral feature attention module (SSFAM) is proposed to improve the hyperspectral representation ability of features by selectively enhancing useful information for reconstruction, inspired by attention mechanism in low-level tasks [15], [16], [17]. Experimental results show the superiority of SSFAM, especially on a real paired dataset with a high-band correlation. The rest of this article is organized as follows. Section II presents a brief overview of spectral super-resolution methods. Section III presents the proposed spectral super-resolution scheme of multispectral and hyperspectral imagery. Section IV provides data preparation and experimental results on both simulated and real paired remote sensing datasets. Finally, Section V concludes this article.

II. RELATED WORK
As we mentioned before, acquiring high spatial and spectral resolution of remote sensing imagery is beneficial to numerous practical applications. The postacquisition techniques can be categorized into spatial resolution enhancement and spectral resolution enhancement. Recently, Spatial resolution enhancement has been actively exploited. Zhao et al. [18] improved spatial resolution by fusing the HSI and panchromatic images with the joint regulation of spatial and spectral nonlocal similarities. Fu et al. [19] introduced bidirectional structure in single HSI super-resolution to exploit spatial-spectral correlation of HSI and global correlation along spectra. Li [20] fused MSI and HSI by utilizing a novel adaptive nonnegative sparse representationbased model.
Compared with spatial resolution enhancement, spectral resolution enhancement, namely spectral super-resolution has drawn less attention. Spectral super-resolution corresponds to reconstructing an HSI from a single RGB or MSI. Recently, the reconstruction from single RGB to hyperspectral imagery attracts considerable attention [21]. Nguyen et al. [22] proposed a radial basis function network for mapping normalized RGB values to spectral reflectances and provided the calculation of global spectral illumination. Arad et al. [23] created the overcomplete hyperspectral dictionary by exploiting hyperspectral prior using the K-SVD algorithm [24]. The hyperspectral spectrum was estimated by the linear combination of the dictionary and the weight vector calculated by the orthogonal matching pursuit algorithm [25]. Aeschbacher et al. [26] further improved the method of Arad et al. [23] with a shallow A+ based model [27]. Akhtar et al. [28] modeled the nature spectra by Gaussian Processes with Kernels from multiple sets of clustered spatialspectral correlated hyperspectral patches. Over recent years, the reconstruction from an end-to-end mapping through CNN has become a mainstream method. Inspired by the significant improvements of CNNs in single image super-resolution [29], [30], [31], Galliani et al. [32] developed an end-to-end network structure achieving spectral super-resolution based on the Densenet architecture [33] without knowing the specific camera response function. Inspired by the VDSR network [34], Xiong et al. [35] designed a residual network to recover the HSI from the spectrally upsampled RGB. Shi et al. [36] further developed the HSCNN-R network by replacing hand-crafted upsampling and proposed the HSCNN-D network based on a densely-connected structure with a path-widening fusion strategy. Aitor et al. [14] applied a conditional generative adversarial framework to help recover spatial contextual information. Li et al. [11] integrated camera spectral sensitivity prior into an adaptive weighted attention network for more accurate reconstruction. Li et al. [37] embedded a dual second-order attention mechanism in a novel 2-D-3-D CNN network structure to further enhance the representational ability of the network. Li et al. [38] proposed a two-pathway attentional network with a structure information consistency module to maintain high-frequency details. Hang et al. [13] fully exploited the spectral correlation and low-dimensional projection property of HSI using two subnetworks.
While most researchers focused on recovering HS from RGB obtained by a specific camera response function, Arad et al. [36] demonstrated that the selection of camera response filter impacts the accuracy of reconstruction. Motivated by their work, Nie et al. [12] introduced a convolutional layer for learning an optimized camera response function with nonnegative and smooth constraints. Similarly, Fu et al. [39] combined optimal camera response selection from a candidate dataset with high-fidelity hyperspectral recovery in a single training process.
The reconstruction from multispectral to hyperspectral remote sensing imagery mostly are based on sparse representation. Fotiadou et al. [40] introduced the coupled dictionary learning technique based on the ADMMs [41] to tackle the SSR problem on remote sensing imagery, but the pixel-wise operation neglects the correlation of neighboring pixels leading to poor generalization in other datasets. Yi et al. [42] involved spatial constraint as a spatial preservation strategy to ensure the spatial consistency of MSI and HSI based on the spectral improvement approach, whereas multiple dictionaries were trained and the steps for reconstruction were complicated. Gao et al. [43] investigated a joint sparse and low-rank dictionary learning method for recovering large-cover hyperspectral imagery. Zheng et al. [44] proposed a spatial-spectral residual attention network using parallel branches to extract spatial and spectral features. A neighboring spectral attention module is also proposed for maintaining neighboring spectral bands correlation in the reconstruction stage.
It is also worth mentioning that the concept of spatial and spectral joint super-resolution has been put forward by Mei et al. [45] using simultaneous and sequential spatial-spectral joint SR frameworks.

III. SPECTRAL RESOLUTION ENHANCEMENT METHOD
In the SSR problem, MSIs can be viewed as spectrally downsampled HSIs from a specific sensor. Let I MS ∈ R l×h×w be MSI, where l, h, and w denote the number of multispectral bands, the image height, and width, respectively. The corresponding HSI can be denoted as I HS ∈ R L×h×w , where L represents the number of hyperspectral bands, while the spatial dimension remains the same.
SSR aims to realize the spectral dimension expansion from an MSI, thus, the proposed model is optimized to learn the mapping function for restoring I HS ∈ R D×h×w as realistic as possible, which is implemented through a generator G with estimated parameters θ G as follows: In the training process, a discriminator D with estimated parameters θ D takes the generated HSI I HS and real HSI I HS as input and predicts the authenticity of the input. The prediction acts as a penalty to impact the optimization of G and D. The penalty drives G to produce as realistic I HS as possible, while it drives the discriminator to predict as correctly as possible. Therefore, D and G are optimized to the objective function in an alternating manner to solve the min-max problem Another understanding of the formula is that the generated HSI Y is so similar to real HSIs to fool the discriminator D distinguishing spectrally reconstructed images from real images. From this perspective, the proposed model forces the generated Y toward the target image manifold to conduct the reconstruction. For the testing process, only the trained generator G is leveraged for reconstructing HSI.
V GAN (D, G) is objective function that adopts sigmoid crossentropy loss function. Viewing the discriminator as a binary classifier causes the problem of vanishing gradients. Least squares generative adversarial networks (LSGANs) proposed in [46] adopt least squares loss functions for penalizing samples far from the decision boundary, which greatly stable the training process. Therefore, the LSGAN framework is utilized for the proposed method to substitute the traditional sigmoid cross-entropy loss function. Since the prediction of real HSIs has no impact on the optimization of G, the objective functions designed for G and D are as follows:

A. Network Architecture
The global residual network structure of the generator G is inspired by VDSR [34], which eliminates the problem of gradient vanishing and explosion. The overall framework of G is composed of feature extraction, attentional feature mapping, and reconstruction. First, the G network extracts shallow features from the input MSI. Then the attention modules and feature fusion layer achieve mapping shallow features to deep attentional features. Finally, the target HSI is reconstructed from deep attentional features. The whole reconstruction process contains pixel-wise feature extraction and reconstruction, thus the generator learns the pixel-wise mapping to reconstruct hyperpixels.
The G network accepts I MS with size of N × N × l as input, generates I HS with size of N × N × L as output. A 3 × 3 convolutional operation is employed to extract where Y SF (·) represents the shallow convolutional operation. Then, the shallow features F 0 go through m SSFAMs. The features extracted by attention modules go through a 1 × 1 convolutional layer respectively. In order to fully utilize different levels of features, the intermediate multilevel features from 1 × 1 convolution layers are aggregated through a concatenation layer. Then, the feature-map channel-wise dimension is reduced by a 3 × 3 convolutional layer to further fuse the concatenation features. The fused feature F SSAF is expressed as where Y m FAB (·) denotes the mth SSFAM and the following 1 × 1 convolutional layer. Y CON (·) represents the concatenation operation and the following fusion layer.
Then, the fused feature F SSAF adds to the F 0 from the first 3 × 3 convolutional layer by global skip connection operation. Finally, a 3 × 3 convolutional layer yields the hyperspectral reconstruction output I HS where Y REC (·) denotes the reconstruction layer. The discriminator adopts a pixel-wise prediction strategy for more accurate hyperpixel reconstruction. Owning to high spectral dimension characteristic of hyperspectral remote sensing image, pixel-wise prediction strategy also brings more stable optimization. The discriminator D takes generated and real HSI as input, and the input passes by two 1 × 1 convolutional layers followed by Leaky ReLU layers with increasing filter number from 64 to 128. Then, the feature goes through a 1 × 1 convolutional layer as the last layer yielding N × N × 1 output of prediction. The output corresponds to the prediction of N × N input hyperpixels. Then, the prediction of the whole image is averaged to calculate the adversarial loss. The detailed structure is shown in Fig. 1.
One of the advantages of generative adversarial networks is that a simple network with fewer parameters can also obtain more ideal results, and problems that are difficult to train can also be alleviated by pretraining the generator network. The discriminator D learns the loss function for specific HSI domain mapping by pixel-wise prediction strategy, yielding toward the target HSI manifold, which produces high-quality HSIs. The proposed method fully utilizes the similarity of spectral features of neighboring and the correlation of spatial contextual information to reconstruct HSI with high fidelity details and consistent illuminance.

B. Spatial Spectral Feature Attention Module
In order to fully exploit the correlation of features among spectral bands and maintain spatial consistency, we investigate an approach to model the feature interchannel and interspatial dependencies. The proposed SSFAM is composed of channel attention blocks and spatial attention blocks. SSFAM adaptively rescales features both channel-wise and spatial-wise that are conducive to HSI reconstruction. Extracting channel-wise interdependencies of features boosts network representational capacity in low-level reconstruction tasks [15]. Based on the spatial consistent characteristic of the spectral super-resolution task, we introduce spatial attention to maintain spatial consistency. The detailed structure is shown in Fig. 2.
Let X ∈ R c×l×h×w denote the input of the residual attention module. The input of channel attention block Y ∈ R c×l×h×w is given as follows: where δ denotes the ReLU activation function, W 1 conv and W 2 conv represent the two convolutional layers passed by the input X. Channel attention block achieves informative feature enhancement by exploiting channel-wise dependencies. The dependencies are captured from activated nonlinear interactions between channels [47], [48]. First, channel-wise spatial feature information is aggregated by a global average pooling. The input Y shrinks over its spatial dimension h × w, yielding the output y ∈ R c×l The channel-wise nonlinear interdependencies are captured by a two-layer 1D convolution followed by an activation function, respectively. The output s ∈ R c×l is represented as where W 1 conv1D and W 2 conv1D represent two 1D convolution, followed by ReLU δ(·) and Sigmoid f (·) activation function, respectively. W 1 conv1D achieves downscaling with a reduction ratio r, and W 2 conv1D realizes upsampling to original dimension also with the ratio r. Then, the output of the channel attention block is obtained by element-wise multiplication with the input feature Y and the spectral scaling feature s. Therefore, the output of the channel attention block X ca ∈ R c×l×h×w is denoted as where S ∈ R c×l×h×w denotes the expansion version of s to match the dimension of Y, and ⊗ denotes element-wise multiplication.
The location of high-frequency spatial details remains the same during reconstruction and the recovered HSI bands are spatially consistent. It is reasonable to consider the consistency of informative spatial regions. We introduce spatial attention blocks to refine the output from the channel attention module by spatially reweighting useful information. We first apply a pooling operation to aggregate channel information, and the independencies are captured by a convolutional layer followed by a Sigmoid f (·) activation function. The output of the channel attention block X ca ∈ R c×l×h×w is also the input of the spatial attention block. The spatial attention map Z is computed as where H P denotes the pooling operation, W conv and f (·) denotes the convolution and Sigmoid activation function, respectively. The output of the spatial attention block is obtained by element-wise multiplication with the input X ca and the spatial scaling feature Z. Then, the output of the entire SSFAM X res ∈ R c×l×h×w is computed by a skip connection with the input X, which is denoted as

C. Loss Function
The overall loss of generator G is a combination of the pixel-wise L2 loss L L2 (G), spectral angle loss L S A (G), and adversarial loss L Ad (G). The loss of discriminator D contains only adversarial loss L D . The objective functions for G and D are expressed as L(G) and L(D) where N is the batch size, λ is the weight to balance L2 loss and α is the weight to balance SA loss. Adversarial loss for the generator and discriminator are formulated as the LSGANs objective functions. L Ad (G) is expressed as L L2 (G) indicates L2 loss between the real HSI I HS and the generated HSI I HS , which is formulated as The spectral angle concept is derived from spectral angle mapper (SAM) [49]. The metric measures the similarity between input spectra and reference spectra by calculating the spectral angle of the two spectral vectors. The spectral angle loss is formulated as follows: where M is the number of pixels of HSI, t ij and r ij represent the spectral vector corresponding to the jth band of ith hyperpixel from generated HSI I HS and real HSI I HS , respectively. Throughout our work, we empirically set λ = 1 and α = 1 through a tuning process. The combination of the adversarial loss and L2 loss helps the model to produce more realistic results and is less prone to artifacts. Meanwhile, spectral angle loss imposes a penalty on spectral distortion, which is designed by minimizing the spectral vector angle between generated and real hyperpixels. Therefore, we first determine the parameter λ to balance the adversarial loss and L2 loss. Then the parameter α is determined to improve spectral reconstruction accuracy.

IV. EXPERIMENTAL RESULTS
In this section, we evaluate the performance of the proposed model on both synthetic and real remote sensing datasets to realize spectral dimension enhancement. We compare the model with the other six representative methods, including Dense-Unet [32], multiscale CNN [50], coupled dictionary learning method (SCDL) [40], HSCNND method [36], J-SLoL method [43], and SSRAN method [44]. For the SCDL algorithm, 10 5 training samples are randomly selected from training sets, 1024 dictionary atoms are chosen and sparsity regularization parameter λ is set to 0.1. The J-SLoL method uses the same optimal parameters as this article. The last convolutional layer of CNN-based methods proposed to reconstruct HSIs from natural RGB images is modified to achieve hyperspectral remote sensing image reconstruction.
The performance of HSI enhancement methods is evaluated by the following metrics: mean peak signal-to-noise ratio (mP-SNR) [51], mean root-mean-square error (mRMSE), SAM [49], and mean structural similarity index (mSSIM) [52]. RMSE is used to evaluate spectral reconstruction accuracy. SAM assesses the distortions for each spectrum, while PSNR and SSIM are utilized as spatial metrics to measure spatial quality and preservation of spatial structures. mPSNR, mSSIM, and mRMSE are calculated and averaged by band dimension. Smaller values of mRMSE and SAM and higher values of mPSNR and mSSIM indicate a better spectral reconstruction performance.

A. Implementation Details
The network is trained and tested under Pytorch framework, using Adam optimizer for both G and D, with β 1 = 0.5 and β 2 = 0.999. The learning rate is initially set to 0.0002 for 200 epochs and decreases linearly to 0 from 200 to 400 epochs. We set the batch size to 1 to benefit from the regularization provided by gradient estimation noise. We pretrain the generator for a more stable adversarial training process. The training is performed iteratively and alternates between G and D: at each step, optimize the D first and then G. The convolution ratio r in the residual channel attention module is set to 16.

B. Synthetic Data Scenario
In the case of synthetic data scenario, hyperspectral data is measured as hyperspectral radiance in Wm −2 sr −1 μm −1 to synthesize multispectral spectra. Hyperion has two spectrometers, one visible/near-infrared (VNIR) spectrometer, and one short-wave infrared (SWIR) spectrometer. We consider a more realistic downsampling approach by simulating MSI according to the spectral response function rather than using the subsampling factor directly. The spectral response function for Hyperion is described by Gaussian function with known center wavelength and full width half maximum.
1) Dataset preparation: The general idea of synthesizing multispectral spectra is to perform a weighted sum of hyperspectral spectra. For each observation of the target multispectral sensor, the hyperspectral bands within the wavelength range are used to synthesize the corresponding multispectral band. The synthetic process is convolving the narrow Hyperion Gaussian spectral response with a broad spectral response. Let P denote the number of HSI bands and M represent the number of multispectral image bands. After constructing spectral downsampling matrix W ∈ R P ×M , low spectral resolution hyperpixel s m ∈ R M can be simulated from high resolution hyperpixel s h ∈ R P , which is constructed as s m = W T s h . To construct W , the normalized spectral response curves of hyperspectral and multispectral sensors are known. Let the spectral response function of ith band of multispectral spectral be g i (λ), and the spectral response function of jth hyperspectral band, which overlaps with the ith multispectral band range be h ij (λ). The where g i (λ) is the spectral response function of ith band of multispectral spectra and h ij (λ) is the spectral response function of jth hyperspectral band which overlaps with the ith multispectral band range. Then, the weight is properly normalized to maintain radiance consistency in units between the two sensors. Under the reasonable assumption that the simulating process is independent of field-of-view location and spectral signature, the per band multispectra is generated by averaging with the normalized weight. In this article, NASA's Landsat 8 and ESA's Sentinel-2, two multispectral synthetic data scenarios are adopted to conduct experiments, thus, two different spectral downsampling matrices are created. Hyperion instrument provides 196 calibrated bands of the total 242 bands with a spatial resolution of 30 m. Owning to the strong water absorption, low ratio of signal to noise, the spectral overlap of the VNIR and SWIR spectrometers, and invalid wavelength of imaging, 107 calibrated and high-quality Hyperion bands, including visible and near-infrared spectrums, are selected. Concerning the Landsat 8 case, 107 calibrated high-quality Hyperion bands ranging from VNIR to SWIR are reconstructed from 8 simulated Landsat 8 spectral observations. Likewise, simulated Sentinel-2 data is synthesized using the spectral downsampling matrix, and the same 107 Hyperion bands are recovered from 11 Sentinel-2 bands. The synthetic datasets are provided. 1 2) Results and Analysis: We compare our method with other spectral super-resolution methods on different scenes of the synthetic Landsat 8 and Sentinel-2 datasets. It is worth mentioning that we trained and tested the model on six and three different scenes containing various objects. The assessment metrics results are shown in Tables I and II, where the best results are written in bold. From quantitative evaluation results, our method achieves better spectral enhancement performance than other state-of-the-art methods. The dictionary learning-based   method enhances the spectral resolution pixel-by-pixel individually without utilizing local spatial correlation. Hence, the SCDL algorithm has a better performance on datasets of similar scenes but is difficult to generalize on the widely different real-world scenarios. The J-SLoL method jointly learns low-rank dictionary pairs and achieves better quantitative results than SCDL on both simulated datasets. The other CNN-based methods generalize better on different scenarios. The DenseUnet combines dense blocks in traditional UNet architecture for iterative feature concatenation, thus performing better than MultiCNN in terms of spectral fidelity and consistent illuminance. The SSRAN method achieves less spectral distortion based on the neighboring spectral module. Our proposed method outperforms the SCDL technique by fully extracting and merging multiscale features to reconstruct hyperpixels, thus, the reconstruction results achieve Figs. 5 and 8 show the reflectance reconstruction error via different methods of three randomly selected pixels on two datasets, respectively, which reveals better spectral reconstruction capability of the proposed method in recovering spectral signature details. Our proposed method reconstructs the most similar spectra curve compared with other methods, which is shown with the lowest radiance difference among most bands.

1) Dataset Preparation:
In this section, the performance of the proposed method is evaluated on the real pairwise multispectral and hyperspectral datasets. The dataset is captured by Hyperion hyperspectral imager and advanced land image (ALI) that are jointly mounted on EO-1 satellite. ALI has 9 VNIR/SWIR multispectral bands with 30 m spatial resolution and a 10 m spatial resolution panchromatic band. Hyperion instrument provides 196 calibrated bands of the total 242 bands with a spatial resolution of 30 m. Owning to strong water absorption, low ratio of signal to noise, the spectral overlap of the VNIR and SWIR spectrometers, and invalid wavelength of imaging, 108 calibrated and high-quality Hyperion bands, including visible and near-infrared spectrums, are selected. Therefore, the spectral reconstruction aims to recover 108 Hyperion bands from 9 multispectral ALI bands. The coregistered image pairs are used in both the training and testing stage.
2) Results and Analysis: Visual reconstruction comparison is shown in Fig. 9, where the recovered bands of 529 nm, 733 nm, 1084 nm, and 1296 nm are compared with ground truth. Visual results have shown that the reconstruction results of our proposed method have high similarity to ground truth in terms of both illuminance and spectral details. Compared with the other deep learning-based methods, SCDL shows a high difference in the edges, resulting in worse spectral details evaluation. HSCNND recovers more accurate spectral details, and the reconstruction error varies greatly depending on the type of object. DenseUnet and MultiCNN reconstruct worse on whole image illuminance. Our proposed method achieves better spectral enhancement accuracy and maintains spatial information. The quantitative   Reflectance reconstruction error via different methods of three randomly selected pixels is shown in Fig. 10. It is obvious that our method reconstructs a highly accurate spectral curve and the reconstruction error is almost close to zero.

1) Loss Function Selection:
In this section, the experiment compares the impact of introducing adversarial loss on the re- construction quality based on the L2 loss and spectral similarity loss. The comparative quantitative results with the combination of adversarial loss on both simulated radiance datasets and the real paired dataset based on the same proposed network are listed in Table IV, where the best results over each dataset are written in bold.
The introduction of adversarial loss improves the reconstruction quality on three datasets with the highest mPSNR and

2) Spatial Spectral Feature Attention Module:
In this experiment, the performance of the proposed network with or without the proposed SSFAM is compared on the three simulated and real paired datasets. The quantitative reconstruction results of different networks with the proposed loss function are listed in Table V, where the best result over each dataset are written in bold.
SSFAM greatly improves the reconstruction quality on the real paired dataset compared with simulated datasets, owing to the higher correlation among input MSI bands. The improvements indicate that SSFAM is conducive to fully exploring the correlation between bands, thus more informative spatiospectral features are exploited from SSFAM for further reconstruction. Compared with the network without SSFAM on the simulated Sentinel-2 dataset, mPSNR and mSSIM improves by 2.21 dB and 0.0348, while SAM and mRMSE drop by 0.014 and 0.009. On the simulated Landsat-8 dataset, the proposed method with SSFAM slightly improves mPSNR by 4.51 dB and greatly decreases SAM value by 0.073. The network with SSFAM on the real paired dataset significantly improves mPSNR and mSSIM with 6.44 dB and 0.146, while SAM and mRMSE decrease by 0.214 and 0.013.

E. Time Cost
In this section, we evaluate the inference time of different methods on a single patch with the shape of 1 × 9 × 256 × 256 from the real paired Hyperion-ALI dataset. We also measure the FLOPs of different deep learning-based methods on the same input patch. The inference time and FLOPs of all methods are listed in Table VI, where the best results of FLOPs and inference time are written in bold. The inference time and FLOPs of Dense-Unet [32], MultiCNN [50], HSCNND [36], SSRAN [44], and our proposed method are measured on the computer equipped with NVIDIA Tesla K80 GPU. The inference time of SCDL [40] and J-SLoL [43] are measured on MATLAB R2016a platform. The proposed method is designed as a lightweight network for large remote sensing scenarios. The FLOPs of SSRAN [44] are slightly lower than our method, while the inference time is higher.

V. CONCLUSION
In this article, a spectral resolution enhancement method is proposed to reconstruct high spectral resolution HSI from MSI. An adversarial learning framework is employed to realize the conversion of hyperspectral and multispectral two image domains. The network architecture of the generator with several SSFAMs is designed to learn the nonlinear mapping from MSI to HSI for subsequent applications. SSFAM is proposed to utilize the inherent correlation of spatio-spectral features and achieve feature mapping by selectively enhancing informative features. The structure of SSFAM combines both spatial and spectral attentional blocks. Experimental results on both simulated radiance datasets and real paired remote sensing datasets show the superiority of the proposed method in both quantitative and qualitative analysis.