Hyperspectral Pansharpening Based on Spectral Constrained Adversarial Autoencoder

Hyperspectral (HS) imaging is conducive to better describing and understanding the subtle differences in spectral characteristics of different materials due to sufficient spectral information compared with traditional imaging systems. However, it is still challenging to obtain high resolution (HR) HS images in both the spectral and spatial domains. Different from previous methods, we first propose spectral constrained adversarial autoencoder (SCAAE) to extract deep features of HS images and combine with the panchromatic (PAN) image to competently represent the spatial information of HR HS images, which is more comprehensive and representative. In particular, based on the adversarial autoencoder (AAE) network, the SCAAE network is built with the added spectral constraint in the loss function so that spectral consistency and a higher quality of spatial information enhancement can be ensured. Then, an adaptive fusion approach with a simple feature selection rule is induced to make full use of the spatial information contained in both the HS image and PAN image. Specifically, the spatial information from two different sensors is introduced into a convex optimization equation to obtain the fusion proportion of the two parts and estimate the generated HR HS image. By analyzing the results from the experiments executed on the tested data sets through different methods, it can be found that, in CC, SAM, and RMSE, the performance of the proposed algorithm is improved by about 1.42%, 13.12%, and 29.26% respectively on average which is preferable to the well-performed method HySure. Compared to the MRA-based method, the improvement of the proposed method in in the above three indexes is 17.63%, 0.83%, and 11.02%, respectively. Moreover, the results are 0.87%, 22.11%, and 20.66%, respectively, better than the PCA-based method, which fully illustrated the superiority of the proposed method in spatial information preservation. All the experimental results demonstrate that the proposed method is superior to the state-of-the-art fusion methods in terms of subjective and objective evaluations.


Introduction
Hyperspectral (HS) images captured by sensors under different spectrums have hundreds of narrow spectral channels with detailed spectral information. Because of the rich spectral information, HS images can play a pivotal role in the fields of classification, detection, segmentation, tracking, and recognition [1][2][3][4][5][6]. However, one of the main obstacles in HS imaging is that the dense spectral bands allow a limited number of photographs to arrive at a narrow spectral window on average. To ensure a sufficient signal-to-noise ratio (SNR), long-time exposure is often required, thereby sacrificing spatial resolution [7,8]. All in all, the present research makes a major contribution to achieve good results. Even so, it regards pansharpening as a black-box deep learning problem without ultimate consideration of spectral and spatial protection [40]. With a supervised network, the method requires plenty of computational costs for the input of the high-dimensional HS images. At the same time, the limitation of training samples is still a problem which remains to be solved. Recently, a number of CNN-based methods have sought to estimate performance on data sets with fewer spectral bands, but when the bandwidth reaches the order of HS image, the computation is heavy, the memory requirement of GPU is high, and the operation is difficult.
In this paper, to address the above problems, we propose a new HS pansharpening method based on spectral constrained adversarial autoencoder (SCAAE) inspired by our previous work [41]. The proposed method was used for HS pansharpening for the first time effectively acquiring the spatial information of HS image and improving the quality of the fused image. In order to reduce spectral distortion, we have added spectral constraints in AAE. Compared with the state-of-the-art methods, the proposed method improves the ability of spatial information enhancement and spectral information preservation. We conduct experiments on different data sets and make further efforts to illustrate the superior performance of the proposed SCAAE based HS pansharpening method.
In summary, the main novelties and contributions of the proposed HS pansharpening method are concluded as follows: 1. We first propose SCAAE based HS pansharpening method to extract features and obtain spatial information of HS images. Especially for spectral information preservation, the spectral constraints are added into the loss function of the network to reduce spectral distortion further. 2. An adaptive selection rule is constructed to select an effective feature that can well represent the up-sampled HS image. In particular, the structural similarity is introduced to compare the similarity of the PAN and the extracted features of the up-sampled HS image. 3. We construct an optimization equation to solve the proportion of HS and PAN images in the final fusion framework. The experiments show that the proposed SCAAE pansharpening method is superior to the existing state-of-the-art methods.
The following part of this paper is divided into six sections. Section 2 reviews the related work. Section 3 describes the proposed method. Section 4 is devoted to experiments and results. Section 5 is about the discussion and analysis. In Section 6, we make conclusions.

Related Work
In this section, the frequently used methods for HS pansharpening are reviewed, and their existing challenges are analyzed. In a traditional way, the HS pansharpening problem can be written as: In the first term, f 1 is the mapping relationship between the down-sampled HS image Y and the pansharpened HS image X, which is used to minimize spectral distortion. In the second term, f 2 is the mapping relationship between the PAN image P and the pansharpened HS image X. This part helps preserve spatial information.
Yang et al. [42] proposed a deep network-based method called PanNet, which automatically learns the mapping purely from data and incorporates problem-specific knowledge into the deep learning framework and focuses on two main aspects of fusion problem: spatial and spectral preservation. In this method, Resnet is trained to obtain high-frequency information by a high-pass filter. Then, the high-frequency details are injected into the up-sampled HS image. However, the spectral constraint of PanNet is constructed from the output of the spatial preserving network and the original HS image, which means that the spectral preservation in PanNet depends on spatial preservation. It is an indirect condition which may lead to sub-optimal preservation results. In addition, the quality of fusion mainly depends on the training result of the PanNet, leading to a lack of stability and robustness.
Hence, we proposed the SCAAE based pansharpening method. The work related to it is adversarial training and adversarial autoencoders, which are described in more detail below.

Adversarial Training
As described by Goodfellow et al. [43], adversarial training involves learning the mapping from latent samples z to data samples x. Adversarial training is an iterative training of two competitive models: the generator model G and the discriminator model D. The generator is fed with input samples x and is optimized to generate the latent feature that fools the discriminator with z ∼ p(z) until z ∼ q(z) can be considered as coming from the imposed prior distribution z ∼ p(z) on the latent feature space. Meanwhile, the discriminator is fed with the latent samples z ∼ q(z) from the output of the generator and the samples z ∼ p(z) which obey the target aggregated posterior distribution. It is trained to correctly predict whether the samples are from the imposed prior distribution or the generated latent feature, and then give the judgment of truth and falsity to update the parameters of the generator. The following min-max objective can achieve the discussed competitive training: where z is output samples of the generator. z ∼ p (z) denotes a target probability distribution. D(z) is the discriminative model. q(z|x) represents both the encoding model and generative model.

Adversarial Autoencoders
In AAE [44] network, adversarial learning ideas are added to the autoencoder (AE) model to accurately approximate the potential feature space on an arbitrary prior basis. In [45], q(z|x) is specified by a neural network whose input is x and output is z, which allows q(z|x) to have arbitrary complexity, unlike the variational autoencoder (VAE) [46] where the structure of q(z|x) is usually limited to a multivariate Gaussian. An analytical solution of Kullback-Leibler (KL) divergence is required as the choice of the prior distribution, and posterior distribution is limited. However, a posterior distribution in AAE does not need to be defined analytically, and it can match q(z|x) to several different priors p (z). The reason is that AAE can learn a model through adversarial training, which can match samples with any complex target distribution, avoiding the need to compute a KL divergence analytically.
AAE accepts a dual training goal, including a traditional reconstruction error and an adversarial training criterion that matches the potential representation with an arbitrary prior distribution. To be specific, both the reconstruction error and the discriminators make the latent feature space distribution approximate the imposed prior contribution for updating the encoder. In AAE, q (z|x) plays a double rule in both the encoder of the autoencoder framework and the generator in an adversarial framework. Thus, the AAE framework learns the aggregate posterior distribution q(z), which can be described as follows: The discriminator of AAE is trained to distinguish the latent samples z ∼ p(z) from the probabilistic encoder conditioned on the samples z ∼ q (z|x). The cost function for training the discriminator D is: where z i ∼ p (z), z j ∼ q (z|x). K is the size of the training batch.
To match q (z) to an arbitrarily chosen prior p (z), the adversarial training is implemented. The cost function for matching q (z|x) to prior p (z) is:

Proposed Method
LetH ∈ R M×N×L represent the up-sampled HS image, in which L is the number of bands and M × N is the number of spatial pixels. Let Z ∈ R M×N×l denote the extracted feature ofH. The final selected feature is denoted by Z s ∈ R M×N . Figure 1 shows the overall flowchart of our proposed approach. The proposed method is described in three parts: feature extraction, feature selection, and solving the model. The detailed description is as follows. The overall model can be expressed by the following equation: where · F denotes the Frobenius norm. M represents the gains matrix. The symbol ⊗ denotes element-wise multiplication.H and H R represent the up-sampled HS image and the reference HS image, respectively. The selected feature and enhanced PAN image are denoted by Z s , and P e . f Gauss (·) represents the Gaussian filter function. For convenience, we denote the combined PAN as: and its Gaussian filtering version is represented as: Thus, the model can be described as: In general, the complete HS image is used for HS pansharpening because the information of HS image is rich and complex, and it contains abundant pixels. Nevertheless, the effective pixels that can represent the spatial information of HS images are limited, that is to say, there is a certain amount of redundant information in HS images. Therefore, we reduce the dimension of HS images by feature extraction and obtain low-dimensional spatial features to represent the effective information on HS images and reduce the computational cost. Furthermore, traditional feature extraction methods only extract shadow features and probably cause the loss of effective information as well as image distortion. It is worth noting that feature extraction methods based on deep learning can better mine deep spatial features. Through deep learning methods based on DNN, the obtained features can maintain certain invariance and contain higher-level semantic information, which effectively narrow the gap between the bottom features and the high-level semantics [47,48]. It is worthwhile exploring a specifically HS pansharpening method based on DNN that is practical and efficient to the data sets [49][50][51][52]. In this paper, we propose an unsupervised deep learning pansharpening method based on SCAAE to achieve feature extraction. More details are explained in Sections 3.1-3.4.

Feature Extraction
In this section, we talk about the motivation and process of feature extraction by SCAAE. According to our research, the existing methods only consider the spatial information of the PAN image and ignore the spatial information of HS images. Furthermore, there are some traditional methods generally used to extract features of HS images to represent the spatial information. It is not satisfying that most of them are shallow features that can not fully express the comprehensive information of HS images [53][54][55]. To solve the problem, in this paper, we propose the SCAAE based pansharpening method to mine deeper features. The specific operations are as follows.
The three-dimensional HS image is converted to a two-dimensional vector and sent to the SCAAE network. The inputH can be interpreted as MN vectors with L dimensions, which can be denoted asH = [h 1 ,h 2 , . . . ,h MN ]. The weights and biases of the encoder and the decoder are operated by linear operations to obtain the reconstructed resultĤ = ĥ 1 ,ĥ 2 , . . . ,ĥ MN . In SCAAE, a suitable constraint is added to the latent feature space in loss function, which is based on spectral angle distance (SAD) so that the spectral consistency can be well kept. In the encoder, which is also called a generator, the hidden layer consists of two fully connected layers, and the activation function is LeakyRelu [56]. The ReLU may lead to dead nodes when training for a positive response is kept, while negative responses are suppressed by setting them to zero. The LeakyReLU overcomes the drawback when negative responses are suppressed by setting them to a small negative slope, such as 0.2. The network structure of the decoder is similar to the encoder, which consists of two fully connected layers, and the activation functions are LeakyRelu and Sigmoid. As for the discriminator, it contains a fully connected layer and uses LeakyRelu as the activation function. We set the learning rate as 10 −4 , and the training batch size is set to the same number as the spatial dimension of the input up-sampled HS image. The loss function of the whole network includes the loss function of autoencoder, generator, and discriminator. Then, the optimization process is optimized by using the Adam algorithm. More details of the training are discussed as follows.
The upsampled HS image is sent as input to the SCAAE network, which is iteratively trained to obtain the feature. The training process consists of two steps. Firstly, the autoencoder is trained to perform image reconstruction, which enables the decoder to recover the original image from the latent samples generated by the encoder. Secondly, the discriminator and generator begin adversarial learning.
In SCAAE, q (z|h) represents both the encoder of the autoencoder framework and the generator in an adversarial framework. The generator G is trained to generate latent samples to deceive the discriminator D. When p (z) is more similar to q (z), the training effect is better. As a result, the feature of the upsampled HS image is obtained, and spatial information is well extracted. The reconstruction error of the SCAAE can be expressed in: whereH andĤ represent the input image and the reconstructed image, repectively. The error between the reconstructed HS image and the input HS image is expressed in the form of norm, and we can measure the reconstruction and feature extraction quality of the SCAAE network. The loss function for matching q (z|h) to prior p (z) is described as follows: The discriminator is trained to distinguish the latent samples z ∼ p(z) from the probabilistic encoder conditioned on the input samples q (z|h). The cost function used to train the discriminator D is: where z i ∼ p (z), z j ∼ q (z|h). K is the size of the training batch.
We improve the structure of AAE by adding spectral constraints in the loss function. By calculating the difference of spectral angle vector between the input imageH and reconstructed imageĤ, the spectral constraint loss is defined: where MN is the total pixels in the HS image. By adding the spectral constraints based on SAD to reduce spectral distortion, the total loss function in SCAAE can be described as follows: When the total loss is minimized, we obtain the feature of the HS image. Let Z denote the extracted feature of up-sampled HS imageH. Let l denote the total number of the feature. Therefore, Z can be expressed by: for i = 1, 2, . . . , l. Z i represents the ith feature map. The process of feature extraction is illustrated in Figure 2.

Feature Selection
Aiming at the problem of insufficient spatial information and spectral distortion, we select the feature based on the structural similarity index (SSIM) by our selection rule in the adaptive fusion approach. We denote Z as the extracted feature of up-sampled HS imageH. l is the number of feature maps. Thus, Z can also be expressed by Z= [Z 1 , Z 2 , . . . , Z l ]. The ith map of the feature is denoted by Z i . To gain less spectral distortion and complete spatial information, we find the SSIM value of feature maps and PAN image, and then take out the feature map whose index is the biggest. The feature map with the largest SSIM value represents the one most similar to the PAN image. Since the PAN image has sufficient spatial information, the feature map with the largest SSIM value is selected as the spatial information of the up-sampled HS image. Accordingly, the spatial information used for fusion is complete. The selection rule is based on the following equation: where Z i and P represent the ith feature map of extracted feature Z and the PAN image, respectively.
is the variance of P. σ Z i P is the covariance between Z i and P. c 1 = (0.01 × C) 2 and c 2 = (0.03 × C) 2 are stable constants. C is the dynamic range of pixel values. The bigger SSIM value means a better result. The optimal value is 1, representing the compared two images being exactly similar to each other [57,58]. The feature map with the most significant SSIM value is denoted as Z s . Therefore, we select Z s whose spatial structure is most similar to the spatial structure of the PAN image, and use it to represent the spatial information of the HS image to improve the spatial information of the fusion process and reduce the spatial distortion.

Solving the Model
As stated above, the overall problem can be described by Equation (6). In this part, we further explain our model and talk about the solution. There are three main steps in our model: obtaining the combined PAN image, injecting details, and solving the optimization equation. The specific processes are described in detail in Sections 3.3.1-3.3.3.

Obtaining the Combined PAN Image
The Laplacian of Gaussian (LOG) image enhancement algorithm is applied to the PAN image to improve the robustness to noise and discrete points to make the spatial information of the PAN image clearer. Firstly, Gaussian convolution filtering is used to remove the noise, and then the Laplacian operator is used to enhance the detail of the denoised image. The algorithm is described as: where P e is the enhanced PAN image. f LOG (x, y) is the kernel function of LOG operator. ω is a constant. * denotes the convolution operator. Spectral and spatial information of HS and PAN images need to be considered simultaneously because of different and complementary spatial information for the same scene. Consequently, the filtered image and the PAN image are adaptively integrated, as described in Equation (7). Finally, the combined PAN image is obtained, and the adequacy of space information is guaranteed.

Injecting Details
Although the adaptive fusion approach successfully obtains the spatial information from the feature map of the interpolated HS image and the detail layer of the enhanced PAN image, some spatial and spectral information of the flat region whose pixel values are similar to the surrounding pixel values has not been enhanced. As a result, some details should be injected to improve the appearance further. In Equation (8), a Gaussian filter is performed on the combined PAN image to remove the high-frequency component so that we can obtain the low-frequency component. Then, the details are acquired when the low-frequency component from the original PAN image is subtracted. The spatial details are obtained by making a difference to the images before and after filtering according to the following formula: The spatial information I is injected into the respective bands of the interpolated image through the gains matrix to generate a fused HR HS image: where H k F represents the kth band of the merged HS image. The gains matrix is M k = βH L is the band number of the reference HS image.H k is the kth band of the up-sampled HS image. M k is the kth band of the gains matrix.

Solving the Optimization Equation
As mentioned above, Z s , P e , and M are defined or learned by our proposed method. Therefore, only α remains unknown in Equation (6). We obtain the parameter values by separately solving the equations which force the partial derivative of α and β to zero.
Firstly, we have: By letting the first factor equal to zero, it can be written as: To simplify the equation, we denote: for i = 1, 2, . . . M, j = 1, 2, . . . N. Then, we have: Therefore, the value of s ij can be obtained: As stated earlier, we can obtain another expression of s ij : Then, we define a new function of the input x: Thus, the equation of α can be formed as follows: Finally, we obtain the solution of Equation (27):

Performance Evaluation
After the model is established and solved successfully, four widely used reference indexes are adopted for performance evaluation, including CC, SAM, RMSE, and ERGAS. The interpretation of the indexes is described in the part of the experiment. The fusion quality of various HS pansharpening methods was compared and assessed by the indexes of Cross-Correlation (CC) [59], spectral angle mapper (SAM) [14], root mean square error (RMSE) [60], and erreur relative to global adimensionnellede synthse (ERGAS) [61]. CC is a spatial index evaluating the spatial distortion degree. The ideal value is 1. Larger CC value indicates better fusion quality. SAM is a spectral indicator that measures spectral distortion between the HS image and the fused image. RMSE is the global index that appraises the spatial and spectral fusion qualities. ERGAS is also a global quality index, evaluating the spatial and spectral distortion. The optimal value of SAM, RMSE, and ERGAS are 0. The smaller the value is, the better the fusion performance. The simulated HS image and the simulated PAN image can be generated from a given reference HS image. Subsequently, the fusion results can be obtained from the simulated HS image and the simulated PAN image. The fusion results are compared with the available reference HS image to evaluate the objective quality of the synthetic data sets.

Experimental Results and Discussion
In this part, we compare the fusion quality of the proposed HS pansharpening method with ten state-of-the-art algorithms, which are SFIM, MTF-GLP, MTF-GLP-HPM, GS, GSA, GFPCA, CNMF, Lanaras's, HySure, and FUSE.

Data Set
To evaluate the effectiveness of the proposed HS pansharpening method, we perform experiments on four simulated HS data sets, i.e., Moffett Field data set [59], Salinas Scene data set, Pavia University data set, and Chikusei data set. The comparative experiments are conducted both quantitatively and visually.
The first data set is the Moffett field data set, acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [62], which can provide the HS image 224 bands in the spectral range of 400-2500 nm. Water absorption bands and damaged bands are discarded, and 176 bands are used for the experiment. The dimensions of the test HS image are 75 × 45 with a spatial resolution of 5.2 m, and the size of the test PAN image is 300 × 180 with a spatial resolution of 1.3 m.
The second data set is the Salinas Scene data set, which is also obtained by the AVIRIS sensor. This data set includes vegetables, bare soils, and vineyard fields. The fourth data set is the Chikusei data set. The airborne HS data set was taken by Headwell Visible and Near-Infrared series C (VNIR-C) imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan. The HS data set has 128 bands in the spectral range of 363-1018 nm. The scene consists of 2517 × 2235 pixels, and the ground sampling distance is 2.5 m. The dimension of the test HS image is 150 × 150, and the size of the test PAN image is 600 × 600.
The Moffett Field data set, Salinas Scene data set, Pavia University data set, and Chikusei data set are all simulated data sets. According to the Wald's protocol, given the reference HS image, simulated HS images and simulated PAN images can be obtained. Moreover, the fusion result is generated by fusing the simulated PAN image and HS image. To evaluate the objective quality of the simulated data sets, these fusion results were compared with original reference HS images.

Experimental Setup
To evaluate the sensitivity of the SCAAE to its key parameters in pansharpening and make sure the superiority of the final fusion results, we run the SCAAE for different values of the related parameters, including the number of the hidden nodes in each layer, the depth, the pattern of the loss function, activation function, learning rate, batch size, and epochs. By trial and error, the number of hidden nodes in the last layer and the depth of the network are verified to have the most important effect on pansharpening results. Precisely speaking, the impact of different parameter settings on the fusion performance, which can be represented by the CC value and SAM value, is conducted based on the variable controlling approach with the others fixed. The above-mentioned factors that affect the fusion result are independent of each other. The influence of the changed factor on fusion can be seen clearly by changing only one factor at a time and controlling the remaining factors to remain unchanged. With the evaluation of SAM and CC indicators, the most favorable parameter setting scheme for the fusion results is obtained. More details are as follows.
In order to meet the requirement of the abundance of the spectral information, the number of hidden nodes in the first two layers is set to 500 by capturing rich features of the input. Figure 3 plots the CC value between the input HS image and the generated HR HS image with the varying number of hidden nodes in the third layer. As Figure 3 shows, for most of the data sets, the proposed method achieves the best CC value when the number of hidden nodes in the last layer is set to 30. That is to say, the dimension of the extracted feature is 30 to achieve optimal pansharpening performance. Notably, though the CC(0.9300 for the Pavia University data set) is smaller than any others when the number of hidden nodes is 30 at the value of 0.9300, the results are approximate to the ideal value under the same conditions (0.9807, 0.9684, and 0.9565 for Moffett, Salinas, and Chikusei data sets, respectively). Furthermore, we set the depth to 3 to evaluate the input of hidden nodes to SAM. In Figure 3b, it can be seen that the SAM value at the number of 30 is much lower than those obtained at the other numbers, which is consistent with the CC value. Moreover, the performance declines when the number is smaller than 30 for all the tested data sets.
As for the depth of the network, we systematically vary the parameter setting one by one under the precondition of the parameter above and report the CC and SAM value. Figure 3c,d studies the effect of depth of the proposed method on CC and SAM values. Similarly, in order to study the effect of a single variable on the results, we set the number of hidden nodes in the last layer to 30. CC reflects the geometric distortion specially and shows an interesting trend, which reflects the spatial information effectively infected in each process. The CC value for the Salinas Scene data set remains stable as the number of depths varies from 1 to 5. For the Moffett Field and Chikusei data sets, the CC value decreases rapidly as the argument is changed from 3 to 4, while there is an obvious increase from 3 to 4 for the Pavia University data set. As a result, the depth is set to 3 by the comprehensive analysis of the proposed method in the experiments of the four data sets.
As can be seen from Figure 3d, although the values of SAM are very close under different depth settings for Pavia University and Chikusei data sets, the SAM achieves a lower level under the condition of three depth for the Moffett Field and Salinas Scene data sets. Therefore, the proposed method achieves a lower SAM value under the condition of ensuring the preservation of spectral information. From Figure 3c,d, we can conclude that, when the depth is set to 3, the performance achieves the best. While the depth is set to 1 or 5, the values of CC are smaller, and the value of SAM is larger than the condition of 3. Hence, the experiment results and analysis prove that, when the depth of the network is set to 3, the performance of the whole network achieves the most satisfactory.
In addition, the learning rate, weight decay, batch size, and epoch mainly influence the speed of converging. Through fine-tuning, we empirically find that, when the learning rate is set to 10 −4 , and the decay is set to 0.9, the network converges faster. Considering the trend of the CC and SAM value, we choose 30 as the number of hidden nodes and three as the depth of the network for all data sets in the following experiments.
To illustrate the process of feature extraction intuitively, the intermediate results of hidden nodes extracted by the SCAAE are shown visually in Figure 4 with parameters optimizing performance.
All experiments were performed in the Matlab (R2019a) environment on a server with CPUs: Intel (R) Core (TM) i5-7200U CPU@ 2.70 GHz, an Nvidia K80 GPU and 128 GB memory.

Component Analysis
In this subsection, the objective results of the proposed method are provided to validate the effect of the essential components step by step. Compared with the existing well-performed methods, the most outstanding advantage of the proposed method is that it not only considered the spatial information of the PAN image but also extracted deep features of HS image as the supplement spatial information through SCAAE. Meanwhile, the method extracts deep features rather than shadow features extracted by traditional methods, which promotes the fusion results very well.  Thus, the effects of significant processing components are analyzed, i.e., deep feature extraction from the HS image from the perspective of objective experimental results on the four tested data sets. In the proposed SCAAE based HS pansharpening model, Z s represents the selected feature with the SSIM based feature selection rule. The MRA based method achieves pansharpening without the spatial information of the HS image. The PCA method considers the spatial information with the shadow feature. As for the SCAAE based method, the deep feature is extracted from the network. Table 1 lists the average objective results of the traditional MRA based method, PCA based method, and the SCAAE based method on the four tested data sets.

Pansharpening Results
The first experiment is tested on the Moffett Field data set. Figure 5a-c show the reference high-resolution HS image, the interpolated HS image, and the PAN image, respectively. Figure 5d-n show the false-color results of the estimated HR HS images obtained by different methods and the proposed method. The SFIM method preserves spectral information well, but some of the edges and texture are too sharpened. Results from MTF-based methods are similar to the SFIM method. This may come from the fact that the experiments are performed on simulated data sets, and MTF-based methods may not be fully realizing its potential. The GS-based methods achieve excellent spatial performance but visible spectral distortion. The results generated by GFPCA and CNMF are fuzzy since the effective spatial information is not sufficiently injected. The CNMF method shows promising results in the aspect of spectral, whereas the edges in the reconstructed images obtained by the CNMF method are too sharp in some areas, such as trees in the CNMF fused image. The Fuse and HySure methods show outstanding performance, but some details in the field are injected insufficiently, and the spectral information is not well preserved. The Lanaras's method produces spectral aberration due to a much higher chromatic aberration in some areas other than the real ground. On the contrary, the proposed method has favorable results. The halo artifacts and the blurring problems can be eliminated by the proposed method. The objective quantitative analysis for the Moffett Field data set is depicted in Table 2. After analyzing the experimental data in Table 2, we conclude that the proposed method obtains all the best values in SAM, RMSE, CC, and ERGAS. The results demonstrate that the proposed method performs well in spatial and spectral domains and gains the least global error.
For clearly demonstrating the performance of different pansharpening methods, the difference map is generated by subtracting the reference image from the pansharpened image pixel by pixel. From the results shown in Figure 6, we can see that the difference map obtained by our proposed method has the smallest value difference, which means that the fusion result obtained by this method is closest to the reference image.  The second experiment is conducted on the Salinas Scene data set. Figure 7a-c show the reference high-resolution HS image, the interpolated HS image, and the PAN image, respectively. Figure 7d-n show the false-color results of the estimated HR HS images obtained by different methods and the proposed method. From the visual analysis, we can know that the spatial details in the results of the SFIM method are slightly fuzzy. The GS method causes pronounced spectral distortion and lacks spatial detail information. In comparison, the GFPCA method has better spectral quality than GSA but shows an indistinct area in the left region. Although the Lanaras's and Fuse methods have a good spatial quality, they generate obvious spectral distortion in the lower half of the scene. It can be seen clearly that the result of HySure suffers from obvious spectral loss, and the details are injected insufficiently. For the Salinas Scene data set, the CNMF and GFPCA methods have high fidelity in rendering the spatial details, but the color difference from the reference image is relatively ignorable, which means spectral distortion. By contrast, the proposed method improves the spatial performance while maintaining spectral information and achieves the superior property in the spatial aspect as it adds more spatial details. The results of the proposed method are closest to the reference image. These facts show that the proposed algorithm performs well in both the spatial and spectral aspects. The objective quantitative analysis for the Salinas Scene data set is shown in Table 3. By analyzing the average values of SAM, CC, RMSE, and ERGAS, we can conclude that the proposed method has the most significant values in CC, the smallest values in SAM and RMSE. Based on the comparison of different methods, the proposed method indeed demonstrates the excellent performance in visual quality and objective indicators. In addition, as shown in Figure 8, the difference map is calculated to verify the effectiveness of the proposed method further. Similar to the Moffett data set, the blue areas in the difference map of this method are larger than that of other competing methods. In addition, it can be observed that, for all the comparison methods, the major differences in edges and pixels mainly exist on small scales, which can further explore the solution in the future. The third experiment is performed on the Pavia University data set. Figure 9a-c show the reference high-resolution HS image, the interpolated HS image, and the PAN image, respectively. Figure 9d-n show the false-color results of the estimated HR HS images obtained by different methods and the proposed method for the Pavia University data set. Visually, the results of the SFIM, GS, and GFPCA methods generate spectral distortion. The results of the GSA methods are dim and blurred in some areas such as edges of metal sheets for the lack of sufficient spatial information injection of the PAN image and HS image. By analyzing and comparing the listed results, we can draw the conclusion that the MTF-based methods have good fusion performance, and the GFPCA methods achieve better capability in preserving the spectral information compared with the GS and MTF-based methods. The HySure, FUSE, and CNMF methods keep spectral information very accurate. However, they have a deficient improvement of the spatial quality in some marginal areas, such as the edges of trees and roofs. By contrast, the Lanaras's and the proposed methods can achieve satisfactory performance, and the false-color image obtained by the proposed method is closest to the reference one. All the quality measures for each comparison method on the Pavia University dataset have been calculated and recorded. Average results have been shown in Table 4. For the Pavia University data set, the objective quantitative analysis in Table 4 shows that the proposed method obtains the most significant CC value and the smallest RMSE and SAM values. The ERGAS value reaches the second-best. Generally speaking, the proposed method has a better fusion effect than other algorithms. In particular, the SAM of the proposed method consistently demonstrates the best objective result. This further demonstrates that the proposed method performs well in spatial and spectral aspects. The absolute difference map between the fused image and the reference image obtained by different methods is shown in Figure 10. It is obvious that the difference map obtained by our proposed method has more blue areas, which means that the difference of pixels obtained by our proposed method is smaller than other compared methods. Similar to the above three experiments, the fourth test on the Chikusei data set was conducted. Figure 11a-c show the reference high-resolution HS image, the interpolated HS image, and the PAN image, respectively. The fused false-color results obtained by different compared methods and the proposed method are shown in Figure 11d-n. The SFIM method has a significant spectral distortion, and the spatial information of the GS and GSA method is injected insufficiently. Here, the MTF based methods and GS based methods do not perform well in spatial. For the result of the GFPCA method shown in Figure 11h, the spatial details are injected insufficiently, and the fused HS image is fuzzy. By contrast, the HySure, CNMF, and the proposed method can achieve better performance, and the false-color image obtained by the CNMF and the proposed method is closest to the reference one. The Lanaras's method has a slight spectral distortion in the edges, and the spatial information is added deficiently. Furthermore, it can be observed that the main distortion of generated HR HS images exists in edges and pixels with a small scale for all compared methods. This may be the direction for our future improvement. For overall consideration, the proposed method achieves better performance than the compared ten state-of-the-art methods. The objective quantitative analysis of the Chikusei data set in Table 5 shows that the proposed method has the smallest SAM and RMSE values. The CC value achieves the second-best being a little smaller than HySure. The ERGAS value is somewhat high, probably because the details' injection is a bit excessive to ensure the quality in the spatial domain. However, for the visualization, we can clearly see that the result of the proposed method is better than other methods for a more detailed texture. In general, the proposed algorithm is better than other methods in the comprehensive performance of spatial and spectral information maintenance. In this case, the absolute difference map between the fusion results obtained by different methods, and the reference methods are shown in Figure 12. The difference between the difference map obtained by other competing methods and the difference map obtained by the proposed method is quite large, which indicates that the proposed method has a better performance than other methods. In a word, the proposed method performs well in the objective indicators and visual effects of the above four data sets. This further proves that this method can achieve advanced fusion performance.

Discussion
According to the four image quality metrics in Tables 2-5 for different types of data sets of ten state-of-the-art methods, the proposed SCAAE based pansharpening method is substantiated to better keep the spectral characteristics compared to the other ten competing methods. It can be clarified that Figures 6, 8, 10, and 12 illustrated the spectral reflectance difference vectors of the different methods at four randomly selected locations. The superiority of the proposed method is owing to the employment of the deep features extracted by SCAAE. The extracted features can consider the spatial information not only from the PAN image but also from the HS image, making the spatial information more comprehensive, which improves the spatial quality of the fused image. A simple yet useful feature selection rule based on calculating the SSIM is a practical approach to select the feature that is closest to the PAN map, which plays an essential role in reducing the spectral distortion. While the discussed convincing experiments and analysis have verified the effectiveness of the proposed SCAAE based pansharpening method, there are still some interesting details that can be further discussed as follows and become the future work: 1. As a convenient and straightforward unsupervised learning model, the network structure of SCAAE can be improved in spatial information enhancement and spectral information maintenance. Next, we will try to extract richer features using the new loss function. 2. As an image quality enhancement method, super-resolution plays a vital role in the preprocessing of each image application field. Next, we will explore more targeted pansharpening methods suitable for specific tasks. 3. The optimization equation to solve the proportion of HS and PAN images in the final fusion framework makes it adaptive the find the portion of the HS and PAN image. In future work, we can try to improve our model by adding more priors.

Conclusions
In this paper, we propose a new HS pansharpening algorithm based on SCAAE to improve the spatial resolution of LR HS images with HR PAN images. An adaptive fusion approach is proposed to incorporate the deep spatial information of the HS images with PAN images as well as preserve spectral information well. Firstly, we propose SCAAE based pansharpening method to obtain spatial information of the HS image, taking advantage of effective features extracted from the training HS data. Secondly, an adaptive fusion approach with a simple feature selection rule is induced to make full use of the sufficient spatial information of the HS image and PAN image. Finally, to improve the quality of spectral information preservation, we introduce the spatial information from two different sensors into an optimization equation to obtain the fusion proportion of the two parts. Moreover, separate from the modern DNN-based methods for HS pansharpening, we mine deep features of HS image, reduce the computational burden, and obtain comprehensive spatial information. The main advantage of our work is that the deep features extracted through SCAAE provide ignorable spatial information, which plays an essential effect on the final results. As the experimental data shows, through our proposed method, especially the part of the feature extraction and feature selection, the CC, SAM, and RMSE values are improved by about 1.42%, 13.12%, and 29.26%, respectively on average compared to the second best method HySure. Besides, the improvement of the CC, SAM, and RMSE are 17.63%, 0.83%, and 11.02%, respectively better than the MRA based method and 0.87%, 22.11%, 20.66% than the PCA based method, which convincingly demonstrates the validity of spatial information preservation. In general, the proposed method has systematically verified the superiority of the proposed method in the enhancement of spatial information and the preservation of spectral information and proved to be active and well-performed in the results of pansharpening, both theoretically and objectively.