Hyperspectral and Multispectral Image Fusion by Deep Neural Network in a Self-Supervised Manner

: Compared with multispectral sensors, hyperspectral sensors obtain images with high-spectral resolution at the cost of spatial resolution, which constrains the further and precise application of hyperspectral images. An intelligent idea to obtain high-resolution hyperspectral images is hyperspectral and multispectral image fusion. In recent years, many studies have found that deep learning-based fusion methods outperform the traditional fusion methods due to the strong non-linear ﬁtting ability of convolution neural network. However, the function of deep learning-based methods heavily depends on the size and quality of training dataset, constraining the application of deep learning under the situation where training dataset is not available or of low quality. In this paper, we introduce a novel fusion method, which operates in a self-supervised manner, to the task of hyperspectral and multispectral image fusion without training datasets. Our method proposes two constraints constructed by low-resolution hyperspectral images and fake high-resolution hyperspectral images obtained from a simple diffusion method. Several simulation and real-data experiments are conducted with several popular remote sensing hyperspectral data under the condition where training datasets are unavailable. Quantitative and qualitative results indicate that the proposed method outperforms those traditional methods by a large extent


Introduction
Different from multispectral remote sensing images, hyperspectral remote sensing images can reflect not only the color information but also the physical property of ground objects, which contributes a lot to Earth observation tasks such as ground object classification [1], target tracking [2] and environment monitoring [3][4][5]. However, due to the signal-to-noise ratio, spatial resolution of hyperspectral images cannot be as high as that of multispectral images. The low resolution constrains the precise application of hyperspectral images. One mainstream strategy to obtain high-resolution hyperspectral (HR HSI) optical images is to fuse the spectral information from low-resolution hyperspectral (LR HSI) images and spatial information from corresponding multispectral images (HR MSI). According to the study in [6], traditional LR HSI and HR MSI fusion methods can be roughly classified into three families: (1) component substitution-based methods (CS-based methods) [7][8][9][10]; (2) multiresolution analysis-based methods (MRA-based methods) [11][12][13][14]; (3) variation model-based methods (VM-based methods) [15][16][17][18][19][20].
CS-based methods are the most traditional LR HSI and HR MSI fusion methods. They actually extend the application of CS-based methods on the pansharpening task to LR HSI and HR MSI fusion tasks. CS-based methods share the same three steps to complete the fusion process. First, they project the LR HSI to a novel feature space; then some bands in the new feature space are substituted by the bands from HR MSI. Finally, HR HSI is obtained by transforming the bands in the novel feature space back to the original space. Different projection methods contribute to different CS-based methods. These methods include resolution and inpainting [35]. For example, given a style image and a content image, reference [34] extracts style representation of the style image and content representation of the content image from the internal layer of a pre-trained deep neural network, such as VGG19 network [38], which is a famous classification model, and combines them to optimize the input map. Finally, a style-transferred image is acquired until optimizing to the optimal. A similar work is that given a texture image, Leon Gatys, et al. [39] shows the generation of images in a VGG19 network with similar but different texture from reference image. Ulyanov, et al. [35] finds that the deep neural network itself can be viewed as a prior. With the random noise as input of network and the known degradation model, the network can complete many low-level vision tasks including super-resolution, inpainting and denoising. The above methods work well because they establish the simple yet accurate relationship between output and the given images. However, when it comes to spatial and spectral fusion tasks such as LR HSI and HR MSI fusion, they cannot extract accurate spatial and spectral features because existing methods cannot establish the complex relationship between LR HSI, HR MSI and the target image.
In order to obtain high-quality fusion results training without training datasets, we introduce a novel strategy for LR HSI and HR MSI fusion. The proposed method can operate in a self-supervised manner where all constraints in the method are constructed by LR HSI and HR MSI themselves. The process of the proposed strategy can be summarized as follows. First, the network takes a fake HR HSI as input, which is obtained roughly by a traditional information diffusion method, to obtain an initial output. Then, one optimization term is constructed between the output and LR HSI to constrain the spectral accuracy; another optimization term is constructed by the output and the fake HR HSI to constrain the spatial accuracy. By optimizing the network parameters with the two optimization terms, we obtain the final output with both high spatial and spectral accuracy. We summarize our contribution as follows: • We introduce a strategy for self-supervised fusion of LR HSI and HR MSI. Different from deep learning methods, the proposed strategy gets rid of the dependence on the size and even the existence of a training dataset. • A simple diffusion process is introduced as the reference to constrain the spatial accuracy of fusion results. Two simple but effective optimization terms are proposed as constraints to guarantee the spectral and spatial accuracy of fusion results. • Several simulation and real-data experiments are conducted with some popular hyperspectral datasets. Under the condition where no training datasets are available, our method outperforms all comparison methods, testifying the superiority of the proposed strategy.
Our paper is developed with the following four sections. In Section 2, we present the workflow and the details of the proposed strategy; in Section 3, experiment results of the proposed method are displayed and compared with other state-of-the-art fusion methods under the condition without datasets. In Section 4, we discuss some findings in our experiment. In Section 5, we summarize the merits and demerits of our method applied on LR HSI and HR MSI fusions and discuss the potential improvement of the proposed method in the future.

Problem Formulation
Before the introduction of the proposed methods, we give some important notations for simplification and state the problem of LR HSI and HR MSI fusion. X ∈ R w×h×C means the LR HSI image where w, h and C are respectively the width, height and the number of channels of X. Y ∈ R W×H×c is HR MSI where W, H and c respectively mean the width, height and band number of Y. We aim to obtain Z ∈ R W×H×C which shares the same spatial resolution with Y and the same spectral resolution with X. Specifically, the width and height of Y are much larger than those of X while the channel number of X is much larger than that of Y, i.e., W w, H h and C c.
It is well accepted that X and Y are two degradation results of Z. On the one hand, X can be viewed as the product of down-sampling Z by some spatial down-sampling algorithm D s such as bicubic and bilinear algorithm. On the other hand, Y is thought to be obtained by down-sampling Z in the spectral dimension with some spectral downsampling algorithm D Φ . The two degradation processes are illustrated in Equations (1) and (2). The target HR HSI can be obtained by solving the Equation (3): For the spectral degradation model D Φ , existing studies all select linear regression model or spectral response function which is also a linear model in their simulation experiments. However, the results may not be satisfying when linear spectral response function cannot accurately reflect the complex relation between Z and real MSI. We will reflect this phenomenon in the real experiment. We attempt to abandon D Φ and try another choice to constrain the spatial information.

Fusion Process
To guarantee the robustness of fusion process, we first diffuse the spatial information from Y to all bands of X with GSA [10] to obtain a fake HR HSI Z T as backbone: Then Z T serves as the input of network G and we obtain an initial output Z from G: To constrain the spectral accuracy of Z, we construct the spectral optimization term directly with X and Z: Loss spectral = ||Z ↓ n ↑ n − X ↑ n || 1 1 (6) where ↓ n is the operation of down-sampling by n times and ↑ n is the operation of up-sampling by n times. n is the spatial resolution ratio between Y and X. Although down-sampling operation can well represent the spectral information of Z, we add the up-sampling operation to the constraint term to further strengthen the spatial information of Z.
Then we make use of the Z T to construct the spatial optimization term with Z to constrain the spatial accuracy of output: With limited loss of spatial information, Z T contains more spectral information of X compared with Y because of the diffusion operation in Equation (4). In this way, Z T could have less effect on the spectral accuracy of fusion results compared with Y.
Finally, the total optimization term, which is described in Equation (8), is constructed by the two optimization terms: Loss total = Loss spectral + λLoss spatial (8) where λ denotes the weight.
The optimization process ends when the total optimization term reaches the optimal. Different from those deep learning-based methods whose target is a trained network, we abandon the network after optimization but retain the output: Remote Sens. 2021, 13, 3226

of 17
It is worth mentioning that the whole process does not depend on a training dataset and operates in a self-supervised manner. Actually, the proposed method can also be viewed as the process of spectral information correction of Z T .

Network Structure
We design a simple 5-layer deep neural network to complete the whole task. The whole structure and parameters are displayed in Figure 1. In each of the first five layers, there exists a convolution operation, a batch-normalization operation and a non-linear activation operation. For the last layer, there are only a convolution operation and a nonlinear activation operation. To avoid the gradient vanishing phenomena, one skip connection operations are used between the first layer and the fifth layer. Detailed parameters are listed in Figure 1.

Datasets
We apply three hyperspectral datasets in the experiment with simulated multispectral images. They are respective CAVE dataset, Pavia University dataset and Washington DC dataset. Two datasets are used in experiment with real multispectral images. They are respectively CAVE dataset and Houston 2018 dataset. We introduce these datasets in detail.
CAVE dataset consists of 32 HR HSI, whose spatial resolution is 512 × 512. The spectral range of hyperspectral images covers from 400 nm to 700 nm and each band covers 10 nm.
Each HR HSI image has a corresponding natural RGB image with the same spatial resolution. In the experiment of CAVE dataset, we select six HR HSI images. We set the down-sampling ratio as 8 so the original HR HSI are down-sampled to 64 × 64 as the LR HSI images. For the simulation experiment, HR MSI are produced by down-sampling HR HSI images in the spectral dimension. For the real-data experiment, we make use of natural RGB images provided by the official website https://www.cs.columbia.edu/ CAVE/databases/multispectral/ (accessed on 12 August 2021) of CAVE dataset.
Pavia University dataset is obtained by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor. The image, whose size is 610 × 340, have the spectral range between 430 nm and 860 nm. The original image has 115 bands and only 103 bands among them are used for experiments after processing. We crop a patch with the size of 280 × 280 × 103 from the dataset as HR HSI. Then, we follow the Wald's protocol and down-sample HR HSI by 8 times to obtain LR HSI with the size of 35 × 35 × 103. HR MSI is simulated by linearly combining the channels of HR HSI and we finally acquire HR MSI with the size of 280 × 280 × 4. In the experiment, we view HR HSI as the ground truth.
Washington DC dataset is an aerial hyperspectral image acquired by the Hydice sensor whose spectral range is between 400 nm and 2400 nm. The image has a total of 191 bands and has a size of 1208 × 307. We select a patch with the size of 280 × 280 × 191 as HR HSI and the ground truth. Then we down-sample it to the size of 35 × 35 × 191 to acquire LR HSI by following the Wald's protocol. We also simulate HR MSI by down-sampling the HR HSI in the spectral dimension and finally obtain HR MSI with the size of 280 × 280 × 4.
Houston 2018 dataset: Houston 2018 dataset is the dataset originally used in the competition of 2018 IEEE GRSS Data Fusion. It is produced and published by University of Houston. Houston 2018 dataset consists of 14 pairs of hyperspectral images and natural HR MSI. Hyperspectral images, whose spatial resolution is 1 m, have a size of 601 × 596 and 48 bands. HR MSIs, whose spatial resolution is 0.05 m, have a size of 12,020 × 11,920 and 3 bands. In the experiment, we select 8 pairs of hyperspectral and multispectral images to testify the effectiveness of the proposed method. First, we follow the Wald's protocol and down-sample HR MSI to the similar size of hyperspectral images. Then we crop patches with the size of 400 × 400 respectively from the hyperspectral images and the down-sampled multispectral images as the ground truth and HR MSI for fusion. The hyperspectral images are further down-sampled by 8 times to the size of 50 × 50 to get the LR HSI for fusion.

Comparison Methods
Different from those deep learning-based methods which need datasets for training, the proposed method needs no training datasets and can be applied on only one image in a self-supervised manner. Hence, we select six state-of-the-art fusion methods from different kinds of fusion methods. They operate under the same conditions as the comparison methods. The six selected methods are respectively Adaptive Gram-Schmidt method (GSA) [13], Coupled Nonnegative Matrix Factorization (CNMF) [17], Coupled Spectral Unmixing (ICCV15) [20], Generalized Laplacian Pyramid for HyperSharpening (GLPHS) [40], Hyperspectral Subspace regularization (HySure) [41] and Smoothing Filter-based Intensify Modulation for HyperSharpening (SFIMHS) [12].

Evaluation Methods
Four commonly used indexes are used to evaluate the fusion results of the proposed and the comparison methods. They are respectively peak-signal-to-noise ratio (PSNR), structure similarity index (SSIM), correlation coefficient (CC) and spectral angle mapper (SAM). The first three indexes can judge the spatial accuracy of fusion results while the last one can evaluate the spectral accuracy of fusion results. For PSNR and SSIM, a higher index means the better result. While for SAM, a lower index indicates more accurate spectral information.

Impletion Details
All of the experiments are conducted with Pytorch 1.0 under the environment of Ubuntu 16.04. Adam optimizer is used to optimize the fusion result and we set the learning rate of the network as 0.0002.

Experiment with Simulated Multispectral Images
In our experiment, we attempt to obtain HR HSI with spatial resolution 8 times higher than LR HSI by the process of fusion, which is a really challenging task. In this part, we test our method with images from CAVE dataset, Pavia University dataset and Washington DC dataset, with HR MSI of which are simulated by adding channels linearly from HR HSI according to the spectral response function. Linear spectral response function is also the basic assumption of the above six comparison methods.

CAVE Dataset
We visualize the fusion results of the proposed method and six comparison methods on CAVE dataset in Figure 2. Band 11, 21 and 31 are selected as R, G and B bands of the displayed images. Results of GSA, SFIMHS, GLPHS, CNMF are respectively displayed in Figure 2a-d. Results of ICCV15, HySure and the proposed method and the ground truth are presented in Figure 2e-h. To more intuitively compare the results of all methods, we further display their residual difference maps from the ground truth in Figure 3. It can be observed that residual map of the proposed method is darker than those of other methods, indicating that the proposed method can obtain results with least difference from the ground truth. We also compare the proposed method with the other six methods quantitatively in Table 1 which lists the three indexes mentioned above. We mark quantitative evaluation results with the highest scores in bold and those with the second highest scores with underline. The proposed method achieves the highest scores in all five indexes and outperforms the method with the second highest score by a large extent. For co-variance map and RX detection map of results in Pavia University dataset with simulated HR MSI, please view Figures S3 and S8 in Supplementary Materials.
Washington DC dataset, with HR MSI of which are simulated by adding channels linearly from HR HSI according to the spectral response function. Linear spectral response function is also the basic assumption of the above six comparison methods.

CAVE Dataset
We visualize the fusion results of the proposed method and six comparison methods on CAVE dataset in Figure 2. Band 11, 21 and 31 are selected as R, G and B bands of the displayed images. Results of GSA, SFIMHS, GLPHS, CNMF are respectively displayed in Figure 2a-d. Results of ICCV15, HySure and the proposed method and the ground truth are presented in Figure 2e-h. To more intuitively compare the results of all methods, we further display their residual difference maps from the ground truth in Figure 3. It can be observed that residual map of the proposed method is darker than those of other methods, indicating that the proposed method can obtain results with least difference from the ground truth. We also compare the proposed method with the other six methods quantitatively in Table 1 which lists the three indexes mentioned above. We mark quantitative evaluation results with the highest scores in bold and those with the second highest scores with underline. The proposed method achieves the highest scores in all five indexes and outperforms the method with the second highest score by a large extent. For co-variance map and RX detection map of results in Pavia University dataset with simulated HR MSI, please view Figure S3 and Figure S8 in Supplementary Materials.

Pavia University Dataset
The fusion results of patches from Pavia University dataset obtained by the seven methods are displayed in Figure 4. We select the 37th, 68th and 103rd bands as the R, G and B bands for visualization. Figure 5 presents the residual information of results from all methods compared with the ground truth. The proposed method has the less residual information in the fusion result than the other methods. We also compare the fusion results with quantitative evaluation in Table 2. The evaluation results with the highest scores are marked in bold and those with the second highest scores are marked with underline. Again the proposed method achieves the highest scores in all five evaluation indexes, indicating that the result of the proposed method is the most accurate in both spatial and spectral details. For co-variance map and RX detection map of results in Pavia University dataset with simulated HR MSI, please view Figures S6 and S11 in Supplementary Materials.

Washington DC Dataset
We visualize the fusion results of the proposed method and six comparison methods on Washington DC dataset in Figure 6. Band 16, 82 and 166 are selected as R, G and B bands of the displayed images. Results of GSA, SFIMHS, GLPHS, CNMF are respectively displayed in Figure 6a-d. Results of ICCV15, HySure and the proposed method and the ground truth are presented in Figure 6e-h. To more intuitively compare the results of all methods, we further display their residual difference maps from the ground truth in Figure 7. It can be observed that the result of the proposed method has the least residual information. We also compare the proposed method with other six methods quantitatively in Table 3 which lists the five indexes mentioned above. We mark quantitative evaluation results with the highest scores in bold and those with the second highest scores with underline. The proposed method achieves the highest scores in all five indexes and outperforms the method with the second highest score by a large extent. For co-variance map and RX detection map of results in Washington DC dataset with simulated HR MSI, please view Figures S4 and S9 in Supplementary Materials.   Scores marked in bold mean the best and those marked with underline mean the second best.  Scores marked in bold mean the best and those marked with underline mean the second best.

Experiment with Real Multispectral Images
The experiments presented above assume that HR MSI can be simulated by adding the channels of corresponding HR HSI linearly according to the spectral response function. However, according to reference [6], the actual relationship between HSI and MSI obtained physically is far from linear but non-linear, and it is hard to precisely establish the complex relationship between them. So those methods based on the assumption of simple linear relationship between real HR HSI and HR MSI will not operate well in real applications. However, the proposed method is not based on this assumption and can deal well with complex spectral response function in the real situation.

CAVE Dataset
We pick one image among six test images from CAVE and display the fusion results of the proposed method and the comparison method in Figure 8. The 11th, 21st and 31st bands are chosen as the R, G and B bands for visualization. With the real multispectral images in CAVE dataset, all six comparison methods, which are all based on the assumption of simple linear spectral response function, obtain unsatisfying fusion results. We also display the residual information maps of ground truth and results from all methods in Figure 9. In terms of the results of the six comparison methods, there is much more residual information in the experiment with real HR MSI than that in the experiments with simulated HR MSI that are displayed in Figures 3 and 5. It is testified again that the six comparison methods follow the wrong assumption. The proposed method, however, has the least residual information in the fusion results compared with the other six methods in CAVE dataset and has the equivalent residual information in experiments with simulated HR MSI and real HR MSI, which can be observed in Figures 3 and 9. That means the proposed method does not rely on the accuracy of spectral response function and can well perform in real situations. The quantitative evaluation scores of fusion results from all methods are listed in Table 4. We mark the highest scores of indexes in bold and the second highest with underline. It is worth noting that in the experiment of real HR MSI, all six comparison methods cannot obtain the scores as high as those in the experiments with simulated HR MSI, which again indicates that they are not ready for real situations. The proposed method, however, acquires the highest scores in all indexes and outperforms the comparison methods by a large extent. Comparing the quantitative evaluation results of the proposed methods in the experiments of real HR MSI and those in the experiments of simulated HR MSI, the proposed method obtains the same good results in both situations, which again confirms that the proposed method is not constrained by the accuracy of spectral response function and can be practically used. For co-variance map and RX detection map of results in CAVE dataset with real HR MSI, please view Figures S2 and S6 in Supplementary Materials.  Scores marked in bold mean the best and those marked with underline mean the second best.

Houston 2018 Dataset
We display the fusion results of the proposed method and the comparison method in Figure 10. From Figure 10, we observe that all comparison methods cannot acquire results with sharp spatial details and spectral information at the same time. Compared with these methods, results of the proposed method have accurate spectral and spatial information.
We also display the residual information maps of ground truth and results from all methods in Figure 11. The proposed method has the least residual information in the fusion results compared with the other six methods in Houston 2018 dataset. The quantitative evaluation scores of fusion results from all methods are listed in Table 5. The proposed method acquires the highest scores in all indexes and outperforms the comparison methods by a     Scores marked in bold mean the best and those marked with underline mean the second best.

Discussion
In our experiment results, we find that the proposed method can even obtain enhanced results compared with ground truth images. As is mentioned in reference [17], out-of-focus blur exists at the extremes of the spectral range because different channels are acquired individually in a fixed focal length with the tunable filters. An example is selected from CAVE dataset. The 1st band of hyperspectral image is displayed in Figure 12a. There is no texture information in the 1st band of ground truth image while there is rich texture information in the corresponding RGB image. However, the proposed method can inject the spatial information from the RGB image to this band of the result without largely changing the spectral accuracy, which is shown in Figure 12b.

Conclusions
In this paper, we introduce a novel strategy, which makes use of the strong fitting ability of deep neural network to LR HSI and HR MSI fusion task and can operate without training datasets in a self-supervised manner. The spatial information of target is constrained by the fake HR HSI obtained by the spatial diffusion and the spectral accuracy is constrained by LR HSI. A simple deep neural network is used to complete the interpolation process. We conduct several simulation and real-data experiments on some popular hyperspectral datasets to compare the proposed method with other state-of-the-art methods. Quantitative and qualitative results confirm the outperformance and higher accuracy of the proposed methods compared with other fusion methods.
In spite of the great performance of our method, the optimization process costs much time. In our future work, we will train a deep neural network in a self-supervised manner with the proposed strategy and process the images in a feed-forward manner. On the other hand, we will attempt to further improve the accuracy of fusion results by combining other strategies such as recurrence.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/rs13163226/s1, Figure S1. Curve of spectral difference between selected pixels and ground truth; Figure S2. co-variance matrix of CAVE dataset experiment with real HR MSI; Figure S3. co-variance matrix of CAVE dataset experiment with simulated HR MSI; Figure S4. co-variance matrix of Washington DC dataset experiment with simulated HR MSI; Figure S5. co-variance matrix of Houston 2018 dataset experiment with real HR MSI; Figure S6. co-variance matrix of Pavia dataset experiment with simulation HR MSI; Figure S7. RX map of CAVE dataset experiment with real HR MSI; Figure S8. RX map of CAVE dataset experiment with simulated HR MSI; Figure S9. RX map of Washington DC dataset experiment with real HR MSI; Figure S10. RX map of Houston 2018 dataset experiment with simulated HR MSI; Figure S11. RX map of Pavia dataset experiment with simulated HR MSI.

Conflicts of Interest:
The authors declare no conflict of interest.