Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction

Hyperspectral pansharpening aims to synthesize a low-resolution hyperspectral image (LR-HSI) with a registered panchromatic image (PAN) to generate an enhanced HSI with high spectral and spatial resolution. Recently proposed HS pansharpening methods have obtained remarkable results using deep convolutional networks (ConvNets), which typically consist of three steps: (1) up-sampling the LR-HSI, (2) predicting the residual image via a ConvNet, and (3) obtaining the final fused HSI by adding the outputs from first and second steps. Recent methods have leveraged Deep Image Prior (DIP) to up-sample the LR-HSI due to its excellent ability to preserve both spatial and spectral information, without learning from large data sets. However, we observed that the quality of up-sampled HSIs can be further improved by introducing an additional spatial-domain constraint to the conventional spectral-domain energy function. We define our spatial-domain constraint as the $L_1$ distance between the predicted PAN image and the actual PAN image. To estimate the PAN image of the up-sampled HSI, we also propose a learnable spectral response function (SRF). Moreover, we noticed that the residual image between the up-sampled HSI and the reference HSI mainly consists of edge information and very fine structures. In order to accurately estimate fine information, we propose a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers. We perform experiments on three HSI datasets to demonstrate the superiority of our DIP-HyperKite over the state-of-the-art pansharpening methods. The deployment codes, pre-trained models, and final fusion outputs of our DIP-HyperKite and the methods used for the comparisons will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) with a large number of spectral bands have gained immense attention in the field of remote sensing due to its applications in broad research areas such as classification [1], unmixing [2], anomaly detection [3], change detection [4], etc. However, due to the limited incident energy available when capturing an image, hyperspectral imaging systems face trade-offs between spectral resolution, spatial resolution, and signal-to-noise ratio (SNR) [5]. For this reason, hyperspectral imaging systems can provide images with high spectral resolution but with low spatial resolution. In contrast, multispectral imaging systems can provide data with high spatial resolution but with fewer spectral bands (e.g., panchromatic images or multispectral images(MSIs) with three or four spectral bands). Low spatial resolution in HSIs leads to relatively poor performance in some practical remote sensing applications, such as road topology extraction [6], and spectral unmixing [7]. Therefore, full-resolution HSIs with high spatial and spectral resolution are desired. One way to obtain such ideal HSIs is to fuse high spectral resolution HSIs with high spatial resolution PAN/MSIs. This fusion process is called HS pansharpening in the remote sensing literature, which is indeed a form of super-resolution.
Traditional pansharpening methods can be mainly divided into four classes [5]: (1) component substitution (CS), (2) multi-resolution analysis (MRA), (3) Bayesian, and (4) matrix factorization. Component substitution methods rely on substituting the spatial component of the HSI with the MSI/PAN image. The family of CS contains algorithms such as Gram-Schmidt adaptive (GSA) [8], [9], principal component analysis (PCA) [10]- [12], and intensity-hue-saturation (IHS) [13]. Even though the CS methods usually generate pansharpened HSIs with accurate spatial information, sometimes they suffer from critical spectral distortions. The MRA approaches are based on injecting the spatial details obtained through the multi-scale decomposition of the MSI/PAN image into the HSI. In order to extract the spatial details from the PAN image, several algorithms have been proposed in the literature, such as decimated wavelet transform (DWT) [14], undecimated wavelet transform (UDWT) [15], smoothing filter-based intensity modulation (SFIM) [16], modulation transfer function with generalized Laplacian pyramid (MTF-GLP) [17], and MTF-GLP with high-pass modulation (MTF-GLP-HPM) [18]. In contrast to the CS methods, the MRA family performs better in spectral preservation, but is more sensitive to registration errors which may cause critical distortions in the spatial domain. Due to these inherent advantages and disadvantages of CS and MRA approaches, there have been works which attempted to combine both CS and MRA methods. One of the representatives of hybrid CS and MRA algorithm is guided filter PCA (GFPCA) [19]. The Bayesianbased methods also provide a convenient way to regularize the fusion methods by modeling the posterior distribution of the target HSI provided that the LR-HSI and MSI/PAN image. Examples of the algorithms based on the Bayesian inference framework include convex regularization under a Bayesian framework (abbreviated as Hysure) [20], naive Bayesian Gaussian prior (abbreviated as BF) [21], and sparsity promoted Gaussian prior (abbreviated as BFS) [22]. Finally, the coupled non-negative matrix factorization (abbreviated as CNMF) is one of the examples for matrix factorization-based methods, which regularizes the fusion problem by using the priors of spectral unmixing [23]. However, the fusion performance of traditional pansharpening approaches is generally limited due to their inadequate representation ability. In addition, the algorithms mentioned above may result in severe quality degradation when the assumptions do not align with a particular dataset. Furthermore, most traditional pansharpening approaches typically reach the optimal solution through an iterative process, which is time-consuming and inefficient.
Recently, deep learning (DL) models based on convolutional neural networks (ConvNets) have also been introduced for the HS pansharpening problem due to ConvNets' excellent ability to learn high-level features automatically. ConvNet-based HS pansharpening methods generally consist of three steps, 1) Up-sampling step: Up-sampling the LR-HSI to the spatial resolution of the PAN image, 2) Residual reconstruction step: Concatenating the upsampled HSI and PAN image along the spectral dimension and passing it through a residual learning network to learn the residual image, 3) Final fusion step: Obtaining the final fused HSI by adding the up-sampled HSI and the residual image.
There have been many methods proposed to up-sample LR-HSI to the spatial resolution of PAN. In the earliest studies, nearest-neighbor and bicubic interpolation were the famous methods to perform up-sampling. However, the methods mentioned above conduct upsampling on each band of the LR-HSI successively, thus ignoring the high spectral correlation of HSIs which may lead to spectral distortions [24], [25]. In order to minimize the spectral distortion, data-driven up-sampling techniques (i.e., deep super-resolution networks) have also been utilized in HS pansharpening. The LapSRN [26] network is an example of such a data-driven super-resolution method, which progressively super-resolves a LR image in a coarseto-fine manner in a Laplacian pyramid framework. However, the LapSRN method requires a large number of images for training which is impractical in the HS domain due to the limited number of datasets available to the public. A remedy to the problem mentioned above was proposed by Ulyanov et al. [25] where they proposed a deep learning-based superresolution framework called deep image prior (DIP). The proposed method uses a randomly initialized ConvNet to upsample an image, using its structure as an image prior, similar to bicubic upsampling. However, this method does not require any training but produces much cleaner results with sharper edges. Motivated by the super-resolution performance of DIP in the RGB domain, researchers have applied DIP to the HS pansharpening problem [24], [27] and achieved impressive results. However, we observed that the energy function defined in HS DIP up-sampling directly applies the energy function formulated for the RGB DIP process, where they only impose spectral-domain constraint by computing the L 1 distance between the down-sampled version of the target up-sampled HSI and the LR-HSI. However, the existing HS DIP methods do not impose any spatial-domain constraint by utilizing the available PAN image. We address this issue by introducing an additional spatial-domain constraint to the HS DIP process as our first contribution.
For residual reconstruction, various ConvNet architectures have been proposed in the literature to accurately predict the residual component between the up-sampled HSI and the reference HSI with less spectral and spatial distortion. Among those, Giuseppe et al. [28] was the first to introduce simple three-layer ConvNet architecture for the residual learning. Further, Lin et al. [29] improved the spatial and spectral prediction capability of Giuseppe's work (abbreviated as HyperPNN) by introducing spectral and spatial prediction modules. To further enhance the representational power of ConvNets, attention mechanisms [30] have also been introduced. Among those, Zheng et al. [24] proposed a spatial and spectral attention mechanism (abbreviated as DHP-DARN) for the residual learning in which they cascade several channel-spatial-attention residual blocks to adaptively learn more informative channel-wise and spatial-domain features simultaneously. More recently, Xu et al. [31] proposed a design (abbreviated as SDPNet) based on two encoder-decoder networks to extract deep-level features from two types of source images with densely connected blocks to strengthen feature propagation. However, we experimentally observed that most of the existing residual learning methods fail when predicting the high-frequency information, such as edges and delicate structures in the residual image. The main reason for this observation is due to the fact that the increasing receptive field of the network in the deep layers. Motivated by this observation, we introduce an over-complete network, called HyperKite, for residual reconstruction task as our second contribution, which constrains the receptive field from increasing in deep layers thus extracting more high-frequency information.
The main contributions of this paper are summarized as follows: 1) A novel spatial constraint is introduced for the DIP upsampling process. To the best of our knowledge, this is the first study that integrates both spatial and spectral constraints to the DIP up-sampling. The proposed spatial constraint significantly improves the spatial and spectral performance measures of the up-sampled HSIs. 2) An over-complete network, called HyperKite, is proposed for the residual reconstruction, which is highly capable of extracting high-frequency information of the residual image by appropriately constraining the receptive field of the network. 3) We conduct extensive experiments to clearly demonstrate the improvements brought in from our contributions to the HS pansharpening. We compared the fusion performance of DIP-HyperKite with both conventional and deep learning-based approaches. The deployment codes, pre-trained models, and final fusion results of our DIP-HyperKite as well as the comparison methods in the results and discussion will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.
The rest of this paper is organized as follows. Section II provides some basics DIP and over-complete representations.
In Section III the proposed DIP-HyperKite is described in detail. Section IV describes the datasets and performance metrics that we used in the experiments. In Section V, the experimental results and discussions of different data sets are presented. Finally, the conclusions are drawn in Section VI.
II. RELATED WORK A. DIP for HSI up-sampling Generally, ConvNets have an excellent ability to learn realistic image priors from a large amount of visual data, placing them in leading positions on the benchmarks of various image processing tasks [32], [33]. Contrary to the general opinion on deep networks that they require large data to capture image priors, DIP [25] has shown that a randomly initialized network can capture low-level image statistics before any training. Concretely, in HS pansharpening, DIP can generate the upsampled HSI x dip of the LR-HSI y with spatial up-sampling factor β by taking a fixed randomly initialized vector z as the input, and utilizing the deep network as a parametric function x dip = f θ (z). Next, the network is optimized over its parameters θ to obtain the up-sampled HSI x dip as follows: where Q(x dip ; y) is an energy function that controls the fidelity toward the LR-HSI y, and R(x dip ) is a regularization function based on prior knowledge. In [25], it has been shown that the regularization term R(x dip ) can be implicitly substituted by the deep network. Therefore, the minimization problem in (1) has simplified to optimizing the network over its parameters θ as follows: where θ * denotes the optimal set of parameters of the network. Furthermore, the most straightforward and commonly utilized energy function in HS pansharpening is that the L 1 distance [29] between the down-sampled version of the up-sampled HSI x dip and the LR-HSI y as follows: where d(·) denotes the down-sampling operator by a factor of β.

B. Over-complete ConvNets
Most of the current architectures in deep learning are "encoder-decoder" [34]- [36] based. Here, the encoder translates the high-dimensional input to a low-dimensional latent space while the decoder learns to take the latent lowdimensional representation back to a high-dimensional output. These type of architectures learn low-level features at their initial layers and high-level features at their deeper layers. These are termed under-complete networks as the input is taken to a lower spatial dimension in the latent space.
In signal processing, over-complete dictionaries are widely used for their highly robust characteristic [37]. The number of basis functions here are more than the number of input signal samples which enables a higher flexibility for capturing structure in data. In [38], over-complete auto-encoders were found to be better feature extractors for denoising when compared to under-complete auto-encoders. In an over-complete network [39], the encoder takes the input data to a higher spatial dimension unlike a traditional encoder. This is achieved by using an upsampling layer after every convolutional layer in the encoder. Using upsampling layers in the encoder causes the receptive field to be constrained in the deep layers. This causes the deep layers in the network to learn more finecontext high-frequency information when compared to undercomplete networks. Increase in receptive field for an overcomplete network can be generalized in an i th layer as follows: where the initial receptive field of the conv filter is assumed to be k × k on the image I. This phenomenon has been visualized in Fig 1. As shown in Figure 1 (b), by employing an upsampling layer after every convolutional layer in the encoder, the over-complete network restricts the receptive field size to a smaller region which forces the network to learn very fine edges as it tries to focus heavily on smaller regions. This is completely different from the conventional over-complete architectures where they perform downsampling after each convolution block which makes the network to focus on a much larger region in the input as shown in Figure 1 (a). Over-complete networks in deep learning is a new topic and was initially proposed for medical image segmentation of small anatomy [39]. It has since been successfully extended to solve fine-context requiring tasks like fine edge segmentation of 3D volumes [40], deep subspace clustering [41], MRI reconstruction [42], adversarial defense against videos [43] and image restoration problems like single image de-raining [44].

III. METHODOLOGY
The overall flowchart of the proposed DIP-HyperKite for HS pansharpening is shown in Figure 2. As can be seen from Figure 2 the proposed method consists of two main steps. In the first step, the LR-HSI y ∈ R l×w×h with w × h pixels and l spectral bands is up-sampled to the spatial resolution of the PAN image p ∈ R 1×βw×βh , where β denotes the ratio Step (i):

DIP Network
Step (ii): In the first step, we up-sample the LR-HSI y via DIP process to obtain the up-sampled HSI x dip . The DIP process takes a fixed noise tensor z as input for a given LR-HSI y, and produces the up-sampled HSI x dip by optimizing the proposed spatial+spectral energy function Qss over the DIP network parameters θ. In the second step, we take the up-sampled HSI x dip and the PAN image p as inputs to predict the residual component xres using our proposed over-complete network -HyperKite. Finally, the predicted residual image xres is added to the up-sampled HSI x dip to obtain the pansharpened HSI x.
between spatial resolution of p and y. We denote the output from the DIP process as x dip ∈ R l×βw×βh . In the second step, we train an over-complete deep network which takes upsampled HSI x dip and the corresponding PAN images p as inputs to predict the residual component x res between the upsampled HSI x dip and the reference HSI x ref .

A. Up-sampling via DIP
As shown in Figure 2, the low resolution HSI y is upsampled to the spatial resolution of the PAN image p using the DIP. This recently introduced DIP method is different from the other existing up-sampling techniques such as bicubic interpolation, and LapSNR [45]. The main advantage of DIP over these conventional methods is that it does not require a large dataset for training. In other words, for each LR image y, the DIP network takes a fixed random tensor z as an input and optimize the network parameters θ by minimizing the loss function Q which is defined in terms of the output up-sampled image x dip and available LR-HSI y as given in (3). In contrast, the LapSRN network utilized in [46] is highly relied upon the RGB image datasets and the knowledge adaptation techniques. Furthermore, the bicubic and LapSNR methods up-sample each band in the HSI separately; thus ignoring the high spatial correlation between the spectral bands, which results in the loss of spatial details. Although the DIP method is capable of producing high-quality upsampling images compared to the other existing methods, it only utilizes the information from the LR-HSI y, thus only imposing constraint on the spectral domain. However, we observed that the quality of the sampled HSIs can be further improved by incorporating an additional spatial constraint in the loss function using the available PAN image p. In the next section we explain our novel spatial+spectral loss function.
1) Proposed spatial+spectral energy function for HS DIP: As we discussed in Section II-A, the energy function given in The proposed learnable spectral response function s, and the computational procedure of evaluating the spatial loss term Q spatial . We take the up-sampled HSI x dip as the input, and feed it in to a Global Average Pooling (GAP) layer, which yielding a vector with a single entry for each spectral band. Then we pass it through a gating mechanism by forming a bottleneck with two fully-connected (FC) layers (1 × 1 convolutions) around the non-linearity to learn the spectral response of each band. Next, we apply a Softmax activation function to obtain normalized spectral response s, and then take the channel-wise multiplication followed by channel averaging to obtain the estimated PAN imagep. Finally, we compute the the L 1 distance between the estimated PAN imagep and the reference PAN image p to obtain the spatial loss Q spatial .
(3) enforces a constraint only in spectral domain by defining the L 1 distance between up-sampled HSI x dip and the LR HSI y. Instead, we propose a loss function (denoted by Q ss ) for HS DIP, which enforces the constraints in both spatial and spectral domains as follows: where s ∈ R 1×l denotes the spectral response function, is the element-wise multiplication, and λ is a regularization constant. The first term in (5) enforces the spectral constraint on x as in (3), and the additional second term enforces the constraint in spatial domain on x dip by utilizing the available PAN image p.
In the simplest case, the spectral response function can be approximated as the average across all spectral bands (i.e. s[i] = 1/l; ∀i ∈ [1, l]) [47], [48]. In this scenario, the spatial loss term in (5) enforces that the average across all the spectral bands in up-sampled HSI x dip to be close as possible to the PAN image p, thus assuming a flat (i.e. uniform) spectral response. However, in general, this assumption is not valid as spectral response varies with wavelength coverage and different spectral bands describe the same semantic information across a wide spectral range with varying quality (i.e. PSNR) [49].
A recent attempt [49] estimate the spectral response function s by utilizing the larger eigenvalue of the structure tensor (ST) matrix (originally proposed in Harris corner detection algorithm [50]). However, this method cannot be directly utilized in an end-to-end deep learning network due to the difficulties encountered while performing back-propagation In addition, it is highly computationally complex as it requires to compute derivatives of each band image along both xand y-directions at each iteration of learning as part of constructing the structure tensor matrix. Instead, we propose a computationally lightweight and learnable spectral response function which can be easily integrated into the spatial loss term in (5) and can be simultaneously learned with DIP.
In this part we describe our novel way of estimating the spectral response function which is computationally lightweight, differentiable, and can be easily integrated into the existing DIP learning process. The overall computational procedure of estimating the spectral response function and thereby evaluating the spatial energy that we introduced for the DIP process in (5) is graphically depicted in Figure 3. First, we assume that the spectral response is proportional to the ratio of information in each spectral band. The next problem arises with this assumption is how do we quantify the information embedded in each spectral band. Motivated by recently proposed Squeeze-and-Excitation networks, we utilize global average pooling to quantify the global information present in each band. Formally, a statistic q ∈ R 1×l which quantifies the informative features in each spectral band is generated by shrinking the up-sampled HSI x dip through its spatial dimensions h × w such that the i-th element in q is Next, we use a simple gating mechanism to capture the dependencies among spectral bands using the band-wise descriptor q that we obtained in the previous step. We parameterize the spectral response function s by forming a bottleneck with two fully-connected (FC) layers around the non-linearity as follows: where σ is the Sigmoid activation function, δ is the ReLU non-linearity, and w 1 , w 2 are the learnable weight matrices.
Here, we use Sigmoid activation to guarantee that the spectral responses of all the bands sump up to one.
2) DIP network: Figure 4 illustrates a U-Net like deep network that we used for the DIP method. The DIP network includes five down-sampling blocks d[i], five upsampling blocks u[i], and five skip-connection blocks sk[i] (i = 1, 2, .., 5). We use stride convolutions as the down-sampling operator, bilinear up-sampling as the upsampling operator, and Lanczos2 as non-linearity. We initialize the input noise vector with uniform noise between 0 and 0.1. The Table I tabulates the values of all the hyperparameters of DIP network.

B. Residual learning via over-complete HyperKite
Our motivation to design an over-complete network for the residual learning task emerged after observing the residual images between DIP up-sampled image x dip and reference HSI x ref as visualized in Figure 5. As we can see from Figure 5, the residual images correspond to different wavelength band   in the Pavia Center data set which will be introduced in Section IV-A. This observation motivated us to use an over-complete network for the residual learning task, which is highly capable of learning low-level features such as fine edges and structures by transforming the input image into a higher dimension. We recommend that readers zoom in on this image to get a closeup view.
mainly consists of boundary information like edges and other high-frequency components. In order to accurately capture this fine information, we design an over-complete HyperKite for the residual learning as shown in Figure 6. The proposed HyperKite consists of an Initial Feature Extraction Network (IFEN), a High-dimensional Feature Mapping Network (HDFMN), and a Final Residual Reconstruction Network (FRRN). The input to the HyperKite x in is obtained by concatenating the up-sampled HSI x dip and the PAN image p along the spectral dimension (denoted as [x dip , p]). The HyperKite starts with the IFEN layer, where one 3 × 3 convolutional layer is applied followed by Batch Normalization (BN) and LeakyReLU non-linearity to extract initial feature representation as: where f IFEN (·) denotes the 3 × 3 convolution followed by LeakyReLU and batch normalization, F D1 denotes the extracted features transformed from x in in D 1 ∈ R n[0]×βw×βh dimensional pixel-space, and n[0] is the number of filters in the convolutional layer. Figure 7 (a) shows six example feature maps of F D1 for the 20-th patch of the Pavia Center dataset that we will introduce in Section IV. As we can see from the figure, the initial feature extraction network f IFEN (·) extract low-level feature of the input x in . In order to capture high-level features that that required for the residual learning, we successively transform the output of IFEN into three higher-dimensional pixel-spaces by utilizing the "bilinear" upsampling denoted as D 2 ∈ R 2βw×2βh , D 4 ∈ R 4βw×4βh , and D 8 ∈ R 8βw×8βh . Then we perform 3 × 3 convolution followed by BN and LeakyReLU to extract meaningful highlevel features at each higher-dimensional space as, where ↑ denotes the "bilinear" interpolation by a factor of 2, f Dd (·) : d ∈ {2, 4, 8} denotes the 3 × 3 convolution layer followed by BN and LeakyReLU at the d-th higherdimensional feature space. Next, we successively transform the extracted high-level features to the original dimensional space D 1 by employing "bilinear" downsampling and skip connections. Formally, we can define the operations of HDFMN as, where ↓ denotes the "bilinear" downsampling by a factor of 2, ⊕ denotes the feature concatenation operator, f Dd (·) : d ∈  Table II. {0, 2, 4} denotes the 3 × 3 convolution followed by BN and LeakyReLU at the d-th dimensional feature space, and F Dd is the most relevant high-level features obtained at D d ∈ {0, 2, 4} space. After flowing through all the downsampling layers (decoder blocks), a 3 × 3 convolutional layer is employed to recover the spectral dimension, and reconstruct the residual image x red as: where f FRNN denotes the 3 × 3 convolutional layer followed by BN and LeakyReLU employed at FRNN. After carrying out DIP up-sampling and residual prediction of our DIP-HyperKite, the DIP up-sampled HSIs x dip and x res are created. Finally, we can obtain the fused HSI x by using x dip and x res as: To this end, we utilize L 1 loss to optimize HyperKite, which has been demonstrated as a superior choice for remote sensing image SR [5], [29] and also experimentally verified to be effective for improving the fusion accuracy. For the is the corresponding reference HSI, and L is the total number of training HSIs in the training set. The L 1 loss function utilized for HyperKite training can be defined as follows: Moreover, all the parameter details of our proposed Hy-perKite are summarized in Table II. We train our network in Pytorch framework using an NVIDIA Quadro 8000 GPU. We use Adam optimizer with a learning rate of 0.001, weight decay of 0.0001 and momentum 0.9 to train HyperKite. We use a batch size of 4 and train the network for 2500 epcochs.

A. Datasets
To evaluate the performance of our proposed DIP-HyperKite for HS pansharpening, we conduct a series of experiments on three HS data sets, which are described in detail below.
1) Pavia Center dataset: The Pavia Center scene was captured by the ROSIS camera [51]. The original HSI consists of 115 spectral bands spanning from 430 to 960 nm. The spatial size of the original image is 1096 × 1096 pixels, where a single pixel is equivalent to geometric resolution of 1.3×1.3 m 2 . The thirteen noisy spectral bands in the original HSI were discarded, thus resulting in a HSI with 102 spectral bands spanning from 430 to 860 nm. In addition, a rectangular area of size 1096 × 381 pixels with no information at the center of the original HSI was also discarded, and the resulting "twopart" image with size of 1096 × 715 × 102 was used for the experiments. Following the same experimental procedure outlined in [24], we also used only the top-left corner of the HSI with size of 960 × 640 × 102, and partitioned it into 24 cubic patches of size 160 × 160 × 102 with no overlap, which constituted the reference images (x ref ) of Pavia Center data set. In order to generate PAN images (p) and LR-HSIs (y) corresponding to each HR-HSI, we utilize Wald's protocol [52]. Following the Wald's protocol, we generate PAN images (p) of size 160 × 160 by averaging first 61 spectral bands of HR reference HSI. In order to generate LR-HSIs of size 40 × 40 × 102, we spatially blurred the HR reference HSI with an 8 × 8 Gaussian filter, and then downsampled the result. The scaling factor (β) was set to 4 for the Pavia Center dataset. We randomly select 17 cubic patches for the training, and the rest of the seven patches forms the testing set of the Pavia Center dataset.
2) Botswana dataset: The Botswana scene was acquired by the Hyperion sensor on the NASA's Earth Observing 1 (EO-1) satellite. The original Botswana HSI consists of 242 spectral bands spanning from 400 to 2500 nm with spectral resolution of 10 nm. The spatial size of the original Botswana image is 1496 × 256 pixels. We remove the uncalibrated and noisy spectral bands in the original image, thus resulting in a HSI with 145 spectral bands. Following the same experimental procedure outlined in [24], we also use only the top-left corner of the HSI with size of 1200 × 240 × 145, and partitioned it into 20 cubic patches of size 120 × 120 with no overlap, which constitute the reference images x ref of the Botswana dataset. In order to generate PAN images p and the LR-HSIs y corresponding to each HR reference image, we follow the Wald's protocol. We generate PAN images p of size 120 × 120 by averaging first 31 spectral bands of HR-HSI. To generate LR-HSIs y, we spatially blur the HR-HSI with an 8 × 8 window, and perform down-sampling. For the Botswana dataset, we set the down-sampling factor β to 3. We randomly select 14 cubic patches for training, and the rest of the patches are utilized for testing.
3) Chikusei dataset [53]: The Chikusei scene was captured by the Headwall Hyperspec-VNIR-C imaging sensor over the agricultural and urban areas in Chikusei, Japan. The original Chikusei HSI consists of 128 spectral bands spanning from 363 to 1018 nm. The spatial size of the Chikusei HSI is 2517 × 2335 pixels, where a single pixel is equivalent to geometric resolution of 2.5×2.5 m 2 . We used top-left corner of the HSI with size of 2304 × 2304 × 128, and partitioned it into 81 cubic patches of size 256×256×128 with no overlap, which constituted the reference images x ref of Chikusei dataset. Following the Wald's protocol, we generate PAN images of size 256 × 256 by averaging first 65 spectral bands of high resolution HSI. To generate LR-HSIs y, we spatially blur the HR-HSI with an 8 × 8 window, and perform down-sampling. For the Chikusei dataset, we set the down-sampling factor β to 4. We randomly select 61 cubic patches for training, and the rest of the patches are utilized for testing.
Note: The standard deviation (σ) of the Gaussian filter that we use to generate LR-HSIs is calculated as σ = 0.4247β [54].

B. Performance measures
In order to evaluate the quality of the proposed pansharpening method, we use different image quality measures. Following [24], we use Cross-Correlation (CC), Spectral Angle Mapping (SAM), Root Mean Square Error (RMSE), Errur Relative Globale Adimensionnelle Desynthese (ERGAS), and Peak Signal to Noise Ratio (PSNR). These measures have been widely used in the HSI processing community and are appropriate for evaluating fusion in spectral and spatial resolutions.

1) Cross-Correlation (CC):
The CC metric characterizes the geometric distortion, and is defined as: where, CCS denotes the cross-correlation for a single-band image as follows: where, µ A = 1 n n j=1 A j is the sample mean of A. The ideal value of CC is 1.0, which indicates that the two HSIs are highly correlated.
2) SAM: SAM is a spectral measure which is defined as: where given the vectors a, b ∈ R l , where < a, b > denotes the inner product between a and b, and · is the L 1 norm. The SAM is a measure of the spectral shape preservation. 3) RSNR/ RMSE: The reconstruction SNR (RSNR) or root mean square error (RMSE) is related to the difference between the reference and fuse images, which is defined as follows:

4) ERGAS:
Relative dimensionless global error in synthesis (ERGAS) calculates the amount of spectral distortion in the image. The ERGAS measure is defined as: where d is the ratio between the linear resolution of the PAN image and the HSIs. defined as: d = PAN linear spatial resolution HS linear spatial resolution ,

V. RESULTS AND DISCUSSION
This section presents the results of our proposed DIP-HyperKite for HS pansharpening, and compares it with the state-of-the-art methods on the Pavia Center, Botswana, and Chikusei datasets. For better clarity, we divide this section into two parts. In the first part (section V-A), we highlight the contribution from our proposed spatial+spectral energy function for the DIP up-sampling process and compare it with available state-of-the-art up-sampling techniques such as nearest-neighbor, bicubic, LapSRN, and DIP with only spectral loss. In the second part (section V-B), we present the final fusion results that we obtain from our proposed HyperKite network and compare it with classical and deeplearning-based pansharpening approaches.
A. Effect of the proposed spatial+spectral energy function for the DIP up-sampling process As we discussed in Section III-A, the recently proposed pansharpening methods such as DHP-DARN [24] and DHP [27] utilized the DIP process to up-sample the LR-HSI instead of using the nearest-neighbor, bicubic, or LapSRN techniques due to its excellent performance. However, we have observed that the quality of up-sampled HSI can be further improved by carefully redesigning the loss function used in the DIP optimization. Instead of only utilizing spectral constraint in the DIP loss function, we derived a novel loss function with spectral and spatial constraints. This section demonstrates the performance improvement brought by our proposed spa-tial+spectral loss function to the DIP up-sampling process. We compare DIP with the proposed spatial+spectral loss against the DIP with spectral loss only. Furthermore, to make the analysis more comprehensive, we also added a conventional up-sampling techniques used in the HS pansharpening domain, such as nearest-neighbor, and bicubic. Further, motivated by the experimental discussion in [24], we also added the results from LapSRN [26], which is trained on a large amount of RGB images.   1) Tuning the hyperparameter λ in our spatial+spectral energy function: We start our discussion with the effect of the regularization constant λ in our proposed spatial+spectral loss function as defined in (5). The variation of CC, SAM, RMSE, ERGAS and PSNR values when varying the regularization parameter λ from 0.0 to 1.0 for the Pavia Center, Botswana and Chikusei datasets are shown in Figure 8, Figure  9, and Figure 10, respectively. As can be seen from these figures, as the value of the regularization constant λ increases, the performance metrics also begin to improve, then hit a saturation point, and then degrade, for all three data sets. Therefore, we carefully select the regularization constant λ for each dataset by considering all the performance metrics. For example, consider the variation of the performance metrics with the regularization constant λ for the Pavia Center dataset which is shown in Figure 8. As we can see, when the value of the regularization constant increases from 0.0 to 0.8, we can see that CC, RMSE, ERGAS, and PSNR start to improve, and when λ increases beyond 0.8 the performance metrics start to degrade. Therefore, we set λ = 0.8 as the optimal value of the regularization constant of our proposed spatial+spectral energy term for the Pavia Center dataset. The variation in performance metrics with the regularization parameter λ for the Botswana and Chikusei datasets are also shown in Figure  9 and Figure 10, respectively. Following the same analysis we described for the Pavia Center dataset, we select λ = 0.8 as the optimal value of the regularization constant for the Botswana and Chikusei datasets. Note that the performance improvement bringing from our proposed spatial+spectral loss function for the DIP upsampling process. Under the optimal regularization constant (λ = 0.8), our spatial+spectral energy function improves the quality of up-sampled HSIs over the spectral loss (equivalent to λ = 0 point in Figure 8, 9, and 10) in-terms of CC, RMSE, ERGAS, and PSNR metrics by 6.64%, 37.8%, 63.4%, 37.3%, and 19.5%, respectively for the Pavia Center dataset. For the Botswana dataset, our proposed loss function improves CC, SAM, RMSE, ERGAS, and PSNR metrics over the DIP with spectral loss by 3.3%, 4.2%, 19.1%, 14.7%, 7.0%, and 5.2%, respectively. Similarly for the Chikusei dataset, our method improves CC, RMSE, ERGAS, and PSNR metrics compared to DIP with spectral loss by 1.8%, 26.3%, 22.9%, 26.2%, and 8.9%, respectively.
Discussion on the regularization constant λ: Let us first consider the case where the regularization constant λ is set to zero. This is equivalent to the case where we only have the spectral constraint. In this case, the DIP network  [26]. (e) DIP with only spectral energy [24]. (f) DIP with our spatial+spectral energy (Qss; λ = 0.8). (g) Reference. The RGB image is generated by utilizing the 10-th, 30-th, and 60-th bands of the HSI for blue, green and red bands, respectively. (d) LapSRN [26]. (e) DIP with only spectral energy [24]. (f) DIP with our spatial+spectral energy (Qss; λ = 0.8). (g) Reference. The RGB image is generated by utilizing the 10-th, 35-th, and 61-th bands of the HSI for blue, green and red bands, respectively. minimizes the distance between the down-sampled version of the up-sampled HSI and the LR-HSI. Since the down-sampling operator acts as a low-pass filter in the frequency domain, what DIP network actually minimizes is that the distance between the low-pass version of the up-sampled HSI and the LR-HSI. Because of this reason, the up-sampled HSI from the DIP network trained only with spectral constraint lacks the high frequency components such as edge information and fine structures. Now let us consider the case where we have both spatial and spectral constraint in the DIP loss function. As we described in Section III-A, we combined the spatial and spectral constraints via regularization parameter λ. The value of λ controls the fidelity of the predicted PAN image towards the actual PAN image. Since the predicted PAN image and the up-sampled HSI are coupled via spectral response function, to make the predicted PAN image close as possible to the actual PAN image, the DIP network tries to predict (d) LapSRN [26]. (e) DIP with only spectral energy [24]. (f) DIP with our spatial+spectral energy (Qss (λ = 0.8). (g) Reference. The RGB image is generated by utilizing the 12-th, 20-th, and 29-th bands for blue, green and red bands, respectively. some of the high-frequency components such as edges and fine structures in the PAN image, while maintaining the lowpass version of the up-sampled HSI close to the LR-HSI. Therefore, the regularization constant what actually controls is the amount of high-frequency components fused from PAN image to the up-sampled HSI. This explain the observation that we made from Figure 8, Figure 9, and Figure 10, where when the value of the regularization parameter increases the DIP network embed some of the high-frequency information to the up-sampled HSI, which ultimately helps to improve the quality of the up-sampled image. However, when the value of the regularization constant is large, the spatial loss term starts to dominate the loss function, and resulting in drop of spectral-domain performance metrics such as SAM and ERGAS. Therefore, we can achieve high-quality up-sampled HSIs by appropriately controlling the regularization parameter in spatial+spectral energy function.
2) Comparison of DIP with the proposed spatial+spectral loss with state-of-the-art up-sampling techniques: In the previous section, we determined the optimal value of the regularization constant λ for our proposed spatial+spectral loss function for the three datasets. In this section, we compare DIP with our spatial+spectral loss against DIP with only spectral loss, and other commonly used up-sampling techniques such as nearest neighbor, bicubic, and LapSRN, both qualitatively and quantitatively. Table III summarizes the quantitative results of nearestneighbor, bicubic, LapSRN, and DIP up-sampling methods for the Pavia Center dataset. For this dataset, our proposed DIP method improves the quality of up-sampled images in terms of CC, SAM, RSNR, ERGAS, and PSNR performance measures by 6.6%, 37.3%, 63.4% 37.3%, and 19.5%, respectively. We have also noticed that this improvement is accompanied by a drop in the SAM index which is around 1.8% compared to the DIP with spectral loss. This fall in the SAM index is not that significant compared to the improvements we have achieved in terms of all other performance measures. Further, we can cross-verify these quantitative results with qualitative results that we have shown in Figure 11 for the Pavia Center dataset. We can see that the DIP up-sampled images with our proposed spatial+spectral constraint looks much more closer to the reference image, and have predicted very fine structures and edges compared to other upsampling methods.
We also summarize the quantitative results for different up-sampling methods for the Botswana dataset in Table IV. As we can see, DIP with the proposed spatial+spectral loss improves the quality of up-sampled images in terms of all the performance metrics by a significant margin: CC value increased by 0.4%, SAM value reduced by 4.3%, RMSE value reduced by 14.0%, RSNR value improved by 11.2%, ERGAS value reduced by 1.7%, and PSNR value value increased by 5.2%. Also, we can verify these quantitative results with the qualitative results shown in Figure 12 for the Botswana dataset. Similar to the qualitative results that we have observed for the Pavia Center dataset, we can see the the up-sampled images using DIP with our proposed spatial+spectral loss is much more closer to the reference HSI.
Finally, we summarize the quantitative results for different  Figure 13 for the Chikusei dataset. From the qualitative results also we can see that DIP with our spatial+spectral constraint is able to predict very fine structures and edges more accurately than the other methods. In summary, we have shown that the DIP method with our proposed spatial+spectral constraints outperforms the state-ofthe-art up-sampling methods with a significant margin in all the datasets that we have considered in this study. In the next section, we present final fusion results and compare them with state-of-the-art pansharpening algorithms, qualitatively and quantitatively.
a) Final Fusion results on the Pavia Center dataset: The average quantitative results for different pansharpening approaches on the testing set of the Pavia Center dataset are shown in Table VI. As can be seen from Table VI, our proposed HyperKite achieves the highest CC value compared to all the other pansharpening approaches that we have considered in this study. A higher CC value indicates that the   [9]. (c) GSA [9]. (d) MTF-GLP-HPM [18]. (e) CNMF [23]. (f) MTF-GLP [17]. (g) HySure [20]. (h) HyperPNN [29]. (i) DHP-DARN [24]. (j) DIP-HyperKite (ours). (k) Reference. outperforms all the other HS pansharpening approaches by a considerable margin. Concretely, our DIP-HyperKite has improved the CC by 1.03%, and PSNR by 3.71%. In addition, our method has reduced the SAM by 9.68%, RMSE by 8.57%, and ERGAS by 10.85%. Furthermore, we have shown qualitative results related to different pansharpening approaches on Botswana dataset in Figure 15. By observing the RGB images and error plots in Figure 15, we can see that the fusion results related to our method are much closer to reference image than the other pansharpening approaches. c) Final Fusion results on the Chikusei dataset: In this section, we compare the qualitative and quantitative results on the Chikusei dataset. The average quantitative results of different pansharpening approaches on the Chikusei dataset is listed in Table VIII. Similar to the results we have observed for the other two datasets, for this dataset also our proposed DIP-HyperKite outperforms all the pansharpening approaches that we considered for the analysis. Our pansharpening method improves the CC, SAM, RMSE, RSNR, ERGAS, and PSNR performance measures over the state-of-the-art results by 1.45%, 19.0%, 6.67%, 0.67%, 18.5%, and 0.90%, respectively. To further highlight the fusion quality of our method we present the qualitative results of selected panshaprpening approaches for the Chikusei dataset is shown in Figure 16. By observing the RGB composite image and the error maps we can clearly  [9]. (c) GSA [9]. (d) MTF-GLP-HPM [18]. (e) CNMF [23]. (f) MTF-GLP [17]. (g) HySure [20]. (h) HyperPNN [29]. (i) DHP-DARN [24]. (j) DIP-HyperKite (ours). (k) Reference. see that the fusion quality of the proposed DIP-HyperKite is higher than the other pansharpening approaches.
VI. CONCLUSION In this paper, we have presented a novel approach for HS pansharpening, which mainly consists of three steps: (1) Upsampling the LR-HSI via DIP, (2) Predicting the residual image via over-complete HyperKite, and (3) Obtaining the final fused HSI by summation. The previously proposed DIP methods for HS up-sampling only impose a constraint in the spectral-domain by utilizing LR-HSI. To better preserve both spatial and spectral information, we first exploited an additional spatial constraint to DIP by utilizing the available PAN image, thereby introduced both spatial and spectral constraints to the DIP. The comprehensive experiments conducted on three HS datasets showed that our proposed spatial+spectral loss function significantly improved the quality of up-sampled HSIs in CC, RMSE, RSNR, SAM, ERGAS, and PSNR performance measures. Next, in the residual prediction task, we have shown that the residual component between up-sampled HSI and the reference HSI primarily consists of edge information and very fine structures. Motivated by this observation, we proposed a novel over-complete deep-learning network for the residual prediction task. In contrast to the conventional under-complete representations, we have shown that our over-complete network is competent to focus on high-level features such as  [16]. (b) BF [21]. (c) GSA [9]. (d) MTF-GLP-HPM [18]. (e) BFS [22]. (f) MTF-GLP [17]. (g) HySure [20]. (h) HyperPNN [29]. (i) DHP-DARN [24]. (j) DIP-HyperKite (ours). (k) Reference. edges and fine structures by constraining the receptive field of the network. Finally, the fused HSI is obtained by adding the residual HSI and the up-sampled HSI. The comprehensive experiments conducted on three HS datasets demonstrated the superiority of our DIP-HyperKite over the other state-of-theart results in terms of qualitative and quantitative evaluations.