Superpixel Spectral Unmixing for Hyperspectral Image Superresolution Using a Coupled Encoder-Decoder Network

In this paper, we propose a novel hyperspectral image superresolution method based on superpixel spectral unmixing using a coupled encoder-decoder network. The hyperspectral image and multispectral images are fused to generate high-resolution hyperspectral images through the spectral unmixing framework with low-rank constraint. Specifically, the endmember and abundance information is extracted via a coupled encoder-decoder network integrating the priori for unmixing. The coupled network consists of two encoders and one shared decoder, where spectral information is preserved through the encoder. The multispectral image is clustered into superpixels to explore self-similarity, and then, the superpixels are unmixed to obtain an abundance matrix. By imposing a low-rank constraint on the abundance matrix, we further improve the superresolution performance. Experiments on the CAVE and Harvard datasets indicate that our superresolution method outperforms the other compared methods in terms of quantitative evaluation and visual quality.


Introduction
With rich spectral and spatial information, hyperspectral images (HSI) have received extensive attention and have been widely used in many fields, especially for remote sensing [1,2] and medical imaging [3]. High resolution is essential to obtain high performance in many HSI applications, such as spectral unmixing, pixelwise classification, object detection, and object tracking. However, due to hardware limitations, hyperspectral imaging requires long exposure times to ensure a high signal-to-noise ratio, which leads to low spatial resolution [4]. Image superresolution is a promising way to acquire high-resolution images in both spatial and spectral domains.
In recent decades, several excellent methods were proposed for HSI superresolution fusing observed low-resolution HSI and the corresponding high-resolution multispectral image (MSI). There were some HSI superresolution methods based on sparse representation theory that is widely used in computer vision tasks. Akhtar et al. [4] impose a nonnegative constraint on spectral dictionary and sparse representations to improve HSI superresolution performance. Dong et al. [5] utilized structural sparsity constraint to exploit the clustering-based sparsity. The superpixel-based sparse representation model [6] learns sparse coding upon MSI superpixel to use spatial self-similarity of the MSI. Han et al.'s work [7] combined nonnegative sparse representation, local similarity, and nonlocal similarity for HSI superresolution, which explores self-similarity in superpixel and across the entire image. However, when learning the spectral dictionary, these methods do not take into account the spectral correlation in the HSI causing severe spectral distortion in the reconstructed images.
In order to overcome this problem, spectral unmixingbased superresolution methods were proposed. Yokoya et al. [8] exploited nonnegative property while unmixing HSI and MSI. Wycoff et al. [9] further introduced sparse constraint in extracting the endmember and the abundance matrix. HSI and MSI were jointly unmixed with sparse constraint and nonnegative and abundance sum-to-one constraints in [10]. Zou and Xia [11] explored spatial structure information in spectra and abundance via mutual distance and graph Laplacian regularization, respectively. These methods showed excellent superresolution performance. However, spectral unmixing-based methods generally assume that the downsampling operation from highresolution to low-resolution images is known. The assumption is not always often valid. Inspired by the great success of deep learning-based methods in the superresolution recovery of colour images, researchers have attempted to apply the deep learning approach to HSI superresolution. These deep learning-based methods mainly learn the mapping from low-resolution images to high-resolution images in a supervised way. For example, the typical models are the autoencoder, 3D convolution neural network [12,13], and deep residual networks [14,15]. These methods assumed that there exists consistent mapping across different image pairs. However, it is not always valid, thus causing severe spectral distortion.
In this paper, the local similarity in high-resolution images is analysed, and the low-rank property is explored in HSI. Then, we propose the superresolution algorithm based on the superpixel spectral unmixing. The coupled encoder-decoder network with sparsity is employed to extract endmembers and abundance information, which naturally integrates meaningful constraints to enhance the unmixing quality. As shown in Figure 1, the shared decoder preserves endmember information for the hyperspectral image. To obtain accurate high-resolution abundance information, we cluster high-resolution images into superpixels, which has adaptive shape and size, and impose a low-rank constraint on the abundance matrix. Finally, we combine the endmember and the abundance matrix to reconstruct the high-resolution HSI. The combination of the coupled encoder-decoder network and low-rank constraint leads to accurate endmember and abundance information so that promising HSI superresolution performance is obtained.

Spectral Unmixing-Based HSI Superresolution
The goal of HSI superresolution is to predict high-resolution HSI X ∈ ℝ W×H×B , where W and H denote image width and height in spatial dimensions and B indicates the spectral dimension. The spectral unmixing-based method takes HSI Z ∈ ℝ w×h×B and MSI Y ∈ ℝ W×H×b as input. For notational convenience, we represent the three-dimensional hyperspectral image in matrix form. Accordingly, we can get matrix X ∈ ℝ B×N , Y ∈ ℝ b×N , and Z ∈ ℝ B×n where N = W × H, n = w × n. We can estimate the desired HSI X, as follows: where matrix D ∈ ℝ N×N models the spatial downsampling and blurring operators. R ∈ ℝ B×b is the spectral degradation matrix depending on the hyperspectral sensor. In this paper, we assume that the spectral response function is known. The naive estimation of the desired HSI according to (1) is computationally complex and inaccurate. Thus, spectral unmixing is used to deal with these problems efficiently. According to the linear spectral mixing model [16], we have X = EA where E = ½e 1 , e 2 , ⋯, e c is the endmember matrix and c represents the number of spectral bases; A = ½a 1 , a 2 , ⋯, a N is the abundance matrix. In the linear mixture model, the endmember represents the reflectance spectrum. The abundance is the proportion of endmembers, and thus, they should be nonnegative. Also, each column of abundances should meet sum to one (STO) requirement, i.e., ⟶ 1 T = ⟶ 1 T a i , i = 1, 2, ⋯n. The fact that there are several endmembers in one pixel means that the abundance also should be sparse. Therefore, we can get endmembers and abundances according to Z

Reconstruction loss
Reconstruction loss Figure 1: The schematic diagram of the proposed approach: (1) cluster the high-resolution MSI into superpixels; (2) unmix the lowresolution HSI and superpixels in MSI to obtain endmember and abundance matrix with useful constrains by a coupled network.

Journal of Sensors
where kAk 0 denotes sparse constraint.

Method
This section describes the proposed HSI superpixel resolution method in detail. The proposed method is based on the superpixel spectral unmixing and the coupled encoderdecoder network. As shown in Figure 1, in the proposed method, endmembers and abundance are extracted using the coupled encoder-decoder network. In order to take advantage of self-similarity in the high-resolution image, the network extracts abundance from the superpixels of the high-resolution image.

MSI Superpixel Segmentation and Low-Rank Constraint.
The square window cannot effectively represent the complex structure of the image since space information in the natural image is usually not regular. Therefore, the traditional unmixing methods, pixel-based and patch-based methods, cannot fully utilize the spatial information of the image. The high-resolution MSIs are locally self-similar, that is, adjacent pixels in MSI have a similar spectral response. Therefore, we cluster MSI into superpixels to extract the abundance information. Superpixel was initially proposed by Ren and Malik [17], which groups pixels based on spectral response and other properties. According to the evaluation of 28 state-of-theart superpixel segmentation algorithms [18], we adopt the efficient topology-preserving segmentation (ETPS) algorithm [19] to obtain superpixels form the MSI. The ETPS formulates the segmentation problem with an objective function similar to k-means clustering, making the obtained superpixels be coherent in appearance and also have a regular shape. The objective function is a Markov energy function, and its definition is as follows: where c = ðc 1 , c 2 ,⋯,c M Þ and μ = ðμ 1 , μ 2 ,⋯,μ M Þ are the set of centres and mean positions for all superpixels. N 8 denotes the eight neighbourhoods of the pixel p. E col ðs p , c s p Þ encourages colour homogeneity of each superpixel, E pos ðs p , μ s p Þ encourages that the superpixels should be regular in shape, E b ðs p , s q Þ is used to encourage the superpixel to have small boundary length, and E topo ðsÞ and E size ðsÞ promote topological connections and superpixel size, respectively. Since the pixels within a superpixel have a similar spectral response, the rank of matrix reshaped by the superpixel is low. Therefore, when a hyperspectral image is unmixed in superpixels, the low-rank constraint can be imposed. Also, the subspace where the high-resolution HSI resides is the same as space spanned by the endmember matrix. Based on the fact, we impose the low-rank constraint on the abundance matrix, not on the high-resolution HSI. The optimization model of the coupled network is defined as with the nonnegative and STO constrains. Here, k•k * denotes matrix nuclear norm, which approximatively replaces low-rank constraint.

Coupled Encoder-Decoder Network.
Motivated by the work in [20], we propose the coupled encoder-decoder network to solve the HSI superresolution problem described in (4). As shown in Figure 2, the basic structure of the coupled network is an asymmetric stack autoencoder network. During the reconstruction process of HSI and MSI, the endmember and abundance information in the image scene are extracted. The autoencoder network consists of an encoder and a decoder. The encoder E h ðθ he Þ maps the hyperspectral data to the low-dimensional representation layer; the decoder D h ðθ d Þ reconstructs the representation layer into hyperspectral data: where W k he and W k d denote the weights in the k th layer of encoder and decoder, respectively, and f k denotes the activation function in the k th layer.
Note that when extracting endmember from HSI through the network, the hidden layer A h should reflect the abundance information, which means A h needs to meet nonnegative requirements and sum-to-one property. The ReLU function is used as an activation function in the encoder to make sure the hidden layer is nonnegative. Following [21], we make the hidden layer variables obey Dirichlet distribution, which is generated using the stick-breaking process. Figure 2: The details of the asymmetric stack autoencoder network for HSI.

Journal of Sensors
Let α = ðα 1 ,⋯,α i ,⋯α n Þ T represent the hidden layer variable, then α i is defined as follows: where v i is sampled from the inverse transform of the Kumaraswamy distribution. u and β are obtained from the encoder-decoder network. The bias parameter is not used in the decoder. The identity function is used as the activation function, so the parameters θ d in the encoder have θ d = W d1 W d2 ⋯ W dk . The weights of the decoder correspond to the endmember. In this way, the deep network can extract the spectral information from HSI, and the hidden layer preserves the spatial information effectively. Similarly, the high-resolution abundance information can be extracted during the reconstruction of the high-resolution MSI.
Due to the limited band number of MSI, it is not ideal for unmixing MSI to generate endmember and abundance information. Since HSI and MSI reflect ground object information in the same scene, the endmember information of MSI and HSI is correlated: i.e., E m = RE h . Thus, the weights of the decoder are shared for both HSI and MSI, and they are frozen while learning autoencoder network parameters for MSI.

Network Structure and Implementation
Detail. The L 0 regularization in (4) is nonconvex and usually replaced with L 1 regularization. However, the abundance vector should need the STO constraint. Therefore, the commonly used L 1 regularization would not guarantee whether the abundance vector is sparse. As similar to [21], we introduce the generalized Shannon entropy function as the sparsity constraint of the hidden layer. The definition of the function is as follows: Because the latent variables of representation layers are nonnegative, we choose p = 1 so that it is computationally efficient. Substituting (7) into (4), the optimization function can be rewritten as In this paper, we solve the optimization problem via the coupled deep encoder-decoder network. Following the general practice of deep networks, the L2 norm is used on the decoder weights to prevent overfitting. The objective functions of the coupled network are expressed as follows: where λ 1 , λ 2 , and μ are used for balancing the reconstruction error, sparsity constraint, weights loss, and locally low-rank constraint, respectively.Z andỸ are reconstructed images, respectively, A h and A m are abundance information of HSI and MSI, respectively. The coupled network consists of two sparse autoencoder networks to extract endmember and abundance information from HSI and MSI, respectively. The network is optimized as follows: first, the autoencoder extracts endmember information from HSI given the objective function in (9); then, the learned decoder weights θ d are frozen and shared with the decoder of the autoencoder network for MSI. Only the encoder weights θ me of the autoencoder network for MSI are updated according to the objective function in (10).
In the experiments, the number of endmembers is set as ten; that is, the hidden layer of the autoencoding network has ten nodes. The details of the coupled network are shown in Table 1. The autoencoder network for HSI includes three hidden layers. In contrast, the autoencoder network for MSI includes four layers and the nodes for each layer increase from five to ten. Since different HSIs describe different scenes, the changes in ground materials in different scenes may be massive. We extract spectral and spatial information from HSI and the corresponding MSI, respectively.

Experimental Results
Two commonly used hyperspectral datasets: CAVE [22] and Harvard [23], are employed to evaluate the proposed method. The details for the two datasets are shown in Table 2. We crop top-left 1024 × 1024 pixels for Harvard in our experiments. For two benchmark datasets, the original images are scaled to ½0, 1. The original image in the dataset is used as ground truth. To simulate low-resolution HSI, we downsample HSI by using the average over 32 × 32 disjoint   Journal of Sensors blocks. The MSIs are synthesized using the spectral response of the Nikon D700 camera integrating the raw data along the spectral dimensions. In order to evaluate the accuracy of the estimated highresolution HSI, four widely used quality metrics are used, including root mean square error (RMSE), peak signal-tonoise ratio (PSNR), spectral angle mapper (SAM), and relative dimensionless global error in synthesis (ERGAS). For PSNR, the larger value indicates better performance, while the smaller value indicates better performance for the other three metrics.

Comparison with State-of-the-Art
Methods. The proposed method is compared with existing HSI superresolution methods including CNMF [8], SNNMF [9], GSOMP+ [4], BSR [24], CSU [10], NSSR [5], and uSDN [21]. In the experiment, we use the original code provided by the authors, excluding uSDN. Among them, CNMF, SNNMF, CSU, and NSSR need the downsampling function as a priori, which is generally unknown in practical. Thus, the downsampling operation in the original implementation is replaced by bicubic function, which is different from the method to simulate low-resolution HSI. Table 3 shows the RMSE and SAM for some examples from CAVE and Harvard datasets. Table 4 shows the average RMSE, PSNR, SAM, and ERGAS results. Tables 3 and 4 show that our method outperforms all other methods and has a significant improvement on the CAVE dataset. The unmixing-based methods usually have a better SAM score. However, CNMF [8] and SNNMF [9] are not competitive on the CAVE dataset in terms of RMSE because they do not exploit constraints on endmember and abundance matrix. The sparse representation-based approaches usually have better reconstruction errors, i.e., GSOMP+ [4], BSR [24], and NSSR [5] can achieve better RMSE. However, the first two methods have a larger SAM score. The reason is that the spectral correlation in HSI is not considered. The uSDN [21] could extract accurate spectral information and spatial representations via a sparse Dirichlet network, minimizing angle similarity that leads to better RMSE and smaller SAM. Our method further explores the spatial structure information and local low-rank property to estimate abundance matrix, achieving a better reconstruction effect than other methods.

The Effects of Parameters.
There are three parameters in the proposed method: λ 1 , λ 2 , and μ. Here, λ 1 and λ 2 are used to balance the reconstruction error and sparse constraint and low-rank constraint, and μ is used for the decoder weight loss. In the experiments, the parameters are set to μ = 1 × To select the optimal parameters, we evaluate the HSI recovery performance according to different λ 1 and λ 2 . We set λ 1 = ½0:1, 0:5, 1, 1:5, 2 × 10 −6 and λ 2 = ½0:5, 1, 1:5, 2, 2:5 × 10 −1 , respectively. We first evaluate the RMSE by adjusting λ 1 while fixing the low-rank constrain weight λ 2 . In similar way, we choose the appropriate weight parameter λ 2 . Figure 3 shows the changed RMSE according to varied parameters. Accordingly, optimal values of λ 1 = 1 × 10 −6 and λ 2 = 1:5 × 10 −1 are selected and used in our experiments. Figures 4 and 5 show the recovered HSIs and the absolute difference images at wavelengths 460 nm, 540 nm, and 670 nm, for "cloth" from CAVE and "imgb8" from Harvard, respectively. The recovered images from BSR [23], CSU [10], NSSR [5], uSDN [20], and our method are compared according to their performance in Table 4. The results show that the absolute difference by our method (Figures 4(f) and 5(f)) is smaller than other methods, which indicates recovered image by the  [23], (c) CSU [10], (d) NSSR [5], (e) uSDN [20], and (f) ours.  Journal of Sensors proposed method is more similar with the original image. Compared with uSDN [20] (Figures 4(e) and 5(e)), our method adopts superpixel segmentation and low-rank constraint. The smaller difference indicates better performance of superresolution performance, which means the effectiveness of our improvements.

Conclusion
In this paper, we propose the HSI superresolution method based on superpixel spectral unmixing with a deep network. The method combines spectral unmixing with superpixel to simultaneously exploit spectral and spatial information in the image. Due to the powerful representation capabilities of deep networks, we can preserve more accurate spectral information. Experiments on two widely used HSI datasets show that our method achieved outstanding performance over other methods. The average RMSE and SAM on CAVE dataset reach 3.69 and 6.53, respectively. In future work, we will explore spectral correlation in HSI for superresolution.