Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction

Zhang, Jing; Zheng, Renjie; Wan, Zekang; Geng, Ruijing; Wang, Yi; Yang, Yu; Zhang, Xuepeng; Li, Yunsong

doi:10.3390/rs16030436

Open AccessArticle

Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction

¹

State Key Laboratory of lntegrated Service Network, Xidian University, Xi’an 710071, China

²

School of Telecommunication Engineering, Xidian University, Xi’an 710071, China

³

Guangzhou Institute of Technology, Xidian University, Guangzhou 510700, China

⁴

Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China

⁵

School of Space Information, Space Engineering University, Beijing 101416, China

⁶

System Engineering Research Institute of CSSC, Beijing 100070, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(3), 436; https://doi.org/10.3390/rs16030436

Submission received: 21 November 2023 / Revised: 17 January 2024 / Accepted: 18 January 2024 / Published: 23 January 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning is an important research topic in the field of image super-resolution. Problematically, the performance of existing hyperspectral image super-resolution networks is limited by feature learning for hyperspectral images. Nevertheless, the current algorithms exhibit some limitations in extracting diverse features. In this paper, we address limitations to existing hyperspectral image super-resolution networks, focusing on feature learning challenges. We introduce the Channel-Attention-Based Spatial–Spectral Feature Extraction network (CSSFENet) to enhance hyperspectral image feature diversity and optimize network loss functions. Our contributions include: (a) a convolutional neural network super-resolution algorithm incorporating diverse feature extraction to enhance the network’s diversity feature learning by elevating the matrix rank, (b) a three-dimensional (3D) feature extraction convolution module, the Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM), to boost the network’s performance in both the spatial and spectral domains, (c) a feature diversity loss function designed based on the image matrix’s singular value to maximize element independence, and (d) a spatial–spectral gradient loss function introduced based on space and spectrum gradient values to enhance the reconstructed image’s spatial–spectral smoothness. In contrast to existing hyperspectral super-resolution algorithms, we used four evaluation indexes, PSNR, mPSNR, SSIM, and SAM, and our method showed superiority during testing with three common hyperspectral datasets.

Keywords:

feature diversity; rank upper bound; loss function; image up-sampling; deep learning

1. Introduction

The super-resolution reconstruction method based on a single image is a direct approach for hyperspectral image super-resolution without the use of any auxiliary information. There are two main types of super-resolution reconstruction methods based on a single image: traditional algorithm-based methods that utilize prior knowledge of the image to constrain the solution space and depth-learning-based single-graph super-resolution reconstruction methods that construct deep learning (DL) networks to learn the nonlinear mapping process from low-resolution images to high-resolution images using a training set of hyperspectral image processing.

The single-image super-resolution reconstruction method for hyperspectral images is a direct approach without any auxiliary information, focusing on enhancing the resolution of a single hyperspectral image. These methods can be broadly categorized into two groups. Firstly, there are traditional algorithm-based single-image super-resolution reconstruction methods, which leverage prior knowledge of an image to constrain the solution space and alleviate the complexity of reconstruction. Secondly, there are deep-learning-based methods for single-image super-resolution reconstruction. These approaches involve constructing a deep learning network and training it on a dataset of processed hyperspectral images to learn the nonlinear mapping process from low-resolution to high-resolution images.

1.1. Traditional Algorithms

In the early days, hyperspectral image super-resolution reconstruction primarily focused on enhancing the spatial resolution of low-resolution hyperspectral images through interpolation techniques. This involved reconstructing the pixel value of a pixel using surrounding pixel information. Common interpolation methods include nearest-neighbor interpolation, bilinear interpolation, and bicubic interpolation, with bicubic interpolation being the preferred choice, as it effectively preserves information in both the spatial and spectral dimensions. While interpolation-based methods are simple and straightforward, they often lead to image blur issues in super-resolution reconstruction, yielding less-than-ideal results.

The challenges posed by interpolation algorithms have spurred advancements in the academic community. Certain scholars have introduced a hyperspectral image super-resolution reconstruction method rooted in sparse coding. This approach involves creating a sparse dictionary through learning and subsequently leveraging this dictionary to enhance the spatial dimension information of low-resolution hyperspectral images. This aids in establishing a more effective mapping relationship between low-resolution and high-resolution images.

Huang et al. [1] introduced a mapping model for super-resolution reconstruction based on sparse representation with multiple dictionaries in 2014. This model effectively captures the spatial–spectral relationship of hyperspectral images through sparse representation during learning and class assignment, achieving robust noise suppression. In the same year, Huang et al. [2] proposed a method for hyperspectral image super-resolution reconstruction that combined prior low-rank and group-sparse modeling methods. They also modeled degraded image conditions during super-resolution reconstruction, leading to improved results in cases of fuzzy and unknown image spaces. Gou et al. [3] extended these ideas in 2014 by introducing non-local self-similarity and local kernel constraint regularization terms. Building on this, they proposed a dictionary learning model that combines local and non-local priors to extract more local and non-local relationships between a space and a spectrum during hyperspectral image reconstruction, enhancing the effectiveness of hyperspectral image super-resolution reconstruction. In 2016, Li et al. [4] presented a hyperspectral image super-resolution method that integrates the sparse prior and non-local self-similarity of images. This method leverages spectral sparsity and the high similarity of spatial–spectral blocks, allowing reconstructed images to fully utilize the spatial–dimensional prior information of hyperspectral images while maintaining spectral consistency.

While traditional algorithms have made strides in hyperspectral image super-resolution reconstruction tasks, these methods often heavily depend on the prior information of hyperspectral images. However, these prior pieces of information are mostly heuristic, and they possess weak representation capabilities. Consequently, in certain scenarios, these priors may fail to effectively constrain the solution space, resulting in unsatisfactory reconstructed image outcomes. Moreover, these methods require parameter adjustments for different images and devices, limiting their generalization capabilities.

1.2. Deep Learning Algorithms

Deep learning, renowned for its effectiveness in various computer vision tasks, has found applications in single-hyperspectral-image super-resolution. Convolutional neural networks (CNNs) have emerged as a primary tool in this domain.

Some notable approaches include Yuan et al.’s [5] utilization of deep learning to transfer a convolutional neural network (CNN) trained on natural images to hyperspectral images in 2017. They employed non-negative matrix decomposition to analyze spectral differences between low-resolution and high-resolution hyperspectral images, enhancing the network’s reconstruction capabilities. Hu et al. [6] proposed a deep spectral differential CNN (SDCNN) that incorporates a spatial error correction model to enhance spatial information while preserving spectral information.

Given the abundant spectral information present in hyperspectral images, researchers are dedicated to restoring as much spectral information as possible while achieving spatial super-resolution. As a result, numerous hyperspectral image super-resolution algorithms leveraging 3D convolution have been developed by scholars. In 2020, Li et al. [7] proposed a 3D-generative adversarial network (GAN) that incorporates a spectral attention mechanism. However, 3D convolution increases the number of network parameters and complexity of floating-point computations. Nonetheless, the increased network parameters and computational complexity associated with 3D convolution remained a concern, limiting the full extraction and utilization of diverse features from hyperspectral images.

Currently, these algorithms exhibit impressive performance; however, networks based on 3D convolution face limitations in terms of the number of parameters. Additionally, due to the shape constraints of 3D convolution, they lack flexibility in extracting diverse features. In this study, features and convolution are treated as matrices. Given the aforementioned issues, the rank of a two-dimensional matrix signifies the independence and freedom of its elements. Therefore, the diversity of features can be quantitatively assessed by analyzing the rank of a feature matrix. Based on that, we proposed a DL method to improve the feature diversity of hyperspectral images. The main contributions of this paper are as follows:

(1): The 3D convolution topology was modified to enhance feature diversity in the feature map without increasing the number of parameters by raising the rank upper bound. A novel 3D feature extraction convolution method was introduced, followed by a Feature Diversity Enhancement Module. To integrate spatial and spectral information effectively, a spatial–spectral feature fusion module was proposed. The combination of these modules forms a spatial–spectral feature extraction module based on channel attention.
(2): A loss function was devised to enhance the feature diversity and spatial–spectral smoothness of hyperspectral images. The feature diversity loss function utilizes the singular value of the image matrix to promote element independence. Additionally, a spatial–spectral gradient loss function, guided by the gradient value of space and spectrum, was introduced to enhance the spatial–spectral smoothness of the reconstructed image.

2. Related Work

In the landscape of hyperspectral image super-resolution research, significant strides have been made to enhance the reconstruction capabilities of networks. Drawing inspiration from innovative studies, researchers have explored diverse methodologies to address the challenges posed by limited spatial resolution in hyperspectral images.

Inspired by the 2018 study by Xie et al. [8], who transformed a super-resolution task into a feature matrix decomposition problem (DFMF) to separate 3D convolution models, in 2020, Li et al. [9] proposed a hybrid convolution module (MCNet) using 2D convolution and separable 3D convolution to effectively extract spatial and spectral feature information while reducing the number of network parameters and the complexity of floating-point computations. To further study the relationship between 2D and 3D convolutions, in 2021, Li et al. [10] proposed an improved method for hyperspectral image super-resolution reconstruction. By sharing spatial information, they reduced the redundancy in the model’s structure, thereby achieving improved network reconstruction performance. In 2021, Liu et al. [11] proposed a mechanism called spectral attention, which uses packet convolutions to avoid the spectral distortion caused by full convolutions to better utilize the prior information of spectral images in hyperspectral images.

Other notable contributions include Jiang et al.’s [12] proposed depth neural network based on the spatial–spectral prior network (SSPSR) to fully utilize the spatial and spectral correlation information within hyperspectral images. XINet [13], an X-shaped interactive autoencoder network, addresses limitations in hyperspectral image super-resolution (HSI-SR). By uniting U-Nets and introducing cross-modality mutual learning, it effectively utilizes multimodal information, enhancing spatial–spectral features. A joint self-supervised loss allows for unsupervised optimization, reducing training costs. The proposed EU2ADL network [14] enhances multispectral-aided hyperspectral image super-resolution. Using coupled autoencoders, a hybrid loss function, and attention-embedded degradation learning, it improves representation learning and reduces distortions. J. Li et al. [15] introduced a novel unsupervised model-guided approach for hyperspectral image super-resolution. The method utilizes degradation models and multiscale attentional fusion to improve mapping learning, resulting in superior results with various datasets. Furthermore, Zhang et al. [16] proposed a CNN-based super-resolution reconstruction algorithm using multiscale feature extraction and a multilevel feature fusion structure to solve the problem of the lack of effective model design for spectral segment feature learning in hyperspectral remote sensing images. In 2023, Zhang et al. [17] proposed the spectral correlation and spatial high–low-frequency information of a hyperspectral image super-resolution network (SCSFINet) based on spectrum-guided attention for analyzing the information acquired from hyperspectral images. With a focus on addressing the challenge of obtaining high-resolution 3D surface structures in computer vision, Y. Ju et al. [18] introduced the Super-Resolution Photometric Stereo Network (SR-PSN). Their aim was to overcome the complexities associated with acquiring high-resolution images in linearly responsive photometric stereo-imaging systems. To integrate the self-attention mechanism with Transformer, a novel Transformer model, Dual Aggregation Transformer (DAT) [19], was introduced. This model effectively amalgamates information across two distinct dimensions, enhancing the expressive power of the network. Chen, Y., et al. [20] proposed a lightweight single-image super-resolution network using two-level nested residual blocks and autocorrelation weight unit to address the common problems of blurred image edges, inflexible convolution kernel size selection, and slow convergence during a training procedure.

The diverse array of approaches discussed in this section reflects the dynamic and evolving nature of hyperspectral image super-resolution research. From novel convolution modules to attention mechanisms and unsupervised learning strategies, each contribution has added a unique perspective to the ongoing quest for sharper and more accurate hyperspectral image reconstructions. However, these methods face the challenge of not fully tapping into and utilizing the diverse features inherent to hyperspectral images.

3. Proposed Method

In this section, to introduce the proposed network, details on CSSFENet are presented in four parts, as shown in Figure 1: the overall framework, 3D-Feature Extraction Convolution (3D-FEC), the Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM), and loss functions. The goal of CSSFENet is to enhance the feature diversity of the extracted feature map through modifications to the network topology.

3.1. Overall Framework

The structure of the proposed Channel-Attention-Based Spatial–Spectral Feature Extraction Network (CSSFENet) is shown in Figure 1. It has four stages, including 3D-Feature Extraction Convolution (3D-FEC) for shallow feature extraction, Channel-Attention-Based Spatial–Spectral Feature Extraction Modules (CSSFEMs) for deep feature mapping and the fusion of the spatial and spectral dimensions, an image reconstruction module for up-sampling low-resolution hyperspectral images, and loss functions to guide the network’s optimization direction.

In the first stage, the 3D-FEC was applied to extract the shallow feature

F_{0}

of the input LR image

I_{L R}

. The value of

F_{0}

was determined using the Equation (1), as follows:

\begin{matrix} F_{0} = H_{3 D - F E C} (U n s q u e e z e (I_{L R})) \end{matrix}

(1)

where

H_{3 D - F E C} (•)

represents the 3D-FEC module,

U n s q u e e z e (•)

represents the dimension extension function, which transforms the input

I_{L R} \in R^{S \times H \times W}

to

I_{L R} \in R^{1 \times S \times H \times W}

, and

S

,

H

, and

W

represent the batch size, spectrum number, height, and width of the low-resolution hyperspectral image.

Then,

F_{0}

was sent to CSSFEM, which learned the nonlinear mapping of the spatial and spectral dimensions and extracted the image features more comprehensively. The output of the

F_{n}

output feature of the last CSSFEM was obtained using Equation (2), as follows:

\begin{matrix} F_{n} = H_{C S S F E M - (n)} (F_{n - 1}) + F_{0} \end{matrix}

(2)

F_{n - 1} = H_{1 \times 1} (c o n c a t (F_{1}, F_{2}, \dots, {H_{C S S F E M - (n - 1)} (F}_{n - 2})))

F_{n - 2} = H_{1 \times 1} (c o n c a t (F_{1}, F_{2}, \dots, {H_{C S S F E M - (n - 2)} (F}_{n - 3})))

F_{2} = H_{1 \times 1} (c o n c a t (F_{1}, {H_{C S S F E M - (2)} (F}_{1})))

F_{1} = {H_{C S S F E M - (2)} (F}_{0})

where

H_{C S S F E M - (n)} (•)

indicates the n-th CSSFEM module,

H_{1 \times 1} (•)

denotes the

1 \times 1

convolution utilized for reducing channel dimensionality, and

c o n c a t (•)

stands for concatenation operation.

In the third stage, the image reconstruction module, we applied the 3D sub-pixel convolution to restore the hyperspectral SR image

I_{S R} \in R^{B \times 1 \times S \times r H \times r W}

;

r

is the upscale factor. Based on the input

F_{n} \in R^{B \times C \times S \times H \times W}

, in which

C

is the channel number of

F_{n}

, the above process was expressed as follows:

I_{S R} = S q u e e z e (H_{u p} (F_{n})) + B i c u b i c (I_{L R})

(3)

where

H_{u p} (•)

indicates the 3D sub-pixel convolution,

B i c u b i c (•)

means the interpolation up-sampling function, and

S q u e e z e (•)

represents the dimension compression function, which transforms the input

I_{S R} \in R^{1 \times S \times r H \times r W}

to

I_{S R} \in R^{S \times r H \times r W}

.

During the final stage, the proposed feature diversity loss function (

L_{F D}

) and spatial–spectral gradient loss function (

L_{S S G}

) were integrated with the standard mean absolute error (MAE) to steer the overall optimization direction of the network, as elaborated upon in Section 3.4.

3.2. D-Feature Extraction Convolution (3D-FEC)

This section mathematically models the challenge of acquiring diverse features from hyperspectral images. Through quantitative analysis, we derived the convolution type that enhances the feature diversity of feature maps by modifying the network topology.

Let

A \in R^{N \times C \times k \times k \times k}

be a kernel tensor with

N

3D convolutions of size

k \times k \times k

, where

C

is the number of input feature channels. By inputting a hyperspectral feature map,

I \in R^{C \times S \times H \times W}

, the 3D convolutional layer outputs a feature map,

F \in R^{N \times S^{'} \times H^{'} \times W^{'}}

.

F = A \cdot I

(4)

where

F \in R^{N \times S^{'} H^{'} W^{'}}

is the matrix form of the output feature map

F

of the 3D convolutional layer, and

A \in R^{N \times k^{3} C}

is obtained by flattening 3D convolution kernel

A

into row vectors and then vertically stacking them to obtain the kernel matrix, as shown in Figure 2a.

I \in R^{k^{3} C \times S^{'} H^{'} W^{'}}

represents the matrix form of the input hyperspectral feature map

I

obtained through iterative expansion along the spatial–spectral direction.

The rank of a 2D matrix indicates the independence and degree of freedom of its elements. Thus, a quantitative measure of the feature diversity of feature matrix

F

can be obtained by analyzing its rank. According to matrix multiplication, per Equation (5):

R a n k (F) \leq \min \{R a n k (A), R a n k (I)\}

(5)

where

R a n k (•)

represents the rank of the input 2D matrix. Because

N

and

C

usually have similar mathematical magnitudes and

N ≪ k^{3} C ≪ S^{'} H^{'} W^{'}

, in summary,

R a n k (F) \leq N

.

Therefore, the upper limit of

R a n k (F)

is fundamentally bounded by

R a n k (A)

. Thus, solving

R a n k (A)

is a prerequisite for facilitating the learning of diverse and powerful image features. In Figure 2a, the convolutional kernels in the 3D convolution layer are flattened into a large 2D matrix,

A \in R^{N \times k^{3} C}

, with a rank limit of

N

. To overcome this limitation, the paper proposes to transform the elements of

A

into 2D matrices with higher rank limits using standard convolutional operations. Furthermore, considering that all three dimensions of hyperspectral images are equally significant [21], no distinction is made when handling these three dimensions, ensuring that the same convolution pattern can be applied to all three dimensions, thus fully utilizing their symmetrical properties.

From an algorithm-implementation perspective, the number of channels is expanded to 3L to construct a larger kernel matrix, denoted as

A^{'} \in R^{3 L N \times k^{3} C}

. Therefore, the convolution process can be redefined using Equations (6) and (7):

F^{'} = A^{'} \cdot I

(6)

A^{'} = [A_{1}^{'}; A_{2}^{'}; A_{3}^{'}]

(7)

where

F^{'} \in R^{3 L N \times S^{'} H^{'} W^{'}}

represents the matrix form of the output feature map, and

A_{i}^{'} \in R^{L N \times k^{3} C}

corresponds to the matrix form of the convolutional kernels for each dimension of the hyperspectral image. To ensure that the number of parameters in the convolutional operation corresponding to

A^{'}

does not exceed the number of parameters in the 3D convolution, it is necessary to add a certain number of zeros to

A^{'}

, and the maximum value for each convolutional kernel parameter is set to

k^{3} C / 3 L

.

The expected kernel matrix

A^{'}

described above can be implemented using common convolution modes. In particular, this paper opted for different combinations of widely used 1D and 2D convolution kernels to substitute for the original 3D convolution kernels.

This means that the convolution layer with large convolution kernels is approximated by aggregating several small convolution kernels in a specific sequence. For instance, let us assume

k = 3

(where

k^{3} = 27

, and

k^{3} C / 3 L = 9 C / L

).

When

L

equals 1, a feasible

A^{'}

can be obtained, and the rank upper limit is elevated to

3 N

. As illustrated in Figure 2b, the kernel matrix can be implemented by employing

3 \times 3

two-dimensional convolution kernels along all three dimensions. In Figure 2, the colored squares represent the convolution kernel weights, while the blank squares denote zero padding. When

L

equals 3, depicted in Figure 2c, a workable

A^{'}

can be constructed with a rank upper limit of up to

9 N

. This corresponds to the application of three

3 \times 3

convolution kernels in each of the three dimensions.

To address the trade-off between the parameter and rank upper limits, alternative solutions are considered. While the number of parameters in

A^{'}

is equivalent to

A

’s kernel parameters, it is possible to construct a feasible

A^{'}

with a higher rank upper limit but fewer parameters to reduce the network’s parameter count. When

L

equals 1, as depicted in Figure 2d, it is possible to create an

A^{'}

with a rank upper limit of

3 N

by utilizing separated

3 \times 3

convolutions for each dimension. Similarly, for

L = 2

, an

A^{'}

with a rank upper limit of

6 N

can be established, as displayed in Figure 2e, where two

3 \times 3

convolution kernels are applied to each dimension individually.

Based on the above analysis, we constructed

A^{'}

without introducing additional parameters, and we labeled Figure 2b as 3D-FEC-b, Figure 2c as 3D-FEC-c, Figure 2d as 3D-FEC-d, and Figure 2e as 3D-FEC-e to compare their parameters and flops, as shown in Table 1.

For an input feature of size

64 \times 7 \times 7 \times 7

(channels

\times

spectral segments

\times

width

\times

height) and an output feature of

64 \times 5 \times 5 \times 5

with no padding (i.e., padding = 0), and using a standard convolution with a kernel size of

3 \times 3

and a stride of 1, we present the parameter, floating-point computations, and parameter ratio between standard 3D convolution and various 3D-FEC convolutions in Table 1.

In the case of the same input and output feature maps, with

3 \times 3

convolutional kernels for each dimension, 3D-FEC-b, 3D-FEC-c, and 3D convolution have equal parameter counts and floating-point operations. However, 3D-FEC-b increases the rank upper bound by three times, and 3D-FEC-c increases it by nine times. Moreover, 3D-FEC-d reduces parameters and flops to one-third of 3D convolution while expanding the rank upper bound by three times. Similarly, 3D-FEC-e has a two-third parameter count and floating-point operation compared to 3D convolution, but the rank upper bound is expanded by six times. Therefore, we used 3D-FEC-e as the 3D Feature Extraction Convolution (3D-FEC), which directly modifies the network topology to improve the upper bound of

R a n k (A)

.

3.3. Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM)

After completing the extraction of feature diversity from feature maps, the next thing that we needed to focus on was how to fully leverage this feature information. The primary objective was to design a specific module capable of seamlessly integrating feature information from the spatial and spectral dimensions.

CSSFEM was designed to meet this demand. As illustrated in Figure 3, for the image feature

F \in R^{C \times S \times H \times W}

, the FDEM module is used to improve the feature diversity of the image, that is, the upper limit of the matrix rank is increased to obtain the output feature map

F_{F D E M} \in R^{C \times S \times H \times W}

.

F_{F D E M} = H_{F D E M} (F_{i n})

(8)

Next, the spatial and spectral information of the image are fused using the Spatial Spectral Feature Fusion Module (SSFFM) to obtain the feature map

F_{S S F F M} \in R^{C \times S \times H \times W}

.

F_{S S F F M} = H_{S S F F M} (F_{F D E M})

(9)

F_{o u t} = F_{F D E M} \cdot σ (H_{c o n v} (R e L u (H_{c o n v} (A v g (F_{S S F F M}), M a x (F_{S S F F M})))))

The feature maps

F_{A v g} \in R^{C \times 1 \times 1 \times 1}

and

F_{M a x} \in R^{C \times 1 \times 1 \times 1}

are obtained using global average pooling and global maximum pooling, respectively.

F_{A v g} = G A P (F_{S S F F M})

(10)

F_{M a x} = G M P (F_{S S F F M})

(11)

where

G A P (•)

and

G M P (•)

stand for global average pooling and global maximum pooling.

Finally, the 3D convolution fusion feature information with a convolution kernel size of 1 × 1 × 1 is used to obtain

F_{f u s i o n} \in R^{C \times 1 \times 1 \times 1}

and reduce the number of channels to prevent channel explosion. The Sigmoid activation function is used to obtain the coefficient feature map

V \in R^{C \times 1 \times 1 \times 1}

, which is then multiplied with FDEM to obtain the final feature map.

F_{f u s i o n} = H_{1 \times 1 \times 1} (c o n c a t (F_{A v g}, F_{M a x}))

(12)

V = σ (F_{f u s i o n})

(13)

F_{o u t} = F_{F D E M} \cdot V

(14)

where

σ

is the sigmoid activation function,

c o n c a t (•)

is the concatenation operations, and

F_{o u t}

is the output feature of CSSFEM.

3.3.1. Feature Diversity Enhancement Module (FDEM)

Building upon 3D-FEC convolution, we introduced a Feature Diversity Enhancement Module (FDEM). The architecture of the FDEM module is illustrated in Figure 3c.

The FDEM module primarily builds upon 3D-FEC convolution. The input image features are split into three streams upon entering the FDEM module. As per the aforementioned theory, this procedure significantly increases the upper limit of feature map rank by a factor of three, enhancing the diversity of features extracted through convolution. The convolution kernels applied are of sizes

3 \times 1 \times 1

,

1 \times 3 \times 1

, and

1 \times 1 \times 3

, respectively. Subsequently, the ReLU activation is applied, and concatenation operations are conducted on the convolution output feature maps individually. To address potential issues related to channel explosion, a

1 \times 1 \times 1

3D convolution is used for channel dimensionality reduction, and the resultant feature maps are added without further operations, yielding the ultimate feature information.

The entire process can be mathematically represented by Equations (15)–(18), as follows:

F_{u p} = H_{3 \times 1 \times 1} (R e l u (H_{3 \times 1 \times 1} (F_{i n})))

(15)

F_{m i d} = H_{1 \times 1 \times 3} (R e l u (H_{1 \times 1 \times 3} (F_{i n})))

(16)

F_{l o w} = H_{1 \times 3 \times 1} (R e l u (H_{1 \times 3 \times 1} (F_{i n})))

(17)

F_{o u t} = H_{1 \times 1 \times 1} (c o n c a t (F_{u p}, F_{m i d}, F_{l o w})) + F_{i n}

(18)

where

F_{u p}

,

F_{m i d}

, and

F_{l o w}

correspond to the upper, middle, and lower inputs in the Figure 3c, while

H_{k 1 \times k 2 \times k 3}

represents a 3D convolution operation with a kernel size of

k 1 \times k 2 \times k 3

.

3.3.2. Spatial–Spectral Feature Fusion Module (SSFFM)

We introduced a Spatial–Spectral Feature Fusion Module (SSFFM). The architecture of the SSFFM module is illustrated in Figure 3b.

Once the feature diversity had been extracted from the feature maps, the next critical step was to leverage this information effectively. This section’s primary objective is to devise a module capable of seamlessly integrating feature details from both the spatial and spectral dimensions while capitalizing on the benefits of feature diversity. To achieve this, we employed Equations (5)–(7) to dissect these spatial–spectral feature fusion configurations, as illustrated in Figure 4.

Figure 4a illustrates the most straightforward spatial–spectral fusion method, which is standard 3D convolution. We have already discussed its formula analysis in Section B, so we will not revisit it here. Instead, we will provide detailed explanations of the other structures in the Figure 4b–d:

(1) In Figure 4b, we introduce the process of “Sequential concatenation of 1D-2D convolution kernels”(S-1D-2D) using matrix multiplication, expressed by Equations (19) and (25):

F_{1 D} = A_{1 D} \cdot I

(19)

F_{o u t}^{1} = A_{2 D} \cdot I_{1 D}

(20)

where

A_{1 D} \in R^{N \times k^{3} C}

represents the matrix form of the 1D convolution kernel,

I \in R^{k^{3} C \times S H W}

represents the input feature map, and

F_{1 D} \in R^{N \times S H W}

represents the output feature map obtained from 1D convolution.

A_{2 D} \in R^{N \times k^{3} N}

represents the matrix form of the 2D convolution kernel,

I_{1 D} \in R^{k^{3} N \times S^{'} H^{'} W^{'}}

represents the input feature map for 2D convolution, which is derived from a matrix transformation of

F_{1 D} \in R^{N \times S^{'} H^{'} W^{'}}

represents the final output feature map.

(2) The process of “Sequential connection of three 1D convolution kernels” (S-three-1D) shown in Figure 4c can be represented in matrix multiplication form using the following Equations (21)–(23):

F_{1 D}^{1} = A_{1 D}^{1} \cdot I

(21)

F_{1 D}^{2} = A_{1 D}^{2} \cdot I_{1 D}^{1}

(22)

F_{o u t}^{2} = A_{1 D}^{3} \cdot I_{1 D}^{2}

(23)

where

A_{1 D}^{1}, A_{1 D}^{2}, A_{1 D}^{2} \in R^{N \times k^{3} N}

represents the matrix forms of the 1D convolution kernels,

I_{1 D}^{1}, I_{1 D}^{2} \in R^{k^{3} N \times S^{'} H^{'} W^{'}}

signifies the matrix form of the input feature maps, and

F_{1 D}^{1}, F_{1 D}^{2}, F_{o u t}^{2} \in R^{N \times S^{'} H^{'} W^{'}}

represents the matrix forms of the output feature maps.

(3) The process of “Parallel joining of 1D-2D convolution kernels” (P-1D-2D) in Figure 4d can be described using matrix multiplication, as shown in Equations (24) and (25):

A_{1 D 2 D} = [A_{1 D}; A_{2 D}]

(24)

F_{o u t}^{3} = A_{1 D 2 D}^{2} \cdot I

(25)

where

F_{o u t}^{3} \in R^{2 N \times S^{'} H^{'} W^{'}}

represents the final output feature map, and

A_{1 D 2 D}^{2} \in R^{2 N \times k^{3} C}

represents the combined form of the 1D and 2D convolution kernels, which are executed in parallel in the fusion process.

According to the equation above, we can derive that

R a n k (F_{o u t}^{1}) \leq N

,

R a n k (F_{o u t}^{2}) \leq N

, and

R a n k (F_{o u t}^{3}) \leq 2 N

. This indicates that the feature extraction method in Figure 4d extends the upper limit of feature map rank from

N

to

2 N

. Building on this insight, we propose a Spatial–Spectral Feature Fusion Module (SSFFM) in this section. This module enhances the feature diversity extraction capabilities while retaining the same number of network parameters as standard 3D convolution.

The SSFFM module primarily employs two types of convolutions with kernels of

3 \times 1 \times 1

and

1 \times 3 \times 3

, respectively, to capture the spectral and spatial features of the image. Subsequently, these feature representations from the spectral and spatial domains are fused using an addition operation. In the context of hyperspectral image super-resolution, the goal is to enhance spatial resolution while preserving hyperspectral fidelity. To achieve this, a convolutional layer with a

1 \times 3 \times 3

kernel is applied after fusion to further improve spatial domain learning. The primary function of this module is to facilitate cross-band knowledge complementarity between the spectral and spatial domains.

The CSSFEM of the network plays a crucial role in feature extraction. Its purpose is to learn the nonlinear mapping of the input five-dimensional feature image to the spatial and spectral dimensions and to enable comprehensive feature extraction. The Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM) further enhances information fusion between the feature maps of each module through a densely connected network.

3.4. Loss Function

The loss function serves as a crucial tool for constraining and guiding the optimization process of super-resolution networks. In this section, the mean absolute error (MAE) is adopted as the primary component of the network loss function due to its smooth continuity and effective prevention of overfitting. Building upon the MAE loss, this section analyzes and addresses the limitations of the proposed network. First, concerning feature diversity extraction, inspired by channel attention, the hyperspectral image is transformed into a two-dimensional matrix. The feature diversity loss function (

L_{F D}

) was meticulously designed to enhance feature map diversity by altering the singular value distribution of the matrix. Detailed methodologies are outlined in Section 3.4.1. Secondly, with respect to spectral characteristics, to recover hyperspectral images with more abundant spectral information, the spatial–spectral gradient loss function (

L_{S S G}

) was tailored accordingly, and the specific methods are expounded upon in Section 3.4.2.

3.4.1. Feature Diversity Loss Function ( $L_{F D}$ )

Consider the feature

F

as the input feature map and

F_{a}

as the feature map post-enhancement;

F

is similar to

F_{0}

, and

F_{a}

is similar to

F_{n}

. Even though enhancing the network topology raises the upper rank of the matrix expansion form of the feature map

F

, it is theoretically possible to further enhance the diversity of the resulting feature map

F_{a}

. Previous research [24,25,26] has demonstrated that, during the backpropagation in deep learning networks, the weights of a significant portion of convolutional kernels tend to converge toward the marginal set of major components, known as the long-tail distribution. This convergence may lead to overfitting.

Assuming that we have a two-dimensional eigenmatrix with singular values

{a_{i}}_{i \in [1, r]}^{r}

, and after raising the rank upper limit through the feature diversity enhancement module described in Section 3.3, these singular values become

{b_{i}}_{i \in [1, n r]}^{n r}

. This transformation can lead to a long-tail distribution of singular values, which can be mathematically expressed by Equation (26):

a_{i} \approx b_{i} ≫ b_{j} > 0, i \in [1, r], j \in [n + 1, n r]

(26)

The new singular values

{b_{j}}_{j \in [n + 1, n r]}^{n r}

, obtained through feature diversity enhancement, provide a feature matrix with a higher rank after rearrangement. However, the size of this matrix might not be large enough to provide a significant advantage in terms of network performance. In other words, the obtained feature map

F_{a}

may still be relatively low-rank. To address this issue, this section directly regularizes the eigenmatrix

F_{a}

to enhance its diversity. This is achieved by minimizing the following loss function, as expressed in Equation (27) below:

L_{F D} = - {‖S V D (F_{n})‖}_{1}

(27)

where

{‖•‖}_{1}

represents the

l_{1}

norm, and

S V D (•)

represents the eigenvalue vector for the input 2D matrix.

The regularization term in Equation (27) narrows the gap between large and small singular values and promotes a more even distribution, thereby preventing the feature map from becoming a low-rank matrix while enhancing feature diversity.

3.4.2. Spatial–Spectral Gradient Loss Function ( $L_{S S G}$ )

The mean absolute error (MAE) and mean square error (MSE) loss functions are primarily designed for general image super-resolution tasks. While they can effectively preserve spatial information in super-resolution results, they tend to overlook the correlation between spectral features. This oversight can result in spectral information distortion in the reconstructed hyperspectral image. To ensure that both spatial and spectral information are well preserved in the reconstructed hyperspectral images, we proposed a spatial–spectral gradient (SSG) loss function. This loss function is an extension of the traditional total variation model, and it takes into account the correlation between the spatial and spectral dimensions. Consequently, it enhances the smoothness of both the spectral and spatial dimensions in the reconstructed image, as expressed in Equation (28) below:

L_{S S G} = \frac{1}{N} \sum_{i = 1}^{N} ({‖\nabla_{B} I_{S R}^{n}‖}_{1} + {‖\nabla_{H} I_{S R}^{n}‖}_{1} + {‖\nabla_{W} I_{S R}^{n}‖}_{1})

(28)

where

I_{S R}^{n}

represents the reconstructed hyperspectral image, and

\nabla_{B}

,

\nabla_{H}

, and

\nabla_{W}

represent the calculated gradient values of spectrum, height, and width for the reconstructed image, respectively.

We combined the mean absolute error loss function, feature diversity loss function, and spatial–spectral gradient loss function to create a comprehensive loss function, as shown in Equation (29):

L_{t o t a l} = L_{M A E} + α L_{F D} + β L_{S S G}

(29)

where

α

and

β

are coefficients used to balance the contributions of different losses. In the experiments conducted in this study,

α = 10^{- 3} a n d β = 10^{- 5}

.

4. Experiments

In this section, we conduct a comprehensive evaluation of the CSSFENet, considering both quantitative and qualitative aspects. Initially, we subjected our CSSFENet to testing with three widely used datasets, and we provided specific implementation details. Following this, we performed an in-depth analysis of the CSSFENet‘s performance. Finally, we conducted a comparative assessment by pitting the proposed CSSFENet against other methods, including Bicubic, VDSR [27], EDSR [28], MCNet [9], MSDformer, MSFMNet [16], AS³ITransUnet, and PDENet [29].

4.1. Dataset and Evaluation Metrics

The CAVE, Pavia, and Pavia University (PaviaU) hyperspectral image datasets were used in this study. The CAVE dataset comprises 32 scenes, each containing 31 spectral segments with a spatial resolution of 512 × 512 pixels, covering the wavelength range of 400–700 nm. During the assessment phase, we designated seven images for testing while reserving 25 for training [17] and the remainder for testing, as shown in Figure 5.

The Pavia dataset consists of 102 spectral bands, whereas the PaviaU dataset consists of 103 spectral bands. Both datasets were acquired using the ROSIS-03 sensor, with a spectral range of 0.43–0.86 µm and a spatial resolution of 1.3 m. The spatial size of the Pavia dataset is 1096 × 1096 pixels, but only 1096 × 715 effective pixels were selected. The PaviaU dataset has a spatial size of 610 × 340 pixels. To facilitate more effective training and testing, we randomly chose a 144 × 144 patch from the Pavia dataset as the test image and utilized the remaining region as the training image. On the PaviaU dataset, we chose to start from the upper left corner of these images as the point of origin. We then extracted images sized at 144 × 144 for the test set, allocating the remaining segments for the training set, as shown in Figure 5.

To assess the efficacy of our reconstruction algorithm, it is essential to employ objective evaluation metrics. Hence, we employed four quantitative evaluation methods to appraise the network’s reconstruction results. These metrics encompass the peak signal-to-noise ratio (PSNR), the mean peak signal-to-noise ratio (MPSNR), structural similarity (SSIM), and spectral angle mapping (SAM). Their definitions are as follows:

M S E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(X (i, j) - Y (i, j))}^{2}

(30)

P S N R = 10 \times \log_{10} \frac{{(2^{B i t s} - 1)}^{2}}{M S E}

(31)

M P S N R = \frac{10}{B} \sum_{i = 1}^{B} \log_{10} \frac{{(2^{B i t s} - 1)}^{2}}{M S E}

(32)

S S I M = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(33)

θ (z^{*}, z_{h}) = \cos^{- 1} (\frac{z_{h}^{T} z^{*}}{\sqrt[2]{{(z^{*})}^{T} z^{*}} \sqrt[2]{z_{h}^{T} z_{h}}})

(34)

A M = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} θ (X (i, j), Y (i, j))

(35)

where,

M S E (•)

denotes the mean square error function.

X

and

Y

represent the reconstructed hyperspectral image (HSI)

I_{S R}

and the real/original hyperspectral image

I_{H R}

, where

i

and

j

correspond to the height

H

and width

W

of the HSI. “

B i t s

” signifies the pixel depth of the image, and

B

denotes the number of spectral bands of the hyperspectral image; the averages

μ_{x}

and

μ_{y}

are indicative of

I_{S R}

and

I_{H R}

, while

σ

represents the variance, and

σ_{x y}

signifies the covariance between

I_{S R}

and

I_{H R}

.

C_{1}

and

C_{2}

are used to avoid division by zero. The function

θ (•)

represents the spectral angle mapping function, and

z^{*}

and

z_{h}

represent the pixel vectors of

I_{S R}

and

I_{H R}

, respectively. Lastly,

T

denotes the vector transpose operation.

4.2. Implementation Details

The data preparation, network training, and testing described in this study were conducted in MATLAB, Python3.8 and PyTorch1.10 environments. The hardware device consisted of four NVIDIA RTX3060 graphics cards.

The number of feature channels (C) was configured as 64. In the CSSFEM module, the reiteration count (n) was set to 5. This signifies that there were five iterations of CSSFEM, which executed non-linear mapping learning across spatial and spectral dimensions through dense connections. This strategy was employed to comprehensively extract image features.

For network training, the mean absolute error (MAE) loss function was used. The Adam optimizer was employed as the optimization algorithm, with first-order and second-order momentum exponential decay rates of 0.9 and 0.999, respectively, a correction factor coefficient of

10^{- 8}

, and a step size of 0.001. The batch size of the network was set to 4, the output learning rate was 5 × 10⁻⁴, and the learning rate was reduced by half every 25 rounds.

4.3. Comparison and Analysis

4.3.1. Ablation Study of 3D-FEC

Figure 2 illustrates four methods to enhance the upper rank limit of the feature map by modifying the network topology, specifically 3D-FEC(b), 3D-FEC(c), 3D-FEC(d), and 3D-FEC(e). As presented in Table 2 and Figure 6, we performed a comparative analysis of the parameter counts between standard 3D convolution and each of these four structures. We also examined how these structures affected the network’s final performance using the PaviaU dataset with a scale factor of 2.

Among these indexes, higher values for PSNR, MPSNR, and SSIM and lower values for the SAM indicate better-reconstructed images. It is worth noting that the 3D-FEC-b and 3D-FEC-c contained an additional

1 \times 1 \times 1

convolution for feature fusion after the 3D-FEC convolution, leading to larger parameter sizes compared to standard 3D convolution.

From Table 2, we observed that the 3D-FEC-b and 3D-FEC-d convolutions increased the rank upper limit of feature maps to

9 N

and

6 N

, respectively. Although they share similar modes and receptive fields with 3D-FEC convolution, the improvements were primarily in SSIM and SAM, while PSNR and MPSNR were not as enhanced as with 3D-FEC-C convolution. In summary, a higher rank upper limit offers limited performance improvement compared to a

3 N

rank upper limit, but it does increase the network parameters. Consequently, this paper opted for the 3D-FEC-d convolution structure for 3D feature extraction convolution (3D-FEC). This structure involves one-dimensional convolution along three dimensions separately, allowing for the learning of diverse spatial and spectral feature information while reducing network parameters.

4.3.2. Ablation Study of SSFFM

As shown in Table 3 and Figure 7, the S-1D-2D and S-three-1D approaches effectively reduced network parameters but resulted in lower image reconstruction quality compared to P-1D-2D. In the case of the former, PSNR was 0.40 dB and 0.24 dB lower than that of P-1D-2D.

4.3.3. Ablation Study of CSSFEM

Notably, the core module of the network, the Feature Diversity Enhancement Module (FDEM), was not included in the ablation experiment comparison. As shown in Table 4, the results revealed that each module contributed to the enhanced quality of the reconstructed network image. Specifically, the 3D-FEC convolution primarily improved feature extraction, leading to enhanced performance in the relevant indicators. The Spatial–Spectral Feature Fusion Module (SSFFM) was designed for spatial and spectral dimension feature extraction, demonstrating the most pronounced improvement in the SAM index and substantial effects on the other three indexes. This underlines its effectiveness in the extraction of spectral and spatial features in reconstructed images. The channel attention mechanism enhances network performance by selecting and amplifying the influence of pertinent channels.

4.3.4. Ablation Study of the Number of CSSFEMs

As shown in Table 5 and Figure 8, the number of CSSFEMs significantly impacted the image reconstruction quality. Having too many modules, such as 10 or 15, increased the number of parameters in the model, while too few modules, such as 3 or 4, weakened the network’s learning ability. After comparing different numbers of modules’ quantities in Table 5, the optimal configuration appeared when five CSSFEM modules were used. Under this setup, image reconstruction quality, specifically in terms of PSNR and MPSNR, was at its best. SSIM was slightly lower, by only 0.0003 compared to the optimal setting, and SAM ranked second by a 0.042 difference. Importantly, this configuration keeps the number of parameters low. Therefore, in this paper, five CSSFEMs were applied for the nonlinear mapping of deep features through dense connections.

4.3.5. Ablation Study of Loss Function

As shown in Table 6, the incorporation of the feature diversity loss function

L_{F D}

and the spatial–spectral gradient loss function

L_{S S G}

into the original loss function enhanced the final reconstruction results. This improvement was evident in the increase of approximately 0.10 dB in the PSNR and MPSNR indexes, a 0.002 boost in SSIM indexes, and a 0.006 enhancement in the SAM indexes.

The parameters α and β were chosen in this paper to maintain the relative magnitude relationship between different loss functions. For instance, during the initial training phase, the loss of function L_FD was about 800 times greater than that of MAE. Therefore, α was set to 10 × 10⁻³, and β was similarly chosen to maintain this proportional relationship.

4.3.6. Contrast Study of CSSFEMNet

As shown in Table 7, for a comparison with the latest algorithm, PDENet, CSSFEMNet was used to calculate the average values of the parameter indicators for three datasets at different scale factors: 2, 4, and 8. For the scale factors of 2, 4, and 8, CSSFEMNet exhibited an average increase of 0.41, 0.20, and 0.10 dB in PSRN, an average increase of 0.002, 0.008, and 0.006 in SSIM, an average increase of 0.48, 0.22, and 0.07 dB in MPSNR, and an average increase of 0.09, 0.07, and 0.08 in SAM, respectively.

From the perspective of subjective visual effects, per Figure 9, Figure 10 and Figure 11, an analysis of the reconstructed images of the Pavia, CAVE, and PaviaU datasets at magnifications of 2, 4, and 8 revealed that the final reconstructed image obtained using CSSFEMNet was better than that obtained using other algorithms, with improvements in image texture detail and clarity. Moreover, in the residual image, the color representation of CSSFEMNet was closer to dark blue, indicating a higher similarity to the original image.

5. Conclusions

Based on the diversified extraction of hyperspectral image features, this study proposed 3D-feature extraction convolution (3D-FEC) through the mathematical modeling of image feature diversity and changing of the topological structure of standard 3D convolution so as to increase the upper limit of feature map rank, thus improving the ability to extract features of diverse feature maps. On this basis, the Feature Diversity Enhancement Module (FDEM) was proposed to enable the network to extract the features of hyperspectral images more fully. In order to improve the extraction of feature diversity as much as possible without increasing the number of parameters, spectral and spatial features of hyperspectral can be fused, and the Spatial–Spectral Feature Fusion Module (SSFFM) was proposed. Then, FDEM was combined with SSFFM and the channel attention mechanism to enhance the proportion of channels that are beneficial to the reconstruction results, and the Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM) was proposed to complete the extraction of image depth features. Finally, in terms of loss functions, the feature diversity loss (

L_{F D}

) function was introduced to enhance the independence of each element, thus further enhancing the expression of feature diversity; the spatial–spectral gradient loss (

L_{S S G}

) function was introduced to increase the smoothness of hyperspectral images. However, this study did not conduct in-depth research on the weighting coefficients of

L_{F D}

and

L_{S S G}

; in future studies, we will conduct research on how to match the two loss functions more finely to improve the reconstruction ability of the network.

Author Contributions

Conceptualization, J.Z. and R.Z.; methodology, J.Z. and R.Z.; software, J.Z., R.Z., Z.W. and X.Z.; validation, R.Z., Z.W., X.Z., R.G., Y.W. and Y.Y.; formal analysis, J.Z.; investigation, R.Z., Z.W. and X.Z.; resources, J.Z.; data curation, R.Z., Z.W. and X.Z.; writing—original draft preparation, R.Z.; writing—review and editing, J.Z. and R.Z.; visualization, R.Z., Z.W. and X.Z.; supervision, Y.L.; project administration, J.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation of China under Grant 62371362, the general project of the key R&D Plan of Shaanxi Province under Grant 2022GY-060.

Data Availability Statement

Data are contained within the article. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, H.; Yu, J.; Sun, W. Super-resolution mapping via multi-dictionary based sparse representation. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3523–3527. [Google Scholar]
Huang, H.; Christodoulou, A.G.; Sun, W. Super-resolution hyperspectral imaging with unknown blurring by low-rank and group-sparse modeling. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 2155–2159. [Google Scholar]
Gou, S.; Liu, S.; Yang, S.; Jiao, L. Remote sensing image super-resolution reconstruction based on nonlocal pairwise dictionaries and double regularization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4784–4792. [Google Scholar] [CrossRef]
Li, J.; Yuan, Q.; Shen, H.; Meng, X.; Zhang, L. Hyperspectral image super-resolution by spectral mixture analysis and spatial—Spectral group sparsity. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1250–1254. [Google Scholar] [CrossRef]
Yuan, Y.; Zheng, X.; Lu, X. Hyperspectral image super-resolution by transfer learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1963–1974. [Google Scholar] [CrossRef]
Hu, J.; Jia, X.; Li, Y.; He, G.; Zhao, M. Hyperspectral image super-resolution via intrafusion network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7459–7471. [Google Scholar] [CrossRef]
Li, J.; Cui, R.; Li, B.; Song, R.; Li, Y.; Dai, Y.; Du, Q. Hyperspectral image super-resolution by band attention through adversarial learning. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4304–4318. [Google Scholar] [CrossRef]
Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
Li, Q.; Wang, Q.; Li, X. Mixed 2D/3D convolutional network for hyperspectral image super-resolution. Remote Sens. 2020, 12, 1660. [Google Scholar] [CrossRef]
Li, Q.; Wang, Q.; Li, X. Exploring the relationship between 2D/3D convolution for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8693–8703. [Google Scholar] [CrossRef]
Liu, D.; Li, J.; Yuan, Q. A spectral grouping and attention-driven residual dense network for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7711–7725. [Google Scholar] [CrossRef]
Jiang, J.; Sun, H.; Liu, X.; Ma, J. Learning spatial-spectral prior for super-resolution of hyperspectral imagery. IEEE Trans. Comput. Imaging 2020, 6, 1082–1096. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Li, Z.; Gao, L.; Jia, X. X-Shaped Interactive Autoencoders with Cross-Modality Mutual Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518317. [Google Scholar] [CrossRef]
Gao, L.; Li, J.; Zheng, K.; Jia, X. Enhanced Autoencoders with Attention-Embedded Degradation Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509417. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Liu, W.; Li, Z.; Yu, H.; Ni, L. Model-Guided Coarse-to-Fine Fusion Network for Unsupervised Hyperspectral Image Super-Resolution. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5508605. [Google Scholar] [CrossRef]
Zhang, J.; Shao, M.; Wan, Z.; Li, Y. Multiscale Feature Mapping Network for Hyperspectral Image Super-Resolution. Remote Sens. 2021, 13, 4180. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, R.; Chen, X.; Hong, Z.; Li, Y.; Lu, R. Spectral Correlation and Spatial High–Low Frequency Information of Hyperspectral Image Super-Resolution Network. Remote Sens. 2023, 15, 2472. [Google Scholar] [CrossRef]
Ju, Y.; Jian, M.; Wang, C.; Zhang, C.; Dong, J.; Lam, K.-M. Estimating High-resolution Surface Normals via Low-resolution Photometric Stereo Images. IEEE Trans. Circuits Syst. Video Technol. 2023; early access. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual Aggregation Transformer for Image Super-Resolution. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 12278–12287. [Google Scholar] [CrossRef]
Chen, Y.; Xia, R.; Yang, K.; Zou, K. MFFN: Image super-resolution via multi-level features fusion network. Vis. Comput. 2023, 1–16. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Q.; Du, B.; Huang, X.; Tang, Y.Y.; Tao, D. Simultaneous spectral-spatial feature selection and extraction for hyperspectral images. IEEE Trans. Cybern. 2016, 48, 16–28. [Google Scholar] [CrossRef]
Heide, F.; Heidrich, W.; Wetzstein, G. Fast and flexible convolutional sparse coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5135–5143. [Google Scholar]
Yanai, K.; Tanno, R.; Okamoto, K. Efficient mobile implementation of a cnn-based object recognition system. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 362–366. [Google Scholar]
Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.A.; De Freitas, N. Predicting parameters in deep learning. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Shang, W.; Sohn, K.; Almeida, D.; Lee, H. Understanding and improving convolutional neural networks via concatenated rectified linear units. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 2217–2225. [Google Scholar]
Lin, M.; Ji, R.; Wang, Y.; Zhang, Y.; Zhang, B.; Tian, Y.; Shao, L. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1529–1538. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Hou, J.; Zhu, Z.; Hou, J.; Zeng, H.; Wu, J.; Zhou, J. Deep posterior distribution-based embedding for hyperspectral image super-resolution. IEEE Trans. Image Process. 2022, 31, 5720–5732. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the Channel-Attention-Based Spatial–Spectral Feature Extraction Network (CSSFENet).

Figure 2. The flattening of different convolution combinations. N represents the rank upper bound. (a) The 3D convolution kernel is flattened into a two-dimensional matrix with an upper limit of matrix rank N. (b) 3D convolution is realized by using 2D convolution kernel with size of 3 × 3 along three dimensions. (c) Perform three convolution kernels of size 3 in each of the three dimensions. (d) Use a convolution of convolution size 3 for each dimension. (e) Apply two convolution kernels of size 3 to each dimension.

Figure 3. (a) Architecture of the Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM). (b) Architecture of the Spatial–Spectral Feature Fusion Module (SSFFM). (c) Architecture of the Feature Diversity Enhancement Module (FDEM).

Figure 4. (a) Standard 3D convolution. (b) Sequential concatenation of 1D–2D convolution kernels (S-1D-2D). (c) Sequential connection of three 1D convolution kernels(S-three-1D). (d) Parallel joining of 1D–2D convolution kernels (P-1D-2D).

Figure 5. Partitioning the training set and test set for three types of datasets.

Figure 6. Bar chart depicting the results from various ablation studies on 3D-FEC.

Figure 7. Bar chart depicting the results from various ablation studies on different SSFEMs.

Figure 8. Bar chart depicting the results from the ablation study on the number of CSSFEMs.

Figure 9. Comparison of subjective visual effects on the Pavia dataset.

Figure 10. Comparison of subjective visual effects on the PaviaU dataset.

Figure 11. Comparison of subjective visual effects on the CAVE dataset.

Table 1. Parameter and flops study of different 3D-FECs using the Pavia dataset with a scale factor of 2. The 3D convolution process can be expressed in the form of 2D matrix multiplication [22,23], as shown in Equation (4).

	Parameter	Flops	Ratio
3D	1.10 × 10⁵	1.38 × 10⁷	100%
3D-FEC-b	1.10 × 10⁵	1.38 × 10⁷	100%
3D-FEC-c	1.10 × 10⁵	1.38 × 10⁷	100%
3D-FEC-d	3.69 × 10⁴	4.61 × 10⁶	33.3%
3D-FEC-e	7.38 × 10⁴	9.22 × 10⁶	66.7%

Table 2. Ablation study results on evaluating the efficiency of different 3D-FECs on the Pavia dataset with a scale factor of 2.

	Parameter	$PSNR ↑$	$MPSNR ↑$	$SSIM ↑$	$SAM ↓$
Standard 3D	3.64M	35.029	35.369	0.9536	3.115
3D-FEC-b	3.76M	35.424	35.676	0.9556	3.090
3D-FEC-c	3.98M	35.463	35.714	0.9606	3.047
3D-FEC-d	2.21M	35.498	35.742	0.9603	3.049
3D-FEC-e	3.16M	35.435	35.688	0.9605	3.091

Table 3. Ablation study results evaluating the efficiency of the different SSFEMs on the Pavia dataset with a scale factor of 2.

	Parameter	$PSNR ↑$	$MPSNR ↑$	$SSIM ↑$	$SAM ↓$
Standard 3D	3.64M	35.029	35.369	0.9502	3.115
S-1D-2D	1.98M	35.102	35.296	0.9509	3.098
S-three-1D	2.16M	35.259	35.371	0.9508	3.092
P-1D-2D	2.21M	35.498	35.742	0. 9603	3.049

Table 4. Ablation study results evaluating the efficiency of the CSSFEM on the Pavia dataset with a scale factor of 2.

FDEM	3D-FEC	SSFFM	CA	$PSNR ↑$	$MPSNR ↑$	$SSIM ↑$	$SAM ↓$
✓	✕	✕	✕	35.165	35.437	0.9510	3.134
✓	✓	✕	✕	35.231	35.491	0.9546	3.106
✓	✓	✓	✕	35.425	35.681	0.9582	3.067
✓	✓	✓	✓	35.498	35.742	0. 9603	3.049

Table 5. Ablation study results evaluating the efficiency of the number of CSSFEMs on the Pavia dataset with a scale factor of 2.

Number	Parameter	$PSNR ↑$	$MPSNR ↑$	$SSIM ↑$	$SAM ↓$
3	1.21M	35.845	35.369	0.9532	3.579
4	1.41M	35.941	35.427	0.9540	3.561
5	1.61M	35.991	35.522	0.9544	3.542
6	1.81M	35.962	35.493	0.9547	3.546
7	2.03M	35.932	35.457	0.9538	3.559
10	2.68M	35.945	35.473	0.9543	3.493
15	2.91M	35.664	35.185	0.9507	3.605

Table 6. Ablation study results evaluating the efficiency of the loss function on the Pavia dataset with a scale factor of 2.

3D-FEC	SSFFM	CA	$PSNR ↑$	$MPSNR ↑$	$SSIM ↑$	$SAM ↓$
✓	✕	✕	35.498	35.742	0.9603	3.049
✓	✓	✕	35.561	35.771	0.9617	3.045
✓	✓	✓	35.591	35.819	0.9620	3.043

Table 7. Quantitative evaluation of the Pavia dataset with different hyperspectral image SR algorithms.

Scale Factor	Algorithms	CAVE	Pavia	PaviaU
Scale Factor	Algorithms	$PSNR / MPSNR ↑$ $/ SSIM / SAM ↓$	$PSNR / MPSNR ↑$ $/ SSIM / SAM ↓$	$PSNR / MPSNR ↑$ $/ SSIM / SAM ↓$
X2	Bicubic	40.330/39.500/0.9820/3.311	32.406/31.798/0.9036/4.370	30.509/30.497/0.9055/3.816
	VDSR	44.456/43.531/0.9895/2.866	35.392/34.879/0.9501/3.689	33.988/34.038/0.9524/3.258
	EDSR	45.151/44.207/0.9907/2.606	35.160/34.580/0.9452/3.898	33.943/33.985/0.9511/3.334
	MCNet	45.878/44.913/0.9913/2.588	35.124/34.626/0.9455/3.865	33.695/33.743/0.9502/3.359
	MSDformer	45.985/46.007/0.9915/2.551	35.557/35.028/0.9493/3.691	34.115/34.159/0.9553/3.211
	MSFMNet	46.015/45.039/0.9917/2.497	35.678/35.200/0.9506/3.656	34.807/34.980/0.9582/3.160
	AS³ITransUNet	46.294/45.199/0.9925/2.519	35.728/35.221/0.9511/3.612	35.004/35.163/0.9591/3.149
	PDENet	46.348/45.278/0.9926/2.527	35.731/35.244/0.9519/3.595	35.115/35.275/0.9594/3.142
	Ours	46.757/45.773/0.9931/2.408	35.991/35.522/0.9544/3.542	35.688/35.924/0.9625/3.038
X4	Bicubic	34.616/33.657/0.9388/4.784	26.596/26.556/0.7091/7.553	29.061/29.197/0.7322/5.248
	VDSR	37.027/36.045/0.9591/4.297	28.328/28.317/0.7707/6.514	29.761/29.904/0.7753/4.997
	EDSR	38.117/37.137/0.9626/4.132	28.649/28.591/0.7782/6.573	29.795/29.894/0.7791/5.074
	MCNet	38.589/37.679/0.9690/3.682	28.791/28.756/0.7826/6.385	29.889/29.993/0.7835/4.917
	MSDformer	38.699/37.702/0.9692/3.671	28.864/28.810/0.7833/5.897	29.998/30.098/0.7905/4.885
	MSFMNet	38.733/37.814/0.9697/3.676	28.920/28.873/0.7863/6.300	30.140/30.283/0.7948/4.861
	AS³ITransUNet	39.003/39.112/0.9715/3.685	28.999/28.874/0.7893/5.972	30.155/30.289/0.7940/4.859
	PDENet	39.248/38.343/0.9723/3.670	29.004/28.951/0.7900/5.876	30.171/30.295/0.7944/4.853
	Ours	39.389/38.513/0.9728/3.523	29.121/29.054/0.7961/5.816	30.486/30.689/0.8107/4.839
X8	Bicubic	30.554/29.484/0.8657/6.431	24.464/24.745/0.4899/7.648	26.699/26.990/0.5936/7.179
	VDSR	32.184/31.210/0.8852/5.747	24.526/24.804/0.4944/7.588	26.737/27.028/0.5962/7.133
	EDSR	33.416/32.337/0.9002/5.409	24.854/25.067/0.5282/7.507	27.182/27.467/0.6302/6.678
	MCNet	33.607/32.520/0.9125/5.172	24.877/25.096/0.5391/7.429	27.20127.483/0.6254/6.683
	MSDformer	33.641/32.511/0.9131/5.102	25.014/25.215/0.5462/7.427	27.291/27.323/0.6341/6.668
	MSFMNet	33.675/32.599/0.9136/5.084	25.027/25.257/0.5464/7.449	27.334/27.586/0.6356/6.615
	AS³ITransUNet	33.925/32.887/0.9177/4.954	25.031/25.258/0.5435/7.417	27.421/27.689/0.6413/6.574
	PDENet	34.031/33.018/0.9182/4.929	25.045/25.288/0.5436/7.402	27.477/27.738/0.6457/6.531
	Ours	34.109/33.077/0.9199/4.798	25.157/25.359/0.5493/7.306	27.585/27.825/0.6569/6.505

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Zheng, R.; Wan, Z.; Geng, R.; Wang, Y.; Yang, Y.; Zhang, X.; Li, Y. Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction. Remote Sens. 2024, 16, 436. https://doi.org/10.3390/rs16030436

AMA Style

Zhang J, Zheng R, Wan Z, Geng R, Wang Y, Yang Y, Zhang X, Li Y. Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction. Remote Sensing. 2024; 16(3):436. https://doi.org/10.3390/rs16030436

Chicago/Turabian Style

Zhang, Jing, Renjie Zheng, Zekang Wan, Ruijing Geng, Yi Wang, Yu Yang, Xuepeng Zhang, and Yunsong Li. 2024. "Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction" Remote Sensing 16, no. 3: 436. https://doi.org/10.3390/rs16030436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction

Abstract

1. Introduction

1.1. Traditional Algorithms

1.2. Deep Learning Algorithms

2. Related Work

3. Proposed Method

3.1. Overall Framework

3.2. D-Feature Extraction Convolution (3D-FEC)

3.3. Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM)

3.3.1. Feature Diversity Enhancement Module (FDEM)

3.3.2. Spatial–Spectral Feature Fusion Module (SSFFM)

3.4. Loss Function

3.4.1. Feature Diversity Loss Function ( $L_{F D}$ )

3.4.2. Spatial–Spectral Gradient Loss Function ( $L_{S S G}$ )

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison and Analysis

4.3.1. Ablation Study of 3D-FEC

4.3.2. Ablation Study of SSFFM

4.3.3. Ablation Study of CSSFEM

4.3.4. Ablation Study of the Number of CSSFEMs

4.3.5. Ablation Study of Loss Function

4.3.6. Contrast Study of CSSFEMNet

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Hyperspectral Image Super-Resolution Based on Feature Diversity Extraction

Abstract

1. Introduction

1.1. Traditional Algorithms

1.2. Deep Learning Algorithms

2. Related Work

3. Proposed Method

3.1. Overall Framework

3.2. D-Feature Extraction Convolution (3D-FEC)

3.3. Channel-Attention-Based Spatial–Spectral Feature Extraction Module (CSSFEM)

3.3.1. Feature Diversity Enhancement Module (FDEM)

3.3.2. Spatial–Spectral Feature Fusion Module (SSFFM)

3.4. Loss Function

3.4.1. Feature Diversity Loss Function ( L F D )

3.4.2. Spatial–Spectral Gradient Loss Function ( L S S G )

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison and Analysis

4.3.1. Ablation Study of 3D-FEC

4.3.2. Ablation Study of SSFFM

4.3.3. Ablation Study of CSSFEM

4.3.4. Ablation Study of the Number of CSSFEMs

4.3.5. Ablation Study of Loss Function

4.3.6. Contrast Study of CSSFEMNet

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. Feature Diversity Loss Function ( $L_{F D}$ )

3.4.2. Spatial–Spectral Gradient Loss Function ( $L_{S S G}$ )