A Joint Network of Edge-Aware and Spectral–Spatial Feature Learning for Hyperspectral Image Classification

Hyperspectral image (HSI) classification is a vital part of the HSI application field. Since HSIs contain rich spectral information, it is a major challenge to effectively extract deep representation features. In existing methods, although edge data augmentation is used to strengthen the edge representation, a large amount of high-frequency noise is also introduced at the edges. In addition, the importance of different spectra for classification decisions has not been emphasized. Responding to the above challenges, we propose an edge-aware and spectral–spatial feature learning network (ESSN). ESSN contains an edge feature augment block and a spectral–spatial feature extraction block. Firstly, in the edge feature augment block, the edges of the image are sensed, and the edge features of different spectral bands are adaptively strengthened. Then, in the spectral–spatial feature extraction block, the weights of different spectra are adaptively adjusted, and more comprehensive depth representation features are extracted on this basis. Extensive experiments on three publicly available hyperspectral datasets have been conducted, and the experimental results indicate that the proposed method has higher accuracy and immunity to interference compared to state-of-the-art (SOTA) method.


Introduction
HSIs are complete spectral information for each pixel generated by hyperspectral sensors by capturing the reflection information of an object in multiple consecutive spectral bands through hyperspectral imaging technology.Compared to RGB images, HSIs also contain the information about the shape, texture, and structure of the object [1], but HSIs contain a large amount of waveband information, which allows identification and differentiation of substances with similar colors but different spectral characteristics.Thus, HSIs are widely used in scientific and industrial applications that require precise substance identification and analysis, such as medical imaging and diagnosis [2], geological and mineral exploration [3], environmental protection [4], agricultural crop monitoring [5], food safety monitoring [6], and military reconnaissance and security [7].To fully exploit the value of HSIs, many subtasks are derived, such as classification [8,9], target detection [10][11][12], and unmixing [13][14][15].Among these tasks, the land cover information classification task has received extensive attention.
When classifying objects in HSIs, due to the phenomenon that "the spectra of the same object may be different and the spectra of different objects may be the same" exists in HSIs [16], therefore, it is not feasible to simply apply the same methods used for the RGB image classification to the HSIs classification.To address the above challenges, researchers around the world have proposed various approaches, such as principal compo-Sensors 2024, 24, 4714 2 of 19 nent analysis (PCA) [17], the Bayesian estimation method [18], SVM [19,20], and k-mean clustering [21,22].
However, with the breakthrough of deep learning, convolutional neural networks (CNNs) are gradually replacing the traditional HSI classification methods due to their stronger model generalization ability and deep feature characterization.And in recent years, in the field of HSI classification, CNNs have been rapidly developed.For example, Hu et al. used 1-D CNN [23] in order to extract the spectral information.But for the HSI classification task, using only spectral information is not enough to realize to obtain accurate classification results.Therefore, Zhao et al. proposed 2-D CNN [24] to extract spatial features.However, both 1-D CNN and 2-D CNN do not fully utilize the 3-D characteristics of HSIs.Thus, Chen et al. applied 3-D CNN [25] to the field of HSI classification in order to fuse the spatialspectral features of HSIs, and the experimental results showed that the performance of the model was improved.Based on these experiments, many researchers have proposed hybrid convolutional methods [26][27][28][29][30][31][32].Among them, Roy et al. proposed the HybridSN [29] with a linear structure.HybridSN contains three sequentially connected 3D convolutional layers for fusing spatial and spectral information and one 2D convolutional layer for extracting local spatial features, respectively.In addition, Zhong et al. proposed the SSRN [32], where they introduced residual connectivity between convolutional blocks to promote backpropagation of the gradient.However, these convolution-based methods are limited by the convolutional kernel, which can only learn the information within the coverage of the convolutional kernel, thus restricting the representation of global features.
However, the characteristics of HSIs, especially the importance of edge features between different classes and spectral bands in the classification process, are not fully considered in these methods that use a mixture of CNN and ViT.To enhance edge features, edge data augmentation methods are often employed.Traditional image edge data augmentation methods usually apply edge detection operators (e.g., Laplacian, Canny, Sobel operators) [50] directly on the original image to obtain the edge information, which is then used for subsequent model training by directly superimposing with the original image.However, in the field of HSI classification, due to the existence of the characteristic that the boundaries of the same object may be different in different spectral bands, processing the original data in this way will cause a large amount of noise, which will affect the subsequent classification performance.In order to minimize the effects of superimposed noise, Tu et al. applied edge-preserving filtering to the edge portion in their proposed MSFE [51] with pyramidal structure, but the MSFE does not take into account the fact that different spectral bands play different roles in the classification process.
Therefore, inspired by the above work, and in order to enhance the image features and weaken the noise interference of the initial HSI, we adopt a dynamic learning approach to obtain the edge information and the decision weights of different spectral bands.Then, we use a mixture of attention mechanisms and CNN on this basis with the aim of obtaining global spectral-spatial features.Figure 1 shows the edge-aware and spectral-spatial feature extraction network which we propose.The network contains two parts: an edge feature augment block and a spectral-spatial feature extraction block.Different from traditional data augmentation that is not dynamically learnable, our edge feature augment block adaptively learns the degree of edge feature enhancement in different spectral bands, which reduces the high-frequency noise.In addition, in the spectral attention block, we adaptively adjust the weights of different spectral bands for classification and then perform feature extraction on its basis.To sum up, there are three main contributions: 1.
We propose a novel feature extraction network (ESSN) with richer and more efficient representation of edge features and spectral-spatial features compared to existing networks; 2.
We designed a novel edge feature augment block.The block consists of an edgeaware part and a dynamic adjustment part.Compared with edge data augmentation methods that are not dynamically learnable, this block greatly reduces edge distortion and noise amplification; 3.
We propose a spectral-spatial features extraction block.It contains a spectral attention block, a spatial attention block, and a 3D-2D hybrid convolution block.Among them, the spectral attention block and the spatial attention block gain an effective feature by enhancing the information favorable for classification and suppressing noise and other interfering information.The convolution block fuses the above features.
Sensors 2024, 24, x FOR PEER REVIEW 3 of 20 which reduces the high-frequency noise.In addition, in the spectral attention block, we adaptively adjust the weights of different spectral bands for classification and then perform feature extraction on its basis.To sum up, there are three main contributions: 1. We propose a novel feature extraction network (ESSN) with richer and more efficient representation of edge features and spectral-spatial features compared to existing networks; 2. We designed a novel edge feature augment block.The block consists of an edgeaware part and a dynamic adjustment part.Compared with edge data augmentation methods that are not dynamically learnable, this block greatly reduces edge distortion and noise amplification; 3. We propose a spectral-spatial features extraction block.It contains a spectral attention block, a spatial attention block, and a 3D-2D hybrid convolution block.Among them, the spectral attention block and the spatial attention block gain an effective feature by enhancing the information favorable for classification and suppressing noise and other interfering information.The convolution block fuses the above features.

Figure 1.
Framework of the classification process using ESSN.Note that BN and ReLU after each convolution operation have been omitted.
The subsequent sections are composed as follows.Our proposed method is described in Section 2. In Section 3, we describe our experimental environment and make a detailed comparison with other SOTA methods in the same environment.We perform sensitivity analysis experiments and ablation experiments aimed at verifying the importance of each part of the model in Section 4. In Section 5, we distill the paper and suggest directions for model improvement.

Methodology
Figure 1 shows the whole process of HSI classification.It consists of a data preprocessing block, the backbone of the proposed network, and a linear classifier.
Real objects are given a hyperspectral image (HSI) after passing through a hyperspectral sensor.Assuming that the HSI is   ∈  ×× ., ,  are the height, width, and number of spectral bands of the raw HSI image, respectively.In HSI, each pixel can be represented by the vector   = ( 1 ,  2 , … ,   ), where   represents the pixel value on the Cth spectrum.Obviously, the greater the number of spectra, the richer the information, but this greatly slows down computational efficiency.Therefore, we adopt the PCA technique to preprocess the HSI data to improve the efficiency, maintain the height and width unchanged, and reduce the spectral number from  to .We denote the HSI after PCA  The subsequent sections are composed as follows.Our proposed method is described in Section 2. In Section 3, we describe our experimental environment and make a detailed comparison with other SOTA methods in the same environment.We perform sensitivity analysis experiments and ablation experiments aimed at verifying the importance of each part of the model in Section 4. In Section 5, we distill the paper and suggest directions for model improvement.

Methodology
Figure 1 shows the whole process of HSI classification.It consists of a data preprocessing block, the backbone of the proposed network, and a linear classifier.
Real objects are given a hyperspectral image (HSI) after passing through a hyperspectral sensor.Assuming that the HSI is I raw ∈ R H×W×C .H, W, C are the height, width, and number of spectral bands of the raw HSI image, respectively.In HSI, each pixel can be represented by the vector X pixel = (V 1 , V 2 , . . . ,V c ), where V c represents the pixel value on the Cth spectrum.Obviously, the greater the number of spectra, the richer the information, but this greatly slows down computational efficiency.Therefore, we adopt the PCA technique to preprocess the HSI data to improve the efficiency, maintain the height and width unchanged, and reduce the spectral number from C to P. We denote the HSI after PCA dimensionality reduction as I pca ∈ R H×W×P , where P denotes the number of the spectra after PCA dimensionality reduction.In order to obtain a suitable input format for the network, we crop the image into patches I patch ∈ R h×w×P with pixel-centered dimensions as h, w, P, where h, w, P represent the height, width, and spectral number of the patch, respectively.The data preprocessing block is shown in Figure 1.Note that the same symbols appearing in this section represent the same meaning.
The backbone of ESSN contains both an edge feature augment block and a global spectral-spatial feature extraction block, and we will describe the content of ESSN in as much detail as possible.Finally, we use a linear classifier to determine the class of each pixel.

Edge Feature Augment Block
As shown in Figure 2, the points where the model fails to predict are mostly at the intersection of different categories, which is due to the high feature similarity between certain categories on the one hand, and the possible boundary blurring between certain categories [52].
after PCA dimensionality reduction.In order to obtain a suitable input format for the net-work, we crop the image into patches  ℎ ∈  ℎ×× with pixel-centered dimensions as ℎ, , , where ℎ, ,  represent the height, width, and spectral number of the patch, respectively.The data preprocessing block is shown in Figure 1.Note that the same symbols appearing in this section represent the same meaning.
The backbone of ESSN contains both an edge feature augment block and a global spectral-spatial feature extraction block, and we will describe the content of ESSN in as much detail as possible.Finally, we use a linear classifier to determine the class of each pixel.

e e t re A ent loc
As shown in Figure 2, the points where the model fails to predict are mostly at the intersection of different categories, which is due to the high feature similarity between certain categories on the one hand, and the possible boundary blurring between certain categories [52].
Previously, edge data augmentations were usually used to strengthen the edge for the above problems.However, the direct superposition of edge information may produce strong edge noise, leading to the confusion of similar categories.Therefore, we propose a novel edge feature augment block, as shown in Figure 1, which can adaptively adjust the model's emphasis on the edges of a region by learning the importance of the edge information in the input data, and personalize the edge information.

Laplacian of Gaussian Operator
The Laplacian of Gaussian operator is generated by the convolution of the Laplace operator and Gaussian filtering operator.The Laplace operator is particularly sensitive to regions of the image that change abruptly, and therefore has a better performance in edgeawareness tasks.Due to the prevalence of Gaussian noise in images captured by electronic devices, which seriously affects the accuracy of edge perception, hyperspectral images need to be processed with Gaussian filtering before perceiving the edges.The Gaussian filtering operator and Laplace operator can be expressed by Equations ( 1) and (2), respectively: where ,  denote the spatial coordinate positions of the HSIs,  is the Gaussian standard deviation, and  represents the value of the pixel on the image.Convolutional operations have the law of union, so we use the result of the convolution of the Gaussian filter operator with the Laplace operator as a new edge-aware Previously, edge data augmentations were usually used to strengthen the edge for the above problems.However, the direct superposition of edge information may produce strong edge noise, leading to the confusion of similar categories.Therefore, we propose a novel edge feature augment block, as shown in Figure 1, which can adaptively adjust the model's emphasis on the edges of a region by learning the importance of the edge information in the input data, and personalize the edge information.

Laplacian of Gaussian Operator
The Laplacian of Gaussian operator is generated by the convolution of the Laplace operator and Gaussian filtering operator.The Laplace operator is particularly sensitive to regions of the image that change abruptly, and therefore has a better performance in edgeawareness tasks.Due to the prevalence of Gaussian noise in images captured by electronic devices, which seriously affects the accuracy of edge perception, hyperspectral images need to be processed with Gaussian filtering before perceiving the edges.The Gaussian filtering operator and Laplace operator can be expressed by Equations ( 1) and (2), respectively: where x, y denote the spatial coordinate positions of the HSIs, σ is the Gaussian standard deviation, and I represents the value of the pixel on the image.Convolutional operations have the law of union, so we use the result of the convolution of the Gaussian filter operator with the Laplace operator as a new edge-aware operator (LoG) and then use the LoG to convolve the image to obtain the image edges.The LoG expression is shown in Equation (3).
Due to the discrete representation of hyperspectral images, we discretize Equation (3) to obtain an approximate LoG operator for practical use.As shown in Figure 3, we list the LoG operators for the two cases σ < 0.5 and σ = 1.4,respectively.Then, let the result after edge-aware operator be I LoG with the following expression: where DWConvLoG(•) indicates depthwise separable convolution with the kernel of LoG.
the  operators for the two cases  < 0.5 and  = 1.4,respectively.Then, let the result after edge-aware operator be   with the following expression: where (⋅) indicates depthwise separable convolution with the kernel of .
In the edge feature augment block, because of the characteristic that "the spectra of the same object may be different and the spectra of different objects may be the same", strengthening the edge features at the same rate in different spectral bands will generate interference noise, so we design a learnable parameter  ∈  1× for adjusting the degree of feature augment in different spectra.We explore the importance of  in Section 4.2.And in order to make the network more flexible and the optimization process smoother and more efficient, we use residual connectivity.The output (  ) of the module is shown below: where φ(⋅) indicates activation function of sigmoid.⨂ denotes the dot product of the corresponding position.

Spectral Attention Block
HSIs are rich in spectral information; to make it easier to see, we show the specific image of each spectral band by means of a grayscale map, as shown in Figure 4. Obviously, the importance of different spectra in the decision-making process is different [53].The spectral attention helps the model in adaptive adjustment of weights for different spectra and in enhancing the representation of these spectra during the learning process.This helps the model to suppress the influence of task-irrelevant spectra.
In the spectral attention block, to strengthen the correlation between encoded and decoded data, we use residual concatenation.Let the input features of the block be   ∈  ℎ×× , then the results after global maximal pooling and global mean pooling are   and   , respectively, where global maximum pooling is complementary to global average pooling.The results are shown below.In the edge feature augment block, because of the characteristic that "the spectra of the same object may be different and the spectra of different objects may be the same", strengthening the edge features at the same rate in different spectral bands will generate interference noise, so we design a learnable parameter γ ∈ R 1×P for adjusting the degree of feature augment in different spectra.We explore the importance of γ in Section 4.2.And in order to make the network more flexible and the optimization process smoother and more efficient, we use residual connectivity.The output (I LoGout ) of the module is shown below: where φ(•) indicates activation function of sigmoid.⊗ denotes the dot product of the corresponding position.

Spectral-Spatial Feature Extraction Block 2.2.1. Spectral Attention Block
HSIs are rich in spectral information; to make it easier to see, we show the specific image of each spectral band by means of a grayscale map, as shown in Figure 4. Obviously, the importance of different spectra in the decision-making process is different [53].The spectral attention helps the model in adaptive adjustment of weights for different spectra and in enhancing the representation of these spectra during the learning process.This helps the model to suppress the influence of task-irrelevant spectra.
In the spectral attention block, to strengthen the correlation between encoded and decoded data, we use residual concatenation.Let the input features of the block be I input ∈ R h×w×P , then the results after global maximal pooling and global mean pooling are v max and v mean , respectively, where global maximum pooling is complementary to global average pooling.The results are shown below.
To reduce the size of parameters, the pooled features are entered into a shared multilayer perception (MLP), and the results h max and h mean are obtained.Let the rate of dimensionality reduction be r and the weights of the two MLP layers be, in order, W 1 ∈ R p×(p/r) and W 2 ∈ R (p/r)×p .h max and h mean are as follows: Adaptive spectral weights W h of the input feature map are obtained by adding h max and h mean and passing through the sigmoid activation function.W h is shown below: where σ(•) is the activation function of sigmoid.Finally, let the output of the spectral attention block be I output ∈ R h×w×P , the expression is as follows: max () = (  ()) ∈  1× (6) To reduce the size of parameters, the pooled features are entered into a shared multilayer perception (MLP), and the results ℎ max and ℎ  are obtained.Let the rate of dimensionality reduction be  and the weights of the two MLP layers be, in order,  1 ∈  ×(/) and  2 ∈  (/)× .ℎ max and ℎ  are as follows: Adaptive spectral weights  ℎ of the input feature map are obtained by adding ℎ max and ℎ  and passing through the sigmoid activation function. ℎ is shown below: where (•) is the activation function of sigmoid.Finally, let the output of the spectral attention block be   ∈  ℎ×× , the expression is as follows:

Spatial Attention Block
In contrast to traditional convolutional operations, which focus on only a portion of the input data, the spatial attention mechanism [54] can adaptively adjust the area of attention over the global spatial range of the input data and give more importance and

Spatial Attention Block
In contrast to traditional convolutional operations, which focus on only a portion of the input data, the spatial attention mechanism [54] can adaptively adjust the area of attention over the global spatial range of the input data and give more importance and weight to these locations during preprocessing, thus improving the recognition accuracy and efficiency of the model.
In Figure 5, the structure of spatial attention in the spatial attention block is illustrated.Considering that the spatial information of the same location may behave differently on different spectral bands, we first fused the local spatial features by 2D convolution and then projected the convolved feature maps to obtain Q, K, and V, respectively: Then, the attention map Attn can be calculated as follows: where d k is the dimension of K. Let Output sa be the output in the network of Figure 4, the expression is as follows: Finally, we reshape Output sa to the size of h × w × P for subsequent processing.
Sensors 2024, 24, x FOR PEER REVIEW 7 of 20 weight to these locations during preprocessing, thus improving the recognition accuracy and efficiency of the model.In Figure 5, the structure of spatial attention in the spatial attention block is illustrated.Considering that the spatial information of the same location may behave differently on different spectral bands, we first fused the local spatial features by 2D convolution and then projected the convolved feature maps to obtain , , and , respectively: Then, the attention map  can be calculated as follows: where   is the dimension of .Let   be the output in the network of Figure 4, the expression is as follows: Finally, we reshape   to the size of ℎ ×  ×  for subsequent processing.In Figure 1, the spatial attention block contains two spatial attention parts; we use two convolutional kernels of different sizes on two spatial attention parts to enhance the perceptual region.In addition, in order to strengthen spatial expression, we use a residual structure.

2D-3D Convolution
2D convolution layer can extract spatial features, and 3D convolution layer can extract spectral features.Therefore, as shown in Figure 1, 2D convolution and 3D convolution are used in the spectral-spatial feature extraction block.In the 2D-3D convolution block, three consecutive 3D convolutional layers with a different kernel and one 2D convolutional layer are included in the 2D-3D convolutional block.A detailed description is shown below.
In the 3D convolution layer, a single 3D convolution can be regarded as a 3D convolution kernel sliding along the three dimensions (H, W, C).During the convolution process, the spatial and spectral information of the neighboring spectra are fused.And the values of the nth feature map of the th layer at the spatial location of (, , ) are as follows: In Figure 1, the spatial attention block contains two spatial attention parts; we use two convolutional kernels of different sizes on two spatial attention parts to enhance the perceptual region.In addition, in order to strengthen spatial expression, we use a residual structure.

2D-3D Convolution
2D convolution layer can extract spatial features, and 3D convolution layer can extract spectral features.Therefore, as shown in Figure 1, 2D convolution and 3D convolution are used in the spectral-spatial feature extraction block.In the 2D-3D convolution block, three consecutive 3D convolutional layers with a different kernel and one 2D convolutional layer are included in the 2D-3D convolutional block.A detailed description is shown below.
In the 3D convolution layer, a single 3D convolution can be regarded as a 3D convolution kernel sliding along the three dimensions (H, W, C).During the convolution process, the spatial and spectral information of the neighboring spectra are fused.And the values of the nth feature map of the mth layer at the spatial location of (x, y, z) are as follows: where φ(•) is the activation function, and b m,n and ω m,n are the bias parameters and weight values of kernel corresponding to the nth feature map of the mth layer, respectively.d l−1 indicates the number of feature maps in the (l − 1)th layer and the depth of ω m,n .The height, width, and spectral dimension of the kernel are (2h m + 1), (2w m + 1), and (2c m + 1), respectively.
Sensors 2024, 24, 4714 8 of 19 In the 2D convolution layer, the convolution kernel slides over the entire space, and the output of the convolution is the sum of the dot products between the kernel and the input data.During the convolution process, the information of different spectra in the same space is fully integrated.In 2D convolution, the values of the nth feature map of the mth layer at the spatial location of (x, y, z) are as follows: the parameters appearing in Equation ( 18) represent the same meaning as in Equation (17).

HSI Datasets
We apply the model to three public datasets: the Indian Pines (IP), the Kennedy Space Center (KSC), and the Pavia University (PU) to evaluate the validity of model.
The IP size is 145 × 145 pixels, and the spatial resolution is 20 m.In the experiment, noisy bands and water absorption have been removed, and 200 spectral bands have been filtered out.The false-color map, ground truth map, training ratios, and 16 vegetation classes of the IP dataset are shown in Figure 6.
where (•) is the activation function, and  , and  , are the bias parameters and weight values of kernel corresponding to the nth feature map of the th layer, respectively. −1 indicates the number of feature maps in the (l − 1)th layer and the depth of  , .The height, width, and spectral dimension of the kernel are (2ℎ  + 1) , (2  + 1) , and (2  + 1), respectively.
In the 2D convolution layer, the convolution kernel slides over the entire space, and the output of the convolution is the sum of the dot products between the kernel and the input data.During the convolution process, the information of different spectra in the same space is fully integrated.In 2D convolution, the values of the nth feature map of the th layer at the spatial location of (, , ) are as follows: , the parameters appearing in Equation ( 18) represent the same meaning as in Equation ( 17).

Comparison Experiments
.

HS t sets
We apply the model to three public datasets: the Indian Pines (IP), the Kennedy Space Center (KSC), and the Pavia University (PU) to evaluate the validity of model.
The IP size is 145 × 145 pixels, and the spatial resolution is 20 m.In the experiment, noisy bands and water absorption have been removed, and 200 spectral bands have been filtered out.The false-color map, ground truth map, training ratios, and 16 vegetation classes of the IP dataset are shown in Figure 6.The KSC size is 512 × 614 pixels, and the spatial resolution is 18 m.In the experiment, noisy bands and water absorption have been removed, and 176 spectral bands have been filtered out.The false-color map, ground truth map, training ratios, and 13 wetland categories of the KSC dataset are shown in Figure 7.The KSC size is 512 × 614 pixels, and the spatial resolution is 18 m.In the experiment, noisy bands and water absorption have been removed, and 176 spectral bands have been filtered out.The false-color map, ground truth map, training ratios, and 13 wetland categories of the KSC dataset are shown in Figure 7.The PU size is 610 × 340 pixels, and the spatial resolution is 1.3 m.In the experiment, noisy bands have been removed, and 103 spectral bands have been filtered out.The falsecolor map, ground truth map, training ratio, and 9 urban land cover categories of the PU dataset are shown in Figure 8.The PU size is 610 × 340 pixels, and the spatial resolution is 1.3 m.In the experiment, noisy bands have been removed, and 103 spectral bands have been filtered out.The falsecolor map, ground truth map, training ratio, and 9 urban land cover categories of the PU dataset are shown in Figure 8.The PU size is 610 × 340 pixels, and the spatial resolution is 1.3 m.In the experiment, noisy bands have been removed, and 103 spectral bands have been filtered out.The falsecolor map, ground truth map, training ratio, and 9 urban land cover categories of the PU dataset are shown in Figure 8.In order to avoid the effect of dataset randomness as much as possible, all experiments under this section are trained and tested on ten randomly generated identical training sets and corresponding test sets.

Land-co er type est atio rain atio round truth map False-color map
.2.
eri ent l Se in

Measurement Indicators
The overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) metrics are being used for quantitative evaluation to fairly consider ESSN against other comparative methods.The larger the value of the model corresponding to the above three metrics, the better the model performs.

Environment Configuration
The software environment for all experiments is PyTorch1.12.0, cuDNN8.0,CUDA11.7, and Python3.8.The hardware environment is an Intel i5-12490F CPU, an NVIDIA GeForce RTX 3060 GPU, RAM 32 GB, and 1 TB of memory.The Stochastic Gradient Descent (SGD) optimizer was chosen as the initial optimizer for all experiments, the cross-entropy loss function is used to calculate the loss, the learning rate is set to 0.01, the batch size is 100, and the patch size is 15 × 15.One hundred epochs are applied to each dataset.

. . Co rison eri ent Res lt
We have experimented fully in the same environment and compared it with other SOTA models.In total, eight comparison models were selected, including 2-D CNN [24], 3-D CNN [25], HybridSN [29], SSRN [32], SSTN [45], SSFTT [37], CTMixer [43], and 3DCT [39].In order to attain a relatively fair comparison result, only the model was changed In order to avoid the effect of dataset randomness as much as possible, all experiments under this section are trained and tested on ten randomly generated identical training sets and corresponding test sets.

Experimental Setting 3.2.1. Measurement Indicators
The overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) metrics are being used for quantitative evaluation to fairly consider ESSN against other comparative methods.The larger the value of the model corresponding to the above three metrics, the better the model performs.

Environment Configuration
The software environment for all experiments is PyTorch1.12.0, cuDNN8.0,CUDA11.7, and Python3.8.The hardware environment is an Intel i5-12490F CPU, an NVIDIA GeForce RTX 3060 GPU, RAM 32 GB, and 1 TB of memory.The Stochastic Gradient Descent (SGD) optimizer was chosen as the initial optimizer for all experiments, the cross-entropy loss function is used to calculate the loss, the learning rate is set to 0.01, the batch size is 100, and the patch size is 15 × 15.One hundred epochs are applied to each dataset.

Comparison Experiment Result
We have experimented fully in the same environment and compared it with other SOTA models.In total, eight comparison models were selected, including 2-D CNN [24], 3-D CNN [25], HybridSN [29], SSRN [32], SSTN [45], SSFTT [37], CTMixer [43], and 3DCT [39].In order to attain a relatively fair comparison result, only the model was changed during the comparison process, and the other parts were kept unchanged, so that the influence of other external factors could be minimized.
Tables 1-3 show the mean classification results and standard deviation for each category on the IP, KSC, and PU datasets, respectively, as well as the evaluation metrics for each model.Results are in percentage terms.show the performance of each model on the same training set on the IP, KSC, and PU datasets, respectively.And in Figures 9-11, the background is labeled as black pixels, the category to be predicted is labeled as colored pixels.
By looking at the classification result plots, it can be found that the points of classification errors of each model occurred mostly in the regions where multiple categories appeared densely, so we zoomed in on the localized multi-category regions.

Comparative Results on the KSC
By observing Table 2, it shows that ESSN is higher than the others.The overall results are OA (99.18%),AA (98.79%), and Kappa (99.08%).Compared with the second-best model (HyBridSN [29]), all performance metrics are improved, 0.19% higher in OA, 0.32% higher in AA, and 0.21% higher in Kappa, respectively.In addition, it can be found that ESSN performance is optimal in most categories, and on a small number of categories, the classification results of our model do not differ much from the best classification results.
Based on Figure 10, it can be seen that ESSN outperforms the other models on the KSC dataset in general.As can be seen in the local zoomed-in image, at the edges of the categories in the category-dense region, ESSN's performance is also better than the other models.The reason for obtaining the above results is that in KSC, in general, each category is more independent, and there are not many sample points that are in close proximity to other categories.The performance of ESSN on the edges is much sharper than the method without edge feature augment.3, the overall performance of ESSN is OA (99.45%),AA (98.82%), and Kappa (99.27%).Compared with the second-best model (3DCT [39]), all performance metrics are improved, where OA is improved by 0.09%, AA by 0.14%, and Kappa by 0.12%,  .

e letion o Reso rces
We take the parameter size, training time, and testing time as resource consumption training metrics, and smaller values are better for all indicators.All results are stored in Table 4.
Based on    1, it shows that ESSN outperforms other methods, and the overall results are OA (98.93%),AA (96.49%), and Kappa (98.78%).And compared with the second-best model (HyBridSN [29]), all performance metrics are improved, 0.21% higher in OA, 0.68% higher in AA, and 0.24% higher in Kappa, respectively.This indicates that the inclusion of edge feature augment and the spectral-spatial attention block is beneficial to the feature representation of the model.In addition, it is observed that all models perform poorly on classes '1' and '9', and by analyzing the IP dataset, it can be found that the overall number of classes '1' and '9' is too small.Consequently, when using the approach of proportional random selection for the training dataset, an inadequate number of samples is obtained for this particular class, resulting in suboptimal classification performance by all models for that class.
In addition, from the overall plot in Figure 9, ESSN has the best overall performance, and from the local plots with high category density, ESSN also outperforms the other models in places with high category density.This indirectly shows that the edge feature enhancement is effective.Also, Figure 9 verifies that most of the places where the prediction is wrong are common areas with different categories.

Comparative Results on the KSC
By observing Table 2, it shows that ESSN is higher than the others.The overall results are OA (99.18%),AA (98.79%), and Kappa (99.08%).Compared with the second-best model (HyBridSN [29]), all performance metrics are improved, 0.19% higher in OA, 0.32% higher in AA, and 0.21% higher in Kappa, respectively.In addition, it can be found that ESSN performance is optimal in most categories, and on a small number of categories, the classification results of our model do not differ much from the best classification results.
Based on Figure 10, it can be seen that ESSN outperforms the other models on the KSC dataset in general.As can be seen in the local zoomed-in image, at the edges of the categories in the category-dense region, ESSN's performance is also better than the other models.The reason for obtaining the above results is that in KSC, in general, each category is more independent, and there are not many sample points that are in close proximity to other categories.The performance of ESSN on the edges is much sharper than the method without edge feature augment.

Comparative Results on the PU
By observing Table 3, the overall performance of ESSN is OA (99.45%),AA (98.82%), and Kappa (99.27%).Compared with the second-best model (3DCT [39]), all performance metrics are improved, where OA is improved by 0.09%, AA by 0.14%, and Kappa by 0.12%, respectively.Combining Table 3 and Figures 7 and 11, compared with the HybridSN [29], which has the highest number of optimal classification accuracies in a single category, it can be found that ESSN's classification accuracies are competitive, if not optimal, in categories '1', '2', '7', and '8', but ESSN outperforms it significantly in categories '3', '4', and '9'.The same result can be observed in the localized zoomed-in plots in Figure 11.This explains why the average classification accuracy of ESSN is higher than that of other methods, even though ESSN's single-class classification accuracy is not optimal on most categories.

Depletion of Resources
We take the parameter size, training time, and testing time as resource consumption training metrics, and smaller values are better for all indicators.All results are stored in Table 4.

Parametric Analysis
In this section, we perform a sensitivity analysis on three parameters: patch size, training ratio, and operator of LoG, respectively, and explore their impact on model performance.
In Table 5, ESSN performs better on IP and PU when patch size is selected as 15 × 15 but does not perform as well on KSC as when patch size is 19 × 19.Considering the amount of computation, patch size 15 × 15 is selected as the optimal size.In addition, as the patch size increases, the OA on the KSC becomes larger and larger, and combined with the full ground truth map of KSC, there are two reasons for this result: one is that as the patches increase, each patch contains more spatial information, and thus the model can learn more key elements from it.The other is that as the patches increase, the longer distance edges are gradually incorporated into the model's observation range.In addition, by looking at the ground truth plots of IP and PU, it can be seen that there is more category intermingling in these two datasets.Increasing the patch can obtain a larger perceptual field which is beneficial to the model but, at the same time, will introduce more noise and confusing information which is detrimental to the model.Therefore, ESSN shows a performance on both IP and KSC that first increases with increasing patch and then decreases with increasing patch.From Table 5, the point of patch size of 15 × 15 is the cutoff point where the model performance goes from up to down.As seen in Figure 12, apparently, as expected, the performance of all models improves with increasing training samples, with OA gradually approaching 100%.In addition, ESSN has a large advantage when the number of training samples is insufficient, and the OA of ESSN gradually decreases with the increase in samples used for training, and eventually the performance of all models gradually converges to the same.All in all, ESSN outperforms other models on different training ratios.
training ratio, and operator of , respectively, and explore their impact on model performance.
In Table 5, ESSN performs better on IP and PU when patch size is selected as 15 × 15 but does not perform as well on KSC as when patch size is 19 × 19.Considering the amount of computation, patch size 15 × 15 is selected as the optimal size.In addition, as the patch size increases, the OA on the KSC becomes larger and larger, and combined with the full ground truth map of KSC, there are two reasons for this result: one is that as the patches increase, each patch contains more spatial information, and thus the model can learn more key elements from it.The other is that as the patches increase, the longer distance edges are gradually incorporated into the model's observation range.In addition, by looking at the ground truth plots of IP and PU, it can be seen that there is more category intermingling in these two datasets.Increasing the patch can obtain a larger perceptual field which is beneficial to the model but, at the same time, will introduce more noise and confusing information which is detrimental to the model.Therefore, ESSN shows a performance on both IP and KSC that first increases with increasing patch and then decreases with increasing patch.From Table 5 As seen in Figure 12, apparently, as expected, the performance of all models improves with increasing training samples, with OA gradually approaching 100%.In addition, ESSN has a large advantage when the number of training samples is insufficient, and the OA of ESSN gradually decreases with the increase in samples used for training, and eventually the performance of all models gradually converges to the same.All in all, ESSN outperforms other models on different training ratios.Figure 13 shows the performance on different cases.Comparing 'a' with 'c', it is clear that if traditional data enhancement methods are used without processing the raw data with learnable parameters, the performance is not as good as when using edge feature augment blocks.In addition, comparing 'a' with 'e', it can be found that traditional data augment does not have a positive effect.Especially on the KSC dataset, it greatly reduces its classification performance.When using the edge feature augment block, it can be seen that the performance of the different operators of  is very close.And comparing 'e' with 'b', 'c', 'd', the classification capabilities of the model are improved when the edge Figure 13 shows the performance on different cases.Comparing 'a' with 'c', it is clear that if traditional data enhancement methods are used without processing the raw data with learnable parameters, the performance is not as good as when using edge feature augment blocks.In addition, comparing 'a' with 'e', it can be found that traditional data augment does not have a positive effect.Especially on the KSC dataset, it greatly reduces its classification performance.When using the edge feature augment block, it can be seen that the performance of the different operators of LoG is very close.And comparing 'e' with 'b', 'c', 'd', the classification capabilities of the model are improved when the edge feature augment block is added.Comparing 'b' with 'c', the performance gap between the two on all the datasets used in the experiment is super small.The reason is that the LoG operators used for both 'b' and 'c' are discrete approximations at σ < 0.5, and the difference between the two is the difference in the angle at which the rotational invariance is satisfied, with the LoG operator corresponding to 'b' having invariant results for rotations in the 90 • direction, and the LoG operator corresponding to 'c' having invariant results for rotations in the 45 • direction.In this study, raw data are not rotationally transformed, so the difference between the two is not significant, and both are better than the case corresponding to 'e'.Obviously, 'b' performs optimally on the KSC, and 'c' performs optimally on the IP and PU.After comprehensive consideration, the LoG operator corresponding to 'c' is chosen.feature augment block is added.Comparing 'b' with 'c', the performance gap between the two on all the datasets used in the experiment is super small.The reason is that the  operators used for both 'b' and 'c' are discrete approximations at  < 0.5, and the difference between the two is the difference in the angle at which the rotational invariance is satisfied, with the  operator corresponding to 'b' having invariant results for rotations in the 90° direction, and the  operator corresponding to 'c' having invariant results for rotations in the 45° direction.In this study, raw data are not rotationally transformed, so the difference between the two is not significant, and both are better than the case corresponding to 'e'.Obviously, 'b' performs optimally on the KSC, and 'c' performs optimally on the IP and PU.After comprehensive consideration, the  operator corresponding to 'c' is chosen.'e' indicates no processing on edges, corresponding to case '6' in Table 6.

Ablation Experiment
In this section, for the influence of external factors, the training samples in the ablation experiments are kept the same as those in the experiments in Section 3. Thus, effects arising from hyperparameters and randomized training samples are excluded.
PCA operation is used in the data preprocessing part, but it also causes loss of spectral information when extracting the principal components of hyperspectral images.Therefore, we explore the effect of PCA operation on the comprehensive performance of the model by conducting ablation experiments of PCA operation on three datasets: IP, KSC, and PU.The experimental results are shown in Figure 14.When PCA operation is used to downscale to 50 dimensions, it can be clearly observed from subplot (a) in Figure 14 that the classification Overall, the edge feature augment block and the spectral attention block play a great role in suppressing the noise in HSIs, and combining them with the spatial attention block will result in better performance than other combinations.

Conclusions
In this paper, a novel feature extraction network (ESSN) is proposed for efficiently extracting local edge features and global spectral-spatial features from HSIs.In the ESSN, firstly, the edge feature augment block performs edge-aware and selective feature enhancement efficiently compared to the traditional edge data augmentation using the LoG operator with no learnable parameters.Secondly, due to the presence of a large amount of noise in some of the spectra in the HSI, different spectra do not have the same importance for the classification decision, so we introduce the spectral attention block to enhance the effective spectra and suppress the noise.Also, due to the geometric constraints of the convolutional operation, we introduce spatial attention to model the pixel-pixel interactions at all locations.Finally, we fuse representations of the feature maps reconstructed by the above methods through the 2D-3D convolution block to obtain the final feature representations.The experimental results show that ESSN performs competitively on the IP, KSC, and PU datasets.
Although ESSN has better performance in HSI classification, further improvements are needed.Afterwards, we will continue the following studies: 1.
Exploring better edge-aware algorithms so as to reduce noise interference from isolated nodes; 2.
Reduce the parameter size to speed up training and increase efficiency.

Figure 1 .
Figure 1.Framework of the classification process using ESSN.Note that BN and ReLU after each convolution operation have been omitted.

Figure 2 .
Figure 2. Comparison of prediction and truth plots.(a) Partial ground truth of IP, (b-d) Predicted classification maps.

Figure 2 .
Figure 2. Comparison of prediction and truth plots.(a) Partial ground truth of IP, (b-d) Predicted classification maps.

Figure 4 .
Figure 4. Binarized images of different spectral bands of PU dataset.(a-c) derived from raw PU dataset, (d-f) derived from PU dataset after PCA.

Figure 4 .
Figure 4. Binarized images of different spectral bands of PU dataset.(a-c) derived from raw PU dataset, (d-f) derived from PU dataset after PCA.

Figure 5 .
Figure 5.The framework of spatial attention in Figure 1.

Figure 5 .
Figure 5.The framework of spatial attention in Figure 1.

Figure 6 .
Figure 6.Specific information on the IP dataset.

Figure 6 .
Figure 6.Specific information on the IP dataset.

Figure 7 .
Figure 7. Specific information on the KSC dataset.

Figure .
Figure .Specific information on the KSC dataset.

Figure .
Figure .Specific information on the PU dataset.

Figure 8 .
Figure 8. Specific information on the PU dataset.

Figure 13 .
Figure 13.Classification accuracy (%) on IP (left), KSC (middle), and PU (right) in different cases.The 'a' denotes traditional edge data augmentation using the  operator with no learnable parameters.In turn, 'b', 'c', and 'd' correspond to the use of the (a), (b), and (c) operators in Figure3, respectively.'e' indicates no processing on edges, corresponding to case '6' in Table6.

Figure 13 .
Figure 13.Classification accuracy (%) on IP (left), KSC (middle), and PU (right) in different cases.The 'a' denotes traditional edge data augmentation using the LoG operator with no learnable parameters.In turn, 'b', 'c', and 'd' correspond to the use of the (a), (b), and (c) operators in Figure3, respectively.'e' indicates no processing on edges, corresponding to case '6' in Table6.

Table 4 ,
compared with 3DCT, although ESSN contains the largest number of parameters, ESSN has a shorter training time.And by comparing SSTN and 2-D CNN, it can be found that the parameter size of the model does not necessarily have a linear relationship with the training time.Resource consumption for each model.

Table 1 .
Classification accuracy (%) of different models on IP.

Table 2 .
Classification accuracy (%) of different models on KSC.

Table 3 .
Classification accuracy (%) of different models on PU.

Table 4 .
Resource consumption for each model.

Table 4 ,
compared with 3DCT, although ESSN contains the largest number of parameters, ESSN has a shorter training time.And by comparing SSTN and 2-D CNN, it can be found that the parameter size of the model does not necessarily have a linear relationship with the training time.
, the point of patch size of 15 × 15 is the cutoff point where the model performance goes from up to down.

Table 6 .
Classification accuracy (%) of different combinations of block.