SSANet-BS: Spectral–Spatial Cross-Dimensional Attention Network for Hyperspectral Band Selection

: Band selection (BS) aims to reduce redundancy in hyperspectral imagery (HSI). Existing BS approaches typically model HSI only in a single dimension, either spectral or spatial, without exploring the interactions between different dimensions. To this end, we propose an unsupervised BS method based on a spectral–spatial cross-dimensional attention network, named SSANet-BS. This network is comprised of three stages: a band attention module (BAM) that employs an attention mechanism to adaptively identify and select highly significant bands; two parallel spectral–spatial attention modules (SSAMs), which fuse complex spectral–spatial structural information across dimensions in HSI; a multi-scale reconstruction network that learns spectral–spatial nonlinear dependencies in the SSAM-fusion image at various scales and guides the BAM weights to automatically converge to the target bands via backpropagation. The three-stage structure of SSANet-BS enables the BAM weights to fully represent the saliency of the bands, thereby valuable bands are obtained automatically. Experimental results on four real hyperspectral datasets demonstrate the effectiveness of SSANet-BS.


Introduction
Hyperspectral imagery (HSI) records numerous contiguous and narrow spectral bands, and has been extensively utilized in diverse fields such as military and industry [1].Nonetheless, the high redundancy of bands in HSI poses great challenges in terms of data transmission, storage, and computation, and can also lead to the Hughes phenomenon in classification, thereby reducing classification accuracy [2,3].Consequently, dimensionality reduction is essential for HSI.
The dimensionality reduction in HSI can be divided into two categories: feature extraction (FE) and band selection (BS) [4].FE aims to project the original HSI into a lower-dimensional space, which results in the loss of HSI's physical information due to alterations in the feature space.Conversely, BS focuses on selecting a representative band subset from HSI that preserves the physical significance and has higher interpretability, which is more preferable for practical applications [5][6][7].
According to different task scenarios, BS methods can be primarily categorized into target detection-oriented methods and classification-oriented methods.The former typically considers the spectral differences between targets and backgrounds when selecting a subset of bands [8,9], whereas the latter selects bands that contain a large amount of information and exhibit strong discrimination capability [10,11].Most of these methods are supervised and depend on prior information, such as ground truth labels, which limits their practical application.In contrast, unsupervised methods do not rely on prior information, but select the representative bands by identifying the intrinsic properties of HSIdata.It is more versatile and can be applied to downstream tasks in all scenarios, including classification [12].Hence, the focus of this paper is on unsupervised band selection methods which are more versatile for task scenarios.
The initial BS approaches employed the artificially designed band evaluation metrics or heuristic strategies to obtain target bands [1].However, the manually designed BS process was unable to account for the complex real-world factors in a comprehensive manner, resulting in unsatisfactory performance.The advent of machine learning has offered novel insights into the field of BS.The maximum-variance principal component analysis (MVPCA) [13] and Boltzmann entropy-based band selection (BE) [14] employ specific metrics derived from machine learning models to evaluate the equality of bands.Techniques such as fast density-peak-based clustering (E-FDPC) [15] and graph regularized spatial-spectral subspace clustering (GRSC) [16] utilize clustering to partition bands into multiple clusters, from which the most representative band in each cluster is selected.The sparse representation-based band selection (SpaBS) [17] and spectral-spatial hypergraph-regularized self-representation (HyGSR) [18] operate under the assumption that the original HSI can be represented by a linear combination of a limited number of bands, identifying the optimal band combination through iterative optimization of the sparse representation model.Those above machine learning-based BS methods attained considerable performance and can effectively reduce band redundancy.
However, those BS methods usually rely on strong assumptions to model the internal interactions within HSI [19,20].In reality, the interaction between bands and pixels is complex [21].Due to various physical factors, the reflectance of band in a certain pixel is influenced by its surrounding pixels and bands.Thus, the predefined strong assumptions cannot cover all situations, and hence are not the optimal solution [22][23][24][25].
Neural networks possess remarkable fitting capabilities and can reveal the intricate interdependent relationships in HSI [26][27][28].The attention mechanism is effective in distinguishing the important features [29,30].Networks with the attention mechanism can automatically learn the potential interrelations and distinguish the most representative bands of HSI [31,32].Therefore, various attention modules are widely employed in the field of band selection.On this basis, BS-Nets [33] is the first band selection framework that combines band attention mechanism and autoencoder.This model utilizes attention mechanism to search for important bands and applies band-wise attention weighting to the original HSI.By optimizing the model using an autoencoder, significant bands are selected according to the band attention weights.
Subsequently, various models have employed different attention mechanisms to model the spectral or spatial dimensions of HSI, and enhance the model performance.Attentionbased autoencoder (AAE) [34] generates the attention mask for each pixel.Then, the band correlations are calculated based on the attention mask and the final band subset is obtained by clustering.Non-local band attention network (NBAN) [20] employs a globallocal attention mechanism, which fully considers the nonlinear long-range dependencies of HSI in the band dimension.This approach significantly enhances the effectiveness and robustness of the attention mechanism, facilitating the automatic selection of bands.Dual-attention reconstruction network (DARecNet-BS) [35] incorporates two independent self-attention mechanisms, one in the spectral dimension and the other in the spatial dimension.These mechanisms enhance the results of band selection by exploring the dependencies of HSI in various dimensions.
The aforementioned methods have utilized the nonlinear interaction information within HSI, yielding promising results.However, these methods solely model in the spectral or spatial dimensions independently, overlooking the potential improvement in performance achieved from considering the connections between them.Most of them are two-stage models, i.e., attention and reconstruction network, which brought new challenges.Methods such as DARecNet-BS and the triplet-attention and multi-scale reconstruction network (TAttMSRecNet) [36] introduce spatial attention modules connected in parallel with the band attention module to mine image spatial information, as shown in Figure 1a.The band attention module and the spatial attention modules are integrated to function in unison, thereby rendering the band attention weights unable to represent the saliency of the bands independently.Hence, these methods are incapable of selecting bands based on the converged band attention weights and must rely on calculating the entropy of the reconstructed image for band selection, a process that constrains the potential of the attention mechanism to automatically identify significant bands.parallel with the band attention module to mine image spatial information, as shown in Figure 1a.The band attention module and the spatial attention modules are integrated to function in unison, thereby rendering the band attention weights unable to represent the saliency of the bands independently.Hence, these methods are incapable of selecting bands based on the converged band attention weights and must rely on calculating the entropy of the reconstructed image for band selection, a process that constrains the potential of the attention mechanism to automatically identify significant bands.In this end, we propose a deep neural network, SSANet-BS, based on spectral-spatial cross-dimensional attention.SSANet-BS is a three-stage model, as shown in Figures 1b  and 2. This network regards BS as a reconstruction task for HSI to achieve unsupervised BS that is applicable to scenarios such as classification.Initially, a band attention module (BAM) is designed to model the spectral dimension of HSI, extracting salient band features, and outputting band attention weights.Subsequently, two spectral-spatial attention modules (SSAMs) are constructed in the band-width (b-w) and band-height (b-h) directions using the BAM-weighted image as input, to explore the complex spectral-spatial interactions within HSI, and generate SSAM weights along with SSAM-fused image.Finally, a multi-scale reconstruction network is used to reconstruct the above fused image.In the optimization process, the band attention weights obtained by BAM are gradually converged to the bands with large information and high saliency.Compared with the above two-stage methods, DARecNet-BS and TAttMSRecNet, SSANet-BS makes SSAMs compatible with BAM by the ingenious design of the three-stage structure, and fully takes advantage of the automatic convergence of the attention mechanism to the important bands during the back propagation process to achieve automatic band selection.The main contributions of this paper are as follows: In this end, we propose a deep neural network, SSANet-BS, based on spectral-spatial cross-dimensional attention.SSANet-BS is a three-stage model, as shown in Figures 1b and  2. This network regards BS as a reconstruction task for HSI to achieve unsupervised BS that is applicable to scenarios such as classification.Initially, a band attention module (BAM) is designed to model the spectral dimension of HSI, extracting salient band features, and outputting band attention weights.Subsequently, two spectral-spatial attention modules (SSAMs) are constructed in the band-width (b-w) and band-height (b-h) directions using the BAM-weighted image as input, to explore the complex spectral-spatial interactions within HSI, and generate SSAM weights along with SSAM-fused image.Finally, a multi-scale reconstruction network is used to reconstruct the above fused image.In the optimization process, the band attention weights obtained by BAM are gradually converged to the bands with large information and high saliency.Compared with the above two-stage methods, DARecNet-BS and TAttMSRecNet, SSANet-BS makes SSAMs compatible with BAM by the ingenious design of the three-stage structure, and fully takes advantage of the automatic convergence of the attention mechanism to the important bands during the back propagation process to achieve automatic band selection.The main contributions of this paper are as follows:  1.This paper proposes a deep neural network based on spectral-spatial cross-dimensional attention for hyperspectral BS, named SSANet-BS.This network employs complementary multi-dimensional attention mechanisms to automatically discover salient bands, and improves the performance of BS by exploring the complex spectral-spatial interactions in HSI.

2.
SSANet-BS, with its three-stage structural design, addresses the issue of existing BS methods that introduce spatial modules, which compromise the independence of the band attention weights.The experimental results demonstrate that SSANet-BS is effective and stable.This offers a novel solution for the field of hyperspectral BS.

The Proposed Method
This section introduces the proposed method, SSANet-BS, outlines its design concept and overall structure, and presents the implementation details of every module and step.

Overview of SSANet-BS
SSANet-BS treats BS as a task of band-weighted reconstruction for HSI.To enhance performance, it fully models the nonlinear interactions between pixels and bands in HSI [37] throughout the reconstruction process.SSANet-BS is comprised of three stages, and the overall structure is shown in Figure 2.
In the first stage, SSANet-BS inputs image patch X ∈ R M×N×L from the original HSI multiple times as a single input with width M, height N, and number of bands L. This process ensures that SSANet-BS can read the original HSI thoroughly.Afterwards, X is fed into the band attention module (BAM) to obtain band attention weights, which are then applied proportionally.The output of BAM is a band-attention-weighted image with enhanced salient bands.
The second stage is designed to extract the spectral-spatial information of the HSI.The above BAM-weighted image is input into two spectral-spatial attention modules (SSAMs) to fully explore the complex spectral-spatial cross-dimesional interactions.This leads to the generation of spectral-spatial attention weights, which are later used to construct the SSAM-fusion image.
The third stage is the reconstruction of the attention-weighted HSI for model optimization.A multi-scale reconstruction network based on 3D convolution and transposed convolution is employed to reconstruct the aforementioned SSAM-fusion image.The loss function is defined as the residual between the reconstructed image and the original image, facilitating the optimization of SSANet-BS.
It should be noted that the existing two-stage approach employs a band attention module and other modules in the same stage.Consequently, the band attention weights are unable to represent the salience of bands independently.In contrast, the BAM of SSANet-BS is employed independently in the first stage.Therefore, the weight vector generated by the BAM represent the salience or reconstruction capability of each band in relation to the original HSI.When SSANet-BS reaches convergence, the band attention weights are sorted in descending order.The higher the band ranking, the higher its priority.Specifically, the details of each module in SSANet-BS are illustrated below.

The Band Attention Module
The BAM takes X as input, and generates a band attention weight vector through neural network f b within this module: The i-th element w i b ∈ [0, 1] of the vector w b ∈ R L represents the salience of the i-th band b i in X.A higher value of w i b indicates that b i contributes more to the reconstruction of X, making it more salient.The structure of f b is detailed in Figure 3 and Table 1.Compared to using a fully connected network to extract band information from a single pixel, employing a convolutional neural network with spatial inductive bias [33] can effectively make use of spatial information and boost modeling capabilities.Consequently, the initial layer of the network uses multiple 2D convolution kernels to extract band information, while the second layer employs max-pooling operations to reduce the feature dimension of the output of the convolutional layer.Finally, after passing through a fully connected network with a sigmoid activation function and batch normalization, the weight vector w b can be obtained.
Here At this point, if b X is obtained through b  wX , it will inevitably lose some original band information, making it difficult for the subsequent SSAMs to fully model the spectral-spatial cross-dimensional interactions in HSI.Therefore, in this paper, the linear transformation linear f is adopted to map each element in b w from the range of [0,1] to [0.5,1] without changing relative relationship of band saliency.As the input of the sub- sequent module, b X enhances the features of salient bands in X, improving the rational- ity of BS.

The Spectral-Spatial Attention Module
If only a single dimension such as spectral or spatial considered in HSI reconstruction, the interdependent relationship between these dimensions is ignored.In reality, proper modeling of the complex nonlinear interactions in HSI can effectively improve the performance of model [36].Based on this, SSANet-BS not only uses BAM to learn and model the interactions in the spectral dimension but also further introduces two spectralspatial attention modules (SSAM) for the band-width (b-w) and band-height (b-h) directions.This approach aims at fusing the spectral-spatial information to deeply explore the complex spectral-spatial cross-dimensional dependencies in HSI.(3,3,3) Conv3D kernel(5,5,5) Concanate MaxPool3D kernel (3,3,3) Conv3D kernel (3,3,3) TranposedConv3D kernel (3,3,3) TranposedConv3D kernel (3,3,3) Subsequently, X is weighted band-by-band to generate the output image X b of BAM: Here, X b ∈ R M×N×L .⊗ represents band-wise multiplication, and f linear denotes the linear transformation operation.The L 1 regularization is imposed on the loss function of SSANet-BS, which introduces a sparse constraint on w b in order to reduce the redundancy of the final band subset.Therefore, some elements in w b may be 0 or close to 0. At this point, if X b is obtained through w b ⊗ X, it will inevitably lose some original band information, making it difficult for the subsequent SSAMs to fully model the spectral-spatial crossdimensional interactions in HSI.Therefore, in this paper, the linear transformation f linear is adopted to map each element in w b from the range of [0, 1] to [0.5, 1] without changing relative relationship of band saliency.As the input of the subsequent module, X b enhances the features of salient bands in X, improving the rationality of BS.

The Spectral-Spatial Attention Module
If only a single dimension such as spectral or spatial considered in HSI reconstruction, the interdependent relationship between these dimensions is ignored.In reality, proper modeling of the complex nonlinear interactions in HSI can effectively improve the performance of model [36].Based on this, SSANet-BS not only uses BAM to learn and model the interactions in the spectral dimension but also further introduces two spectral-spatial attention modules (SSAM) for the band-width (b-w) and band-height (b-h) directions.This approach aims at fusing the spectral-spatial information to deeply explore the complex spectral-spatial cross-dimensional dependencies in HSI.
In Equation ( 3), the elements in W b−w ∈ R M×L are non-negative.The detailed structure of f b−w is shown in Table 1.SSAM b−w first performs max-pooling and averagepooling along the height direction of X b to reduce its dimensionality, obtaining feature maps of salient and global information in the b-w direction.Further, by stacking the above two feature maps and passing them through a convolutional layer, a batch normalization layer and a ReLU nonlinear activation function, W b−w and the SSAM-weighted image of the b-w direction X b−w ∈ R M×N×L can be obtained: Here, ⊙ represents the corresponding position-wise multiplication.Specifically, let In this case, Avg is the average operation.X b−w−h will provide spectral-spatial cross-dimensional interaction information for the adjustment of BAM weight vector w b and subsequent image reconstruction process, thus enabling SSANet-BS to fully utilize the spectral-spatial correlation information to select more reasonable bands and achieve performance improvement.

The Multi-Scale Reconstruction Network
3D convolutional networks can exploit spectral-spatial information and have found extensive usage in reconstructing HSI [36,38].To develop a network that can model HSI's interactions on varying scales and enhance reconstruction proficiency, this paper puts forth a multi-scale reconstruction network f ms rec inspired by MSRN [38] that incorporates 3D convolutions and transposed convolutions with diverse kernel scales.Then, the above SSAM-fusion image X b−w−h can be reconstructed by f ms rec : The detailed implementation of f ms rec are displayed in Table 1.The SSAM-fusion image X b−w−h will be reconstructed as X ∈ R M×N×L using f ms rec .To ensure that bands with adjacent spectral positions are not assigned approximate attention weights and to reduce the redundancy of the band subset, the loss function of SSANet-BS is designed as: Here, θ represents all trainable parameters of SSANet-BS.P = M × N denotes the total number of pixel in X. ∥•∥ 1 represents the L 1 sparse constraint.The coefficient λ controls the sparsity degree of w b .The three-stage design of SSANet-BS enables the band attention weight w b of the BAM to represent the band saliency independently, thus facilitating band selection.Specifically, the average of w b corresponding to each X, w b can be treated as the ultimate salience scores of each band once the SSANet-BS has converged.The larger the i-th atom of w b , the more important the i-th band b i .Based on this, after sorting the atoms of w b in descending order, the bands linked to the top n values are picked as the ultimate band subset.

Experiments
This section presents a comparative analysis of the proposed SSANet-BS model, two state-of-the-art feature extraction methods, and eight state-of-the-art BS methods on four publicly available datasets.The classification results are used to verify the effectiveness of each method.The experimental data, parameter settings, comprehensive analysis and discussions are detailed in the following sections.

Experimental Setup
The comparison methods include locally linear embedding (LLE) [39], isometric mapping (Isomap) [40], maximum-variance principal component analysis (MVPCA) [13], enhanced fast density-peak clustering (E-FDPC) [15], adaptive subspace partitioning strategy (ASPS) [41], scalable one-pass self-representation learning (SOPSRL) [42], graph regularized spatial-spectral clustering (GRSC) [16], BSNet-Conv [33], DARecNet-BS [35] and spatial and spectral structure preserved self-representation (S 4 P) [43], respectively.It is crucial to emphasize that LLE and Isomap require significant computational resources for processing large-scale HSI.Therefore, sampled versions are chosen to ensure their successful operation on the four datasets.Further, in order to facilitate comparisons between the two feature extraction methods (LLE and Isomap) and the BS methods, the number of dimension after feature extraction is set equal to the number of selected bands.The four hyperspectral datasets are as follows, as shown in Figure 4: • Indian Pines (IP220): IP220 is captured by the AVIRIS sensor in 1992 in an Indian pine forest landscape which located at the northwest of Indiana.It contains 220 bands, with a resolution of 145 × 145 pixels and 16 classes of ground objects labeled.• Washington DC Mall (DC191): It is an airborne HSI acquired by the HYDICE sensor, which contains 191 bands, with a resolution of 280 × 307 and 6 classes.The experiment uses Support Vector Machine (SVM) as the classifier, with 10%, 1%, 5%, and 10% samples selected from IP220, DC191, PU103 and QY176 for training.The classification results include producer's accuracy (PA), average producer's accuracy (APA), average user's accuracy (AUA), overall accuracy (OA) and kappa coefficient (kappa) are used to assess the effectiveness of each method.To reduce the uncertainty caused by random sample selection, the OA of each band subset is the average from five independent tests.The experiment divides the HSI into multiple non-overlapping images X ∈ R 7×7×L as input for SSANet-BS, and takes the SGD as optimizer.SSANet-BS is implemented using the  The experiment uses Support Vector Machine (SVM) as the classifier, with 10%, 1%, 5%, and 10% samples selected from IP220, DC191, PU103 and QY176 for training.The classification results include producer's accuracy (PA), average producer's accuracy (APA), average user's accuracy (AUA), overall accuracy (OA) and kappa coefficient (kappa) are used to assess the effectiveness of each method.To reduce the uncertainty caused by random sample selection, the OA of each band subset is the average from five independent tests.The experiment divides the HSI into multiple non-overlapping images 77L   X as input for SSANet-BS, and takes the SGD as optimizer.SSANet-BS is implemented using the PyTorch framework based on CUDA 10.7.All experiments are run on Intel Xeon E5-2699 v4 CPU and Nvidia Tesla P40 GPU.

Parameter Setting
The hyperparameter of SSANet-BS, λ , is the coefficient to control the regularization.Its range is set to {0.0001, 0.001, 0.01, 0.1}.The optimal λ is determined based on the average OA (AOA) under the number of bands BS n varies from 5 to 30 with a step of 5.
Table 2 shows the AOA values of SSANet-BS under different λ .It can be observed that the optimal values on the IP220, DC191, PU103 and QY176 datasets are 0.01, 0.0001, 0.001 and 0.0001, respectively.

Parameter Setting
The hyperparameter of SSANet-BS, λ, is the coefficient to control the regularization.Its range is set to {0.0001, 0.001, 0.01, 0.1}.The optimal λ is determined based on the average OA (AOA) under the number of bands n BS varies from 5 to 30 with a step of 5. Table 2 shows the AOA values of SSANet-BS under different λ.It can be observed that the optimal values on the IP220, DC191, PU103 and QY176 datasets are 0.01, 0.0001, 0.001 and 0.0001, respectively.

Result Analysis
To validate the effectiveness of the proposed method, Figure 5 shows the OA values of five runs for each BS method at different n BS .For the IP220 dataset, SSANet-BS achieves the best results under most bands.It is closely followed by DARecNet-BS, GRSC, and ASPS, with E-FDPC, LLE and Isomap performing poorly.The advantage of SSANet-BS becomes more pronounced when fewer bands are selected.In terms of stability, the OA values of SSANet-BS, GRSC, and DARecNet-BS vary slightly under different n BS , while the OA values of ASPS drops when the n BS is 20, which is not stable.Meanwhile, Figure 5b reveals that SSANet-BS has a more significant advantage under most n BS on the DC191 dataset.As the n BS increases to 20, the gap between SSANet-BS and other comparison methods gradually narrows, still leaving SSANet-BS as an outstanding performer.For the PU103 dataset, although it is less effective than S 4 P when the n BS under 15, SSANet-BS still performs well.Nevertheless, it outperforms other methods in all other n BS .As with other datasets, SSANet-BS demonstrates an advantage over the other methods with fewer bands, such as 5 and 10, in the QY176 dataset.As the number of bands increase, the performance of SSANet-BS gradually approaches that of the other methods, with the exception of DARecNet-BS and MVPCA.
FOR PEER REVIEW 10 of 22 the DC191 dataset.As the BS n increases to 20, the gap between SSANet-BS and other comparison methods gradually narrows, still leaving SSANet-BS as an outstanding performer.For the PU103 dataset, although it is less effective than 4   SP when the BS n under 15, SSANet-BS still performs well.Nevertheless, it outperforms other methods in all other BS n .As with other datasets, SSANet-BS demonstrates an advantage over the other meth- ods with fewer bands, such as 5 and 10, in the QY176 dataset.As the number of bands increase, the performance of SSANet-BS gradually approaches that of the other methods, with the exception of DARecNet-BS and MVPCA.As shown in Figure 5, methods such as SSANet-BS, DARecNet-BS and 4 SP outper- forms full bands across the majority of bands on the IP220 dataset.This indicates that those BS methods effectively reduced the data redundancy and further obtain good performance.On the DC191, PU103, and QY176 datasets, full bands surpasses all BS methods.However, as the number of bands increases, this gap gradually narrows.It is important to emphasize that the objective of BS is to improve data transmission and processing speed, conserve computational resources, and enhance model usability while maintaining task accuracy as much as possible.For instance, on the DC191 dataset with 191 bands, when the number of bands is 15, SSANet-BS achieves a reduction of approximately 92% in data volume with an 1.32% loss in OA.Moreover, in this experiment, the running time for SVM with 15 bands and full bands is 0.43s and 2.94s, respectively, which is of considerable importance in practical applications with large-scale datasets.Therefore, the BS As shown in Figure 5, methods such as SSANet-BS, DARecNet-BS and S 4 P outperforms full bands across the majority of bands on the IP220 dataset.This indicates that those BS methods effectively reduced the data redundancy and further obtain good performance.On the DC191, PU103, and QY176 datasets, full bands surpasses all BS methods.However, as the number of bands increases, this gap gradually narrows.It is important to emphasize that the objective of BS is to improve data transmission and processing speed, conserve computational resources, and enhance model usability while maintaining task accuracy as much as possible.For instance, on the DC191 dataset with 191 bands, when the number of bands is 15, SSANet-BS achieves a reduction of approximately 92% in data volume with an 1.32% loss in OA.Moreover, in this experiment, the running time for SVM with 15 bands and full bands is 0.43s and 2.94s, respectively, which is of considerable importance in practical applications with large-scale datasets.Therefore, the BS methods incur a acceptable loss of accuracy to significantly reduce the data volume of HSI, thereby increasing processing efficiency.
Further, Figures 6-9 illustrate the classification maps of each method on four datasets at n BS = 15.It can be observed that there are discrepancies between the false color image (a) and the ground truth (b) in Figures 6-9.These differences are more pronounced in the areas highlighted by the yellow box in Figure 9.One of the reasons for these discrepancies is the interference from shadows, reflections, and other disturbances.Therefore, these factors are more conducive to validating and distinguishing the effectiveness of different band selection methods.The classification maps demonstrate that the selected bands of SSANet-BS are more closely aligned with the ground truth than those of other methods.The prediction accuracy of SSANet-BS is higher in adjacent regions belonging to the same class.This phenomenon is more pronounced in the yellow box labelled region of Figures 6-9.For instance, on the IP220 dataset, the bands selected by SSANet-BS exhibit a lower misclassification rate in the yellow box labelled region, in contrast to MVPCA, E-FDPC and other methods, which exhibit higher rates.Similarly, on the QY176 dataset, SSANet-BS is the most closely aligned with the ground truth in the yellow box, whereas methods such as DARecNet-BS and MVPCA are less effective.This indicates that the joint spectral-spatial information of HSI has been fully utilized.The prediction accuracy of SSANet-BS is higher in adjacent regions belonging to the same class.This phenomenon is more pronounced in the yellow box labelled region of Figures 6-9.For instance, on the IP220 dataset, the bands selected by SSANet-BS exhibit a lower misclassification rate in the yellow box labelled region, in contrast to MVPCA, E-FDPC and other methods, which exhibit higher rates.Similarly, on the QY176 dataset, SSANet-BS is the most closely aligned with the ground truth in the yellow box, whereas methods such as DARecNet-BS and MVPCA are less effective.This indicates that the joint spectralspatial information of HSI has been fully utilized.Furthermore, Tables 3-6 also present the producer's accuracy (PA), average producer's accuracy (APA), average user's accuracy (AUA), overall accuracy (OA) and kappa coefficient (kappa) for each method at BS 15 n = on the IP220, DC191, PU103, and QY176 datasets, respectively.For the IP220 dataset, SSANet-BS achieves the optimal APA, AUA, OA and kappa, and PA in 11 classes.In those classes where SSANet-BS did not achieve the optimal outcome, the PA value between the SSANet-BS and the optimal method is less Furthermore, Tables 3-6 also present the producer's accuracy (PA), average producer's accuracy (APA), average user's accuracy (AUA), overall accuracy (OA) and kappa coefficient (kappa) for each method at BS 15 n = on the IP220, DC191, PU103, and QY176 datasets, respectively.For the IP220 dataset, SSANet-BS achieves the optimal APA, AUA, OA and kappa, and PA in 11 classes.In those classes where SSANet-BS did not achieve comparable to that of IP220.The APA, AUA, OA and kappa of SSANet-BS all represent the optimal values.This indicates that the selected subset of bands for SSANet-BS is of high quality and that the classification performance is stable.When considered collectively, SSANet-BS achieves the optimal values of APA, AUA, OA, and kappa on the remaining datasets, with the exception on PU103, which is outperformed by 4   SP.This in- dicates that SSANet-BS is a stable method and that the selected bands can effectively represent the original HSI.comparable to that of IP220.The APA, AUA, OA and kappa of SSANet-BS all represent the optimal values.This indicates that the selected subset of bands for SSANet-BS is of high quality and that the classification performance is stable.When considered collectively, SSANet-BS achieves the optimal values of APA, AUA, OA, and kappa on the remaining datasets, with the exception on PU103, which is outperformed by 4   SP.This in- dicates that SSANet-BS is a stable method and that the selected bands can effectively represent the original HSI.Furthermore, Tables 3-6 also present the producer's accuracy (PA), average producer's accuracy (APA), average user's accuracy (AUA), overall accuracy (OA) and kappa coefficient (kappa) for each method at n BS = 15 on the IP220, DC191, PU103, and QY176 datasets, respectively.For the IP220 dataset, SSANet-BS achieves the optimal APA, AUA, OA and kappa, and PA in 11 classes.In those classes where SSANet-BS did not achieve the optimal outcome, the PA value between the SSANet-BS and the optimal method is less than 3% except class 7 and 16.The performance of SSANet-BS on DC191 and QY176 are comparable to that of IP220.The APA, AUA, OA and kappa of SSANet-BS all represent the optimal values.This indicates that the selected subset of bands for SSANet-BS is of high quality and that the classification performance is stable.When considered collectively, SSANet-BS achieves the optimal values of APA, AUA, OA, and kappa on the remaining datasets, with the exception on PU103, which is outperformed by S 4 P.This indicates that SSANet-BS is a stable method and that the selected bands can effectively represent the original HSI.To further ensure the stability of SSANet-BS, Figure 10 presents the AOA values of each BS method across six band subset subgroups from 5 to 30 with a step size of 5.The AOA values of the optimal and suboptimal methods are bolded in red and black.Upon examination of Figure 10, it is observed that there exists significant discrepancy in the performance of the various methods on the IP220 and PU103 dataset, whereas a relatively minor difference is noted on the DC191 and QY176 dataset.Moreover, most methods demonstrate superior performance on DC191 and QY176.This phenomenon can be attributed to a variety of factors, including sensor characteristics, the attributes of the ground objects within the scenes, atmospheric conditions, and the impact of lighting, among others.Consequently, the IP220 and PU103 present greater challenges for different BS methods.Figure 10 shows that the AOA values of SSANet-BS exceed those of all other comparison methods on the IP220, DC191 and QY176 datasets, leading the suboptimal methods by 3.08%, 2.05% and 0.44%, respectively.On the PU103 dataset, SSANet-BS is suboptimal, with a difference of only 1.42% from S 4 P but a 2.05% improvement over the third-best method BSNet-Conv.These outcomes indicate that SSANet-BS produces good and stable performance on various datasets by modeling complex spectral-spatial cross-dimensional interactions in the reconstruction process.

Discussion
This section discusses the quality of the selected band subset and the runtime of each method, verifies the effectiveness of two SSAM modules through ablation experiments, and concludes with the advantages and limitations of SSANet-BS.

Band Quanlity
Hyperspectral band selection methods aim to select a subset of bands that are both informative and low-redundancy, while also providing a comprehensive representation of the original HSI.Consequently, the quantity of information and the degree of redundancy are pivotal metrics for evaluating the quality of the band subset selected by the BS method under examination.On the one hand, bands with greater information content exhibit higher Shannon entropy values.On the other hand, the content of adjacent bands in HSI is similar and tends to be redundant [44], which means that the distribution of bands can reflect the redundancy of the band subsets.
It can be observed in Figure 11 that bands with high entropy exhibit greater clarity in the features of ground objects.Conversely, bands with low entropy, such as Figure 11c, are noisy bands, which can have a detrimental impact on subsequent classification tasks.In order to assess the quality of the selected band for each method, Figure 12 further plots the distribution of the selected bands (top for each subplot), and the entropy values for all bands (bottom for each subplot) for the IP220 dataset.All subplots of Figure 12 indicates that the distribution of selected bands for MVPCA is concentrated in comparison to other methods.Although the selected bands of MVPCA are concentrated in the region of higher entropy, the classification performance is unsatisfactory.In contrast, methods such as EFDPC, ASPS, SOPSRL, BSNet-Conv and S 4 P, select bands that exhibit greater dispersion but inevitably fall within the low entropy range.The sparse constraints imposed on SSANet-BS result in a uniform distribution of bands across the four datasets.The selected bands are spaced further apart with lower redundancy and superior quality.This demonstrates the effectiveness of SSANet-BS.It can be observed in Figure 11 that bands with high entropy exhibit greater clarity in the features of ground objects.Conversely, bands with low entropy, such as Figure 11c, are noisy bands, which can have a detrimental impact on subsequent classification tasks.In order to assess the quality of the selected band for each method, Figure 12 further plots the distribution of the selected bands (top for each subplot), and the entropy values for all bands (bottom for each subplot) for the IP220 dataset.All subplots of Figure 12 indicates that the distribution of selected bands for MVPCA is concentrated in comparison to other methods.Although the selected bands of MVPCA are concentrated in the region of higher entropy, the classification performance is unsatisfactory.In contrast, methods such as EFDPC, ASPS, SOPSRL, BSNet-Conv and 4   SP, select bands that exhibit greater disper- sion but inevitably fall within the low entropy range.The sparse constraints imposed on SSANet-BS result in a uniform distribution of bands across the four datasets.The selected bands are spaced further apart with lower redundancy and superior quality.This demonstrates the effectiveness of SSANet-BS.It can be observed in Figure 11 that bands with high entropy exhibit greater clarity in the features of ground objects.Conversely, bands with low entropy, such as Figure 11c, are noisy bands, which can have a detrimental impact on subsequent classification tasks.In order to assess the quality of the selected band for each method, Figure 12 further plots the distribution of the selected bands (top for each subplot), and the entropy values for all bands (bottom for each subplot) for the IP220 dataset.All subplots of Figure 12 indicates that the distribution of selected bands for MVPCA is concentrated in comparison to other methods.Although the selected bands of MVPCA are concentrated in the region of higher entropy, the classification performance is unsatisfactory.In contrast, methods such as EFDPC, ASPS, SOPSRL, BSNet-Conv and 4   SP, select bands that exhibit greater disper- sion but inevitably fall within the low entropy range.The sparse constraints imposed on SSANet-BS result in a uniform distribution of bands across the four datasets.The selected bands are spaced further apart with lower redundancy and superior quality.This demonstrates the effectiveness of SSANet-BS.

Computation Time
This section mainly focuses on the computation time of SSANet-BS.Deep learningbased methods can be accelerated by GPU, so SSANet-BS, DARecNet-BS, and BSNet-Conv run on GPU, while the others run on CPU.Table 7 shows the computation time of different methods for selecting 30 bands on the IP220 dataset.Compared to other methods, deep learning-based methods take more time.Among the three deep learning methods, DARecNet-BS requires a significantly longer processing time than BSNet-Conv and SSANet-BS.The reason is that the band attention weights of DARecNet-BS can not represent the band saliency in its entirety.Consequently, DARecNet-BS is only able to select bands by calculating the entropy of the reconstructed image, which introduces additional computational cost.This disadvantage becomes more pronounced as the image size increases.Conversely, the three-stage structure of SSANet-BS enables the selection of bands from the converged band attention weights directly as in BSNet-Conv, thereby reducing the computational costs.This represents a distinct advantage of the three-stage structure of SSANet-BS.

Ablation Study for SSAMs
In this section, three variants of the SSANet-BS model are constructed to verify the effectiveness of SSAM.This is achieved by removing the SSAM b−w in the bandwidth (b-w) direction, SSAM b−h in band-height (b-h) direction and both of them in the SSANet-BS, respectively.The three aforementioned variants, designated as no − SSAM b−w , no − SSAM b−h , and no-SSAM, are subjected to testing on the IP220 dataset.The OA and A OA values of SSANet-BS, along with its three variants under n BS ranging from 5 to 30 are recorded, as shown in Figure 13.
As can be seen in Figure 13a, the SSANet-BS exceeds the three variants mentioned above at all n BS .Meanwhile, both variants lacking SSAM module in one direction, namely no − SSAM b−w and no − SSAM b−h , are superior to the variant without any SSAM module.In addition, the AOA values shown in Figure 13b indicate that in comparison to the complete SSANet-BS, the variants lacking either any modules SSAM b−w or SSAM b−h , SSAM b−h , or all modules exhibited a reduction in AOA values of 2.89%, 4.94% and 14.59%, respectively.Therefore, both SSAM b−w and SSAM b−h developed in this paper can effectively improve the model's performance.The ablation study indicates that SSANet-BS has successfully utilized SSAM to capture the spectral-spatial information of HSI during the band selection process.The three-stage structure of SSANet-BS, comprising BAM, SSAMs and reconstruction network, has been demonstrated to be effective.As can be seen in Figure 13a, the SSANet-BS exceeds the three variants mentioned above at all BS n .Meanwhile, both variants lacking SSAM module in one direction,

Effectiveness of the Three-Stage Structure
In order to validate the necessity and effectiveness of the three-stage structure, a variant of the SSANet-BS with two-stage has been constructed.This variant is named SSANet-BS-2S.In SSANet-BS-2S, the BAM is situated in the same stage as the two SSAMs, which are in a parallel relationship.This is in contrast to the progressive relationship in the three-stage version of SSANet-BS.
Under the optimal parameters, Figure 14 shows that the performance of SSANet-BS-2S is markedly inferior to that of SSANet-BS, exhibiting an approximate 21% deficit in AOA.The discrepancy can be attributed to the fact that, in the variant SSANet-BS-2S which is a two-stage structure, the BAM and the two SSAMs operate in a cooperative manner, jointly modelling the HSI.Information pertaining to the significance of bands is distributed throughout the hidden features of w b , W b−w , and W b−h .Consequently, w b cannot independently and comprehensively represent the significance of bands.In contrast, within the SSANet-BS which is a three-stage structure, the BAM and SSAMs are in a progressive order.The spectral-spatial information within HSI learned by the SSAMs is used to guide the adjustment of w b in the BAM via backpropagation, enabling w b to automatically converge to the bands with high significance during the training process.

Effectiveness of the Three-Stage Structure
In order to validate the necessity and effectiveness of the three-stage structure, a variant of the SSANet-BS with two-stage has been constructed.This variant is named SSANet-BS-2S.In SSANet-BS-2S, the BAM is situated in the same stage as the two SSAMs, which are in a parallel relationship.This is in contrast to the progressive relationship in the three-stage version of SSANet-BS.
Under the optimal parameters, Figure 14 shows that the performance of SSANet-BS-2S is markedly inferior to that of SSANet-BS, exhibiting an approximate 21% deficit in AOA.The discrepancy can be attributed to the fact that, in the variant SSANet-BS-2S which is a two-stage structure, the BAM and the two SSAMs operate in a cooperative manner, jointly modelling the HSI.Information pertaining to the significance of bands is distributed throughout the hidden features of To further validate the effectiveness of the three-stage structure of SSANet-BS, we developed a variant based on SSANet-BS-2S, termed SSANet-BS-2SE.In order to address the aforementioned issues that have arisen from the introduction of spatial or spectralspatial modules, DARecNet-BS selects bands with higher entropy value during the re- To further validate the effectiveness of the three-stage structure of SSANet-BS, we developed a variant based on SSANet-BS-2S, termed SSANet-BS-2SE.In order to address the aforementioned issues that have arisen from the introduction of spatial or spectralspatial modules, DARecNet-BS selects bands with higher entropy value during the reconstructed process.Variant SSANet-BS-2SE implements band selection in an analogous manner.Figure 14 illustrates that SSANet-BS-2SE demonstrates a notable enhancement in performance relative to SSANet-BS-2S.Nevertheless, it still exhibits a performance deficit when compared to SSANet-BS.This suggests that, in comparison to criteria (entropy) that have been manually designed, the automatic discovery of salient bands using attention mechanisms can effectively enhance model performance.It is evident that the two-stage structure of SSANet-BS-2SE is unable to fully capitalize on the advantages of the attention mechanism.
In conclusion, the three-stage structure of SSANet-BS guarantees that the band attention weights can independently and comprehensively represent the significance of bands.This allows the attention mechanism to automatically evaluate and select salient bands and achieve superior results.Therefore, the three-stage structure is both an effective and necessary.

Comments on Existing BS Methods and SSANet-BS
The attention mechanism can be used to learn the complex spectral-spatial interactions within HSI and enable the automated identification of significant bands.Current research on deep learning-based BS methods predominantly focuses on how to more effectively utilize attention mechanisms to enhance model performance.BS-Net [33] is the first BS method to automatically select bands using an attention mechanism.Then, NBAN [20] employs a non-local attention mechanism to capture long-range contextual information in the spectral dimension.Next, DARecNet-BS [35] introduces an independent spatial attention module and TAttMSRecNet [36] further exploits spectral-spatial information to improve model performance.By contrast, the proposed method SSANet-BS makes the SSAMs compatible with the BAM through the ingenious design of the three-stage structure, and achieves automatic band selection using the attention mechanism based on the full use of spectral-spatial information.The experimental results demonstrate that SSANet-BS is an effective and stable method.
BSNet-Conv is characterized by a straightforward structure that facilitates expeditious processing in real-world scenarios.Nonetheless, Its performance is generally mediocre.DARecNet-BS incorporates an independent spatial attention module, which offers new insights for HSI BS domain.However, the band selection process of DARecNet-BS relies on entropy, which results in a slower processing speed.SSANet-BS achieves promising results by learning the spectral-spatial information of HSIs.But according to statistics from the PyTorch framework, for the IP220 dataset (comprising 220 bands), the parameter count of SSANet-BS is about 43% higher than that of BSNet-Conv.The augmented number of parameters results in a greater requirement for GPU memory.This is less conducive to computing platforms with lower specifications, which may limit its applicability in certain contexts.However, in the context of today's highly developed GPU hardware, the parameter volume of SSANet-BS does not present a significant bottleneck in application.With the rapid advancement of computer technology, this disadvantage is becoming mitigated.
Further, deep learning-based methods, including SSANet-BS, have the following potential issues.In contrast to domains such as CV and NLP, where models are employed in the manner of inference [45,46], the band selection process of existing attention-based BS methods is conducted on the training process.The existing BS model can only learn information about the target HSI, which greatly limits the potential capability of the neural network.In addition, due to the training process, the deep learning-based BS method takes tens or even hundreds of times longer than machine learning-based methods.Therefore, it is interesting to see how to make the model train on multiple HSIs, and implement BS on target HSI in the manner of inference.The inference process of neural network is much faster than the training process, and if the training can be done on multiple HSIs, it may be possible to obtain a BS method with higher performance and comparable time to machine learning-based methods.In the future, we will fully study those above issues and improve SSANet-BS.

Conclusions
This paper presents SSANet-BS, a network designed for BS.SSANet-BS is a three-stage BS method that solves the problem that existing two-stage BS methods cannot automatically search for salient bands using the attention mechanism while learning spatial information.It considers BS as a weighted reconstruction task of HSI, and leverages BAM and SSAMs to model the complex spectral-spatial cross-dimensional nonlinear interactions in HSI during the reconstruction process.Further, a multi-scale reconstruction network, featuring convolution kernels of various scales, is used to reconstruct HSI to optimize model.Experimental results on four publicly available datasets demonstrate that SSANet-BS outperforms existing BS methods and exhibits satisfactory stability.In the future, SSANet-BS is expected to be deployed and utilized for tasks including HSI classification, segmentation, and target detection, providing strong support for the HSI processing field.

Figure 2 .Figure 2 .
Figure 2. Overall structure of SSANet-BS.1.This paper proposes a deep neural network based on spectral-spatial cross-dimensional attention for hyperspectral BS, named SSANet-BS.This network employs complementary multi-dimensional attention mechanisms to automatically discover salient bands, and improves the performance of BS by exploring the complex spectral-

22 Figure 3 .
Figure 3. Schematic diagram of the neural network in the BAM.

.
 represents band-wise multiplication, and linear f denotes the linear transformation operation.The 1 L regularization is imposed on the loss function of SSANet-BS, which introduces a sparse constraint on b w in order to reduce the redun- dancy of the final band subset.Therefore, some elements in b w may be 0 or close to 0.

Figure 3 .
Figure 3. Schematic diagram of the neural network in the BAM.
SSAM b−w and SSAM b−h are implemented in the same way, except that the directions are different.Taking the SSAM b−w in b-w direction as an example, the neural network f b−w takes X b as input, and generates the spectral-spatial attention weight matrix W b−w in b-w direction: the k-th section or layer of X b−w and X b in the height direction, 1 ≤ k ≤ L, respectively.Then, X k b−w is obtained by the element-wise multiplication of X k b and W b−w .Similarly, the module SSAM b−h outputs the SSAM-weighted image X b−h of the b-h direction.Then, X b−w and X b−h are fused to generate the SSAM-fusion image PyTorch framework based on CUDA 10.7.All experiments are run on Intel Xeon E5-2699 v4 CPU and Nvidia Tesla P40 GPU. with a resolution of 145 × 145 pixels and 16 classes of ground objects labeled.• Washington DC Mall (DC191): It is an airborne HSI acquired by the HYDICE sensor, which contains 191 bands, with a resolution of 280 × 307 and 6 classes.• Pavia University (PU103): PU103 is taken in 2002 by the ROSIS sensor in the campus of Pavia University in Italy.It size is 610 × 340 × 103, and has 9 classes.• QUH-Qingyun (QY176) [44]: The image was captured on 18 May 2021 in Qingdao, China, utilising a Gaiasky mini2-VN imaging spectrometer mounted on a UAV platform.It comprises 176 spectral bands.After cropping, it is 600 × 200 in size and contains 5 classes of ground labels.

Figure 4 .
Figure 4.The dataset used in the experiment.The land cover types and the number of samples for each dataset are indicated, respectively.(a) IP220.(b) DC191.(c) PU103.(d) QY176.

Figure 4 .
Figure 4.The dataset used in the experiment.The land cover types and the number of samples for each dataset are indicated, respectively.(a) IP220.(b) DC191.(c) PU103.(d) QY176.
Remote Sens. 2024,16,  x FOR PEERREVIEW  11 of 22    areas highlighted by the yellow box in Figure9.One of the reasons for these discrepancies is the interference from shadows, reflections, and other disturbances.Therefore, these factors are more conducive to validating and distinguishing the effectiveness of different band selection methods.The classification maps demonstrate that the selected bands of SSANet-BS are more closely aligned with the ground truth than those of other methods.

Figure 10 .
Figure 10.The AOA values of SSANet-BS and comparison methods on four HSI datasets.(a) IP220.(b) DC191 (c) PU103.(d) QY176.The optimal and suboptimal results are bolded in red and black.

Figure 10 .
Figure 10.The AOA values of SSANet-BS and comparison methods on four HSI datasets.(a) IP220.(b) DC191 (c) PU103.(d) QY176.The optimal and suboptimal results are bolded in red and black.

4. 3 .Figure 13 .
Figure 13.The results of the ablation study for SSAMs on IP220 dataset.(a) OA values.(b) AOA values.The optimal and suboptimal results are bolded in red and black.

Figure 13 .
Figure 13.The results of the ablation study for SSAMs on IP220 dataset.(a) OA values.(b) AOA values.The optimal and suboptimal results are bolded in red and black.
Remote Sens. 2024, 16, x FOR PEER REVIEW 18 of 22 module.In addition, the AOA values shown in Figure 13b indicate that in comparison to the complete SSANet-BS, the variants lacking either any modules all modules exhibited a reduction in AOA values of 2.89%, 4.94% and 14.59%, respectively.Therefore, both paper can effectively improve the model's performance.The ablation study indicates that SSANet-BS has successfully utilized SSAM to capture the spectral-spatial information of HSI during the band selection process.The three-stage structure of SSANet-BS, comprising BAM, SSAMs and reconstruction network, has been demonstrated to be effective.

Figure 14 .
Figure 14.The OA and AOA values of SSANet-BS, SSANet-BS-2S and SSANet-BS-2SE on IP220 dataset.(a) OA values.(b) AOA values.The optimal and suboptimal results are bolded in red and black.

Figure 14 .
Figure 14.The OA and AOA values of SSANet-BS, SSANet-BS-2S and SSANet-BS-2SE on IP220 dataset.(a) OA values.(b) AOA values.The optimal and suboptimal results are bolded in red and black.

Table 1 .
Detailed structure of each module in SSANet-BS.

•
Pavia University (PU103): PU103 is taken in 2002 by the ROSIS sensor in the campus of Pavia University in Italy.It size is 610 × 340 × 103, and has 9 classes.

Table 2 .
The AOA of SSANet-BS under different λ on four datasets.Optimal results are highlighted in bold.

Table 3 .
Classification results of SSANet-BS and comparative methods with 15 bands on the IP220 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.Label Full Bands LLE Isomap MVPCA E-FDPC ASPS SOPSRL GRSC BSNet-Conv DarecNet-BS 4 SP

Table 3 .
Classification results of SSANet-BS and comparative methods with 15 bands on the IP220 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.

Table 4 .
Classification results of SSANet-BS and comparative methods with 15 bands on the DC191 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.

Table 5 .
Classification results of SSANet-BS and comparative methods with 15 bands on the PU103 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.

Table 6 .
Classification results of SSANet-BS and comparative methods with 15 bands on the QY176 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.

Table 6 .
Classification results of SSANet-BS and comparative methods with 15 bands on the QY176 dataset.Values in the table are in per cent.Optimal results of BS methods are highlighted in bold.

Table 7 .
The runtime (s) of selecting 30 bands by different BS methods on the IP220 dataset.