Pyramid Cascaded Convolutional Neural Network with Graph Convolution for Hyperspectral Image Classification

: Convolutional neural networks (CNNs) and graph convolutional networks (GCNs) have made considerable advances in hyperspectral image (HSI) classification. However, most CNN-based methods learn features at a single-scale in HSI data, which may be insufficient for multi-scale feature extraction in complex data scenes. To learn the relations among samples in non-grid data, GCNs are employed and combined with CNNs to process HSIs. Nevertheless, most methods based on CNN-GCN may overlook the integration of pixel-wise spectral signatures. In this paper, we propose a pyramid cascaded convolutional neural network with graph convolution (PCCGC) for hyperspectral image classification. It mainly comprises CNN-based and GCN-based subnetworks. Specifically, in the CNN-based subnetwork, a pyramid residual cascaded module and a pyramid convolution cascaded module are employed to extract multiscale spectral and spatial features separately, which can enhance the robustness of the proposed model. Furthermore, an adaptive feature-weighted fusion strategy is utilized to adaptively fuse multiscale spectral and spatial features. In the GCN-based subnetwork, a band selection network (BSNet) is used to learn the spectral signatures in the HSI using nonlinear inter-band dependencies. Then, the spectral-enhanced GCN module is utilized to extract and enhance the important features in the spectral matrix. Subsequently, a mutual-cooperative attention mechanism is constructed to align the spectral signatures between BSNet-based matrix with the spectral-enhanced GCN-based matrix for spectral signature integration. Abundant experiments performed on four widely used real HSI datasets show that our model achieves higher classification accuracy than the fourteen other comparative methods, which shows the superior classification performance of PCCGC over the state-of-the-art methods.


Introduction
With the development of hyperspectral imaging techniques, hyperspectral images (HSIs) with abundant spectral signatures and spatial features are available.HSIs have hundreds of narrow and continuous electromagnetic spectrum bands, spanning from the visible to the near-infrared ranges.HSI can reflect the area on an earth surface with spectral and spatial information.And, it has been widely utilized in various applications, e.g., mineral exploitation [1], environment surveillance [2], water quality monitoring [3], and urban planning [4,5].Among various applications, HSI classification, which discriminates each land-cover type at the pixel-by-pixel level [6], plays a critical role and has attracted more researchers to study it.However, it is a significant challenge for HSI classification that the HSIs contain excessively high spectral dimensions and complex spatial features [7].
The traditional methods based on hand-crafted features for HSI classification mainly fall into two categories, such as spectral-based and spatial spectral-based methods [8,9].The spectral-based method, e.g., random forest [10], k-nearest neighbor [11], and support vector machine [12], which only learn spectral signatures, produce the classification maps with salt-and-pepper phenomena and unsatisfactory classification performance.At the same time, the principal component analysis [13] method is developed to remove redundant spectral signatures for spectral signatures extraction.To further improve the classification performance, another category of methods based on spectral-spatial feature have emerged, such as extended morphological profile [14] and superpixel segmentation [15].In these, the classification result is enhanced compared with the spectral-based methods.Although the aforementioned methods can effectively solve the HSI classification tasks, these methods have some limited feature extraction capabilities and fail to learn deep semantic information due to a lack of strong data fitting abilities and being less robust for complex HSI data scenes [16,17].
Over the past few years, deep learning (DL) has been recognized as a powerful data analysis technique [18] for effectively addressing nonlinear problems, and it has been extensively used for HSI processing tasks.Compared with traditional machine learning methods, DL, which includes various deep networks such as CNN [19], RNN [20], GNN [21], Transformer [22][23][24], and GAN [25], makes significant progress in HSI classification.Among various network architectures, CNN and GCN have received the most attention.In the case of CNN, it is extensively used because it can reuse convolutional kernels with shared weights across input feature maps, enabling it to adeptly learn features of images [26].For example, Hu et al. [19] employ 1D convolution to learn features along the spectral channel of the HSI data cube, resulting in good classification performance compared with traditional methods.Although the method based on 1D convolution better adapts to spectral signatures, it overlooks spatial information, which limits the capability of the method to describe spatial contextual information.To cope with this problem, in [27], a novel devised dual-branch spectral and spatial model is proposed.This network uses 1D convolution and 2D convolution to learn spectral signatures and spatial features separately.Subsequently, a fully connected layer is utilized to exploit spectral and spatial correlation.The mentioned double-branch network learns the spectral signatures and spatial features separately.To jointly excavate features from a 3D HSI cube, in [28], a newly developed cascaded 3D convolution to learn spectral-spatial features is used.However, this cascaded 3D CNN uses a 3D kernel with a size of (h × w × d), which has many parameters.In this article, 3D convolutions with kernel sizes of (h × w × 1) and (1 × 1 × d) are employed to reduce the network parameters.Meanwhile, too many 3D convolutions will lead to the disappearance of the network gradients and decrease the final classification accuracy.To conquer this phenomenon, the residual network [29] and the dense network [30] are utilized.Although the above-mentioned CNN-based methods can be better used for HSI classification, they overlook the extraction of multiscale features [31], thus not being robust enough for various HSI data scenes.Meanwhile, the CNN method based on grid data has limitations in capturing the relationships between samples in HSI data.
With the emergence of graph neural network (GNN) [32] architecture, GNNs have been used by some researchers as effective tools to deal with HSI data based on non-Euclidean geometry properties.Among various GNN architectures [33][34][35], GCN has been widely used in HSI processing due to its direct application to arbitrarily shaped graphs, allowing it to learn the graph structure information and node features simultaneously [36].To proficiently utilize the relationships between different nodes, the methods for graph construction of HSI data typically include superpixel-based and pixel-based [37,38] approaches.Specifically, from the perspective of superpixels, in [33], a novel graph convolutional method that utilizes the GCN to extract features from the constructed graph using superpixel methods was explored.Based on the superpixel graph, Wan et al. [33] employ multiscale stacked GCN layers to learn context-based spatial information.Although the above-mentioned methods based on GCN can achieve satisfactory feature extraction, the graph constructed by the superpixel method tends to overlook certain local spectral and spatial information.Then, a graph-in-graph method is devised by Jia et al. [39], where each node within a local range structure is called an internal graph, and all the nodes of the HSI form an external graph.This approach can highlight both the local and global information of the entire graph.Although the graph based on superpixels can describe the local features of HSI well, pixels with different labels may be assigned to the same superpixel region, resulting in the misclassification of results in the final classification map [8].Meanwhile, the constructed superpixel-based graph models may overlook pixel-level feature descriptions, which further leads to poor pixel-level classification results.To cope with these difficulties, from another pixel-based perspective, a mini-batch GCN method is proposed by [40], which employs a batch-by-batch GCN network training approach for pixel-wise feature extraction of HSI data.Based on the patch training strategy, Gao et al. [41] propose a novel model, which proficiently improves neighborhood node aggregation by adaptively learning the weight correlations between different nodes.Zhang et al. [42] devise a proficient GCN method, which regards the HSI data as graph-structured data and systematically aggregates structural information between different nodes for pixel-wise land cover processing.
GCNs use convolution as a weighted function to indicate the influence exerted on a target node by its neighbors and itself, which is beneficial for graph-based data processing.The constructed graph can be updated to adapt to the HSI data representations produced by each GCN layer, which in turn makes the data representations more accurate [43].Meanwhile, the GCN can handle arbitrary graph-based data and efficiently learn the internal similarity relationships between adjacent nodes in HSI data [44].Following the benefits of GCN, some researchers try to combine the advantages of CNNs and GCNs for classifying different HSI data scenes [45].Lu et al. [46] develop a novel model that combines a separable GCN with a CNN for HSI classification.Specifically, the model encodes spectralspatial features, adaptively learned by the designed attention module, into the structure of a graph.Then, a separable deep GCN is developed to learn long-range contextual structure relationships from the graph.Meanwhile, a local convolutional feature extraction network is employed to extract complementary local features.To make the model compatible with different HSI data scenes, Li et al. [47] design a staged feature fusion model that combines CNNs and GCNs.In the first stage, the model uses CNN to extract non-local features.In the second stage, GCN is employed to optimize the connectivity relationships of the graph constructed based on spectral similarity.Shi et al. [44] propose a novel network that has a graph convolution branch and grouping convolution branch.In the graph branch, a multihop graph rectify attention is proposed to weigh the features extracted by the GCN.In the convolution branch, a spectral intra-group and inter-group signature extraction module is designed to address the problem of high spectral dimensionality.Ghotekar et al. [45] devise an feature segmentation network, consisting of hybrid convolution and graph convolution networks, for HSI classification.First, a CNN is used to extract multi-layer features.Then, the features are fed into GCN module to obtain patch-to-patch correlation feature maps.Finally, the extracted features are concatenated to be fed into the linear layer for the final classification results.These hybrid collaborative networks, which include CNNs and GCNs, show efficient performance in HSI processing.Nevertheless, some challenges persist in hybrid networks for HSI feature learning.For hybrid networks, the CNN module based on the single-scale kernel exhibits limited effectiveness in extracting features from original spectral and spatial information included in complex HSI data scenes.And the lack of multiscale features will inevitably lead to poor classification results.To extract graph-based high-level features, the GCN-based module employs multiple stacked graph convolutional layers, which inevitably results in oversmoothing issues.Considering that CNNs employ shared kernels across spectral feature maps to learn pixel spectral signatures and GCNs obtain pixel spectral signatures through two different matrix multiplications, the CNN-based spectral signatures may have a certain degree of incompatibility with the GCN-based spectral signatures.Therefore, directly integrating the two different types of pixel spectral signatures can result in degraded final classification accuracy.
In this article, we propose a novel pyramid cascaded convolutional neural network with graph convolution (PCCGC) for HSI classification.It contains two parallel subnetworks, e.g., a CNN-based subnetwork and a GCN-based subnetwork.Specifically, from the perspective of the CNN-based subnetwork, it includes a spectral pyramid residual cascaded module and a spatial pyramid convolution cascaded module.The spectral module features a designed spectral pyramid hybrid convolution block and multiple 3D spectral convolution layers, which are connected in a cascaded manner for multiscale spectral signature extraction.Moreover, considering that HSIs have rich spectral signatures, a residual connection is employed for spectral signature extraction.The spatial module is similar to the spectral module for multiscale spatial feature extraction.The difference between the two modules is the 3D convolution kernel used.Then, an adaptive featureweighted fusion strategy is utilized to fuse multiscale spectral and spatial features based on their respective weights.From another perspective of the GCN subnetwork, a band selection network (BSNet) is used to learn the spectral signatures in the HSI using nonlinear inter-band dependencies.Then, a spectral-enhanced GCN module is utilized to learn and accentuate the important information in the spectral matrix.And to prevent the oversmoothing problem and learn deep features, multiple graph convolution layers are utilized in a one-shot strategy.Subsequently, a mutual-cooperative attention mechanism is constructed that can align the spectral signatures between a BSNet-based matrix with a spectral-enhanced GCN-based matrix for pixel-wise spectral signature integration.It can transfer the spectral features extracted by BSNet to the GCN-based feature matrix through cross multi-head self-attention blocks, and transfer the spectral features learned by the spectral-enhanced GCN module to the BSNet-based spectral feature matrix through cross multi-head self-attention block.Finally, an additive fusion strategy is utilized to fuse the features extracted by the CNN-based and the GCN-based subnetworks.Our main contributions are as follows: (1) To extract multiscale spectral and spatial features from complex HSI datasets, the spectral pyramid residual cascaded module and spatial pyramid convolution cascaded module are designed.The spectral module includes a devised spectral pyramid hybrid convolution block and multiple 3D spectral convolution layers, which are connected in a cascaded manner.Moreover, considering that HSIs have rich spectral signatures, a residual connection is employed to enhance spectral signature extraction.The spatial module is similar to the spectral module but is used for spatial feature extraction.Furthermore, the 3D convolution kernels used in spectral and spatial modules are different, which benefits the extraction of spectral and spatial features separately.Then, an adaptive feature-weighted fusion strategy is utilized to fuse multiscale spectral and spatial features based on their respective weights.
(2) To model the important spectral relations of the samples, a spectral-enhanced GCN module is employed.It can strengthen the deep significant spectral relations based on the constructed graph and capture the interconnectivity between pixels as well as the interdependencies among spectral signatures.To prevent the oversmoothing problem, the multiple graph convolution layers in the spectral-enhanced GCN module are stacked in a one-shot strategy.
(3) A mutual-cooperative attention mechanism is constructed to align the spectral signatures between the BSNet-based matrix and the spectral-enhanced GCN-based matrix for spectral signature integration.It transfers the spectral features extracted by BSNet to the GCN-based feature matrix through a cross multi-head self-attention block, and transfers the spectral features learned by the spectral-enhanced GCN module to the BSNet-based spectral feature matrix through another cross multi-head self-attention block.Subsequently, the two aligned matrices are concatenated for spectral signature integration.(4) A novel method called the PCCGC is proposed to realize hyperspectral image classification.PCCGC can extract CNN-based multiscale spectral and spatial features, which are then fused adaptively.In addition, PCCGC utilizes BSNet and spectral-enhanced GCN for significant pixel-wise spectral signature extraction, and these extracted pixelwise spectral signatures are integrated using a mutual-cooperation attention mechanism.Furthermore, the integrated spectral signature are added to the CNN-based features, resulting in the proposed model achieving good classification performance.
The remainder of this article is structured as follows: related work is shown in Section 2, the devised PCCGC is characterized in Section 3, the experimental results are listed in Section 4, some parameters of PCCGC are discussed in Section 5, and the conclusion is presented in Section 6.

Convolutional Neural Network
An HSI data patch with a size H × W × D is specified as input data, where H × W indicates the spatial size, and D represents the number of spectral bands [48].In (1), the 3D convolution has p 3D convolution kernels of size (h × w × c).Following the 3D convolution process, p feature maps of size (H − h + 1) × (W − w + 1) × (D − c + 1) are generated.Moreover, by calculating the dot product between the local area position (x, y, z) and the weight matrix, each feature map (F) is obtained [49].The output of a neuron v x,y,z l,i at the position (x, y, z) of the i-th F in the lth layer can be calculated by: where δ indicates the activation function, such as Mish, and b l,i is the bias of the ith F in the lth layer.The p index indicates the connection between the current F and the F in the previous layer.W l and H l are indicate the width and height of the 3D convolution kernel in the spatial dimension, respectively.C l refers to the 3D convolution kernel size of the spectral dimension.The weight k h,w,c l,i,p is used to convolve the input data cube v (x+h),(y+w),(z+c) (l−1),p in 3D convolution kernels, with an offset of (h, w, c) [50].
The (1) in our manuscript is the process of the 3D convolution, which is similar to 1D and 2D convolution.However, it is essential to note that the input format of a 3D convolutional network is B batch , C channel , H height , W width , D spectral .The 3D feature extraction model has demonstrated itself to be very effective in simultaneously capturing the spatial and spectral features of 3D feature maps by applying 3D kernels to 3D hyperspectral image data scenes [8,51].Compared with the 1D convolution and 2D convolution operation process for HSI data with rich spectral signatures, the 3D convolution can greatly decrease spectral distortion phenomenon and learn more information (e.g., spatial-spectral correlation characteristics and absorption differences between adjacent spectral bands).Moreover, the 3D CNN is theoretically well-suited to excavating 3D feature maps for HSI processing since HSIs are usually denoted as a 3D patch cube.

Graph Convolutional Network
Graph neural networks (GNNs) can generalize the convolution process from gridbased data to graph-based data.The fundamental concept is to describe a node V with its own feature and neighbors' feature.The model based on graph convolution can learn the high-level node feature representation through multiple stacked convolutional layers.And, GNNs fall into two categories [52], being spectral-based [53] and spatial-based [54] methods.And the spectral-based GNN methods define convolution in the graph signal processing.Spatial-based GNN methods define convolution by the information propagation strategy.Among Spatial-based and Spectral-based GNN methods, the GCN is widely employed due to its generality.The undirected graph is typically defined as G = (V, E), where V denotes the set of nodes or vertices, and E represents the set of edges.According to the undirected graph G, the adjacency A is constructed.Based on the convolution on the undirected graph, here is a layer-by-layer propagation rule for multiple layers GCN, defined as follows: where ∼ A = A + I is termed as renormalization of A, A is the adjacency matrix of G, and I is the identity matrix.
∼ D is defined as renormalization of D, and D ii = ∑ j A ij is a diagonal matrix indicating the degree of A. φ() indicates an activation function, such as ELU().H (i+1) ∈ R N×D and H (i) ∈ R N×D is the matrix of (i + 1)th and (i)th layer, respectively.Besides, H 0 = X, where X indicates a matrix of feature vectors of the input node.W i represent the trainable weight matrix.

CNN and GCN for HSI Classifications
Considering that HSI data has rich spectral and spatial information, an effective model is well-suited to HSI data classification.CNNs can effectively extract local features using shared weight kernels.Based on different convolutional kernels, 1D CNNs extract spectral signatures, 2D CNNs extract spatial features, and 3D CNNs extract spectral-spatial features.The use of CNNs can greatly improve HSI processing performance.Meanwhile, GCNs generalize the convolution operation to graph data [8], allowing for the learning of node feature representations.High-level feature representations can be captured by multiple stacked GCN layers [52].In HSI data, GCNs are employed to capture spatial contextual structure information, which is advantageous for HSI information processing.Based on the above-mentioned, some models combining CNNs with GCNs are designed for HSI classification.Liu et al. [43] design a novel heterogeneous network called CNNenhanced GCN.Specifically, the 2D CNN is used to extract features from local-range regular regions, while the GCN is employed to learn features from long-range irregular region.The features extracted by both CNN and GCN are then used as complementary features for HSI classification.Lu et al. [46] develop a novel SDGCP method.It employs a separable deep GCN for learning long-range contextual structure features.The learned features are then combined with local complementary features extracted by CNN for HSI classification.Wang et al. [55] design a novel DF2Net for HSI classification, which includes two subnetworks: a spectral-spatial hypergraph convolutional subnetwork for learning long-range and high-order correlations, and a spectral-spatial convolution subnetwork for pixel-wise local feature extraction.

The Overall Structure of PCCGC
In this article, we propose a novel PCCGC method for HSI classification.As depicted in Figure 1, it contains two parallel subnetworks, e.g., a CNN-based subnetwork and a GCN-based subnetwork.Specifically, from the view of CNN subnetwork, a spectral pyramid residual cascaded module (SpePRCM) is used to extract multiscale spectral signatures.Meanwhile, a spatial pyramid convolution cascaded module (SpaPCCM) is employed to extract multiscale spatial features.And, the features extracted by the CNN subnetwork are more robust for the proposed model in classifying HSIs.Furthermore, an adaptive feature-weighted fusion strategy is utilized to adaptively fuse multiscale spectral and spatial features based on their respective weights.From another perspective of the GCN subnetwork, a BSNet is used to learn the spectral signatures in the HSI using non-linear inter-band dependencies, which also reduces the computational cost of GCN.Then, the spectral-enhanced GCN module is utilized to learn and accentuate the important information in the spectral matrix.Subsequently, a mutual-cooperative attention mechanism is constructed to align the spectral signatures between a BSNet-based matrix and a spectralenhanced GCN-based matrix for spectral signature integration.Finally, the additive fusion strategy is utilized to fuse the features extracted from GCN-based and CNN-based subnetworks.In the following, we will elaborate on the functionalities of each module in the proposed model.mechanism is constructed to align the spectral signatures between a BSNet-based matrix and a spectral-enhanced GCN-based matrix for spectral signature integration.Finally, the additive fusion strategy is utilized to fuse the features extracted from GCN-based and CNN-based subnetworks.In the following, we will elaborate on the functionalities of each module in the proposed model.

Adaptive Feature-Weighted Feature Fusion Based SpePRCM and SpaPCCM
Considering the HSI data cubes, which contain plentiful spectral signatures and a lot of spatial information, the SpePRCM and SpaPCCM are devised to extract multiscale spectral and spatial features separately.Moreover, the spectral pyramid hybrid convolution (SpePHC) block and spatial pyramid hybrid convolution (SpaPHC) block are included in SpePRCM and SpaPCCM separately.And then an adaptive feature-weighted fusion strategy is employed to fuse the extracted multiscale spectral and spatial information.Furthermore, the 3D convolutional layer used hereinafter refers to 3D convolution, Mish activation function, and Batch Normalization.

Spectral Pyramid Hybrid Convolution Block
The proposed SpePHC, as shown in Figure 2, includes a pyramid architecture with different types of convolutional layers, featuring various sizes of kernels and varying numbers of output feature channels.The processed spectral feature maps  ∈ ℝ × × , where  indicates the th layer, are fed into the SpePHC block, then  are processed in parallel by three different steps, e.g., _1, _2 , and _3 .For _1 , the  are convolved by the 3D convolutional layer  _ located at the bottom of the pyramid architecture, with the purpose of learning the spectral feature using a kernel size of (1 × 1 × 5).This results in the spectral feature maps  ( )_ , which have an output dimension of 36.For _2, the  are convolved by the 3D Transpose convolutional layer  _ located at the middle of the pyramid architecture, with the purpose of learning the spectral feature using a kernel size of (1 × 1 × 3).This yields the spectral feature map  ( )_ , which has an output dimension of 24 .For _3, the  are convolved by the 3D convolutional layer  _ located at the top of the pyramid architecture, which uses a kernel size of (1 × 1 × 1).The output dimensions of resulting spectral feature maps  ( )_ are 12.Then, we concatenate the three

Adaptive Feature-Weighted Feature Fusion Based SpePRCM and SpaPCCM
Considering the HSI data cubes, which contain plentiful spectral signatures and a lot of spatial information, the SpePRCM and SpaPCCM are devised to extract multiscale spectral and spatial features separately.Moreover, the spectral pyramid hybrid convolution (SpePHC) block and spatial pyramid hybrid convolution (SpaPHC) block are included in SpePRCM and SpaPCCM separately.And then an adaptive feature-weighted fusion strategy is employed to fuse the extracted multiscale spectral and spatial information.Furthermore, the 3D convolutional layer used hereinafter refers to 3D convolution, Mish activation function, and Batch Normalization.

Spectral Pyramid Hybrid Convolution Block
The proposed SpePHC, as shown in Figure 2, includes a pyramid architecture with different types of convolutional layers, featuring various sizes of kernels and varying numbers of output feature channels.The processed spectral feature maps FM i spe ∈ R h×w×d , where i indicates the ith layer, are fed into the SpePHC block, then FM i spe are processed in parallel by three different steps, e.g., Step_1, Step_2, and Step_3.For Step_1, the FM i spe are convolved by the 3D convolutional layer Conv i_3 spe located at the bottom of the pyramid architecture, with the purpose of learning the spectral feature using a kernel size of (1 × 1 × 5).This results in the spectral feature maps FM (i+1)_3 spe , which have an output dimension of 36.For Step_2, the FM i spe are convolved by the 3D Transpose convolutional layer TransConv i_2 spe located at the middle of the pyramid architecture, with the purpose of learning the spectral feature using a kernel size of (1 × 1 × 3).This yields the spectral feature map FM (i+1)_2 spe , which has an output dimension of 24.For Step_3, the FM i spe are convolved by the 3D convolutional layer Conv i_1 spe located at the top of the pyramid architecture, which uses a kernel size of (1 × 1 × 1).The output dimensions of resulting spectral feature maps FM (i+1)_1 spe are 12.Then, we concatenate the three different spectral feature maps using the concatenate operation in the channel dimension, and the output spectral feature maps FM i+2 spe are obtained.The detailed operation process is shown in the following: where Concat() dim=channel indicates the concatenation operation that operates on the channel dimension.
Remote Sens. 2024, 16, x FOR PEER REVIEW 8 of 40 different spectral feature maps using the concatenate operation in the channel dimension, and the output spectral feature maps  are obtained.The detailed operation process is shown in the following: Where () indicates the concatenation operation that operates on the channel dimension.

Spatial Pyramid Hybrid Convolution Block
The proposed spatial pyramid hybrid convolution (SpaPHC), as shown in Figure 3, is employed to learn multiscale spatial features.It has almost the same pyramid architecture as SpePHC and includes three parallel steps for processing the processed spatial feature maps  .

Spatial Pyramid Hybrid Convolution Block
The proposed spatial pyramid hybrid convolution (SpaPHC), as shown in Figure 3, is employed to learn multiscale spatial features.It has almost the same pyramid architecture as SpePHC and includes three parallel steps for processing the processed spatial feature maps FM i spa .
Remote Sens. 2024, 16, x FOR PEER REVIEW 8 of 41 different spectral feature maps using the concatenate operation in the channel dimension, and the output spectral feature maps  are obtained.The detailed operation process is shown in the following: Where () indicates the concatenation operation that operates on the channel dimension.

Spatial Pyramid Hybrid Convolution Block
The proposed spatial pyramid hybrid convolution (SpaPHC), as shown in Figure 3, is employed to learn multiscale spatial features.It has almost the same pyramid architecture as SpePHC and includes three parallel steps for processing the processed spatial feature maps  .For the first and second step, it has almost the same procedures as the Step_1 and Step_2 in the SpePHC block, but with different kernel sizes within the 3D convolutional layer Conv i_1 spa and 3D Transpose convolutional layer Conv i_2 spa , which are (5 × 5 × 1) and (3 × 3 × 1) respectively.And then, two different spatial feature maps are generated, namely, FM (i+1)_3 spa with output dimension 12, and FM (i+1)_2 spa with output dimension 24.For the third step, the FM i spa are convolved by the 3D convolutional layer Conv i_1 spa located at the top of the pyramid architecture using a kernel size of (1 × 1 × 1), resulting in the spatial feature maps FM (i+1)_1 spa with an output dimension of 48.Then, we concatenate the three different spatial feature maps in the channel dimension, the output spatial feature maps FM i+2 spa are obtained.FM The Concat() dim=channel indicates the concatenation operation that operates on the channel dimension of the spatial feature maps.
The proposed SpePHC and SpaPHC can generate spectral and spatial feature maps with various receptive field and corresponding output feature channels, learning information about more granular-level objects with larger output feature channels, as well as capturing more details about context information with smaller output feature channels.

The Multiscale Spectral and Spatial Feature Extraction of SpePRCM and SpaPCCM
In the spectral pyramid residual cascaded module (SpePRCM), the original feature map FM j HSI , j indicates the jth layer, and is first processed by the 3D convolutional layer Conv j spe with a kernel size of (1 × 1 × 7) to extract the spectral signatures.This generates the spectral feature maps FM j+1 spe with the output channel number of 46.To extract the multiscale spectral features, the FM j+1 spe are fed into the SpePHC block P spe (x; ε), the FM j+2 spe are generated, where ε is a learnable parameter.Then, the concatenation operation is employed on this FM j+1 spe and FM j+2 spe along the channel dimension.To further extract the spectral features from the previous feature maps and prevent information loss, the 3D convolutional layer Conv j+2 spe with a kernel size of (1 × 1 × 1) is employed, to obtain the multiscale spectral feature maps FM The spatial pyramid convolution cascaded module includes the 3D convolutional layer Conv j spa with a kernel size of (1 × 1 × band), which is employed to squeeze the depth of the FM j HSI , resulting in the spatial feature maps FM j+1 spa .Then, the SpaPHC block P spa (x; ε) is utilized to extract the multiscale spatial feature, yielding the spatial feature maps FM The FM j+3 spa are the output spatial feature maps generated from the spatial pyramid convolution cascaded module (SpaPCCM).The SpePHC and SpaPHC are included in CNN-based subnetwork which make it more generalized and robust while learning from different datasets.

The Multiscale Spectral and Spatial Feature Fusion with the Adaptive Feature-Weighted Fusion Strategy
Considering the importance of the feature for classification results, the extracted multiscale spectral and spatial features play a significant but unequal role.Meanwhile, to fully harness the extracted spectral and spatial features, inspired by [56], the adaptive feature-weighted fusion strategy is employed.It can fuse the spectral signatures and spatial features adaptively.In this strategy, the extracted features located at the same location are added element-wise, aggregating information across spectral signatures and spatial features, thereby enhancing the features that are important to the classification accuracy.Meanwhile, to dynamically allocate weights, two different weight coefficients, namely, α_1 and α_2, are used.To avoid the poor classification accuracy that could result from these two values being too large or too small, a softmax function is applied to adjust the values of α_1 and α_2.Then, α_1 and α_2 will balance the spectral signatures and spatial features, enhancing the fusion of different information according to the impact of various features on classification accuracy.The detailed operation is illustrated as follows:

Spectral-Enhanced GCN Module
To learn the features in the non-grid image data scene, the GCN is employed.It can capture information different from the CNN, thereby enhancing the classification accuracy.The graph is usually defined as G = (V, E), where V represent the set of nodes or vertices, and E represent the set of edges.Assume v i ϵV to indicate a node and e ij = v i , v j ϵE to indicate an edge starting from v i to v j .The neighborhood of a node v is characterized by the set NU M(v) = {u ∈ V|(v, u) ∈ E}.The adjacency matrix A n×n of graph G is defined as follows: Graph G might possess node attributes denoted as S, where S ∈ R m×c is a node feature matrix, and s v ∈ R c represents the feature vector of a node v. Simultaneously, a graph G might possess edge attributes denoted as S e , where S e ∈ R m×d is an edge feature matrix, and s v,u e ∈ R d denotes the feature vector of an edge (v, u).
In this part, the spectral-enhanced GCN module is developed to extract and accentuate the features from the spectral channel.Considering the input scalar types required by the GCN, we need to construct an undirected graph G = (V, E) based on each patch from HSI data.Due to the abundance of spectral signatures in each pixel of original HSI, the GCN methods based on the origin HSI data will create a larger graph, resulting in an abundance of computational resource costs [37,57].To address the above-mentioned issue, inspired by [58], we employ the BSNet to select important spectral signatures from the original HSI patch FM HSI ∈ R h×w×c in the spectral channel.
where FM BS is the output of BSNet, and BSNet() function indicates the BSNet.Here, θ is the learnable parameter in BSNet.This selection enhances the performance of GCN for HSI classification by considering the nonlinear interdependencies among different spectral bands.By stacking multi-layer GCN, it is possible to learn deeper node information from the constructed graph.However, stacking a certain number of GCN layers will result in a decrease in model performance and lower classification accuracy.In this part, to avoid the mentioned issues, the multi-hop adjacency matrix [57] is constructed, in which record nodes are at a distance of d hops from the selected nodes.It can excavate the underlying feature relationships and enlarge the receptive field.The used multi-hop adjacency matrix is constructed in the spectral channel to help the GCN learn the spectral signature in the HSI.
For constructing the multi-hop adjacency matrix, the FM HSI processed by the BSNet is first transformed into feature nodes M BS ∈ R hwc : where M BS is the result processed by BSNet, abbreviated as the BS-based feature matrix, and reshape() indicates the reshape operation.Then, the multi-hop adjacency matrix A is constructed based on the M BS .
The spectral-enhanced GCN module, depicted in Figure 4, is employed to learn the feature relationships among multi-hop adjacency matrix through a multi-layer GCN.To learn the features of HSIs from the perspective of GCN, a four-layer cascaded GCN, in which the number of GCN layers is discussed in the Section 5.2, is first implemented in a one-shot strategy.In each GCN layer, a collection of feature nodes from the M BS , denoted as where n represents the number of feature nodes, are fed into the GCN layer.And a learnable parameter matrix W ∈ R d×d is employed on each node, resulting in the nodes N i+1 j with comprehensive expressive abilities.Then, the nodes N j i+1 are multiplied by the adjacency matrix A. An exponential linear unit (Elu) activation function is used to accelerate the GCN learning process, which is denoted as: In the above expression, the hyperparameter α controls the points at which Elu function saturates towards negative values for negative inputs in the GCN layer.Then, the concatenation operation is utilized to concatenate the features from the outputs of the four GCN layers.
where Cat() denotes the concatenation operation.F i+1 indicates the features yielded from the i-th GCN layer.To mitigate the overfitting problem in the four-layer GCN, which may lead to a decrease in HSI classification performance, the dropout technique is utilized.The parameter p (here, we set the p value to 0.3) in the dropout layer of each GCN layer is used as a threshold value that determines which part of the features in the GCN layer is dropped.To further learn and integrate information from the HSIs, another one-layer GCN is employed after these four GCN layers.Meanwhile, an Elu nonlinearity activation function is used.Finally, to ensure comparability among the features yielded from different GCN layers, the softmax function is applied.
After obtaining the features extracted from the four-layer cascaded GCN and the onelayer GCN, we further enhance them through a spectral-enhanced method.Specifically, the linear layer is employed to improve the linear expressive skill of the features.Then, we reshape the features to obtain the feature matrix  ∈ ℝ × .In order to learn the significant spectral signature, the two cascaded adaptive average pooling layers are used.Additionally, a Mish activation function is included in the adaptive average pooling layer to avoid gradient saturation.Then, the feature matrix  ∈ ℝ × is obtained.Finally, the  ∈ ℝ × is multiplied with the  ∈ ℝ × to obtain the feature matrix  ∈ ℝ × with significant information in the spectral channel.Furthermore, the important values in the spectral channel of the feature matrix are emphasized.And the detailed procedure is shown as follows: =  ⨂ (28) where ⨂ in Equation ( 33) indicates the matrix product operation. is the result of processing by a spatial-based GCN module, abbreviated as the GCN-based feature matrix.In this module, the GCN layers are employed to learn the inherent features of nodes, which are different for CNN to extract.Then, the spectral-based method is employed to accentuate significant features in the spectral dimension.The significant spectral signature can be learned and accentuated, and the HSI classification performance can be enhanced through the spectral-enhanced GCN module.After obtaining the features extracted from the four-layer cascaded GCN and the one-layer GCN, we further enhance them through a spectral-enhanced method.Specifically, the linear layer is employed to improve the linear expressive skill of the features.Then, we reshape the features to obtain the feature matrix M 1 ∈ R c×hw .In order to learn the significant spectral signature, the two cascaded adaptive average pooling layers are used.Additionally, a Mish activation function is included in the adaptive average pooling layer to avoid gradient saturation.Then, the feature matrix M 2 ∈ R c×1 is obtained.Finally, the M 1 ∈ R c×hw is multiplied with the M 2 ∈ R c×1 to obtain the feature matrix M GCN ∈ R c×hw with significant information in the spectral channel.Furthermore, the important values in the spectral channel of the feature matrix are emphasized.And the detailed procedure is shown as follows: where in Equation (33) indicates the matrix product operation.M GCN is the result of processing by a spatial-based GCN module, abbreviated as the GCN-based feature matrix.
In this module, the GCN layers are employed to learn the inherent features of nodes, which are different for CNN to extract.Then, the spectral-based method is employed to accentuate significant features in the spectral dimension.The significant spectral signature can be learned and accentuated, and the HSI classification performance can be enhanced through the spectral-enhanced GCN module.

Mutual-Cooperative Attention Mechanism
After obtaining the feature matrix M GCN , extracted by spectral-enhanced GCN module, and the feature matrix M BS , extracted by BSNet, we construct a customized mutualcooperative attention mechanism (MCAM) to align the spectral signature between M GCN and M BS .As shown in Figure 5, the MCAM mainly includes two cross multi-head selfattention mechanisms (CMSM).One improved CMSM, referred to as the M BS to M GCN cross multi-head self-attention block (BG-CMSB), enables the transfer of spectral signatures from M BS into M GCN .Vice versa, the other CMSM, referred to as the M GCN to M BS cross multi-head self-attention block (GB-CMSB), enables the transfer of enhanced spectral signature from M GCN to M BS .Subsequently, the obtained two-feature matrices are merged using an element-wise addition operation.The detailed process is shown as follows: where f BG−CMSB (), f GB−CMSB (), and A() represent the BG-CMSB, GB-CMSB, and the element-wise addition, respectively.M indicates the output feature matrix of the mutualcooperative attention mechanism.

Mutual-Cooperative Attention Mechanism
After obtaining the feature matrix  , extracted by spectral-enhanced GC ule, and the feature matrix  , extracted by BSNet, we construct a customized cooperative attention mechanism (MCAM) to align the spectral signature betwee and  .As shown in Figure 5, the MCAM mainly includes two cross multi-he attention mechanisms (CMSM).One improved CMSM, referred to as the  cross multi-head self-attention block (BG-CMSB), enables the transfer of spectra tures from  into  .Vice versa, the other CMSM, referred to as the  cross multi-head self-attention block (GB-CMSB), enables the transfer of enhanc tral signature from  to  .Subsequently, the obtained two-feature matr merged using an element-wise addition operation.The detailed process is show lows: where  (),  (), and () represent the BG-CMSB, GB-CMSB, and ment-wise addition, respectively. indicates the output feature matrix of the cooperative attention mechanism.The proposal of this block is aimed at improving the transfer and expression of spectral signatures between BS-based feature matrix M BS and GCN-based feature matrix M GCN .Given M BS ∈ R hw×c and M GCN ∈ R hw×c are inputs to BG-CMSB.The BG-CMSB initially combines the M BS and M GCN using the element-wise addition, resulting in the feature matrix M i_1 BG .Then, M i_1 BG is concatenated with M BS to construct a new feature matrix.Next, the new feature matrix is separately multiplied with two different matrices, thereby linearly constructing the key K ∈ R hw×2c and value V ∈ R hw×2c simultaneously.At the same time, the M BS is used, being multiplied with a weight matrix to linearly construct the query Q ∈ R hw×c .Additionally, the Q, K, and V are all projection matrices.Then Q, K, and the scale factor are processed by softmax, subsequently with V, to calculate the cross multihead self-attention score from M BS to M GCN .Moreover, the scale factor is used to control the gradient of the model during the training process.The detailed overall operation is shown as follows: To obtain stable results, we execute the attention calculation process multiple times in parallel; here, it is executed eight times.Then, we reshape the CA BG , and the linear operation is used to project the reshaped attention score into M i_2 BG ∈ R hw×c .Subsequently, a softmax operation is performed.To enhance the important features, the residual connection is added on the M BS .Subsequently, element-wise addition is used to concatenate the M BS and the processed result of M BS using softmax as well as M i_2 BG .The process is shown as follows: where M BG−CMSB indicates the resulting feature maps from BG-CMSB, and so f tmax() is the softmax function.

GCN-Based Feature Matrix to BS-Based Feature Matrix Cross Multi-Head Self-Attention Block
This block is devised to transfer and improve the expression of enhanced spectral signature between GCN-based feature matrix M GCN and BS-based feature matrix M BS .Given M BS ∈ R hw×c and M GCN ∈ R hw×c as inputs to GB-CMSB, the GB-CMSB first combines the M GCN and M BS using the element-wise addition, resulting in the feature matrix M i_1 GB .Then, M i_1 GB is concatenated with M GCN to linearly build the key K ∈ R hw×2c and value V ∈ R hw×2c simultaneously.At the same time, the M GCN is employed to linearly construct the value Q ∈ R hw×c .The following operations are similar to those of the BG-CMSB: where CA GB is the result of the CMSM score from M GCN to M BS , and reshape() indicates the reshape operation.Then, the obtained M BG−CMSB and M GB−CMSB are concatenated to obtain an output result for the mutual-cooperative attention mechanism.With the help of the mutualcooperative attention mechanism, integrating comprehensive spectral features from various neural networks has a positive impact on subsequent classification tasks.

Additive Feature Fusion Based on CNN-Based Subnetworks and GCN-Based Subnetworks
Considering that the features extracted from different network architectures may play different roles in HSI classification, it is important to fuse multi-type features with a proper fusion strategy.For instance, classification performance can be enhanced when different types of features are effectively integrated.Conversely, when different types of features cannot be effectively integrated, the classification performance may be degraded.
In this article, the proposed network includes two subnetworks: one is the CNN-based subnetwork, and the other is the GCN-based subnetwork, which includes a spectralenhanced module and a mutual-cooperative attention mechanism.Specifically, we first reshape the features extracted from CNN subnetwork, and then a linear layer is utilized to yield the final classification performance of CNN subnetwork.Concurrently, the features extracted from the GCN subnetwork are reshaped.An adaptive average pooling layer is utilized to produce the final classification performance of the GCN subnetwork.Finally, an additive fusion strategy is employed to achieve the final classification results of PCCGC.Based on the above operations, the CNN-based features can be well integrated with the GCN-based features in an additive strategy for the HSI processing.

Experimental Datasets
In this part of the experiments, five widely used real HSI datasets are utilized to validate the robustness and practicality of our designed model, e.g., the Pavia University dataset, WHU-Hi-Honghu dataset, Houston University dataset, and Indian Pines dataset.
(1) Pavia University (PU) dataset: The PU is collected by using the ROSIS equipment over the University of Pavia, Italy, and its surrounding areas.The spatial area of the PU dataset is 610 × 340 pixels, with an approximate resolution of 1.3 m per pixel.The PU dataset in our experiment includes 103 spectral bands, spanning a spectrum wavelength from 430 to 860 nm.The quantities of samples for each class utilized in the training, validation, and testing sets are exhibited in Table 1.(2) Houston University (Houston) dataset: The Houston dataset is gathered over the campus of University of Houston, Houston, USA.The spatial dimension size of the Houston dataset is 349 × 1905 pixels, with 144 spectral bands from 380 to 1050 nm, and the spatial resolution of Houston is about 2.5 m per pixel.The Houston dataset contains 15 categories, and the number of samples used for training, validation, and testing are recorded in Table 2. (3) WHU-Hi-Honghu (Honghu) dataset: The Honghu dataset is obtained in Hubei Province, China via imaging sensors mounted on a UAV platform.And, the Honghu dataset has a spatial size of 940 × 475 pixels, containing 270 spectral bands spanning from 400 to 1000 nm.In our experiments, only 16 categories are selected due to the limitations of the utilized device.The numbers of training, validation, and testing samples for each selected class, as well as the corresponding totals, are listed in Table 3. (4) Indian Pines (IP) dataset: The IP scene data is acquired using the Airborne/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines area of northwestern Indiana.The spatial scale of the IP imagery is 145 × 145 pixels, consisting of 220 spectral bands ranging from 400 nm to 2500 nm.The IP image includes 16 categories, and the number of labelled samples for training, validation, and testing are displayed in Table 4. (5) Xiongan New Area (Xiongan) dataset: The Xiongan (Matiwan Village) scene data is acquired using the Visble and Near-Infrared Imaging Spectrometer over the Xiongan Country, and Baiyangdian Lake.The spatial range of the Xiongan imagery is 3750 × 1580 pixels, containing 250 spectral bands ranging from 400 to 1000 nm.The Xiongan image includes 19 categories, and the number of labelled samples for training, validation, and testing are listed in Table 5.To validate the superior classification performance of our devised method, fourteen different comparative methods, e.g., SSRN [59], DBDA [60], SSGCA [61], PCIA [62], MDB-Net [63], HDDA [64], DBPFA [65], ChebNet [66], GCN [67], MVAHN [68], DGFNet [38], DKDMN [69], FTINet [49], and MRCAG [41] are used for comparison with our method, and the description of each comparative method is shown in Section 4.2.2.To provide a clearer perspective on the classification results for each method, the metric of overall accuracy (OA), average accuracy (AA), and kappa (K), and each class classification accuracies are used to assess their classification performance.All experiments are performed in the same environment, namely, a mini base station equipped with 128 GB of DDR4 RAM as well as 8 × NVIDIA GeForce RTX 2080Ti Graphical Processing Units, with a memory of 11 GB.The software environment used in our experiment includes CUDA Version 11.6, PyTorch 1.10.1, and python 3.8.
To keep a fair comparison environment, we standardized the parameters, optimizer, and architecture of the other fourteen comparison methods to be consistent with the experimental settings of our proposed model.During the training process of our proposed model, the parameter is updated by applying the Adam optimizer.And, the sets {0.0005, 350}, {0.0009, 150}, {0.0007, 120}, and {0.0003, 130} are elected as the learning rate and number of epochs for the proposed model in the PU, Houston, Honghu, and IP data scenes separately, which is discussed in Section 5.3.Moreover, the set {0.0007, 200} is selected as the learning rate and number of epochs for the proposed model in the Xiongan data scenes.The spatial size of 9 × 9 is employed for the HSI patch cube, and the batch size of 64 is chosen.Early stopping is utilized in the training process of our model.Also, the numbers for the training set, validation set, and the test set for PU, Honghu, Houston, IP, and Xiongan data scenes can be observed in Tables 1-5.
The averaged results and standard deviation of quantitative assessments, in terms of OA, AA, K, and the accuracy values for each class, and qualitative assessments for the fourteen comparative methods and our proposed method on the five HSI datasets are recorded in Tables 6-10, and Figures 6-10, respectively.And, the averaged results and standard deviations for all measurements are derived based on ten repeated experiments.Additionally, the highest values for the three indices and accuracy values for each class are bolded.(a  (a (2) DBDA: The DBDA has spectral and spatial branches, with dense spectral block and channel attention mechanisms included in the spectral branch for extracting and refining spectral features, and a spatial attention block and a dense spatial block included

The Fourteen State-of-the-Art Comparison Methods
(1) SSRN: The SSRN adopts spectral and spatial residual modules as its backbone and combines them in a consecutive manner to address the accuracy decreasing problem.It first extracts spectral signatures and then extracts the spatial features for pixel-wise HSI classification.Additionally, batch normalization is used in each 3D convolutional layer to regulate the feature extraction process.
(2) DBDA: The DBDA has spectral and spatial branches, with dense spectral block and channel attention mechanisms included in the spectral branch for extracting and refining spectral features, and a spatial attention block and a dense spatial block included in the spatial branch for learning and optimizing spatial information.Then, a concatenation operation is utilized to fuse spectral and spatial information.
(3) SSGCA: The SSGCA first uses a spectral-spatial module to excavate spectral and spatial features separately.Then, a channel global context attention mechanism is developed to enhance the significance of the extracted spectral signatures, and a position global context attention mechanism is devised to enhance the importance of the extracted spatial features.
(4) PCIA: The PCIA is a dual-branch model, which first uses spectral and spatial pyramidal blocks to efficiently learn spectral and spatial information.Then, a novel iterative attention, namely, a new expectation-maximization attention, is employed to refine the learned spectral and spatial information.Finally, the refined spectral-spatial information is conveyed to the fully connected layer for the final classification outcomes.
(5) MDBNet: The MDBNet uses PCA to operate on the original dataset, yielding the processed dataset.The processed dataset is then processed by the multiscale spectralspatial feature extraction module to extract the multiscale spectral and spatial features.Then, a dual-branch information fusion block consisting of residual connections and dense connections is used to learn discriminant features.Finally, a new shuffle attention is proposed to adaptively weigh the spectral and spatial features, resulting in improved classification accuracy.
(6) HDDA: The HDDA architecture features a novel hybrid dense module and dual attention mechanisms.It utilizes a stacked autoencoder to decrease the number of channels in the HSI.And then, a hybrid 2D-3D CNN module is employed to extract the spectral and spatial information.The channel and spatial attention mechanism is designed to refine the extracted spectral and spatial features separately.Additionally, a dropout layer and batch normalization are employed separately to mitigate overfitting and enhance computational efficiency.
(7) DBPFA: The DBPFA mainly consists of dual-branches and an improved attention mechanism for HSI classification.It includes a spectral feature extraction branch for extracting spectral signatures and a spatial feature excavating branch for extracting spatial features.It also includes a polarized fully attention for learning context feature information.
(8) ChebNet: The Spectral GCN is a graph convolutional operation with fast local convolutional filters, where the filter is approximated by a K-order Chebyshev polynomial.And, it is applicable to any graph structure.In this article, a filter with a 1-order Chebyshev polynomial is used.
(9) GCN: The GCN is an efficient tool based on convolutional neural networks and constructed by an approximation of localized first-order spectral graph convolutions, which can be directly operate on graphs.The GCN is also a linear layer model that can learn the representation of both graph edges relations and node information.In this article, a five-layer GCN is employed.
(10) MVAHN: The MVAHN is a new hybrid vision architecture-based model, which first utilizes CNN to extract the spectral signatures and spatial features from HSIs.Next, the generated features are divided into two components; one is for the GCN module, and another is for the transformer module.Finally, a residual learning block is used to fuse the extracted features.
(11) MRCAG: The MRCAG model mainly has three components: a developed multiscale random shape convolution part for learning convolution-based multiscale features, where the convolution kernels used are randomized.A designed adaptive graph convolution part for learning graph-based features, where the weights for neighborhood nodes are learned adaptively.A local feature processing part is designed to exploit CNN-based features and GCN-based features, enhancing the feature representation.
(12) FTINet: The FTINet method consists of three stages: First, multiple stacked CIformers are used to learn the dynamic and static spatial contextual information of the data.Then, concatenated FTCUs are employed to learn the spectral and topological features of the processed data.Meanwhile, the edges of the graph are learned for information aggregation and propagation.Finally, the learned features and information are fed into the classification stage to produce the classification results.
(13) DKDMN: The DKDMN method is a hybrid neural network architecture.It first employs the proposed multi-scale spectral signature extraction module for spectral signature extraction.Then, the extracted multiscale spectral signature is combined with positional embedding for Transformer preprocessing.Then, the signatures are fed into the designed module for comprehensive spectral signature learning.This module is composed of multiple CNN-Transformer blocks and a residual GCN.To better achieve the final classification results, the learned spectral signature is combined with the features extracted by the diffusion model.( 14) DGFNet: The DGFNet is a dual-branch GNN fusion network, which includes a spatial-based branch and a spectral-based branch.It takes HSI subcubes as input data.The spatial branch first employs a Graph Attention Network to learn the intrinsic relationships within the input data.Then, it develops a local guidance module to learn significant features.The spectral branch employs weights for different spectral bands to obtain spectral features.Finally, a linear layer is used to fuse the spatial features and spectral features.

Experimental Results
In this section, for the five widely used data scenes, the training set, validation set, and test set are composed of randomly selected labeled samples from each category on the five HSI data scenes, which are recorded in Tables 1-5 separately.Tables 6-10 show the accuracy of OA, AA, K, and each class of the fourteen competitive methods as well as our proposed model on the five widely used data scenes.Figures 6-10 show the ground truth map, full-pixel classification maps, and False-color images on the PU, Houston, Honghu, and IP data scenes.The detailed discussion of classification results on the five used data scenes is shown as follows: (1) Classification results on the PU dataset: In the PU dataset, as recorded in Table 6, in terms of the value of OA, our proposed method surpasses SSRN, DBDA, SSGCA, PCIA, MDBNet, HDDA, DBPFA, ChebNet, GCN, and MVAHN by about 13.33%, 2.76%, 0.76%, 1.30%, 15.63%, 2.89%, 1.38%, 10.93%, 9.88%, and 0.50%, respectively, demonstrating the superior classification performance of our method.The methods based on CNN, namely SSRN, DBDA, SSGCA, PCIA, MDBNet, HDDA, and DBPFA, all achieved excellent classification performance in terms of OA, AA, K, and accuracy values for each class, except for SSRN and MDBNet.This can be attributed to CNN being one of the powerful data fitting tools of deep learning.The GCN-based methods, namely ChebNet and GCN, exhibit lower classification performance, except for MVAHN.This can be ascribed to the fact that GCNbased methods only consider features from a single perspective.To some extent, the pure GCN-based method has limited feature extraction performance compared to the spectral and spatial CNN-based method.The hybrid vision architectures, namely our proposed method and MVAHN, which are based on CNN and GCN architecture, both achieve outstanding classification performance on the PU dataset.Specifically, our proposed method achieve the highest OA, AA, and K values among the other fourteen comparative methods on the PU dataset.Furthermore, from Figure 6, the classification map yielded by our proposed method is not only clearer than those from the other fourteen comparative experiments, but also has smoother land cover edges.Conversely, the classification maps yielded by MDBNet and GCN show a lot of salt-and-pepper scenarios.
(2) Classification results on the Houston dataset: For the Houston dataset, our proposed method achieves the highest records in terms of OA, AA, and K compared to the fourteen comparative methods, as documented in Table 7.The OA value of our proposed method is approximately 0.47% higher than that of MVAHA, about 0.33% higher than SS-GCA, a CNN-based method with higher OA among CNN-based methods, and about 18.53% higher than ChebNet, a GCN-based method with higher OA among GCN-based methods.This is due to the fact that our proposed method is designed by combining the CNN and GCN networks, which can excavate the multiscale spectral-spatial features and learn the pixel-wise spectral signature among the graphs.From Figure 7, it can be observed that our proposed method exhibits a classification map that closely resembles the ground-truth feature map in comparison to the other fourteen classification methods.
(3) Classification results on the Honghu dataset: On the Honghu dataset, as seen in Figure 8a, the terrain regions of same land covers are more concentrated, so this dataset is more conductive to being distinguished.From Table 8, our proposed method shown higher values of OA, AA, and K compared with the fourteen comparative methods.The MVAHN achieve the second classification performance.The reason for this can be attributed to the method combining the CNN and GCN, which enables the extraction of different types of features.The CNN-based methods, namely PCIA and DBPFA, all get good classification performance, but worse than hybrid CNN and GCN methods.It is also shown that the features extracted by CNN-based methods are less expressive than the hybrid CNN and GCN methods.Furthermore, the classification performance of categories of C1, C4, C5, C12, and C15 that are obtained by our proposed method is higher than the other fourteen comparative methods, which is shown the in better feature extraction ability of our proposed method.At the same time, from Figure 8, the classification map obtained by our proposed method has less salt-and-pepper noise and more similarity to the ground truth map.
(4) Classification results on the IP dataset: To validate the classification performance under limited training samples, the IP data scene is used.From Table 9, it can be seen that our proposed model gets the best classification accuracy on OA, AA, and K.Meanwhile, our proposed model gets 100% classification accuracy on the classes of C1, C8, and C13, which demonstrates the effectiveness of our devised model under the IP data scene.Especially under the condition of limited sample quantities from classes C1, C7, C9, and C16, our proposed model achieves good accuracy in individual class classification.The SSRN and ChebNet receive 0% classification accuracy on C7 and C9, respectively.However, both the MVAHN and our proposed model achieve better classification accuracy on C7 and C9 compared to SSRN and ChebNet.This shows that the model combining CNN with GCN has better feature extraction performance than a model only based on CNN or GCN architecture.In Figure 9, the classification map generated by our proposed model shows clear boundaries between different classes.Compared to other comparative methods, MDBNet gets a worse OA value, and its classification map exhibits a higher occurrence of the salt-and-pepper noise phenomenon.
(5) Classification results on the Xiongan dataset: To further validate the superior classification performance of our designed method, we use the Xiongan dataset.From Table 10, it can be observed that our method achieves the best OA, AA, and k compared to other methods.Conversely, the SSRN method, which is based on CNN, has lower classification results, especially for the AA value, compared to other comparative methods.The classification accuracy of C3, C4, C6, C9, C11, C12, C16, C17, and C18 produced by SSRN is 0%, which contributes to the lower AA value.MVAHN, a hybrid architecture that combines CNN and GCN models, yields comparable results but has slightly lower classification performance than our method.For GCN-based comparative methods, such as ChebNet and GCN, the classification results are relatively lower compared to other methods.Other comparative methods, such as DGFNet, FTINet, and MRCAG exhibit acceptable classification results.From Figure 10, the full-pixel classification map of SSRN displays unclear edges between different classes, resulting in a poor classification map.Our method shows a classification map that is more similar to the ground-truth map, demonstrating superior classification performance.

The Importance of an Adaptive Feature-Weighted Strategy in Feature Fusion
In the CNN subnetwork of our proposed method, the extracted spectral and spatial information plays unequally significant roles in the classification process.The spectral signature learning weight α_1 and spatial features learning weight α_2 are employed to show the importance of spectral and spatial features in the training accuracy and validation accuracy of the four used HSI datasets.As shown in Figure 11a, it can be observed that for the PU data scene, the value of the spectral learning weight α_1 is higher than that of the spatial weight α_2, and the difference between them becomes increasingly larger as the epochs increase, while both the training accuracy and validation accuracy are increasing.Therefore, when the proposed method achieves higher training and validation accuracy, it demonstrates that the spectral learning weight α_1 and the spatial learning weight α_2 play different roles in classification, which shows the appropriateness of the used adaptive feature-weighted fusion strategy.The extracted spectral signatures have a relatively larger proportion than spatial features in classification.From Figure 11a, when the value of α_1 is 0.5329 and the value of α_2 is 0.4671, the proposed model gets the best performance in training and validation accuracy on the PU dataset.

The Importance of an Adaptive Feature-Weighted Strategy in Feature Fusion
In the CNN subnetwork of our proposed method, the extracted spectral and spatial information plays unequally significant roles in the classification process.The spectral signature learning weight α_1 and spatial features learning weight α_2 are employed to show the importance of spectral and spatial features in the training accuracy and validation accuracy of the four used HSI datasets.As shown in Figure 11a, it can be observed that for the PU data scene, the value of the spectral learning weight α_1 is higher than that of the spatial weight _2, and the difference between them becomes increasingly larger as the epochs increase, while both the training accuracy and validation accuracy are increasing.Therefore, when the proposed method achieves higher training and validation accuracy, it demonstrates that the spectral learning weight _1 and the spatial learning weight _2 play different roles in classification, which shows the appropriateness of the used adaptive feature-weighted fusion strategy.The extracted spectral signatures have a relatively larger proportion than spatial features in classification.From Figure 11a, when the value of _1 is 0.5329 and the value of _2 is 0.4671, the proposed model gets the best performance in training and validation accuracy on the PU dataset.Additionally, in Figure 11b-d, feature learning weights α_1 and α_2 are analyzed respectively on Honghu, Houston, and IP data scene.In these three figures, phenomena similar to those in Figure 11a can be observed.Especially in the Honghu data scene, when the value of α_1 is 0.6279 and the value of α_2 is 0.3721, the training accuracy reaches a value of 100%, and the validation accuracy reaches a value of 98.09%.

The Value of n in n-Layer GCN of the Spectral-Enhanced GCN Module
To learn the features from the perspective of GCN as well as to learn the relations in the spectral feature matrix, an n-layer GCN is used.In general, GCN-based architectures can learn deep node features by using more stacked GCN layers.However, when the GCN layers reach a certain level, it will lead to a drop in the performance of the spectral-enhanced GCN module.Additionally, using the spectral-enhanced GCN module with fewer GCN layers results in inadequate capture of deep information.To ensure that the values of n are reasonable, an experiment is conducted.To be specific, we implement experiments with different n-layer (n = 1, 2, 3, 4, 5) GCNs in spectral-enhanced GCN modules on the four datasets used.The purpose is to determine the optimal value of n to achieve the best performance of our model.From Figure 12, when the value of n is 4, the OA of our model gets the best classification accuracy on four used data scenes separately.Therefore, a four-layer GCN is utilized in the spectral-enhanced GCN module, which is beneficial for spectral signature learning.
the value of α_1 is 0.6279 and the value of α_2 is 0.3721, the training accuracy r value of 100%, and the validation accuracy reaches a value of 98.09%.

The Value of n in n-Layer GCN of the Spectral-Enhanced GCN Module
To learn the features from the perspective of GCN as well as to learn the rela the spectral feature matrix, an -layer GCN is used.In general, GCN-based archi can learn deep node features by using more stacked GCN layers.However, when t layers reach a certain level, it will lead to a drop in the performance of the spe hanced GCN module.Additionally, using the spectral-enhanced GCN module wi GCN layers results in inadequate capture of deep information.To ensure that th of  are reasonable, an experiment is conducted.To be specific, we implement ments with different -layer ( = 1, 2, 3, 4, 5) GCNs in spectral-enhanced GCN m on the four datasets used.The purpose is to determine the optimal value of  to the best performance of our model.From Figure 12, when the value of n is 4, th our model gets the best classification accuracy on four used data scenes separately fore, a four-layer GCN is utilized in the spectral-enhanced GCN module, which i cial for spectral signature learning.

The Learning Rate under Different Epoch Numbers
The learning rate, a vital hyperparameter, plays a great role in training a dee ing-based model.And, it has a significant influence on both the convergence and classification performance of our model.Additionally, the number of epochs can a convergence speed of the model as well as its training time.So, we evaluate the im various learning rates on the classification accuracy of our proposed model under ferent number of epochs on the four used data scenes, and the results are shown i 13.To analyze the impact of learning rates and various epochs on the proposed the learning rate for the PU, Honghu, Houston, IP, and Xiongan are selected from {0.007, 3 × 10

Impact of Different Training Samples on The Classification Result
For our proposed model and other comparative methods, as deep learning-based models, the training sample is an important factor that influences the classification performance.

Impact of Different Training Samples on the Classification Result
For our proposed model and other comparative methods, as deep learning-based models, the training sample is an important factor that influences the classification performance.

Visual Results about Different Methods
In this section, to intuitively visualize the classification performance of fourteen comparative methods and our devised model, a t-distributed stochastic neighbor embedding (T-SNE) technique is employed.And, we take IP data scene as an example.From the

Visual Results about Different Methods
In this section, to intuitively visualize the classification performance of fourteen comparative methods and our devised model, a t-distributed stochastic neighbor embedding (T-SNE) technique is employed.And, we take IP data scene as an example.From the Figure 15e, it can be observed that the classification map of MDBNet shows different class mix-ups together, causing a phenomenon of confusion.This is consistent with the result from Table 9, that the OA of MDBNet is lower than other comparative methods.

Ablation Experiment
As described in the Methods Section, our proposed model mainly includes a CNNbased subnetwork and a GCN-based subnetwork.And in each subnetwork, the spectral and spatial pyramid hybrid convolution block, adaptive feature-weighted fusion strategy, spectral-enhanced GCN module, and mutual-cooperative attention mechanism are crucial for our proposed model.In this section, some ablation experiments are performed to verify the effectiveness of various designed modules in our proposed model.The environment settings of ablation experiments performed on PU, Honghu, Houston, and IP image data scenes are described in Section 4.2.And, we take the value of OA as an evaluation indicator to show the effectiveness of various devised modules, which will compare the complete model with the model lacking corresponding modules to show the effectiveness of each corresponding module.In detail, from Figure 16, the OA of model_0 is higher than that of model_1, model_2, model_3, model_4, model_5, model_6, model_7, and model_8, which shows the reasonability and the superlative classification accuracy of our proposed model.

Ablation Experiment
As described in the Methods Section, our proposed model mainly includes a CNNbased subnetwork and a GCN-based subnetwork.And in each subnetwork, the spectral and spatial pyramid hybrid convolution block, adaptive feature-weighted fusion strategy, spectral-enhanced GCN module, and mutual-cooperative attention mechanism are crucial for our proposed model.In this section, some ablation experiments are performed to verify the effectiveness of various designed modules in our proposed model.The environment settings of ablation experiments performed on PU, Honghu, Houston, and IP image data scenes are described in Section 4.2.And, we take the value of OA as an evaluation indicator to show the effectiveness of various devised modules, which will compare the complete model with the model lacking corresponding modules to show the effectiveness of each corresponding module.In detail, from Figure 16, the OA of model_0 is higher than that of model_1, model_2, model_3, model_4, model_5, model_6, model_7, and model_8, which shows the reasonability and the superlative classification accuracy of our proposed model.
To demonstrate the validity of the designed spectral pyramid hybrid convolution block and spatial pyramid hybrid convolution block, we individually eliminate the spatial pyramid hybrid convolution block and spectral pyramid hybrid convolution block from the complete model.As demonstrated in Figure 16, the OAs of model_8 and model_7 are both lower than that of model_0, which indicates the importance of the spectral pyramid hybrid block and spatial pyramid hybrid convolution block in multiscale feature extraction.To demonstrate the validity of the designed spectral pyramid hybrid convolution block and spatial pyramid hybrid convolution block, we individually eliminate the spatial pyramid hybrid convolution block and spectral pyramid hybrid convolution block from the complete model.As demonstrated in Figure 16, the OAs of model_8 and model_7 are both lower than that of model_0, which indicates the importance of the spectral pyramid hybrid block and spatial pyramid hybrid convolution block in multiscale feature extraction.
To verify the effectiveness of the devised GCN-based subnetwork, we remove the GCN-based subnetwork from the proposed model.The OA of model_5 is lower than that of model_0, which shows the effectiveness of the GCN-based subnetwork.
To demonstrate the robustness of the CNN-based subnetwork to the complete model, we remove the CNN-based subnetwork; as shown in Figure 16, the OA of model_4 is lower than that of model_0 on the four widely used HSI data scenes.Meanwhile, the OA of model_4 is lower than others on the model axis; especially on the IP data scene, the OA of model_4 is much lower than model_0.And, the limited feature extraction ability of the GCN-based subnetwork is also shown.
To show the importance of the mutual-cooperative attention mechanism, we remove the mutual-cooperative attention mechanism from the proposed model; as shown in Figure 16, the OA of model_1 is lower than the OA of model_0, which indicates the importance of the mutual-cooperative attention mechanism.Meanwhile, the OAs of model_2 and model_3 are both lower than the OA of model_0, which shows the significance of the To verify the effectiveness of the devised GCN-based subnetwork, we remove the GCN-based subnetwork from the proposed model.The OA of model_5 is lower than that of model_0, which shows the effectiveness of the GCN-based subnetwork.
To demonstrate the robustness of the CNN-based subnetwork to the complete model, we remove the CNN-based subnetwork; as shown in Figure 16, the OA of model_4 is lower than that of model_0 on the four widely used HSI data scenes.Meanwhile, the OA of model_4 is lower than others on the model axis; especially on the IP data scene, the OA of model_4 is much lower than model_0.And, the limited feature extraction ability of the GCN-based subnetwork is also shown.
To show the importance of the mutual-cooperative attention mechanism, we remove the mutual-cooperative attention mechanism from the proposed model; as shown in Figure 16, the OA of model_1 is lower than the OA of model_0, which indicates the importance of the mutual-cooperative attention mechanism.Meanwhile, the OAs of model_2 and model_3 are both lower than the OA of model_0, which shows the significance of the mutual-cooperative attention mechanism without BSNet-based spectral features and GCN-based spectral signatures separately.And, the phenomenon that the OA of model_3 is lower than the OA of model_0 also demonstrates the significance of spectral-enhanced GCN modules.

The Visualization of the Spectral-Enhanced GCN Module
To validate the effectiveness of the proposed spectral-enhanced GCN modules, the heatmaps of features before and after using the spectral-enhanced GCN module are shown in Figure 17.Taking the PU data scene as an example, we randomly chose a 9 × 11 pixel region from the spectral matrix to show the features contained in the spectral matrix.From Figure 17a, it can be seen that the image has lighter color, especially in the range of lighter green, and each pixel does not show significant differences.It shows that the features within a certain region before using the spectral-enhanced GCN module have weak differences.Conversely, from Figure 17b, it can be seen that the heatmap displays darker colors, with different darker shades among pixels for different classes.According to Figure 17b,d,f,g,h, the features within the feature matrix are accentuated, which demonstrates the effectiveness of the proposed spectral-enhanced GCN module.On the Houston, Honghu, and IP datasets, Figure 17c,e,g shows that the feature matrices display relatively lighter colors.In contrast, Figure 17d,f,h present relatively darker feature maps, which are processed by the spectralenhanced GCN module.These heatmaps exhibit more pronounced differences between pixels of different classes, highlighting the significant features in the feature map.

The visualization of The Spectral-enhanced GCN module
To validate the effectiveness of the proposed spectral-enhanced GCN modules, the heatmaps of features before and after using the spectral-enhanced GCN module are shown in Figure 17.Taking the PU data scene as an example, we randomly chose a 9 × 11 pixel region from the spectral matrix to show the features contained in the spectral matrix.From Figure 17a, it can be seen that the image has lighter color, especially in the range of lighter green, and each pixel does not show significant differences.It shows that the features within a certain region before using the spectral-enhanced GCN module have weak differences.Conversely, from Figure 17b, it can be seen that the heatmap displays darker colors, with different darker shades among pixels for different classes.According to Figure 17b,d,f,h,g, the features within the feature matrix are accentuated, which demonstrates the effectiveness of the proposed spectral-enhanced GCN module.On the Houston, Honghu, and IP datasets, Figure 17c,e,g shows that the feature matrices display relatively lighter colors.In contrast, Figure 17d, f, and h present relatively darker feature maps, which are processed by the spectral-enhanced GCN module.These heatmaps exhibit more pronounced differences between pixels of different classes, highlighting the significant features in the feature map.

Training Times
In this subsection, the time consumed in experiments is discussed to compare the efficiency of our proposed method on each dataset used.show the detailed training and testing times for each comparative method and our method.Taking the PU data scene as an example, Table 11 shows that the method called HDDA has the highest training time.HDDA also has the highest testing times compared to other methods.The GCN comparative method has the lowest training and testing times.SSRN, DBDA, and MDBNet have much higher training and testing times than our proposed method.For the other datasets used in our experiment section, our method exhibits similar training and testing times compared to other comparative methods.Although our proposed method does not have the lowest training and testing times, its time efficiency is acceptable compared to other comparative methods.

Training Times
In this subsection, the time consumed in experiments is discussed to compare the efficiency of our proposed method on each dataset used.show the detailed training and testing times for each comparative method and our method.Taking the PU data scene as an example, Table 11 shows that the method called HDDA has the highest training time.HDDA also has the highest testing times compared to other methods.The GCN comparative method has the lowest training and testing times.SSRN, DBDA, and MDBNet have much higher training and testing times than our proposed method.For the other datasets used in our experiment section, our method exhibits similar training and testing times compared to other comparative methods.Although our proposed method does not have the lowest training and testing times, its time efficiency is acceptable compared to other comparative methods.

Conclusions
In this article, we propose a novel PCCGC method that combines CNN and GCN for HSI classification.It contains two parallel subnetworks, namely, a CNN-based subnetwork and a GCN-based subnetwork.Specifically, in the CNN subnetwork, a SpePRCM is employed to extract multiscale spectral signatures.Meanwhile, SpaPCCM is used to extract multiscale spatial features.Furthermore, an adaptive feature-weighted fusion strategy is employed to adaptively fuse multiscale spectral and spatial features based on their respective weights.Based on the above, the CNN subnetwork can enhance the robustness of the proposed model in classifying HSIs.In the GCN subnetwork, a BSNet is first used to learn the spectral signatures in the origin HSI using nonlinear inter-band dependencies.Then, the spectral-enhanced GCN module is employed to learn and accentuate the important features in the spectral channel.Subsequently, a mutual-cooperative attention mechanism is constructed that can align the spectral signatures between BSNet-based matrix with spectral-enhanced GCN-based matrix for spectral signature integration.Finally, the additive fusion strategy is utilized to fusion the features extracted from GCN-based and CNN-based subnetworks.The effectiveness and robustness of our designed model are demonstrated by quantitative and qualitative experiments.In addition, to verify the superior performance of our model, numerous parametric analyses and ablation experiments are conducted.
However, the spectral-enhanced module used in GCN-based subnetwork only learns the significant features in the spectral channel.In the future, the designed GNN will be employed to extract the features from the spectral and spatial channels simultaneously.And, a fusion-based mechanism will be employed to exquisitely combine the CNN and GCN model.

Figure 1 .
Figure 1.The overall structure of the PCCGC.

Figure 1 .
Figure 1.The overall structure of the PCCGC.

Figure 2 .
Figure 2. The detailed structure of the SpePHC block.

Figure 3 .
Figure 3.The detailed structure of the SpaPHC block.

Figure 2 .
Figure 2. The detailed structure of the SpePHC block.

Figure 2 .
Figure 2. The detailed structure of the SpePHC block.

Figure 3 .
Figure 3.The detailed structure of the SpaPHC block.Figure 3. The detailed structure of the SpaPHC block.

Figure 3 .
Figure 3.The detailed structure of the SpaPHC block.Figure 3. The detailed structure of the SpaPHC block.

2 + 1
j+3 spe .At the same time, the residual connection Res() is added to the FM j+1 spe and FM j+3 spe to assist the spectral pyramid feature extraction module in learning the original spectral signatures, thereby benefiting the improvement of classification accuracy.Finally, to fully extract the multiscale spectral signatures in the HSIs and decrease the depths of the HSI data cube, the 3D convolutional layer Conv j+3 spe with a kernel size of 1 × 1 × band−7 is utilized to obtain the spectral feature maps FM j+4 spe with an output channel number of 72.The detailed operation is presented below: the subsequent operations in the spatial pyramid feature extraction module are similar to the spectral signature extraction process in the spectral pyramid feature extraction module.The detailed multiscale spatial feature extraction process in the spatial pyramid convolution cascaded module is shown as follows:

Figure 4 .
Figure 4.The detailed structure of the spectral-enhanced GCN module.

Figure 4 .
Figure 4.The detailed structure of the spectral-enhanced GCN module.

Figure 5 .
Figure 5.The structure of the mutual-cooperative attention mechanism.

Figure 5 .
Figure 5.The structure of the mutual-cooperative attention mechanism.
3.4.1.BS-Based Feature Matrix to GCN-Based Feature Matrix Cross Multi-Head Self-Attention Block

Figure 10 .
Figure 10.Full-pixel classification maps for the IP data scene.(a) Ground-truth; (b) SSRN; (c) DBDA; (d) SSGCA; (e) PCIA; (f) MDBNet; (g) HDDA; (h) DBPFA; (i) ChebNet; (j) GCN; (k) MVAHN; (l) DGFNet; (m) FTINet; (n) DKDMN; (o) MRCAG; (p) Ours; (q) False-color image.4.2.2.The Fourteen State-of-the-Art Comparison Methods(1) SSRN: The SSRN adopts spectral and spatial residual modules as its backbone and combines them in a consecutive manner to address the accuracy decreasing problem.It first extracts spectral signatures and then extracts the spatial features for pixel-wise HSI classification.Additionally, batch normalization is used in each 3D convolutional layer to regulate the feature extraction process.(2)DBDA: The DBDA has spectral and spatial branches, with dense spectral block and channel attention mechanisms included in the spectral branch for extracting and refining spectral features, and a spatial attention block and a dense spatial block included

Figure 11 .
Figure 11.The train accuracy and validation accuracy under the influence of weight _1 and weight _2 on (a) PU; (b) Honghu; (c) Houston; and (d) IP data scenes.Additionally, in Figure11b-d, feature learning weights _1 and _2 are analyzed respectively on Honghu, Houston, and IP data scene.In these three figures, phenomena similar to those in Figure11acan be observed.Especially in the Honghu data scene, when

Figure 11 .
Figure 11.The train accuracy and validation accuracy under the influence of weight α_1 and weight α_2 on (a) PU; (b) Honghu; (c) Houston; and (d) IP data scenes.

Figure 12 .
Figure 12.The OA of our proposed model under different GCN layers on PU, Honghu, and IP data scenes.

Figure 12 .
Figure 12.The OA of our proposed model under different GCN layers on PU, Honghu, Houston, and IP data scenes.

Figure 13 .
Figure 13.The OA of learning rate of our method under different epochs on (a) PU; (b) Honghu; (c) Houston; (d) IP data scenes.
, it is clear that our proposed model exhibits the best classification accuracy under different training samples, especially with limited training samples.

Figure 14 .
Figure 14.The OA of different classification methods under different training samples on (a) PU; (b) Honghu; (c) Houston; (d) IP data scenes.
Figure 15e, it can be observed that the classification map of MDBNet shows different class mix-ups together, causing a phenomenon of confusion.This is consistent with the result from Table 9, that the OA of MDBNet is lower than other comparative methods.Additionally, the T-SNE-based classification maps of DBPFA, MVAHN, and our proposed model demonstrate a clearer inter-class separation phenomenon.Furthermore, the intra-class separation phenomenon of the T-SNE-based classification maps produced by our model is the best compared to DBPFA and MVAHN, demonstrating the superior classification performance of our method.Meanwhile, as depicted in Figure 15f,j,k, the T-SNE-based classification map generated by the hybrid GCN and CNN model exhibits a better interclass separation phenomenon compared to the T-SNE-based classification maps generated by models based solely on GCN or CNN, which shows the benefit of models that combine the CNN and GCN for feature extraction.From Figure 15l, the T-SNE-based classification map of our proposed model, which shows a better inter-class and intra-class clustering scenario, indicates better classification performance compared to other comparative methods.

Figure 14 .
Figure 14.The OA of different classification methods under different training samples on (a) PU; (b) Honghu; (c) Houston; (d) IP data scenes.
Additionally, the T-SNE-based classification maps of DBPFA, MVAHN, and our proposed model demonstrate a clearer inter-class separation phenomenon.Furthermore, the intra-class separation phenomenon of the T-SNE-based classification maps produced by our model is the best compared to DBPFA and MVAHN, demonstrating the superior classification performance of our method.Meanwhile, as depicted in Figure 15f,j,k, the T-SNE-based classification map generated by the hybrid GCN and CNN model exhibits a better inter-class separation phenomenon compared to the T-SNE-based classification maps generated by models based solely on GCN or CNN, which shows the benefit of models that combine the CNN and GCN for feature extraction.From Figure 15l, the T-SNE-based classification map of our proposed model, which shows a better inter-class and intra-class clustering scenario, indicates better classification performance compared to other comparative methods.

Figure 16 .
Figure 16.Ablation experiments of our proposed model on PU, Houston, Honghu, and IP data scenes: model_0: complete model; model_1: model without mutual-cooperative mechanism; model_2: model with mutual-cooperative attention mechanism that includes GCN-based spectral signature; model_3: model without spectral-enhanced GCN module; model_4: the model that only includes GCN-based subnetwork; model_5: model that only includes CNN-based subnetwork; model_6: model without adaptive feature-weighted fusion strategy; model_7: model without spectral pyramid hybrid convolution block; model_8: model without spatial pyramid hybrid convolution block.

Figure 16 .
Figure 16.Ablation experiments of our proposed model on PU, Houston, Honghu, and IP data scenes: model_0: complete model; model_1: model without mutual-cooperative mechanism; model_2: model with mutual-cooperative attention mechanism that includes GCN-based spectral signature; model_3: model without spectral-enhanced GCN module; model_4: the model that only includes GCN-based subnetwork; model_5: model that only includes CNN-based subnetwork; model_6: model without adaptive feature-weighted fusion strategy; model_7: model without spectral pyramid hybrid convolution block; model_8: model without spatial pyramid hybrid convolution block.

Figure 17 .
Figure 17.The features before the spectral-enhanced GCN module: (a,c,e,g,i), the features after the spectral-enhanced GCN module: (b,d,f,h,j).

Figure 17 .
Figure 17.The features before the spectral-enhanced GCN module: (a,c,e,g), the features after the spectral-enhanced GCN module: (b,d,f,h).

Table 1 .
The landcover classes of the PU, the color of each class, and the number of each class in the training set, validation set, and test set.

Table 2 .
The landcover classes of the Houston, the color of each class, and the number of each class in the training set, validation set, and test set.

Table 3 .
The landcover classes of the Honghu, the color of each class, and the number of each class in the training set, validation set, and test set.

Table 4 .
The landcover classes of the IP, the color of each class, and the number of each class in the training set, validation set, and test set.

Table 5 .
The landcover classes of the Xiongan, the color of each class, and the number of each class in the training set, validation set, and test set.

Table 6 .
Classification results of the PU data based on 1% training samples.

Table 7 .
Classification results of the Houston data based on 2% training samples.

Table 8 .
Classification results of the Honghu data based on 1% training samples.

Table 9 .
Classification results of the IP data based on 5% training samples.

Table 10 .
Classification results of the Xiongan data based on 1% training samples.

Table 11 .
Training and testing times of different comparative methods and our method on the PU data.

Table 11 .
Training and testing times of different comparative methods and our method on the PU data.

Table 12 .
Training and testing times of different comparative methods and our method on the Houston data.

Table 13 .
Training and testing times of different comparative methods and our method on the Honghu data.

Table 14 .
Training and testing times of different comparative methods and our method on the IP data.
Author Contributions: Conceptualization, H.P. and H.Y.; methodology, H.P., H.Y. and H.G.; software, H.P., H.Y. and H.G.; validation, H.P., H.Y. and H.G.; formal analysis, H.P. and H.Y.; investigation, H.P., H.Y. and H.G.; resources, H.P., L.W. and C.S.; data curation, H.P. and H.G.; writing-original draft preparation, H.P. and H.Y.; writing-review and editing, H.P. and H.Y.; visualization, H.P., H.Y. and H.G.; supervision, H.P. and H.Y.; project administration, H.P., L.W. and C.S.; funding acquisition, H.P. and C.S.All authors have read and agreed to the published version of the manuscript.This research was funded by Heilongjiang Provincial Natural Science Foundation of China, grant number LH2023F050, the Fundamental Research Funds in Heilongjiang Provincial Universities under Grant, grant number 145309208, the National Natural Science Foundation of China, grant number 42271409, and the Heilongjiang Provincial Higher Education Teaching and Reform Project, grant number SJGZ20220112. Funding: