Rotation is All You Need: Cross Dimensional Residual Interaction for Hyperspectral Image Classification

The performance of deep convolutional neural networks has been significantly improved in recent years as a result of additional attention mechanisms applied to the standard networks. Numerous experiments conducted have demonstrated that spectral-spatial attention enhances the network's categorization ability. The three attention modules that currently use spatial attention, spectral attention, and channel attention are isolated from each other and their interrelationships are not fully considered. To solve this problem and establish the dependencies among different channels, spectral bands, spatial height, and width simultaneously, in this article, a new cross attention module called quadlet is proposed, which can capture information using simultaneous interaction of the channel, spectral depth and spatial location to improve the classification accuracy of hyperspectral images. By incorporating the quadlet attention module, a cross-dimensional residual network (QuadNet) is proposed for HSIs classification. A series of experiments conducted on four publicly available hyperspectral datasets showed that the proposed cross-attention residual network can effectively establish the dependencies among different dimensions of input tensor and achieve 98.22%, 99.88%, 99.10%, and 96.46% overall accuracy on IN, UP, SA, and UH datasets, respectively.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) can efficiently distinguish objects with similar appearance through contiguous spectral signatures, which provide abundant and detailed spectral information [1], [2]. They have been widely used in various fields of Earth observation, such as agriculture, forestry, land management, and military monitoring [3]. One of the fundamental research areas of HSIs processing is classification, which aims to classify each pixel in HSIs [4]. Although hundreds of bands can provide rich spectral features, they also cause severe band redundancy and the spectral curse of dimensionality [5]. To solve these problems and extract feature bands efficiently, many band selection and band extraction methods have been applied to HSIs over the past decades, such as principal component analysis and search-based or clustering methods. [6], [7]. However, the feature extraction methods with manual intervention cannot achieve expected results with a good generalization ability. Therefore, discriminative feature extraction from HSIs remains challenging.
With the application of deep learning techniques, especially deep convolutional neural networks (CNNs), HSIs classification performance has made great progress [4]. According to the difference of input features, CNN-based methods could be roughly divided into spectral-based methods and spectralspatial-based methods. Spectral-based methods [8], [9] utilize the spectral signatures of each pixel as input, without considering the spatial information. Spectral-spatial-based methods [10], [11], [12] extract patches that consist of the central target pixel and its neighboring pixels to effectively integrate both spectral and spatial features. Besides CNNs, recurrent neural networks (RNNs) [13], gated recurrent unit network (GRU) [14], long short-time memory [15], and generative adversarial networks (GANs) [16], [17], [18] have also been widely explored for HSIs classification.
The abovementioned models extract features using deep neural networks but without attention modules. Attention mechanisms have also been introduced to improve the image classification results [19]. The attention mechanism is an emerging technique in recent years to simulate the signal processing mechanism unique to the human vision system, and it quickly acquires the target regions that need to be focused on [20], [21]. The aim of attention is to create dependencies among different channels within feature maps and capture the meaningful information encoded in channel dimensions. The attention mechanism has been widely studied in the field of computer vision [22], [23], as well as in HSIs classification tasks [24], [25].
However, most of the attention modules applied in the existing networks establish interrelationships between the spectral channels and the spatial features, together or separately, for HSIs classification, while ignoring the importance of crossdimensional interaction with the number of obtained feature maps [26]. 3 feature maps from inputs with spatial and spectral dimensions and the feature maps correspond to the learned representation of the input data. Incorporating the number of feature maps in an attention module of 3-D CNNs can selectively highlight the most informative features and suppress both irrelevant and noisy features. Therefore, incorporating the number of feature maps in an attention module can improve the quality of the learned features and increases the accuracy of the model. To simultaneously model the interactions among the number of feature maps, spectral depth, and spatial locations, i.e., height and width, inspired by the triplet attention mechanism [27], a new quadlet attention is proposed in this article for accelerating the learning of discriminative spectral and spatial features during models training. Quadlet attention constructs the relationship among different dimensions of the input tensor, i.e., the number of feature maps, spectral channels, and spatial locations to extract the cross dimensional attention weights by capturing cross dimension interaction using a four-branch parallel architecture.
Consider the shape of input tensor (B, C, D, H, W ) where the batch size B, the number of feature maps C, spectral depth D, spatial height H, and width W are generated during the forward propagation of CNNs. The corresponding four independent branches of quadlet attention can be modeled as (D, H, W ), (C, H, W ), (D, C, W ), and (D, H, C), to establish the dependencies between channels, bands, spatial height, and width, respectively. Quadlet attention encodes interchannels and spatial information for a given input tensor and develops interdimensional interdependence through permutation operation, followed by a cost effective residual connection. The proposed quadlet attention module is utilized to design a simple and effective cross dimensional spectral-spatial residual interaction network for HSI classification. The main contributions of this article are summarized as follows.
1) We integrate a simple and effective cross dimensional attention called triplet attention in HSI classification. Moreover, we consider one additional dimension to further propose the quadlet attention, which could establish the dependencies between any three dimensions among the number of feature maps, spectral depth, the spatial height and width of input tensor. 2) The quadlet attention module is integrated with an improved SSRN which enables learning of cross dimensional spectral-spatial feature representation for the HSIs classification task. The rest of this article is organized as follows. Section II introduces related work in detail. The proposed quadlet attention module and the developed QuadNet architecture are described in detail in Section III. The experimental setup and results are provided in Section IV. Finally, Section V concludes this article.

A. HSIs Classification Using Conventional Networks
Many conventional networks without attention mechanisms have been widely explored for HSIs classification. From the 1-D spectral perspective, Hu et al. [8] directly classified HSIs from the spectral domain using a 1-D CNN. Gao et al. [9] extracted spectral information and transformed the 1-D spectral array to a 2-D feature map. Then, they classified the HSIs by stacking convolutional layers with kernel sizes of 1 × 1 and 3 × 3. For 2-D spatial frameworks, Yu et al. [28] took the original data as input and employed a 2-D CNN for HSIs classification. Ding et al. [29] trained a 2-D CNN framework where the kernel size was adaptively learned from the data to classify HSIs. More common approaches involve extracting spectral and spatial features jointly for HSIs classification. For instance, Roy et al. [30] proposed a hybrid spectral convolutional neural network (HybridSN), which includes spectral-spatial 3-D-CNN and spatial 2-D-CNN. The former learns the joint spectral-spatial feature representations, and the latter extracts more abstract spatial information. Zhong et al. [31] developed a supervised deep learning framework called spectral-spatial residual network (SSRN) for HSI classification. SSRN includes four consecutive residual blocks to capture discriminative features from spectral signatures and spatial contexts. Paoletti et al. [32] proposed a deep pyramidal residual network to extract deeper spectral-spatial representations through more convolutional filters of the network. Zhang et al. [33] designed a multiscale dense network to combine and make full use of different scale features. Mou et al. [34] proposed a fully end-to-end conv-deconv network for unsupervised spectral-spatial feature learning. The proposed conv-deconv network can largely alleviate the reliance on training sample data with labels and solve the problem of a limited number of hyperspectral remote sensing image samples.
In addition to CNNs, other types of networks, such as RNNs GANs, have also been applied for HSIs classification. Mou et al. [35] considered HSIs as sequenced data and explored the efficient RNN for HSIs classification. In addition, Hang et al. [36] considered spectral signatures to be sequences and used GRUs to create a cascaded RNN model to separate the important representation from the redundant data. To deal with the issue of limited sample data and the challenge of gathering ground-truth labels, Hang et al. [37] proposed a multitask generative adversarial network (MTGAN). The proposed MTGAN consists of a generator network for hypercube reconstruction and classification, and a discriminator network to discriminate between the real and reconstructed data. Similarly, Roy et al. [38] introduced a generative model which can efficiently tackle the problem of classwise imbalanced training samples for HSIs classification.

B. HSIs Classification Using Attention-Aided Networks
In recent studies, attention module has been introduced to establish the dependencies within spectral bands or spatial locations. Paoletti et al. [39] designed an attention-aided capsule network to increase hyperspectral classification performance and computational efficiency. The attention mechanisms could help extract and identify the most representative and meaningful features of the images. Yu et al. [40] presented a feedback attention-guided spectral-spatial dense CNN to address the problem of information redundancy and inefficient representations of spectral-spatial features for hyperspectral classification tasks. Yang et al. [41] proposed a cross-attention spectral-spatial network to solve the problem of the high sensitivity of convolutional features extracted from the HSIs. However, it still exhibited a poor classification performance for the pixels near the edges. Hang et al. [42] designed an attention-aided CNN model to fully explore the discriminative features by focusing on the spectral bands and spatial positions within small hypercubes. Zhu et al. [43] proposed a residual spectral-spatial attention network (RSSAN) for HSIs classification. Nevertheless, a notable limitation of RSSAN was the lack of utilization of 3-D CNN for discriminative spectral-spatial feature extraction. Haut et al. [44] incorporated the attention mechanisms to the residual networks for characterizing the spectral-spatial information. This approach resulted in an improved classification performance but the quality of the captured features needed to be enhanced further. Mou et al. [45] developed a spectral attention module to selectively highlight the most important spectral bands in HSIs using the gating mechanisms. Despite their method achieved promising results, it lacked an interpretation for assessing the importance of the spectral bands obtained through the spectral attention module. Li et al. [46] developed a double-branch dual-attention mechanism network (DBDA), which enables learning of complementary spectral and spatial features by utilizing channel attention and spatial attention separately. Mei et al. [47] introduced a novel approach named the spectral-spatial attention network for HSI classification. The network utilized a combination of RNNs with attention to capture spectral correlations within a continuous spectrum, and CNNs with attention to model the spatial relevance between neighboring pixels in the spatial domain. However, the generalization of these methods to complex scenarios was not taken into consideration. Wu et al. [48] constructed a 3-D CNN-based residual group channel and spatial attention network for HSIs classification. The attention modules selectively strengthened the informative features in the input data, enhancing both the spatial as well as channelwise representations. The developed method improved the classification accuracy but also resulted in an increase in the number of network parameters.
Furthermore, the self-attention transformers have also been applied in the field of HSIs classification [49], [50], [51], [52]. Liu et al. [53] designed a scaled dot-product central attention module to extract spectral-spatial information from the central pixels and their adjacent pixels. Based on the proposed attention module, a central attention network was developed, which achieved superior classification performance. Zhao et al. [54] presented a graph transformer network with graph attention mechanism to learn node features on heterogeneous graphs. The proposed method can solve the problem of zero weight edge in heterogeneous graph through assigning weights for edges. Nonetheless, most of the aforementioned methods primarily focus on establishing dependencies among spectral and spatial dimensions, ignoring the potential interaction across different dimensions, such as the number of feature maps. Fig. 1 shows the overall architecture of the proposed multibranched cross attention residual network (QuadNet). The proposed method mainly consists of four crucial steps as follows.

III. PROPOSED METHOD
1) Extraction of HSIs small hypercubes and low-level representations. 2) Establishment of dependencies among the number of feature maps, spectral channels, spatial height and width through the quadlet attention module. 3) Extraction of spectral-spatial features via triplet attentionaided spectral-spatial residual blocks. 4) Classification by fully connected layer and softmax function.

A. Preprocessing of HSI Data
Suppose the HSI data are represented as H ∈ R h,w,b , where h, w denote the spatial dimension, i.e., height and width, and b denotes its spectral dimension, namely, the number of bands of the hyperspectral data. Among all pixels, suppose there are N pixels with category labels {x 1 , x 2 , . . . , x n } ∈ R 1×1×b , and the number of land-cover categories is c, then the corresponding true values of N pixels are {y 1 , y 2 , . . . , y n } ∈ R 1×1×c . To consider both the spectral and spatial information from HSIs, the central pixel with ground truth label and their neighborhood pixels in a certain range of spatial dimensions are extracted simultaneously, thus forming a small hypercube of X ∈ R p×p×b , where p denotes the patch size. All the N samples with their associated labels are randomly divided into training sets (X train ), validation sets (X val ), and test sets (X test ), respectively. X train is used to train the model and optimize the model parameters, X val is used for model selection during training, and X test is used for final model evaluation.

B. Overall Classification Framework
Let us consider Indian pine (IN) dataset with the size of 200 × 145 × 145 as an example. The overlapping patches are extracted to create small hypercube of the size 200 × 11 × 11. The input data of shape (B, 1,200,11,11) are given to the initial 3-D convolution layer, where B represents the batch size, 1 is the number of feature maps and 200, 11, and 11 denote the spectral bands, height, and width, respectively, of the extracted hypercube of the IN dataset, as shown in Fig. 1. Table I shows the model parameters for each block. The network takes a small hypercube as input and the first 3-D convolution layer is applied to extract low-level features by considering the convolution operations in both spectral and spatial dimensions with the help of a 3-D convolutional kernel of size (7, 1, 1), and stride of (2, 1, 1), followed by a batch normalization (BN) layer. We consider the input small hypercubes as the initial feature map with the number of 1, and the number of kernels in the first layer, i.e., the number of output feature maps is set to 24. The first 3-D convolutional layer will increase the number of feature maps from 1 to 24, decrease the spectral dimension from 200 to 97, and keep the same spatial dimension. BN is applied after every convolutional layer to prevent the model from overfitting.
To establish the cross dimensional interaction among different dimensions of feature maps, i.e., number of feature maps, spectral depth, and spatial locations, a quadlet attention module, which can effectively extract the meaningful feature representation using the interaction of all the dimensions of the feature maps is introduced and explained step by step in the following section. The quadlet attention module does not change the shape of its input feature map, so the shape of the output feature map obtained after the attention module is still (B, 24,97,11,11). After exploring the interaction in different dimensions of the feature maps, the feature extraction is further performed using a 3-D convolution with a kernel size of (1, 1, 1) and stride of (1, 1, 1), followed by a BN, and a ReLU activation layer. Then four successive triplet attention aided spectral-spatial residual blocks are used to further extract spatial and spectral features. Finally, two fully connected layers are used to obtain the predicted label. Fig. 1 shows the framework of the proposed QuadNet network, which is described in the following sections.

1) 3-D Convolution:
Suppose the input feature map of the lth 3-D convolutional layer is defined by i.e., the number of input feature maps, and D l−1 , H l−1 , and W l−1 represent the spectral depth, spatial height, and width of the (l − 1)th layer output feature maps. The lth convolutional layer has C l convolutional kernels of size (k l 1 , k l 2 , k l 3 ), and subsampling strides of (s 1 , s 2 , s 3 ). Zero padding is also employed to keep the shape of output feature maps unchanged. Then the lth convolutional layer generates an output feature map of shape (C l , D l , H l , W l ), where the number of output feature maps is equal to the number of kernels C l . The spectral depth D l equals to 1 + (D l−1 − k l 1 )/s 1 , spatial height H l equals to 1 + (H l−1 − k l 2 )/s 2 , and the spatial width is similar to its height. The lth 3-D convolutional layer with BN operation could be expressed by as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where, * denotes the convolution operation, X l−1 j is the jth input tensor, W l p and b l p are the weights and additive inductive bias of the pth filter bank in the lth convolution layer. BN is represented in (2), where μ and σ 2 represent the expectation and variance, γ and β are the learnable parameters during the training process. R is activation function and calculated as ReLU(x) = max(0, x).
2) Quadlet Attention: This attention aims to model the cross dimensional interaction of input tensor, which include the number of feature maps, spectral depth, spatial height and width, respectively. The quadlet attention module F quad (·) takes the convolutional feature map with shape (B, C, D, H, W ) as input, where C denotes the number of feature maps, D is the spectral depth, and H, W are the spatial height and width, respectively, and produces a calibrated feature map O of the same shape as the input where, θ represents the learnable parameters of the attention function.  The quadlet attention mechanism is implemented via four independent branches for extraction of the cross-dimensional feature information, and they are named F c (·), F d (·), F h (·), and F w (·), respectively, according to their corresponding rotational dimensions. To establish the cross-dimensional relationship of input feature maps with a shape of (B, C, D, H, W ), the extracted features are aggregated using an elementwise addition operation as where, θ c , θ d , θ h , and θ w represent trainable parameters of the four branches of the quadlet attention module, i.e., F c , F d , F h , and F w , respectively. ⊕ is the elementwise addition operation. Each branch of the quadlet attention module is explained next The F c (X; θ c ) branch is used to build interaction among the spectral depth, spatial height and width of the feature maps. To do this, global maximum pooling and average pooling are first performed in the number of feature maps dimensions, as shown in (6), where MaxPool and AvgPool represent max pooling and average pooling, respectively. The results produce a feature map with a shape of (2, D, H, W ). After that, feature extraction is performed using 3-D convolution followed by a BN operation to produce the intermediate feature of dimensions (1, D, H, W ). To obtain the cross-dimensional attention weights, the intermediate features are passed through a sigmoid (σ) function, and finally the dot product (⊗) with the input features denoted by identity function I(X) is conducted to obtain the output features The second branch F d (X; θ d ) is used to perform the interactions among the number of feature maps, the spatial height and width. First, the positional relationship is created between feature maps C and spectral depth D in the feature map by rotating the input tensor, and this produces feature mapsX D,C,H,W . Then global pooling and average pooling are performed in the spectral depth dimension and the resultant feature is concatenated to obtain the output feature map of shape (2, C, H, W ). The term "rotation" means the permutation of the dimensions of the input tensor. In Fig. 3, the red arrow represents the dimension permutation operation, which can be viewed as a rotation operation. Then, the 3-D convolutional layer, BN blocks are utilized in the last three dimensions, including the number of feature maps, height, and width, to capture the cross dimensional interaction. Finally, the obtained feature maps are passed through the sigmoid function to produce the corresponding attention weights. The obtained attention weights are produced with the original permuted feature maps, and the dimensions of the spectra and channels are exchanged again to obtain the final output of shape (C, D, H, W ), as shown in Fig. 3.
In the third branch F h (X; θ h ), the cross-dimensional attention weights between the spectral depth D, the number of feature maps C, and the spatial width W are constructed. Similar to the second branch, the input feature map is firstly permuted between the number of feature maps C and the spatial height H to obtain a feature mapX H,D,C,W . Next, the height H dimension is globally and averaged pooled and concatenated along the second dimension, i.e., spatial H dimension, to obtain a feature map with shape (2, D, C, W ). The output is passed through a 3-D convolution layer, followed by BN and a sigmoid (σ) function, to calculate the attention weights of spectral, channel, and width dimensions. Finally, the output is further rotated to keep same with its original input feature shape (C, D, H, W ) The fourth branch F w (X; θ w ) is similar to the second and third branches of quadlet attention to capture the relationship information among spectral, height, and channel. The input features are first permuted in C and W dimensions to obtain a feature map of shape (W, D, H, C). Then, global and average pooling, followed by 3-D convolution and BN, are applied, which is then passed through the sigmoid activation function to obtain the cross-dimension attention weights. After the elementwise dot product between attention weights and input tensor, we obtain the output feature map having the same shape as its input.
The calibrated features of shape (C, D, H, W ) generated by each branch of quadlet attention module are then aggregated using elementwise addition and the result is divided by the total number of branches, as shown in (4). It can be seen from (4) that the triplet attention is a special case of the quadlet attention (shown in Fig. 2), which ignores the cross dimensional interaction in the channel dimension of the input tensor, and hence, we can rewrite the (4) as where, F TA (X; θ) denotes the triplet attention applied on input tensor X.

3) Triplet Attention Aided Spectral-Spatial Residual Block:
The residual network has been well designed to deal with the gradient disappearance problem that occurs during the training of hyperspectral classification tasks and has generated great interest in the remote sensing research community. In order to extract more robust spectral and spatial information, the triplet attention aided SSRN, which incorporates a triplet attention layer after every residual block is introduced. The output of quadlet attention is then passed through a 3-D convolution followed by BN and the ReLU activation function to perform feature normalization with the help of (1, 1, 1) the pointwise convolutional kernel. The gray shading structures in Fig. 1 show the triplet attention aided spectral-spatial residual block. Why choosing triplet rather than quadlet in the spectral-spatial residual block is mainly based on the tradeoff between classification performance and computational cost. In spectral-spatial residual blocks, both the spectral and spatial residual blocks are repeated twice, which requires us to use the attention module four times. In comparison to the triplet attention module, the quadlet attention module has more parameters and operations. If it is also used in subsequent spectral-spatial residual blocks, the computational cost of the network will be increased significantly.
To learn the robust representation, the normalized convolutional feature input is passed through four consecutive triplet attention aided spectral-spatial residual blocks. Each residual block consists of a Conv3D layer followed by a BN and a ReLU activation layers, and these three primitive steps are repeated twice in a residual block for enhancing the feature extraction as well as forward and backward propagation of information. Suppose X k−1 represents the input feature map of the spectral-spatial residual blocks, which is parameterized with F RN (X k−1 ; ω 1 , ω 2 ) and the output feature map is then passed through the triple attention layer F TA (·), and the final feature representation can be calculated as follows: where, I(X k−1 ) is the skip function connects input to the output of a residual unit, X k+1 is the output of the triple attention aided spectral-spatial residual block. ω 1 , ω 2 are the parameters of the triple attention aided spectral-spatial residual blocks, and ω 1 = {W k , W k+1 }, and ω 2 = {b k , b k+1 } denote the weight matrix and bias associated with the kth and (k + 1)th 3-D convolutional layers, respectively. In Fig. 1, we consider L x as 2, which means two units of both spectral and spatial residual blocks used sequentially are investigated in order to improve the discriminative capability of each residual block. One may readily distinguish between spectral feature learning and spatial feature learning based on the convolutional operation used to perform convolution in depth or spatially to extract the robust feature representation. Each residual block having 24 kernels shown in Fig. 1 and extracting spectrally focused spatial features is done with the last two ResBlocks whereas extracting spatially focused spectral features is done with the first two ResBlocks. Table I shows the shapes of the applied kernels and strides for both the spectral and spatial residual blocks. The values of the hyperparameters (the number of feature maps, the patch size and convolutional kernel size) are determined by refereeing to the widely used setting for A2S2K-ResNet [25] and SSRN [31]. As shown by many existing research, the 3-D convolutional operation enables the model to capture spatial and spectral dependencies among the input data, which is essential for achieving high performance on HSIs classification. Thus, 3-D convolutional operation is also employed here. The receptive field of the convolutional filters, which regulates the amount of features to be learned by the model, is determined by the kernel size in a 3-D convolutional operation. In our method, the kernel size is selected by referring to the widely accepted setting, such as those in SSRN [31]. The kernel of size (7,1,1) with a stride of (2,1,1) in the first convolutional layer is selected to reduce the spectral dimension for saving computation resources. A kernel size of (7,1,1) is used in the first two residual blocks to extract spectral features whereas a kernel size of (1,3,3) is applied in the last two residual blocks to extract spatial features.
As a result, the discriminative capability of the proposed model is increased by cooperative learning of spectral and spatial information. In addition, the triple attention is embedded in the residual blocks to achieve spectral and spatial cross dimensions interaction of the residual feature representation. The triplet attention in the spectral-spatial residual block captures the spectral and spatial cross dimensional relationship in spectral depth, spatial height and width, respectively, ignoring the dimension of the number of channels.

A. Datasets
The experiments are conducted on three widely used HSI datasets, including IN, University of Pavia (UP), Salinas (SA), and University of Houston (UH). The details of each dataset are explained in the following.
1) The IN dataset was collected by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor over the test site in northwest India in 1992. The IN dataset has a size of 145 × 145 pixels in the spatial dimension and contains 224 bands in the spectral dimension. Twenty four bands were excluded due to the effect of water vapor absorption. The wavelength range is 400-2500 nm, and the spatial resolution of each pixel point is 20 m. Out of 21 025 pixels, a total of 10 249 pixels that contain 16 different kinds of vegetation classes are selected. 10% of the selected samples are used for training, 10% for validation, and 80% for testing. Fig. 4 shows the distribution of various categories in the IN dataset and the colour representation for each land cover category in the IN dataset. 2) The UP dataset was acquired by the reflective optics system imaging spectrometer sensor in 2001 at the UP in northern Italy. It contains 610 × 340 pixels, each with a spatial resolution of 1.3 m. The spectral dimension is 103 with wavelengths in the 430-860 nm range. All the pixels with labels were classified into nine different urban land cover types. The color representation and the number of instances for each category are illustrated in Fig. 5. For the UP dataset, 5% of the samples are used as training samples, 10% as validation samples, and 85% used as test samples.

3) The SA dataset was acquired by the AVIRIS sensor in SA
Valley, California. The dataset has 512 × 217 pixels and 204 spectral bands with a spatial resolution of 3.7 m. The 54 129 pixels with labels were divided into a total of 16 different terrestrial categories. Fig. 6 shows the ground truth and the number of labeled samples in the SA dataset. During the experiment, 5% of the samples are selected as training samples, 10% as validation samples, and the remaining 85% are considered as test set. 4) The UH dataset was collected over the campus of the UH using a compact airborne spectrographic imager with a spatial resolution of 2.5 m and a spectral range of 380-1050 nm. The dataset consists of 144 bands and covers an area of 340×1905 pixels. It contains 15 land cover classes. Fig. 7 shows the ground truth and the number of labeled samples in the SA dataset. During the experiment, 5% of the samples are selected as training samples.

B. Experiment Setup
To evaluate the effectiveness of the proposed QuadNet method, we compare it with various state-of-the-art models using these three adopted datasets. The reference models include a HybridSN [30], a deep SSRN [31], and its variants with different attention mechanisms, e.g., A2S2K-ResNet [25], DBMA [33], DBDA [46], RSSAN [43], and SPANet [55], respectively. Besides, two representative transformer-based methods, vision transformer (ViT) [56] and multimodal fusion transformer (MFT) [52], are selected for comparison. It should be noted that the MFT method used in our experiments is a modified version where only the hyperspectral branch of that in [30] is kept and the LiDAR part is removed. This is because this study only involves HSI classification.
For all the models, cross-entropy is used as a loss function to measure the classification effect of the model parameters during training, and the Adam optimizer is chosen to back-propagate the error gradient and update the model weights in the network with a learning rate 0.001. The models are trained for 200 epochs in each experiment. In addition, in order to prevent the model from overfitting on the training set, an early stop strategy is used. When the loss value of the model on the validation set does not decrease for 50 consecutive times, the model training is terminated and the weights with the lowest loss value on the validation set are saved. The model weights corresponding to the lowest loss values on the validation set are used to evaluate the test set. The whole experiment is repeated three times, and the AA and standard deviation of the three experiments are obtained to avoid the randomness that may exist in a single run. All experiments are performed on the compute Canada server with 64 GB memory.
For the input tensor (i.e., extracted small patches), the minmax scaling normalization is implemented before feeding it into the deep learning models used in our experiments. Min-max scaling normalization is a commonly used technique to scale the input features to a fixed range of values, [-0.5, 0.5] here. It can also help reduce the impact of different scales on the learning process and improve the performance of the model.

C. Classification Results
In this section, the qualitative and quantitative experimental results are analyzed. Three different evaluation metrics, including overall accuracy (OA), average accuracy (AA), and kappa coefficients (Kappa) are used for model evaluation.

1) Results on Random Split Datasets:
We first conduct experiments over the IN, UP, and UH datasets to compare the performance with and without quadlet attention, the results are   This is because the proposed multibranches cross-attention can simultaneously establish the dependencies among four different dimensions, i.e., the number of feature maps, spectrum, spatial height, and spatial width, thus achieving higher classification results. Besides, the accuracy of ViT is relatively lower than that of QuadNet, while MFT shows small decrease in performance compared with QuadNet. Fig. 4 shows the classification results of different methods on IN dataset. It can be seen that a large amount of noise appeared in the classification maps obtained using the HybridSN, RSSAN, and ViT methods, indicating that a large number of pixels are misclassified. The classification maps obtained by the A2S2K-ResNet network have some confusion between Alfalfa (red) and Hay-Windrowed (dark gray), and the classification maps obtained by the DBMA and SSRN methods have confusion between soybean-notill (yellow-brown) and corn-notill (light green). The classification map obtained by the Quad-Net network proposed in this article is the closest to the true map, thus proving its superiority over the other seven methods. However, a little misclassification happened in the boundary region between corn-notill (light green) and soybean-notill (dark green).
To verify the classification performance sensitivity of different sample sizes for these models, the classification results of the proposed QuadNet and other models are compared using    Table 8 shows the corresponding OA, AA, and Kappa results. For the IN dataset, 2%, 3%, 5%, 10%, and 15% are chosen as the training data, and the corresponding OA, AA, and kappa values of the QuadNet model are depicted by the light blue curves in Fig. 8. It can be seen from the figure that the classification performance of various models increases with the number of training samples. The proposed QuadNet has the best classification performance for ranging amount of training samples. Under the condition of limited training sizes, QuadNet also obtains highest classification accuracy, whereas the rest of the models, such as HybridSN and RSSAN, are relatively worse in terms of OA, AA, and Kappa.
The classification results of different models on the UP dataset are illustrated in Table IV. The UP dataset has more samples with ground truth than the IN dataset, therefore, the OA of all methods is higher than 95% with only 5% as training samples. Among all the methods, the HybridSN, RSSAN, ViT similarly show the lower OA in classification. Nevertheless, the QuadNet achieves the highest OA (99.88%), AA (99.83%), and kappa (99.85%). Fig. 5 shows the classification graphs obtained by different models when 5% data are used for training. It can be seen that large differences are observed between the ground truth and the classification maps obtained by HybridSN, RSSAN. For example, there is an obvious confusion of pixels between bare soil (pink) and meadows (light green) in the maps. Since the training data are sufficient, SSRN, A2S2K ResNet, DBDA, DBMA, and the QuadNet proposed in this article, all achieve an accuracy greater than 99.5%. Fig. 9 shows the OA, AA, and kappa obtained under different number of training samples. Due to the availability of sufficient labeled samples in UP datasets, most methods demonstrate good classification performance. However, to assess the model's capability with fewer samples,  we evaluate the classification accuracy using a smaller number of samples. For the UP dataset, the training sample proportions of 1%, 2%, 3%, 4%, and 5% are considered, and it can be seen from the figure that the proposed model still achieves better results with small training sample proportions. When only 1% are used for training, QuadNet achieves the highest OA (98.42%), which is better than SSRN (97.58%), A2S2K ResNet (96.04%) DBMA (97.41%), and DBDA (96.89%), as shown in Fig. 9(a). In addition, it can be seen that HybirdSN and RSSAN are less effective under small sample conditions. Table V lists the average OA, AA, Kappa, and their test standard deviations based on three runs using 5% of the SA dataset. Similar to the IN and UP datasets, the proposed method outperforms other network models. Specifically, the QuadNet method achieves an OA of 99.10%, while SSRN, DBMA, and DBDA are 98.03%, 98.75%, and 98.60%, respectively. Hy-bridSN, RSSAN, ViT are relatively less effective. In terms of AA and Kappa, the proposed QuadNet also display the highest scores compared with SSRN, DBMA, and A2S2K ResNet. In addition, the deviations obtained from the three experiments show that QuadNet has the lowest deviations for OA (0.01%), AA (0.02%), and Kappa (0.01%), which is lower than A2S2K ResNet, DBMA, and DMDA. This indicates that the proposed network has higher stability. Fig. 6 illustrates the classification maps generated by different methods and the ground truth. It can be seen that the classification maps obtained by QuadNet are the closest to the ground truth, while other methods, such as DBMA, RSSAN, and SSRN, display more confusion between Vineyard untrained (orange) and grapes untrained (dark gray), resulting in a lower classification accuracy. Similar to UP datasets, the SA dataset also provides sufficient labeled samples. Therefore, we conduct experiments using fewer training dataset proportions of 1%, 2%, 3%, 4%, and 5%, as shown in Fig. 10. Again, QuadNet achieves better results than all other models at different amounts of training sets.
To further demonstrate the effectiveness and robustness of the proposed model, a relatively new and advanced dataset-the UH dataset is also employed. Table VI displays the classification results of various methods on the UH dataset. As evident from the table, the proposed QuadNet outperforms all other methods in terms of OA, AA, Kappa, and for the majority of the classes, demonstrating its superiority. On the other hand, ViT, HybridSN, and RSSAN exhibit lower accuracies when compared to other CNN or transformer-based methods. Fig. 7 presents the classification maps generated by different methods and provides a visual representation of the classification performance of each method on the UH dataset. The maps clearly demonstrate that the proposed method outperforms other methods, as it generates more clear and distinct boundaries between different land cover categories. Fig. 11 displays the accuracy corresponding to different percentage (1%, 2%, 3%,   4%, and 5%) of training samples. As shown in the graph, the proposed method (see the light blue curve) achieves the highest accuracy even with a limited number of training samples, demonstrating its superior generalization capacity.
Figs. 14-17 present the distribution of the extracted features from four datasets for different methods using T-distributed stochastic neighbor embedding (t-SNE). For the proposed method, the same classes of samples are clustered together and there is a significant difference among different categories, further demonstrating the strong classification capacity of the proposed method.
2) Results on Disjointed Datasets: The sampling method using random selection of training data for HSIs is prone to the problem that the training and test sets are similar. For classifying any pixel, a patch with it as center is used as input. In the random sampling method, the extracted patches for training and testing always overlap in some extent. For example, two adjacent pixels with one belonging to the training dataset and the other for testing, a large portion of overlap exists between their corresponding patches, thus the training and testing datasets are not completely separated. To avoid this issue, disjointed datasets that are sampled from nonoverlapping regions, as illustrated in the figure below are used. This ensures that the training and testing sets are entirely disjoint, thus avoiding any potential spatial overlap between them for better evaluating the robustness of the models. The training and test data distribution of disjointed      Table VII shows the classification results of different models on DIP and DUP datasets. It can be seen that the classification results of all models are degraded by different degrees due to the spatial separation of the training and test sets. However, the proposed method in this paper still achieves the best classification results on both datasets. On the DIP dataset, QuadNet produces OA=81.70%, AA=79.29%, and Kappa=85.99%. For the DUP dataset, the OA, AA, and Kappa are 87.88%, 90.82%, and 84.31%, respectively.

D. Ablation Study
Ablation studies are conducted to further validated the effectiveness of different modules in the proposed QuadNet model.  (Quadlet-Residual) and the proposed QuadNet model are compared. In the configuration of TA-Residual, quadlet attention module is removed from the proposed QuadNet model and the remaining model keeps the triplet attention aided spectral and spatial residual blocks unchanged. The second model is Quadlet-Residual, which removes the triple attention from the residual blocks and remains the quadlet attention module in proper position of the QuadNet model. The final scenario is the proposed QuadNet model discussed in Section III.
Table VIII depicts the OA, AA, and Kappa classification results of the ablation experiments over the IN, UP, and SA datasets. Overall, it can be seen from the table that QuadNet incorporating both attentions, i.e., quadlet and triplet can improve the classification results in terms of OA, AA, and Kappa for three datasets and achieves the best performance. Therefore, it demonstrates that the cross dimensional interaction among different dimensions, i.e., the number of feature maps, the spectral depth, spatial height and width helps emphasize discriminative power of features extraction by suppressing useless or redundant information. Table IX illustrates the number of trainable weights and computational cost during the training process of the proposed QuadNet as well as other comparison networks. From the table, one can see that HybridSN has the largest number of parameters due to the 3-D convolution operation with large kernel sizes. The DBMA and DBDA methods have similar parameter numbers because of the use of multiscale kernel in the feature extraction process. The proposed method QuadNet has similar number of parameters as SSRN, and fewer than A2S2K-ResNet. RSSAN has the least number of model parameters. In terms of floating point operations (FLOPs), proposed QuadNet has nearly 550×10 6 , less than DBMA, DBDA, and SPANet, but more than other models, such as SSRN and A2S2K-ResNet.

V. CONCLUSION
In this article, a cross-attention module named quadlet is proposed for capturing the dependencies of HSIs across different dimensions during the forward propagation of the network. The designed quadlet attention can build the relationships among the number of feature maps, spectral bands, spatial height and width. Besides, triplet attention is incorporated to spectral-spatial residual blocks to enhance the learning of spectral-spatial features. Based on the quadlet cross-attention module and improved spectral-spatial residual blocks, a quadlet cross-attention aided residual network is further built for the HSI classification task. With the help of generalized triple attention, the developed network can extract more discriminative features and boost the classification performance. A series of experiments are conducted and the results show that the proposed Quadlet-Residual can achieve higher classification accuracy with limited samples due to the extracted cross dimensional dependencies and discriminative power of feature representation.
However, the constraint also needs to be noted even though the proposed strategy yields encouraging results in the experiments. The presented approach uses 3-D convolutional processes, which could be computationally expensive when compared with 2-D convolution. Moreover, the attention module involves an additional dimension-the number of feature maps, this also increases the computation complexity. In the future, the effect of different attention mechanisms, especially the self-attention mechanisms, on the classification performance of HSIs will be investigated.