Hyperspectral Image Classiﬁcation Based on 3D Coordination Attention Mechanism Network

: In recent years, due to its powerful feature extraction ability, the deep learning method has been widely used in hyperspectral image classiﬁcation tasks. However, the features extracted by classical deep learning methods have limited discrimination ability, resulting in unsatisfactory classiﬁcation performance. In addition, due to the limited data samples of hyperspectral images (HSIs), how to achieve high classiﬁcation performance under limited samples is also a research hotspot. In order to solve the above problems, this paper proposes a deep learning network framework named the three-dimensional coordination attention mechanism network (3DCAMNet). In this paper, a three-dimensional coordination attention mechanism (3DCAM) is designed. This attention mechanism can not only obtain the long-distance dependence of the spatial position of HSIs in the vertical and horizontal directions, but also obtain the difference of importance between different spectral bands. In order to extract the spectral and spatial information of HSIs more fully, a convolution module based on convolutional neural network (CNN) is adopted in this paper. In addition, the linear module is introduced after the convolution module, which can extract more ﬁne advanced features. In order to verify the effectiveness of 3DCAMNet, a series of experiments were carried out on ﬁve datasets, namely, Indian Pines (IP), Pavia University (UP), Kennedy Space Center (KSC), Salinas Valley (SV), and University of Houston (HT). The OAs obtained by the proposed method on the ﬁve datasets were 95.81%, 97.01%, 99.01%, 97.48%, and 97.69% respectively, 3.71%, 9.56%, 0.67%, 2.89% and 0.11% higher than the most advanced A2S2K-ResNet. Experimental results show that, compared with some state-of-the-art methods, 3DCAMNet not only has higher classiﬁcation performance, but also has stronger robustness.


Introduction
In the past decades, with the rapid development of hyperspectral imaging technology, sensors can capture hyperspectral images (HSIs) in hundreds of bands. In the field of remote sensing, an important task is hyperspectral image classification. Hyperspectral image classification is used to assign accurate labels to different pixels according to multidimensional feature space [1][2][3]. In practical applications, hyperspectral image classification technology has been widely used in many fields, such as military reconnaissance, vegetation and ecological monitoring, specific atmospheric assessment, and geological disasters [4][5][6][7][8].
Traditional machine-learning methods mainly include two steps: feature extraction and classification [9][10][11][12][13][14]. In the early stage of hyperspectral image classification, many classical methods appeared, such as feature mining technology [15] and Markov random field [16]. However, these methods cannot effectively extract features with strong discrimination ability. In order to adapt to the nonlinear structure of hyperspectral data, a pattern recognition algorithm support vector machine (SVM) was proposed [17], but this method struggles to effectively solve the multi classification problem.
With the development of deep learning (DL) technology, some methods based on DL have been widely used in hyperspectral image classification [18][19][20]. In particular, the hyperspectral image classification method based on convolutional neural network (CNN) has attracted extensive attention because it can effectively deal with nonlinear structure data [21][22][23][24][25][26][27][28]. In [29], the first attempt to extract the spectral features of HSIs by stacking multilayer one-dimensional neural network (1DCNN) was presented. In addition, Yu et al. [30] proposed a CNN with deconvolution and hashing method (CNNDH). According to the spectral correlation and band variability of HSIs, a recurrent neural network (RNN) was used to extract spectral features [31]. In recent years, some two-dimensional neural networks have also been applied to hyperspectral image classification, and satisfactory classification performance has been obtained. For example, a two-dimensional stacked autoencoder (2DSAE) was used to attempt to extract depth features from space [32]. In addition, Makantasis et al. [33] proposed a two-dimensional convolutional neural network (2DCNN), which was used to extract spatial information and classify the original HSIs pixel by pixel in a supervised manner. In [34], Feng et al. proposed a CNN-based multilayer spatial-spectral feature fusion and sample augmentation with local and nonlocal constraints (MSLN-CNN). MSLN-CNN not only fully extracts the complementary spatial-spectral information between shallow and deep layers, but also avoids the overfitting phenomenon caused by an insufficient number of samples. In addition, in [35], Gong et al. proposed a multiscale convolutional neural network (MSCNN), which improves the representation ability of HSIs by extracting depth multiscale features. At the same time, a spatial spectral unified network (SSUN) based on HSIs was proposed [36]. This method shares a unified objective function for feature extraction and classifier training, and all parameters can be optimized at the same time. Considering the inherent data attributes of HSIs, spatial-spectral features can be extracted more fully by using a threedimensional convolutional neural network (3DCNN). In [37], an unsupervised feature learning strategy of a three-dimensional convolutional autoencoder (3DCAE) was used to maximize the exploration of spatial-spectral structure information and learn effective features in unsupervised mode. Roy et al. [38] proposed a mixed 3DCNN and 2DCNN feature extraction method (Hybrid-SN). This method first extracts spatial and spectral features through 3DCNN, then extracts depth spatial features using 2DCNN, and finally realizes high-precision classification. In [39], a robust generative adversarial network (GAN) was proposed, and the classification performance was effectively improved. In addition, Paoletti et al. [40] proposed the pyramid residual network (PyResNet).
Although the above methods can effectively improve the classification performance of high HSIs, they are still not satisfactory. In recent years, in order to further improve the classification performance, computer vision has widely studied the channel attention mechanism and applied it to the field of hyperspectral image classification [41][42][43][44]. For example, a squeeze-and-excitation network (SENet) improved classification performance by introducing the channel attention mechanism [45]. Wang et al. [46] proposed the spatial-spectral squeeze-and-excitation network (SSSE), which utilized a squeeze operator and excitation operation to refine the feature maps. In addition, embedding the attention mechanism into the popular model can also effectively improve the classification performance. In [47], Mei et al. proposed bidirectional recurrent neural networks (bi-RNNs) based on an attention mechanism. The attention map was calculated by the tanh function and sigmoid function. Roy et al. [48] proposed a fused squeeze-and-excitation network (FuSENet), which obtains channel attention through global average pooling (GAP) and global max pooling (GMP). Ding et al. [49] proposed local attention network (LANet), which enriches the semantic information of low-level features by embedding local attention in high-level features. However, channel attention can only obtain the attention map of channel dimension, ignoring spatial information. In [50], in order to obtain prominent spatial features, the convolutional block attention module (CBAM) not only emphasizes the differences of different channels through channel attention, but also uses the pooling operation of channel axis to generate a spatial attention map to highlight the importance of different spatial pixels. In order to fully extract spatial and spectral features, Zhong et al. [51] proposed a spatial-spectral residuals network (SSRN). Recently, Zhu et al. [52] added a spatial and spectral attention network (RSSAN) to SSRN and achieved better classification performance. In the process of feature extraction, in order to avoid the interference between the extracted spatial features and spectral features, Ma et al. [53] designed a double-branch multi-attention (DBMA) network to extract spatial features and spectral features, using different attention mechanisms in the two branches. Similarly, Li et al. [54] proposed a double-attention network (DANet), incorporating spatial attention and channel attention. Specifically, spatial attention is used to obtain the dependence between any two positions of the feature graph, and channel attention is used to obtain the channel dependence between different channels. In [55], Li et al. proposed double-branch dual attention (DBDA). By adding spatial attention and channel attention modules to the two branches, DBDA achieves better classification performance. In order to highlight important features as much as possible, Cui et al. [56] proposed a new dual triple-attention network (DTAN), which uses three branches to obtain cross-dimensional interactive information and obtain attention maps between different dimensions. In addition, in [57], in order to expand the receptive field and extract more effective features, Roy et al. proposed an attention-based adaptive spectral-spatial kernel improved residual network (A 2 S 2 K-ResNet).
Although many excellent classification methods have been used for hyperspectral image classification, extracting features with strong discrimination ability and realizing high-precision image classification in small samples are still big challenges for hyperspectral image classification. In recent years, although the spatial attention mechanism and channel attention mechanism could obtain spatial dependence and channel dependence, there were still limitations in obtaining long-distance dependence. Considering the spatial location relationship and the different importance of different bands, we propose a threedimensional coordination attention mechanism network (3DCAMNet). 3DCAMNet mainly includes three main components: a convolution module, linear convolution, and threedimensional coordination attention mechanism (3DCAM). Firstly, the convolution module uses 3DCNN to fully extract spatial and spectral features. Secondly, the linear module aims to generate a feature map containing more information. Lastly, the designed 3DCAM not only considers the vertical and horizontal directions of spatial information, but also highlights the importance of different bands.
The main contributions of this paper are summarized as follows: (1) The three-dimensional coordination attention mechanism-based network (3DCAM-Net) proposed in this paper is mainly composed of a three-dimensional coordination attention mechanism (3DCAM), linear module, and convolution module. This network structure can extract features with strong discrimination ability, and a series of experiments showed that 3DCAMNet can achieve good classification performance and has strong robustness. (2) In this paper, a 3DCAM is proposed. This attention mechanism obtains the 3D coordination attention map of HSIs by exploring the long-distance relationship between the vertical and horizontal directions of space and the importance of different channels of spectral dimension. (3) In order to extract spatial-spectral features as fully as possible, a convolution module is used in this paper. Similarly, in order to obtain the feature map containing more information, a linear module is introduced after the convolution module to extract more fine high-level features.
The main structure of the remainder of this paper is as follows: in Section 2, the components of 3DCAMNet are introduced in detail. Some experimental results and experimental analysis are provided in Section 3. Section 4 draws the conclusions.

Methodology
In this section, we introduce the three components of 3DCAMNet in detail: the 3D coordination attention mechanism (3DCAM), linear module, and convolution module.

Overall Framework of 3DCAMNet
For a hyperspectral image, Z = {X, Y}, where X is the set of all pixel data of the image, and Y is the set of labels corresponding to all pixels. In order to effectively learn edge features, the input image is processed and filled pixel by pixel to obtain N cubes with the size S ∈ R H×W×L . Here, H × W is the space size of the cube, and L is the number of spectral bands. The designed 3DCAMNet is mainly composed of three parts. Firstly, the input image is extracted by convolution module. Secondly, in order to fully consider the importance of the space and spectrum of the input image, a 3D coordination attention mechanism (3DCAM) is designed. After feature extraction, in order to extract advanced features more accurately, inspired by the ghost module, a linear module is designed. Lastly, the final classification results are obtained through the full connection layer (FC) and softmax layer. The overall framework of 3DCAMNet is shown in Figure 1. Next, we introduce the principle and framework of each module in 3DCAMNet step by step.

DCAM
Application of the attention mechanism in a convolutional neural network (CNN) can effectively enhance the ability of feature discrimination, and it is widely used in hyperspectral image classification. Hyperspectral images contain rich spatial and spectral information. However, in feature extraction, effectively extracting spatial and spectral dimensional features is the key to better classification. Therefore, we propose a 3D coordination attention mechanism (3DCAM), which is used to explore the long-distance relationship between the vertical and horizontal directions of spatial dimension and the difference of band importance of spectral dimension. The attention mechanism obtains the attention masks of the spatial dimension and spectral dimension according to the long-distance relationship between the vertical and horizontal directions of spatial information and the difference of importance of spectral information.
The structure of the proposed 3DCAM is shown in Figure 2. 3DCAM includes two parts (spectral attention and spatial coordination attention). Spectral and spatial attention can adaptively learn different spectral bands and spatial backgrounds, so as to improve the ability to distinguish different bands and obtain more accurate spatial relationships. Assuming that the input of 3DCAM is F ∈ R H×W×L , the output F out can be represented as where F and F o ut represent the input and output of 3DCAM, respectively. M H (·) represents the attention map in direction H, and the output size is H × 1 × 1. M W (·) represents the attention map in direction W, and the output size is 1 × W × 1. Similarly, M L (·) represents the attention map in direction L, and the output size is 1 × 1 × L. M H (·) and M W (·) are obtained by considering the vertical and horizontal directions of spatial information, so as to obtain long-distance dependent information. Specifically, F obtains F H ∈ R H×1×1 in the vertical direction and F W ∈ R 1×W×1 in the horizontal direction through the global average pooling layer, and the obtained results are cascaded. In order to obtain the long-distance dependence in the vertical and horizontal directions, the cascaded results are sent to the unit convolution layer, batch normalization layer (BN), and nonlinear activation layer. The activation function of the nonlinear activation layer is h_swish [58], this kind of activation function has relatively few parameters, which results in the neural network having richer representation ability. The h_swish function can be expressed as where α is a trainable parameter. Finally, the obtained results are separated and convoluted to obtain the vertical attention map M H (·) and the horizontal attention map M W (·). Similarly, F passes through the global average pool layer to obtain F L ∈ R 1×1×L , and then the obtained result passes through the unit convolution layer and the activation function layer to obtain the spectral attention map M L (F). The implementation process of 3DCAM is shown in Algorithm 1.

Algorithm 1 Details of 3DCAM.
1: Input: 2: Features: F ∈ R H×W×L . 3: Output: 4: Feature of 3DCAM: F out ∈ R H×W×L . 5: Initialzation: 6: Initialize all weight parameters of convolutional kernels. 7: F passes through L Avgpool, H AvgPool, and W AvgPool layers to generate F L ∈ R 1×1×L , F H ∈ R H×1×1 , and 8: F W ∈ R 1×W×1 , respectively; 9: Reshape the size of feature F H to 1 × H × 1 and cascade with F W to generate F HW ; 10: Convolute F HW with the 3D unit convolution kernel and the results through regularization and nonlinear a: 11: tivation function layer to generate F HW ; 12: Split F HW and convolute the results with 3D unit convolution kernel to generate F H and F W ; 13: Normalize F H and F W with the sigmoid function to generate the attention features M H (F) ∈ R H×1×1 and 14: M W (F) ∈ R 1×W×1 ; 15: Convolute F L with the 3D unit convolution kernel to generate F L ; 16: Normalize F L with the sigmoid function to generate the attention feature M L (F) ∈ R 1×1×L ; 17: Finally, the attention features M H (F) ∈ R H×1×1 ,M W (F) ∈ R 1×W×1 , and M L (F) ∈ R 1×1×L are added to the input feature F to 18: obtain F out ∈ R H×W×L .

Convolution Module
CNNs have strong feature extraction abilities. In particular, it is possible to use the convolution and pooling operations in a CNN to get deeper information from input data. Due to the data properties of HSIs, the application of a three-dimensional convolutional neural network (3DCNN) can preserve the correlation between data pixels, so that the data will not be lost. In addition, the effective extraction of spatial and spectral information in hyperspectral images is still the focus of hyperspectral image classification.
In order to effectively extract the spatial-spectral features of HSIs, a convolution block based on space and spectrum is proposed in this paper. Inspired by Inception V3 [58], the convolution layer uses a smaller convolution kernel, which can not only learn the spatial-spectral features of HSIs, but also effectively reduce the parameters. The structure of the convolution module based on space and spectrum is shown in Figure 3. As can be seen from Figure 3, input X i consists of c feature maps with the size of n × n × b. X o is the output of input X i after multilayer convolution, which can be expressed as where F(·) is a nonlinear composite function. Specifically, the neural network consists of three layers, and each layer is composed of a convolution, batch normalization (BN), and nonlinear activation function (ReLU). The convolution kernel size of the convolution layer is 1 × 1 × 3. The use of the ReLU function can increase the nonlinear relationship between various layers of neural network, and then complete the complex tasks of neural network, as shown below.
where x represents the input of the nonlinear activation function, and g activate (·) represents the nonlinear activation function. In addition, in order to accelerate the convergence speed, BN layer is added before ReLU to normalize the data, which alleviates the problem of gradient dispersion to a certain extent [59]. The normalization formula is as follows: where E[x (i) ] represents the average input value of each neuron, and Var[x (i) ] represents the standard deviation of the input value of each neuron.

Linear Module
In the task of hyperspectral image classification, extracting feature information as much as possible is the key to improve the classification performance. Inspired by the ghost module [60], this paper adopts a linear module. On the basis of the features output after the fusion of 3DCAM and convolution module, the feature map containing more information is generated by linear module.
The structure of the linear module is shown in Figure 4. The input y i is linearly convoluted to obtain y m , and then the obtained feature map y m is cascaded with the input y i to obtain the output y o . The output y m of linear convolution is calculated as follows: where ϕ(·) is a linear convolution function, v x,y,z i,j represents the neuron at the position (x, y, z) of the j-th feature map on the i-th layer, h i , w i , and l i represent the height, width, and spectral dimension of the convolution kernel, respectively, and C is the index of (i − 1) feature map. In addition, K α,β,γ i,j,C represents the weight of the j-th convolution kernel on (α, β, γ) at the C-th feature map position of layer i. v (x+α),(y+β),(z+γ) (i−1),C represents the value of the neuron at (x + α, y + β, z + γ) of the C-th feature map on layer (i − 1), and b i,j is the bias term.

Experimental Results and Analysis
In order to verify the classification performance of 3DCAMNet, this section conducts a series of experiments using five datasets. All experiments are implemented on the same configuration, i.e., an Intel (R) core (TM) i9-9900k CPU, NVIDIA Geforce RTX 2080TI GPU, and 32 GB random access memory server. The contents of this section include the experimental setup, comparison of results, and discussion.

Datasets
Five common datasets were selected, namely, Indian Pines (IP), Pavia University (UP), Kennedy Space Center (KSC), Salinas Valley (SV), and University of Houston (HT). The IP, KSC, and SV datasets were captured by airborne visible infrared imaging spectrometer (AVIRIS) sensors. The UP and HT datasets were obtained by the reflective optical spectral imaging system (ROSIS-3) sensor and the compact airborne spectral imager (CASI) sensor, respectively.
Specifically, IP has 16 feature categories with a space size of 145 × 145, and 200 spectral bands can be used for experiments. Compared with IP, UP has fewer feature categories, only nine, and the image size is 610 × 340. In addition to 13 noise bands, 103 bands are used in the experiment. The spatial resolution of KSC is 20 m and the spatial size of each image is 512 × 614. Similarly, after removing the water absorption band, 176 bands are left for the experiment. The SV space size is 512 × 217 and contains 16 feature categories, while there are 204 spectral bands available for experiments. The last dataset HT has a high spatial resolution and a spatial size of 349 × 1905, the number of bands is 114, and the wavelength range is 380-1050 nm, including 15 feature categories. The details of the dataset are shown in Table 1. In 3DCAMNet, the batch size and maximum training rounds used were 16 and 200, respectively, and the "Adam" optimizer was selected during the training process. The learning rate and input space size were 0.0005 and 9 × 9, respectively. In addition, the crossloss entropy was used to measure the difference between the real probability distribution and the predicted probability distribution. Table 2 shows the superparameter settings of 3DCAMNet. Table 2. Superparameter setting of 3DCAMNet.

Layer Name
Output Shape Filter Size Padding

Evaluation Index
Three evaluation indicators were adopted in the experiments, namely, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) [61]. The measuremnet units of these evaluation indicators are all dimensionless. The confusion matrix H = a i,j n×n is constructed with the real category information of the original pixel and the predicted category information, where n is the number of categories, and a i,j is the number of samples classified as category i by category j. Assuming that the total number of samples of HSIs is M, the ratio of the number of accurately classified samples to the total number of samples OA is where, a i,i is the correctly classified element in the confusion matrix. Similarly, AA is the average value of classification accuracy for each category, The Kappa matrix is another performance evaluation index. The specific calculation is as follows: where a i,_ and a _,i represent all column elements in row i and all row elements in column i of confusion matrix H, respectively.
SVM is a classification method based on the radial basis kernel function (RBF). SSRN designs a residual module of space and spectrum to extract spatial-spectral information for the neighborhood blocks of input three-dimensional cube data. PyResNet gradually increases the feature dimension of each layer through the residual method, so as to get more location information. In order to further improve the classification performance, DBMA and DBDA designed spectral and spatial branches to extract the spectral-spatial features of HSIs, respectively, and used an attention mechanism to emphasize the channel features and spatial features in the two branches, respectively. Hybrid-SN verifies the effectiveness of a hybrid spectral CNN network, whereby spectral-spatial features are first extracted through 3DCNN, and then spatial features are extracted through 2DCNN. A 2 S 2 K-ResNet designs an adaptive kernel attention module, which not only solves the problem of automatically adjusting the receptive fields (RFs) of the network, but also jointly extracts spectral-spatial features, so as to enhance the robustness of hyperspectral image classification. Unlike the attention mechanism proposed in the above methods, in order to obtain the long-distance dependence in the vertical and horizontal directions and the importance of the spectrum, a 3D coordination attention mechanism is proposed in this paper. Similarly, in order to further extract spectral and spatial features with more discriminant features, the 3DCNN and linear module are used to fully extract joint spectral-spatial features, so as to improve the classification performance.
The classification accuracy of all methods on IP, UP, KSC, SV, and HT datasets are show in Tables A1-A5, respectively. It can be seen that, in the five datasets, compared with other methods, the method proposed in this paper not only obtained the best OA, AA, and Kappa, but also almost every class had greater advantages in classification accuracy. Specifically, due to the complex distribution of features in the IP dataset, the classification accuracy of all methods on this dataset was low, but the method in this paper not only obtained better accuracy in the categories that were easy to classify, but also obtained better accuracy in the categories that were difficult to classify such as Class 2, Class 4, and Class 9. Similarly, in the UP dataset, we can clearly see that the accuracy of the method proposed in this paper, according to OA, AA, and Kappa or various categories, has great advantages over other methods. Compared with the IP dataset, the UP dataset has fewer feature categories, and all methods exhibited better classification results, but the method in this paper obtained the highest classification accuracy. The KSC dataset has the same number of categories as the IP dataset, in addition to 16 feature categories, but the KSC feature categories are scattered. It can be seen from Table A3 that all classification methods obtained ideal results, but the proposed method obtained the best classification accuracy. In addition, because the sample distribution of the SV dataset is relatively balanced and the ground object distribution is relatively regular, the classification accuracy of all methods was high. On the contrary, HT images were collected from the University of Houston Campus, with complex distribution and many categories, but the method proposed in this paper could still achieve high-precision classification.
In addition, Figures 5-9 shows the classification visualization results of all methods, including the false-color composite image and the classification visualization results of each method. Because the traditional classification methods cannot effectively extract spatialspectral features, the classification effect was poor, while the image was rough and noisy, as seen for SVM and the deep network methods based on ResNet, including SSRN and PyResNet. Although these kinds of method can obtain good classification results, there was still a small amount of noise. In addition, DBMA, DBDA, and A 2 S 2 K-ResNet all added an attention mechanism to the network, which yielded better classification visualization results, but there were still many classification errors. However, the classification visualization results obtained by the method proposed in this paper were smoother and closer to the real feature map. This fully verifies the superiority of the proposed method.
In conclusion, through multiple angle analysis, it was verified that this method has more advantages than other methods. First, among all methods, the proposed method had the highest overall accuracy (OA), average accuracy (OA), and Kappa coefficient (Kappa). In addition, the method proposed in this paper could not only achieve high classification accuracy in the categories that were easy to classify, but also had strong judgment ability in the categories that were difficult to classify. Second, among the classification visualization results of all methods, the method in this paper obtained smoother results that were closer to the false-color composite image.

Discussion
In this section, we discuss in detail the modules and parameters that affect the classification performance of the proposed method, including the impact of different attention mechanisms on classification accuracy OA, the impact of different input space sizes and different training sample ratios on classification accuracy OA, ablation experiments of different modules in 3DCAMNet, and the comparison of running time and parameters of different methods on IP datasets.

Effects of Different Attention Mechanisms on OA
In order to verify the effectiveness of 3DCAM, we consider two other typical attention mechanisms for comparison, SE and CBAM, as shown in Figure 10. The experimental results of the three attention mechanisms are shown in Table 3. The results show that the classification accuracy of 3DCAM on the five datasets was better than SE and CBAM, and the attention mechanism of CBAM was better than SE on a whole. The reason is that SE attention only emphasizes the importance differences of channels, without considering spatial differences. Although CBAM considers the channel dependence and spatial dependence, it does not fully consider the spatial location information. Lastly, for hyperspectral data types, 3DCAM fully considers the position relationship in the horizontal and vertical directions of space, obtains the long-distance dependence, and considers the differences in spectral dimension. Therefore, our proposed 3DCAM can better mark important spectral bands and spatial location information.

Discussion
In this section, we discuss in detail the modules and parameters that affect the classification performance of the proposed method, including the impact of different attention mechanisms on classification accuracy OA, the impact of different input space sizes and different training sample ratios on classification accuracy OA, ablation experiments of different modules in 3DCAMNet, and the comparison of running time and parameters of different methods on IP datasets.

Effects of Different Attention Mechanisms on OA
In order to verify the effectiveness of 3DCAM, we consider two other typical attention mechanisms for comparison, SE and CBAM, as shown in Figure 10. The experimental results of the three attention mechanisms are shown in Table 3. The results show that the classification accuracy of 3DCAM on the five datasets was better than SE and CBAM, and the attention mechanism of CBAM was better than SE on a whole. The reason is that SE attention only emphasizes the importance differences of channels, without considering spatial differences. Although CBAM considers the channel dependence and spatial dependence, it does not fully consider the spatial location information. Lastly, for hyperspectral data types, 3DCAM fully considers the position relationship in the horizontal and vertical directions of space, obtains the long-distance dependence, and considers the differences in spectral dimension. Therefore, our proposed 3DCAM can better mark important spectral bands and spatial location information.

Effects of Different Input Space Sizes and Different Training Sample Ratios on OA
The size n × n of input space and the proportion p of different training samples are two important superparameters of 3DCAMNet, and their changes have a great impact on the classification performance. In particular, the selected input space sizes of 5 × 5, 7 × 7, 9 × 9, 11 × 11, and 13 × 13 were used to explore the optimal space size of 3DCAMNet method. In addition, the proportion of training samples p refers to the proportion of training samples used by the network. Among them, the value of p for the IP, KSC, and HT datasets was {1.0%, 2.0%, 3.0%, 4.0%, 5.0%}, while the value of p for the UP and SV datasets was {0.5%, 1.0%, 1.5%, 2.0%, 2.5%}. Figure 11 shows the OA results of 3DCAMNet with different input size n and different training sample ratio p for all datasets. As can be seen from Figure 11, when n = 5 and the proportion of training samples of IP, UP, KSC, SV, and HT datasets was 1.0%, 0.5%, 1.0%, 0.5%, and 1.0%, respectively, the OA value obtained by the proposed method was the lowest. With the increase in proportion of training samples, OA increased slowly. In addition, when n = 9 and the number of training samples was the highest, the classification performance obtained better results.

Comparison of Contributions of Different Modules in 3DCAMNet
In order to verify the effectiveness of the method proposed in this paper, we conducted ablation experiments on two important modules of the method: the linear module and 3DCAM. The experimental results are shown in Table 4. It can be seen that, when both the linear module and 3DCAM were implemented, the OA value obtained on all datasets was the largest, which fully reflects the strong generalization ability of the proposed method. On the contrary, when neither module was implemented, the OA value obtained on all datasets was the lowest. In addition, when either the linear module or the 3DCAM module was applied to the network, the overall accuracy OA was improved. In general, the ablation experiment shows that the classification performance of the basic network was the lowest, but with the gradual addition of modules, the classification performance was also gradually improved. The ablation experiments fully verified the effectiveness of the linear module and 3DCAM.

Comparison of Running Time and Parameters of Different Methods on IP Dataset
When the input size was 9 × 9 × 200, the comparison results of parameter quantity and running time between 3DCAMNet and other advanced methods were as shown in Table 5. It can be seen that the PyResNet based on space and spectrum needed the most parameters. This is because it obtains more location information by gradually increasing the feature dimension of all layers, which inevitably necessitates more parameters. In addition, the longest running time of all methods was DBDA. However, the parameter amount of the proposed method was similar to that of other methods, and the running time was also moderate. For further comparison, the OA values obtained by these methods on the IP dataset are shown in Figure 12. Combined with Table 5, it can be seen that, compared with other methods, the parameter quantity and running time of the proposed 3DCAMNet were moderate, while 3DCAMNet method could achieve the highest OA.

Conclusions
A 3DCAMNet method was proposed in this paper. It is mainly composed of three modules: a convolution module, linear module, and 3DCAM. Firstly, the convolution module uses 3DCNN to fully extract spatial-spectral features. Secondly, the linear module is introduced after the convolution module to extract more fine features. Lastly, 3DCAM was designed, which can not only obtain the long-distance dependence between vertical and horizontal directions in HSI space, but also obtain the importance difference between different spectral bands. The proposed 3DCAM was compared with two classical attention mechanisms, i.e., SE and CBAM. The experimental results show that the classification method based on 3DCAM could obtain better classification performance. Compared with some state-of-the art methods, such as A2S2K-ResNet and Hybrid-SN, 3DCAMNet could achieve better classification performance. The reason is that, although A2S2K-ResNet can expand the receptive field (RF) via the adaptive convolution kernel, the deep features cannot be reused. Similarly, Hybrid-SN can extract spatial and spectral features using 2DCNN and 3DCNN, but the classification performance was still worse than that of 3DCAMNet because of its small RF and insufficient extracted features. In addition, in order to verify the effectiveness of the proposed method, a series of experiments were carried out on five datasets. The experimental results show that 3DCAMNet had higher classification performance and stronger robustness than other state-of-the-art methods, highlighting the effectiveness of the proposed 3DCAMNet method in hyperspectral classification. In future work, we will consider a more efficient attention mechanism module and spatial-spectral feature extraction module.

Acknowledgments:
We would like to thank the handling editor and the anonymous reviewers for their careful reading and helpful remarks.

Conflicts of Interest:
The authors declare no conflict of interest.