Spectral and Spatial Global Context Attention for Hyperspectral Image Classiﬁcation

: Recently, hyperspectral image (HSI) classiﬁcation has attracted increasing attention in the remote sensing ﬁeld. Plenty of CNN-based methods with diverse attention mechanisms (AMs) have been proposed for HSI classiﬁcation due to AMs being able to improve the quality of feature representations. However, some of the previous AMs squeeze global spatial or channel information directly by pooling operations to yield feature descriptors, which inadequately utilize global contextual information. Besides, some AMs cannot exploit the interactions among channels or positions with the aid of nonlinear transformation well. In this article, a spectral-spatial network with channel and position global context (GC) attention (SSGCA) is proposed to capture discriminative spectral and spatial features. Firstly, a spectral-spatial network is designed to extract spectral and spatial features. Secondly, two novel GC attentions are proposed to optimize the spectral and spatial features respectively for feature enhancement. The channel GC attention is used to capture channel dependencies to emphasize informative features while the position GC attention focuses on position dependencies. Both GC attentions aggregate global contextual features of positions or channels adequately, following a nonlinear transformation. Experimental results on several public HSI datasets demonstrate that the spectral-spatial network with GC attentions outperforms other related methods.


Introduction
Compared with traditional panchromatic and multispectral remote sensing images, hyperspectral images (HSIs) contain rich spectral information owing to the hundreds of narrow contiguous wavelength bands. In addition, some spatial information from homogeneous areas is also incorporated into HSIs. Recently, HSIs have been widely used in various kinds of fields, such as land cover mapping [1], change detection [2], object detection [3], vegetation analysis [4], etc. With the rapid development of HSI technology, HSI classification has become a hot and valuable topic, which aims at assigning each pixel vector to a specific land cover class [5,6]. Due to the curse of dimensionality and the Hughes phenomenon [7,8], how to explore the plentiful spectral and spatial information of HSIs remains extremely challenging.
To take advantage of abundant spectral information, traditional HSI classification methods tend to take an original pixel vector as the input, such as κ-nearest neighbors (KNNs) [9], multinomial logistic regression (MLR) [10], and linear discriminant analysis (LDA) [11]. These methods mainly focus on two steps: feature engineering and classifier training. Feature engineering reduces the high dimensionality of the spectral pixel vector to capture effective features. Then, the extracted features are fed into a general-purpose classifier to yield the classification results. However, these spectral-based classifiers only global spectral or spatial information, which inadequately utilizes global contextual information. On the other hand, they recalibrate the importance of channels and positions by the rescaling operation, which cannot effectively capture long-range feature dependencies. Then, Li et al. [32] proposed a double-branch dual-attention mechanism network (DBDA) for HSI classification based on the DBMA and the DANet [33]. The AMs in the DBDA consume considerable computing resources because of matrix multiplication operations when obtaining attention maps. The interactions among enhanced channels or positions are not well exploited.
In order to alleviate the problems of AMs in the DBMA and the DBDA, we developed two novel AMs known as channel global context (GC) attention and position GC attention inspired by the GC block [34]. The proposed channel GC attention and position GC attention can make full use of global contextual information with less time consumption. In addition, the interactions among enhanced channels or positions are modeled to learn high-level semantic feature representations. Concretely, our channel and position GC attentions can be abstracted into three procedures: (a) feature aggregation, which aggregates the features of all positions or channels via weighted summation under the aid of the channel or position attention map to yield global contextual features; (b) feature transformation, which learns the adaptive channel-wise or position-wise non-linear relationships by a bottleneck transform module consisting of two 1 × 1 convolutions, the ReLU activation function, and layer normalization; (c) feature enhancement, which merges the transformed global contextual information into features of all positions or channels by element-wise summation to capture long-range dependencies and obtain more powerful feature representations. To sum up, the main contributions of this paper are the following: • An end-to-end spectral-spatial framework with channel and position global context (GC) attention (SSGCA) is proposed for HSI classification. The SSGCA has two branches: the spectral branch with channel GC attention is used to capture spectral features, while the spatial branch with position GC attention is used to obtain spatial features. At the end of the network, spectral and spatial features are combined for HSI classification. • A channel GC attention and a position GC attention are proposed for feature enhancement in the spectral branch and the spatial branch, respectively. The channel GC attention is designed to capture interactions among channels, while the position GC attention is invented to explore interactions among positions. Both GC attentions can make full use of global contextual information with less time consumption and model long-range dependencies to obtain more powerful feature representations. • The SSGCA network is applied to three well-known public HSI datasets. Experimental results demonstrate that our network achieves the best performance compared with other well-known networks.
The remainder of this article is organized as follows. Section 2 introduces the related work, and Section 3 describes the proposed methodology in detail. Next, the experimental results and comprehensive analysis are reported in Sections 4 and 5. Finally, some conclusions of this article are drawn in Section 6.

Related Work
In this section, we introduce the related work that plays a significant role in our work, including 3D convolution operation, CNN-based methods for HSI classification, the dense connection block, and the attention mechanism (AM).

3D Convolution Operation
The 3D convolution operation was first proposed in [35] to compute features from both spatial and temporal dimensions for human action recognition. Later, various 3D CNN networks based on the 3D convolution operation were designed for HSI classification. For example, Chen et al. [25] proposed a deep 3D CNN model, which employed several 3D convolutional and pooling layers to extract deep spectral-spatial feature maps for classification. Different from [25], Li et al. [36] designed a 3D CNN network, which stacks 3D convolutional layers without the pooling layer to extract deep spectral-spatialcombined features effectively. Furthermore, Roy et al. [26] proposed a hybrid model named HybridSN which consists of 3D CNN based on 3D convolution and 2D CNN based on 2D convolution to facilitate the joint spectral-spatial feature representations and spatial feature representations. To sum up, 1D convolution extracts spectral features, whereas 2D convolution extracts local spatial features; unlike 1D and 2D convolution, 3D convolution allows extracting spatial and spectral information simultaneously.
In this paper, 3D convolution is employed as a fundamental element playing a crucial role in the feature extraction stage. At the same time, BN [37] and the ReLU activation function are attached to each 3D convolution operation in our network, which can accelerate the learning rate of DL models and assist the network to learn non-linear feature relationships, respectively. The 3D convolution operation can be formulated as [35]: where l indicates the layer that is discussed, i is the number of feature maps in this layer, v xyz li represents the output at position (x, y, z) on the ith feature maps of the lth layer, H l , W l , and R l stand for the height, width, and channel number of the 3D convolution kernel, respectively, j indexes the feature maps in the (l − 1)th layer connected to the current feature maps, k hwr lij is the value at position (h, w, r) of the kernel corresponding to the jth feature maps, g is the activation function, and b is the bias.

Cube-Based Methods for HSI Classification
Different from traditional pixel-based classification methods [18][19][20] that only utilize spectral information, cube-based methods explore both spectral and spatial information from HSIs. Recently, many cube-based methods were proposed for HSI classification, such as DRCNN [38], CDCNN [39], DCPN [40], and SSAN [41]; these methods have attracted increasing attention and made considerable achievements. In the remainder of this section, we will briefly introduce the process of HSI classification with the aid of cube-based methods.
For a specific pixel of an HSI, a square HSI data cube is cropped, centered on this pixel, which is taken as the input data of the cube-based network, and the land cover label of the HSI cube is determined by its central pixel. Let x i ∈ R h×w×c represent the ith HSI data cube and y i ∈ {1, 2, . . . , m} represent the corresponding land cover label, where h × w is the spatial size, c is the number of channels, and m is the number of land cover categories. Consequently, the ith sample can be denoted as (x i , y i ), and all samples will be divided into three sets. To be specific, a certain number of samples are randomly selected as the training set; another certain number of samples are randomly assigned as the validation set; and the remaining samples are used as the testing set. The training set is fed into the network in batches to adjust the trainable parameters; the validation set acts as the monitor to observe the optimization process of the network, while the testing set serves to evaluate the classification performance of the network. In this article, the designed network is also cube-based, but the difference is that we introduce AMs into our network to enhance the extracted features and thus obtain more powerful feature representations for HSI classification.

Dense Connection Block
In order to alleviate the vanishing gradient problem, strengthen feature propagation, and encourage feature reuse, Gao et al. [29] designed DenseNet, which introduces direct connections from any layer to all subsequent layers. Inspired by SSRN [28] and DenseNet, Wang et al. [30] proposed the FDSSC network for HSI classification to learn discriminative spectral and spatial features separately. The FDSSC consists of a spectral dense block, a reducing dimension block, and a spatial dense block. In the DBMA [31] and DBDA [32], the capability to extract features of dense connection blocks from the FDSSC was demonstrated once again. For this article, we also use the spectral dense block and spatial dense block from the FDSSC as the baseline network for feature extraction, and the difference is that the spectral and spatial branches are connected in parallel, the same as the DBMA and DBDA, rather than in a cascaded manner. These two dense connection blocks can reduce the high dimensionality and automatically learn more effective spatial and spectral features separately with little depletion of the computing resources.
As shown in Figures 1 and 2, the spectral and spatial dense blocks consist of several sets of feature maps, direct connections, and 3D convolution operations. The BN operation and the ReLU activation function are also included. In the spectral dense block, the size of the 3D convolution kernels is (1 × 1 × d, k), the stride is (1, 1, 1), and the manner of padding is "same", which can ensure the output shape is the same as the input shape. The feature maps of the lth layer receive feature maps from all previous outputs and the initial input. Consequently, if we set x l as the lth feature maps, we can calculate it by the previous (l − 1) feature maps, as shown in the following equation: where [x 0 , x 1 , . . . , x l−1 ] refers to the concatenation of the previous feature maps. H l (·) includes batch normalization (BN), the ReLU activation function, and 3D convolution operations. Furthermore, the number of lth input feature maps k l can be calculated as follows: where k 0 is the number of initial feature maps and k is the kernel number of the 3D convolution operation. In this paper, the spectral dense block consists of four layers; if the input shape is (p × p × b, k 0 ), after the above analysis, we can know that the shape of the output remains (p × p × b), and the channel number will change to k 0 + 3 × k. In the spatial dense block, the kernel size of 3D convolution is (r × r × 1, k), the manner of padding is "same", and the stride is (1, 1, 1). Same as spectral dense block, the feature maps of current layer have been linked to previous feature maps and back feature maps, and the final output is merged by all previous outputs and the initial input. As shown in Figure 2, if the input shape is (q × q × m, k 0 ), the shape of output feature maps remain (q × q × m) and the channel number will change into k 0 + 3 × k after spatial dense block.

Attention Mechanism
Different spectral channel and spatial position features acquired by the DL-based network may provide different contributions to classification. Therefore, how to make models focus on the most informative part and differentiate low-correlation information is essential for classification. The AM was firstly presented in language translation [42]. Immediately afterwards, it developed rapidly and acquired incredible breakthroughs in the field of computer vision. For example, Hu et al. [43] invented a squeeze-and-excitation network (SENet) to adaptively refine the channel-wise feature response by modeling interdependencies among channels, which brings significant improvements to the performance of CNNs. Different from SENet, Woo et al. [44] proposed a convolutional block attention module (CBAM), which exploits attention in both the channel and position dimensions to learn what and where to emphasize or suppress, aiming at refining intermediate features effectively. With the purpose of capturing long-range dependencies, Wang et al. [45] presented non-local operations as a generic family of building blocks for capturing long-range dependencies, which compute the response at a position as a weighted sum of the features at all positions. Inspired by the non-local block from [45] and the SE block from [43], Yue Cao et al. [34] proposed a global context (GC) block, which is lightweight and can model the global context more effectively.
Recently, AMs have received increasing attention in the field of remote sensing. Since HSIs contain more abundant information, especially in the spectral dimension, it is critical to avoid the impact of redundant information while utilizing useful information efficiently. Therefore, plenty of DL-based methods combined with various kinds of AMs have been designed for HSI classification, such as SSAN [41], the DBMA [31], the DBDA [32], and so on. These networks further improved the classification performance with the assistance of AMs; however, previous AMs attached to CNN-based networks for HSI classification have some insufficiencies. For example, they inadequately utilize global contextual information and cannot model long-range dependencies effectively; in addition, they are time-consuming and cannot learn non-linear feature relationships. To tackle these issues, the GC block [34] employed in the field of general images seems to work. However, the GC block only focuses on the channel dimension, and in the HSI data cube, the position characteristics also play a crucial role in classification. Therefore, we invented a position-wise framework resembling the GC block, aiming at processing the position information of HSIs. These two frameworks called channel GC attention and position GC attention are attached to the spectral and spatial branch to optimize the features, avoiding the disadvantages of previous AMs. The implementation details of our channel GC attention and position GC attention are described in Sections 3.1 and 3.2.

Methodology
In this Section, we introduce the general architecture of the proposed channel global context (GC) attention and position GC attention. We describe the implementation details of our spectral-spatial network with channel and position GC attention (SSGCA) in three specific stages.

Channel Global Context Attention
As illustrated in Figure 3, the channel GC attention is abstracted into three procedures: (a) feature aggregation, which performs a weighted summation on the features of all positions to yield global contextual features; (b) feature transformation, which learns nonlinear channel-wise inner relationships by a bottleneck transform module; (c) feature enhancement, which merges the transformed channel features into each position to utilize long-range dependencies, optimizing the feature representations. Specifically, for the input feature maps A ∈ R h×w×c , we first feed it into a 1 × 1 convolutional layer of which the kernel number is 1, and a new feature map B ∈ R h×w×1 is obtained. Then, B is reshaped to R 1×1×n , and a softmax operation is applied to it to obtain the attention map D ∈ R 1×1×n , where n = h × w. Meanwhile, A is reshaped to C ∈ R n×c . After that, we conduct a matrix multiplication between C and the attention map D to obtain channel features that contain global contextual position information. Next, the channel features are fed into a bottleneck transform module to learn nonlinear interactions between channels. The bottleneck transform module consists of two 1 × 1 convolutional layers, a ReLU activation function between them, and a layer normalization operation (LN) [46] before the ReLU to benefit optimization. The bottleneck ratio r is selected from {8, 12, 16, 20} (this parameter is discussed in Section 5.2) to reduce the computational cost and prevent overfitting. Finally, we perform an element-wise sum operation between transformed features and initial input features A to get the final output E ∈ R h×w×c . Above all, our channel GC attention can be defined as: where p and p are the input and output feature maps of the channel GC attention, respectively, i ∈ {1, 2, . . . , N p } is the index of the positions, and N p = h × w is the number of positions. ∑ N p j=1 α j · p j represents the global contextual features aggregated from all positions via weighted summation with weight α j . The weight α j is calculated according to: where W u denotes the linear transformation matrix. f (·) denotes the feature transformation operations in the bottleneck transform module, shown as: which includes two linear transformation matrices W v1 and W v2 , a layer normalization, and a ReLU activation function.

Position Global Context Attention
As illustrated in Figure 4, the position GC attention is also described as three procedures: (a) feature aggregation, which applies a weighted summation among features of all channels to generate global contextual features; (b) feature transformation, which aims to learn the nonlinear inner relationships between position features by a bottleneck transform module; (c) feature enhancement, which captures long-range dependencies to form better feature representations by merging the transformed position features into each channel. Concretely, as shown in Figure 4, the input feature maps is A ∈ R h×w×c . First, a global average pooling is applied on A to squeeze global spatial information into a feature vector B ∈ R 1×1×c ; meanwhile, A is reshaped to C ∈ R c×n . Then, B is fed into a softmax layer to obtain attention map D ∈ R 1×1×c . After that, a matrix multiplication is performed on attention map D and matrix C, where the result is the position features, which include global contextual channel information. The same as the channel GC attention, the position features are fed into a bottleneck transform module to learn nonlinear position-wise interactions, and the value of bottleneck ratio r is the same as that of the channel GC attention. Finally, we reshape the output of transform module to E ∈ R h×w×1 and execute an element-wise sum operation between A and E to obtain the final output F ∈ R h×w×c .  Likewise, our position GC attention can be defined as: where c and c denote the input and output feature maps of the position GC attention, respectively, i ∈ {1, 2, . . . , N c } is the index of channels, and N c is the number of channels. ∑ N c j=1 b j · c j represents the global contextual features aggregated from all channels via weighted summation with weight b j . The weight b j is calculated from Equation (8), in which F avg represents a global average pooling operation. δ(·) denotes the operations of the bottleneck transform module formulated as Equation (9), where W k1 and W k2 are also linear transformation matrices.

Spectral-Spatial Network with Global Context Attention
The proposed SSGCA consists of three parts: feature extraction, feature enhancement, and feature fusion classification. Feature extraction focuses on collecting discriminative spectral and spatial features based on dense connection blocks. Feature enhancement pays close attention to obtaining more powerful feature representations via GC attentions. Feature fusion classification concatenates spectral and spatial features directly to determine the final land cover categories. Next, we take the Indian Pines dataset as an example to introduce the network structure of these three stages.

Feature Extraction
As shown in Figure 5, this part consists of two parallel stages: spectral feature extraction stage and spatial feature extraction stage. The former stage aims to extract spectral features, while the latter is to explore spatial information. For each pixel to be classified, a corresponding HSI data cube of size (9 × 9 × 200) is cropped for both the spectral and spatial branches. In the spectral branch, a 3D convolution is first applied to reduce the number of bands. The kernel size is (1 × 1 × 7, 24); the stride is (1, 1, 2); and the padding is "valid". After that, the feature maps with the shape of (9 × 9 × 97, 24) are generated. Then, they are fed into the spectral dense block, in which the kernel size, padding, and stride of 3D convolution are all (1 × 1 × 7, 12), "same", and (1, 1, 1). As a result, the shape of feature maps changes into (9 × 9 × 97, 60), and they are re-imported to a 3D convolution layer with a kernel size of (1 × 1 × 97, 60) and a padding manner of "valid" to reduce the number of channels again. Consequently, the final feature maps of the spectral feature extraction stage can be obtained, and the shape is (9 × 9 × 1, 60). Similar to the spectral branch, in the spatial branch, the input data cube is first fed into a 3D convolution layer, and the kernel size, stride, and padding are (1 × 1 × 200, 24), (1, 1, 1), and "valid". Then, the output feature maps are fed into the spatial dense block, in which the kernel size, stride, and padding of 3D convolution are all (3 × 3 × 1, 12), (1, 1, 1), and "same". After that, the final output of the spatial feature extraction stage with the size of (9 × 9 × 1, 60) is acquired. The use of the ReLU activation function and the BN operation can be seen in Figure 5. In addition, the detailed descriptions of the spectral dense block and the spatial dense block are shown in Section 2.3, and the implementation details of feature extraction are described in Tables 1 and 2.

Feature Enhancement
After the feature extraction stage, spectral feature maps with abundant spectral signatures and spatial feature maps with plentiful spatial information can be acquired. Then, these two type of features will be fed into spectral and spatial feature enhancement stage, as shown in Figure 5, respectively. The spectral feature enhancement stage serves to capture interactions between channels via the channel GC attention, while the spatial feature enhancement stage aims to explore position relationships by the position GC attention. Both GC attentions can make full use of global contextual information, learn nonlinear feature interrelationships, and adequately capture long-range dependencies. As described in Figure 5, the extracted spectral and spatial feature maps are first reshaped to (9 × 9 × 60) and then separately fed into the GC attention module. Both the input size and output size of the channel GC attention and the position GC attention are (9 × 9 × 60). Detailed descriptions of GC attentions are displayed in Sections 3.1 and 3.2, and the implementation details of the feature enhancement stage are reported in Tables 3 and 4.

Feature Fusion Classification
The transformed spectral and spatial features yielded by the feature enhancement stages are separately fed into a global average pooling layer and a reshape layer to aggregate information in each channel, and as a result, both the size of the spectral and spatial feature maps are from (9 × 9 × 60) to (1 × 60). Finally, we perform a concatenation between these two feature maps and feed the result to a fully-connected layer with the softmax activation function to obtain the certain land cover label. The implementation details of the feature fusion classification stage are shown in Table 5.

Datasets
In this paper, we selected three well-known HSI datasets to evaluate the effectiveness of our network compared with other widely used methods proposed before. The selected datasets include Indian Pines (IN), University of Pavia (UP), and the Salinas Valley (SV) dataset.
The IN dataset was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the Indian Pines test site in northwestern Indiana in 1992, and the image contains 16 vegetation classes and has (145 × 145) pixels with a spatial resolution of 20 m per pixel. After removing 20 water absorption bands, this dataset includes 200 spectral bands for analysis ranging from 400 to 2500 nm.
The UP dataset was acquired over the city of Pavia, Italy, in 2002 by an airborne instrument-the Reflective Optics Spectrographic Imaging System (ROSIS). This dataset consists of (640 × 340) pixels with a 1.3 m per pixel spatial resolution and 103 spectral bands ranging from 430 to 860 nm after removing 12 noisy bands. The dataset contains a large number of background pixels, so the total number of pixels including the features is only 42,776. There are 9 types of features, including asphalt roads, bricks, pastures, trees, bare soil, etc.
The SV dataset was also gathered by the AVIRIS sensor over the region of Salinas Valley, CA, USA, with a 3.7 m per pixel spatial resolution. The same as the IN dataset, twenty water absorption bands of the SV dataset were discarded. After that, the SV dataset consisted of 204 spectral bands and (512 × 217) pixels for analysis from 400 to 2500 nm. The dataset presents 16 classes related to vegetables, vineyard fields, and bare soil.
In this paper, we randomly selected a few samples of each dataset for training and validation. To be specific, from the IN dataset, we selected 5% of the samples for training and 5% for validation. For the UP dataset, we selected 1% of the samples for training and 1% for validation. In addition, we also selected 1% as training samples and 1% as validation samples from the SV dataset. Tables 6-8 list the sample numbers for the training, validation, and testing for the three datasets.

Evaluation Measures
In order to quantify the performance of the proposed method, four evaluation metrics were selected: the accuracy of each class, the overall accuracy (OA), the average accuracy (AA), and the kappa coefficient. The OA is the ratio of the number of correctly classified HSI pixels to the total number of HSI pixels in the testing samples. The AA is the mean of the accuracies for different land cover categories. Kappa measures the consistency between the classification results and the ground truth. Let M ∈ R (m×n) represent the confusion matrix of the classification results, where m denotes the number of land cover categories. The values of the OA, AA, and kappa can be calculated as follows [47]: where diag(M) ∈ R m×1 is a vector of the diagonal elements of M, sum(·) ∈ R 1 represents the sum of all elements of the matrix, sum(·, 1) ∈ R 1×m represents the sum of the elements in each column, sum(·, 2) ∈ R m×1 represents the sum of the elements in each row, mean(·) ∈ R 1 represents the mean of all elements, and ./ represents the element-wise division.

Experimental Setting
In this paper, SVM with the RBF kernel [48] and several well-known DL-based methods were selected for comparison, including SSRN [28], the FDSSC [30], the DBMA [31], and the DBDA [32]. To ensure the fairness of the comparative experiments, we adopted the same hyperparameter settings for these methods, and all experiments were executed on an NVIDIA GeForce GTX 2070 SUPER GPU with a memory of 32 GB. For the DL-based methods, the spatial size of the HSI cubes was set to 9 × 9, the batch size was 64, and the number of training epochs was set to 200. Besides, a cross-entropy loss function was exploited to measure the difference between the predicted value and the real value to train the parameters of the networks, which can be formulated as follows: where m is the number of land cover categories, y j denotes the land cover label (if the category is j, y j is 1; otherwise, y j equals to 0), and p j represents the probability that the category is j, which is calculated by the softmax function. To prevent overfitting, the early stopping strategy was adopted. If the loss value of the validation dataset no longer decreases for 20 epochs, the training process will be stopped. Furthermore, the optimizer was set to Adam with a 0.001 learning rate, and we used the cosine annealing [49] method to dynamically adjust the learning rate, which can prevent the model from falling into local minima. The learning rate was adjusted according to the following equation [32,49]: where η t is the learning rate within the ith run and [η i max − η i min ] is the range of the learning rate. T cur represents the count of epochs that were executed, and T i controls the count of epochs that will be executed in a cycle of adjustment. Finally, a dropout layer was adopted in the bottleneck transform module of GC attentions to further avoid overfitting, and the dropout percentage p was set to 0.5.

Classification Results
Tables 9-11 show the OAs, AAs, kappa coefficients, and classification accuracies of each class for the three HSI datasets. Obviously, the proposed method SSGCA achieves the best OAs, AAs, and kappa coefficients compared with the other methods on the three HSI datasets, which can demonstrate the effectiveness and generalizability of our method. For example, when 5% of the samples are randomly selected for training on the IN dataset, our method achieves the best accuracy with 98.13% OA, improving 1.14% over the DBDA (96.99%), 1.35% over the DBMA (96.78%), 1.95% over the FDSSC (96.18%), and 2.68% over SSRN (95.45%). In contrast to SVM (74.74%), our method achieves a considerable improvement of more than 23% in terms of OA. Besides, from the results, we can learn that all the DL-based methods achieve higher performance than SVM on three HSI datasets. For example, the OAs of the DL-based methods obtain more than a 20% increase in contrast to SVM on the IN dataset and about a 10% increase on the UP and SV dataset. The reason is that DL-based methods can exploit high-level, abstract, and discriminative feature representations to improve the classification performance.
Furthermore, the classification results of the FDSSC, DBMA, and DBDA on the three datasets are higher than those of SSRN with approximately a 1-2% improvement in OA. These results demonstrate the effectiveness of a dense connection structure, which is adopted in the FDSSC, DBMA, and DBDA. Moreover, comparing the FDSSC to the three attention-based methods, we can find that the proposed SSGCA achieves a higher OA than the FDSSC for all three datasets; however, the results of the DBMA and DBDA are not always higher than those of the FDSSC. For instance, the OA of the DBDA is lower than the FDSSC on the UP and SV datasets, and the OA of the DBMA is below the FDSSC on the UP dataset. This means that our GC attentions can optimize the feature representations more effectively compared with the AMs in the DBMA and DBDA when these three methods employ the same feature extraction network. The reason is that our GC attentions can utilize global contextual information adequately, capture feature interactions well, and model long-range dependencies effectively. Finally, from the classification accuracies of each class for the three HSI datasets, we can find that our SSGCA achieves more stable results, benefiting from the invented GC attentions. Taking the SV dataset as an example, the best and worst single-category results of the SSGCA are 100% and 95.76%, respectively, with a difference of only 4.24%, while the difference is 8.82% for the DBDA (best: 100%; worst: 91.18%), 8.26% for the DBMA and FDSSC (best: 100%; worst: 91.74%), 12.16% for SSRN (best: 100%; worst: 87.84%), and 43.19% for SVM (best: 99.62%; worst: 56.43%).  8 show the visualization maps of all methods along with the corresponding ground truth maps for the three HSI datasets. Firstly, from the visual classification results, we can intuitively conclude that the proposed SSGCA delivers the most accurate and smooth classification maps on all datasets, because the SSGCA can obtained more powerful feature representations with the aid of our GC attentions for classification. Secondly, compared to DL-based methods, the classification maps of SVM on the three datasets show plenty of mislabeled areas due to the lack of incorporation of spatial neighborhood information, and the extracted features are at a low-level. Thirdly, we can find that the visual classification maps of the attention-based methods are smoother than SSRN and the FDSSC, especially at the edges of land cover areas, and it can be observed that the SSGCA yields better classification maps in contrast to the DBDA and DBMA, because our GC attentions can adequately utilize global contextual information, learn nonlinear feature interactions, and model long range dependencies for feature enhancing compared with the attentions in the DBDA and DBMA.

Other Investigations
In this section, we conduct further investigations in the following three aspects. Firstly, different proportions of training samples from the three datasets are selected for all methods to investigate the performance of our method with different training sample numbers. Secondly, several ablation experiments are designed to investigate the effectiveness of our channel GC attention and position GC attention; meanwhile, we explore the effectiveness of different bottleneck ratio values in GC attentions. Thirdly, we report the total trainable parameter number and running time of all methods to investigate the computational efficiency of different methods.

Investigation of the Proportion of Training Samples
In this part, several experiments are designed to explore the robustness and generalizability of the proposed method with different training proportions. On the one hand, two percent, 3%, 7%, and 9% training samples were randomly selected from the IN dataset; on the other hand, zero-point-three percent, 0.5%, 0.7%, and 1.3% training samples were randomly selected from the UP and SV datasets.  Firstly, it is evident that different numbers of training samples bring about different classification performances for all methods: as the number of training samples increases, the classification accuracy increases, and the proposed method SSGCA achieves the best performance compared with other methods with different numbers of training samples on the three datasets. Secondly, the performance of the FDSSC is relatively poor when the training set is quite small compared to the DBMA, DBDA, and SSGCA, but as the training samples increase, the FDSSC outperforms the DBMA and DBDA, while the classification accuracy is closer to the best result, especially on the UP and SV datasets. From this comparison, we can learn that AMs have a more significant effect when lacking training samples. Thirdly, we can see that the accuracies of the DL-based methods get closer along with the increasing of the training samples, particularly on the IN and SV datasets.

Investigation of the Global Context Attentions
In this part, several ablation experiments are executed to demonstrate the effectiveness of the GC attentions in this article, including experiments only with the channel GC attention, experiments only with the position GC attention, and experiments without GC attentions. The training sample proportions of IN, UP, and SV in these experiments were 5%, 1%, and 1%, respectively, and the bottleneck ratio r was set to 16. It can be observed from Figure 12 that both GC attentions can improve the classification performance on each dataset. For example, on the IN dataset, the channel GC attention brings an improvement of 0.37% OA, the position GC attention brings an improvement of 0.55% OA, and using both GC attentions can improve by 1% OA, which is a considerable advance in the case of limited training samples and high baseline accuracy. In addition, we can find that the channel GC attention plays a more significant role compared with the position GC attention on the IN and SV datasets, and the result is opposite on the UP dataset; however, the highest results can be achieved when both GC attentions are included in our network for all three datasets. The bottleneck ratio r introduced in Sections 3.1 and 3.2 is a hyperparameter, which aims to reduce the computational cost of our GC attentions. In this part, we also designed several experiments to explore the effectiveness of different values of the bottleneck ratio. The training sample proportions of IN, UP, and SV selected in these experiments were 5%, 1%, and 1%, respectively. The classification results reported in Table 12 show that our SSGCA network acquires the highest performance when the bottleneck ratio r is set to 16 on the three datasets.

Investigation on Running Time
Tables 13-15 report the total number of trainable parameters of the DL-based methods. From the results, we can find that the SSGCA has fewer trainable parameters than the FDSSC, DBMA, and DBDA. Meanwhile, Tables 13-15 show the training time and test time of all methods on the three datasets. Note that all experiments' results were collected when the training sample proportions were 5%, 1%, and 1% on IN, UP, and SV, respectively, and the bottleneck ratio r of the SSGCA was set to 16. From these three tables, we can learn that SVM consumes less training and testing time than the DL-based methods. Furthermore, the proposed SSGCA spends the least training time and the second least testing time among all DL-based methods on the three datasets. Above all, we can conclude that the proposed SSGCA obtains the best classification performance with little time consumption, which can prove the high efficiency of our method.

Conclusions
In this article, an end-to-end spectral-spatial network with channel and position global context (GC) attention (SSGCA) is proposed for HSI classification. The proposed SSGCA is based on the work of many predecessors, including 3D convolution, DenseNet, the FDSSC, GCNet, the DBDA, and so on. The SSGCA mainly contains three stages: feature extraction, feature enhancement, and feature fusion classification. Feature extraction is based on dense connection blocks to acquire discriminative spectral and spatial features in two separate branches. Feature enhancement aims to optimize spectral and spatial feature representations to improve the classification performance by the channel GC attention and the position GC attention, respectively. Compared to the previous AMs used in HSI classification methods, our GC attentions can make full use of global contextual information and capture adaptive nonlinear feature relationships in the spectral and spatial dimensions with less computation consumption. Furthermore, our AMs can adequately model longrange dependencies. Feature fusion classification concatenates two types of features in the channel dimension, then the fusion features are fed into an FC layer with the softmax function to generate the final land cover category. Moreover, we designed a great quantity of experiments on three public HSI datasets to verify the effectiveness, generalizability, and robustness of our method. Later, analyses on the experimental results demonstrated that the proposed method acquired the best performance compared to other well-known methods. In the future, we will combine our GC attention with other DL-based networks and apply these novel models on other HSI datasets.