HResNetAM: Hierarchical Residual Network With Attention Mechanism for Hyperspectral Image Classification

This article proposes a novel hierarchical residual network with attention mechanism (HResNetAM) for hyperspectral image (HSI) spectral-spatial classification to improve the performance of conventional deep learning networks. The straightforward convolutional neural network-based models have limitations in exploiting the multiscale spatial and spectral features, and this is the key factor in dealing with the high-dimensional nonlinear characteristics present in HSIs. The proposed hierarchical residual network can extract multiscale spatial and spectral features at a granular level, so the receptive fields range of this network will be increased, which can enhance the feature representation ability of the model. Besides, we utilize the attention mechanism to set adaptive weights for spatial and spectral features of different scales, and this can further improve the discriminative ability of extracted features. Furthermore, the double branch structure is also exploited to extract spectral and spatial features with corresponding convolution kernels in parallel, and the extracted spatial and spectral features of multiple scales are fused for hyperspectral image classification. Four benchmark hyperspectral datasets collected by different sensors and at different acquisition time are employed for classification experiments, and comparative results reveal that the proposed method has competitive advantages in terms of classification performance when compared with other state-of-the-art deep learning models.


I. INTRODUCTION
R EMOTE sensing technology is one of the most important components in the field of earth observation (EO), which can perceive and recognize the observed scenes using their different reflection characteristics without making physical contact with the objects. The imaging spectroradiometer can observe the continuous spectrum from visible to short-wave infrared, thus acquired hyperspectral images (HSIs) have hundreds of narrow and approximately continuous spectral bands, and this unique characteristic offers both opportunities and challenges for subsequent information extraction and geoscience applications [1]. Manuscript  According to the unique spectral and spatial characteristics, HSI classification aims to determine the ground category of each pixel, which has been widely used in, e.g., environmental monitoring, resource management, urban planning, military, and security applications over the past decade [2]. The intrinsic specificities of HSIs bring several challenges for the classification task, basically, there are three tough problems that need to be solved. 1) The high-dimensional nonlinear characteristic in spectral domain will cause the Hughes phenomenon and affect classification accuracy seriously. 2) The number of annotated samples is often insufficient because labelling samples is expensive and time consuming. 3) Effectively integrating spatial information for spectral-spatial classification to improve pixel-wise classification performance. Aiming to effectively solve above typical problems, lots of classic machine learning models have been exploited for HSI classification [3]. Containing multiple processing layers, deep learning models can learn abstract, intricate, and discriminative features from raw data using backpropagation algorithm, which have brought about striking breakthroughs in many scientific research fields [4]. Deep learning techniques also revolutionize the ways of remote sensing image processing, especially in the HSIs classification field [5]- [8]. According to the feature types employed for classification, HSI classification methods based on deep learning can be generally divided into three categories: Spectral-feature based, spatial-feature based, and spectral-spatial-feature based networks. Due to the fact that both spatial information and spectral information make contributions to HSI classification, the spatial-feature and spectral-spatial-feature-based networks have witnessed more interest in recent years [9].
Because 3D convolutional neural networks (3D-CNNs) can learn spectral-spatial features simultaneously without compressing spectral and spatial information, it is now commonly accepted that 3D-CNNs can be directly utilized for spectralspatial-feature-based classification without any preprocessing or postprocessing process [10]- [12]. Combining the recurrent network with 3D convolution operators, the recurrent 3D CNN (R-3D-CNN) can exploit both spatial and spectral information for classification [13]. Due to the fact that deeper learning networks can learn more high-level discriminative features, deeper learning models have shown more superiorities in image recognition and classification [14], [15]. But the major problem of very deep networks is the vanishing gradient in the training This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ process. By introducing identity mapping to the main path of network structure, the residual network (ResNet) framework can ease this training problem in which the underlying error can be propagated through the shortcut [16]. In the contextual deep CNN (CDCNN) initial spectral and spatial information are extracted by multiscale convolutional filter bank, and these joint spatial-spectral features are fed into two residual blocks and fully convolutional network to predict corresponding class label [17]. The spectral-spatial residual network (SSRN) employs spectral and spatial residual blocks to facilitate back propagation of gradients and alleviate the declining-accuracy phenomenon, in which batch normalization is also used to regularize learning process [18]. Aiming to explore the intrinsic complexity of HSI, the deep pyramidal residual networks use pyramidal bottleneck residual blocks to learn high-level spectral-spatial features [19].
To solve the small samples classification of HSI, deep fewshot learning and multiview learning are proposed in the deep residual learning framework recently [20], [21]. Dense network (DenseNet) connects each layer to every other layer in a feedforward way, which can also alleviate the vanishing-gradient problem [22]. Using densely connected structure in network architecture, the end-to-end fast dense spectral-spatial convolution network (FDSSC) can extract spectral-spatial features for classification, which can lead to extremely accurate classification [23]. Deep and Dense convolutional network introduces two dense blocks to construct deep network and integrate various spectral-spatial features for classification [24].
Visual patterns appear at multiscales in natural scenes. Different objects have different sizes in the same image, and context information of an object may occupy different areas in different images. Therefore, in order to accurately understand objects in image, it is essential to perceive information from different scales. Recently, some CNN-based models try to learn spectral-spatial features of multiple scales for HSI classification. The multilayer fusion dense network (MFDN) uses PCA and 2D dense network to extract spatial features, and the spectral features are extracted by 3D dense blocks, then these features are fused for classification [25]. The CNNs with multiscale convolutions (MS-CNNs) use convolution kernels of different sizes to extract features of different scales, and three types of classification network structures are proposed [26]. The multiscale deep middle-level feature fusion network (MMFN) uses two stages to fuse complementary and related information, the first stage extracts middle-level spectral and spatial features by corresponding scale model, and these middle-scale features are fused using residual blocks in the second stage [27]. The hierarchical multiscale CNN with the auxiliary classifier (HMCNN-AC) extracts multiscale features from image patches of different sizes, and bidirectional long-short-term memory (LSTM) considers these features as sequential data to capture dependence and correlation [28]. In [29], the multiscale residual network (MSRN) utilizes depthwise separable convolution (DSC) to construct multiscale residual block (MRB), and two MRBs are connected by high-level shortcut to aggregate features of different levels.
Inspired by visual perception of the human visual system, the attention mechanism has been employed for HSI classification.
In [30], recurrent neural network (RNN) with attention learns the continuous spectrum features, and CNN with attention is designed to extract robust spatial features. Then, the multilayer network uses spectral and spatial features to extract conjoint characteristics. The double-branch multiattention mechanism network (DBMA) and double-branch dual-attention mechanism network (DBDA) use spectral and spatial dense blocks to extract spectral and spatial features, respectively, and the attention modules are utilized to set different weights for extracted features [31], [32]. Aiming to solve the problem that CNNs set the same weight for all spectral bands, the spectral attention module-based convolutional network recalibrates spectral bands so as to strengthen important bands and suppress less useful ones [33]. The end-to-end spectral-spatial squeeze-andexcitation residual bag-of-feature (S3EResBof) model combines the residual block and squeeze-and-excitation block to boost the classification performance, in which batch normalization is also used to regularize the network [34]. In order to suppress the influence of interfering pixels, the spectral-spatial attention network (SSAN) introduces two attention modules to learn more discriminative spectral-spatial features [35]. A series of attention blocks are used in the end-to-end residual spectral-spatial attention network (RSSAN), the first group of attention modules adaptively select spectral bands and spatial pixels, then the second group of attention modules refine the spectral-spatial features, and the residual blocks embedded with attention modules are utilized to optimize the training process [36].
To obtain multiscale representations of objects, feature extractors need to employ different receptive fields to describe objects at different scales [37]. However, the existing CNNs based multiscale extractors can only extract features of fixed receptive fields, which can not extract global and local features at the same time. Current hierarchical features are extracted using the layer wise method, but this method may cause the gradient vanishing phenomenon and need many labeled samples for training. In addition, existing attention-based HSI classification methods only employ single-scale features, which can not make full use of the complex spectral and spatial features of multiple scales. All these factors will affect the HSI classification accuracy to some extent. Drawing intuition from the success achieved by using the hierarchical residual network (HResNet) to extract multiscale features, the hierarchical residual network with attention mechanism (HResNetAM) is proposed, which not only extracts different scale spectral and spatial features but also employs attention mechanism to promote the discriminative ability of features for HSI classification. Besides, using the residual-like style and batch normalization in the module, the proposed method can also avoid the gradient vanishing problem. Our main contributions in this article can be summarized as follows.
1) First, HResNet block is exploited to extract multiscale spectral and spatial features, and these features can represent the global and local receptive fields of the datasets. And this is the first time to extract spectral and spatial features of multiple scales for HSI classification at a granular level.
2) Second, to take full advantage of the hierarchical spectral and spatial features for classification, the attention mechanism is also employed to adaptively calibrate spectral and spatial features of different scales, which can further promote the discriminability of extracted features for HSI classification. 3) Third, double branch structure for HSI classification is also utilized. In two parallel branches, different sizes of convolution kernels are employed to learn corresponding spectral and spatial features. And the spatial and spectral features of different scales are fused for spectral-spatial classification. In addition, the residual learning and batch normalization can also facilitate the model training. 4) The experimental results, obtained over four benchmark HSI datasets, reveal that the proposed method exhibits potential to learn more discriminative spectral-spatial features, providing competitive performance advantages compared with state-of-the-art deep learning classification models. The remainder of this article is organized as follows. Section II introduces the proposed HResNet with attention mechanism model in detail. Parameter analysis and comparative HSI classification results are presented in Section III, and Section IV concludes this article.

II. METHODOLOGY
The proposed model makes full use of the multiscale feature extraction ability of the HResNet and the weight calibration capability of the attention mechanism. First, drawing intuition from the success achieved by residual network, the hierarchical residual block can not only extract multiscale features from raw data but also avoid the gradient vanishing problem. Then, in order to enhance the discriminative ability of spatial and spectral features with different scales in HSI classification, the spectral attention module and spatial attention module are employed. Finally, the proposed double branch structure which extracts spectral and spatial features separately is described, and detailed model architecture and parameters are also introduced.

A. Residual Learning
Deeper learning models have stronger feature learning and expression capabilities, but the vanishing gradient problem will be exposed in the training process. With the network depth increasing, accuracy get saturated and then degrades rapidly. Unexpectedly, this problem is not caused by overfitting, and adding more layers leads to higher training error. The key idea of residual learning is to introduce identity mapping into the backbone path of network structure. In the training process of deep residual networks, the underlying error can be propagated through the shortcut, which can effectively solve the notorious gradients vanishing problem. The residual learning does not require additional parameters, so it neither adds extra parameter nor increases computational complexity compared with the original network. The deep residual network is composed of many stacked residual units, in which a single residual unit is illustrated in Fig. 1. This residual unit contains one convolutional layer, one batch normalization (BN) layer and one rectified linear unit (ReLU) layer as well as the identity mapping. And the basic form of the residual unit is formulated as in which x i and x i+1 are corresponding input and output of the unit, and F refers to the residual function. In order to train the model more efficiently, the batch normalization is implemented after every convolutional layer [38]. Moreover, the rectified linear unit layer is also utilized to extract nonlinear features. Through this skip connection strategy, the residual networks can build very deep network structures without worrying about the gradients vanishing problem. The deep residual networks have been exploited for HSI classification, which can obtain superior classification accuracy than the CNN-based methods [18], [19], [29], [39].

B. Hierarchical Residual Learning
It is critically essential to extract multiscale features for image classification task. Most existing CNNs enhance multiscale representation strength via layer-wise way, while the multiscale representation ability of HResNet refers to the multiple available receptive fields at a granular level. To achieve this goal, the hierarchical residual block divides the input feature maps into several groups, and each subgroup of feature maps is performed with different layers of convolution operators. In the hierarchical residual block, different subgroups of feature maps have different receptive fields, thus the combined feature maps can represent multiscale features, so it can increase the receptive fields of the network [40]. Existing convolutional networks obtain multiscale features by stacking convolutional layers, but these features have relatively fixed receptive fields. The hierarchical residual learning introduces a new scale dimension as an essential factor except existing dimensions of depth, width, and cardinality [41]. In HResNet, the scale dimension means the number of feature groups in a hierarchical residual unit. Fig. 2 shows the hierarchical residual unit with 3 scales, in which and ⊕ mean split operation and concatenation operation, respectively.
We denote input and output of the hierarchical residual unit with x and y. First, we split the input feature map x into s feature subsets, and every subset is represented as The subset x i has the same spatial size with input x, but only 1/s channels. Except for x 1 , every x i has corresponding convolution operator, denoted by K i (). And we use y i to denote the output of K i (). To obtain hierarchical features, we add the output of K i−1 () to the feature subset x i , and then they are fed into K i (). Thus, y i can be generally written as follows: Through this hierarchical residual structure, each convolution operator K i () can receive information from subsets x j (j ≤ i), thus the feature split x i has a larger receptive field than x j . The concatenation operation at the end of hierarchical residual unit combines feature maps of different receptive fields. In addition, the split and concatenation strategy can force the hierarchical residual block process features more efficiently. In the hierarchical residual unit, larger scale factor s allows the unit to learn features with richer receptive field sizes. We also conduct batch normalization and rectified linear unit activation function after every convolutional layer to train the HResNet more effectively. Therefore, the residual-like connections within the hierarchical residual unit could make it capture global and local features at a granular level.

C. Attention Mechanism
Drawing intuition from the human visual system, the attention mechanism can recalibrate channel-wise features by explicitly establishing the relationships between channels [42]. The traditional HSI classification models assign equivalent weights to all pixels and bands in the spatial and spectral domains, respectively. It is a fact that different spatial pixels and spectral bands make unequal discriminative contributions to classification results. For instance, several edge pixels in the HSI block have different labels with the center pixel, and these interfering pixels will weaken the discriminative ability of spectral-spatial features, thereby affecting the classification accuracy. If the weight of these pixels can be suppressed, the discriminability of the spectral-spatial features will be increased. Thus, it is feasible to introduce the attention mechanism to HSI classification, which can focus more on the discriminative and effective spatial and spectral features and weaken information detrimental to classification. Because exploiting spectral and spatial-wise attention is superior to only using channel-wise attention [43], so we adopt spectral attention module as well as spatial attention module simultaneously to recalibrate spectral and spatial features of multiple scales. Two attention modules are introduced in detail as follows.

1) Spectral Attention Module:
The spectral attention module is constructed by modeling the interdependencies between channels, as shown in Fig. 3. The spectral attention map X ∈ R c×c is calculated from the initial input A ∈ R c×h×w , in which h × w represents the spatial size while c denotes the channels of the original features. Specifically, we first reshape and transpose A ∈ R c×h×w into A T ∈ R c×n , and conduct a matrix multiplication between A and A T . And the results are fed into a softmax layer to get the attention map X in which x ji represents the influence of ith channel on the jth channel. In addition, a matrix multiplication is conducted between X T and A, and their results are reshaped into R c×h×w . Finally, a scale parameter α is used to weight the results and perform a element-wise sum operator with the input A to obtain the spectral attention map E ∈ R c×h×w where the parameter α is initialized to be 0 and can be optimized gradually in the training process. We can see that the spectral attention feature map E is a weighted combination of all the original channels, which can selectively strengthen informative channels and suppress less useful ones. Therefore, the spectral feature discriminability can be increased through this spectral attention module.
2) Spatial Attention Module: Fig. 4 shows the spatial attention module, the initial input A ∈ R c×h×w is fed into two different convolution layers to generate two new feature maps B and C, respectively, in which {B, C} ∈ R c×h×w . And these two feature maps are reshaped into R c×n , where n = h × w refers to the number of spatial pixels. Then a matrix multiplication between B T and C is performed, and the results are fed into a softmax layer to obtain spatial attention map S ∈ R n×n as follows: in which s ji measures the ith pixel's influence on the jth pixel. The closer the spatial distance between two pixels, the greater the correlation between them.
A new feature map D ∈ R c×h×w is also generated from initial input feature A through a convolution layer and reshaped into R c×n subsequently. Then a matrix multiplication between D and S T is performed, and the results are reshaped into R c×h×w . Finally, a scale parameter β is utilized to weight the results and perform a element-wise sum operator with the initial input A to get spatial attention map E ∈ R c×h×w as follows: in which the parameter β is initialized to be 0 and can be optimized gradually in the training process. It can be inferred that each position in the spatial attention feature map E is a weighted combination of all the original pixels, which have a global view and selectively emphasize informative positions. Thus, the feature discriminability will be improved in the spatial domain.

D. Framework of the Proposed Model
The whole structure of the HResNetAM model is illustrated in Fig. 5. In order to make the most of the spectral and spatial features of different scales, we adopt the double branch architecture for HSI classification. The upper spectral branch consists of the hierarchical spectral residual network and corresponding spectral attention module. The HResNet containing spectralbased convolution operators are utilized to extract hierarchical spectral features, and the spectral attention module is employed to assign different weights for hierarchical spectral features. The lower spatial branch is composed of hierarchical spatial residual network and corresponding attention block. For the similar purpose, the spatial attention module can recalibrate the spatial features of different scales. The adaptively weighted multiscale spectral and spatial features are fused to conduct the HSI spectral-spatial classification.
The model in Fig. 5 is a HResNetAM network with 4 scales and 6 kernels, and the corresponding detailed parameters of the spatial and spectral feature extraction network in HResNetAM are listed in Table I. In this representative model, we employ the Pavia Centre dataset and the spatial size is set to be 7. And the HSI block serves as the input of the two branch structures. In our proposed model, we employ the convolution kernels with (1, 1,  5) and (3, 3, 1) to extract spectral and spatial features, respectively. Note that the stride of Conv1 in the spectral branch is (1, 1, 2) and stride for other convolution operations in HResNetAM is (1, 1, 1).

III. EXPERIMENTAL RESULTS AND ANALYSIS
In our experiments, all the comparative classification experiments are carried out on a workstation equipped with an Intel Core i9-7900X, an Nvidia Geforce RTX 2080 Ti GPU, and 128 G RAM. The proposed HResNetAM model is implemented using the PyTorch library with Python language. We employ main classification evaluation coefficients, namely, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (κ) to quantitatively assess the classification performance. And we also exploit classification maps to qualitatively evaluate the experimental results. In order to increase the reliability and credibility of experimental results, we conducted ten trials for each classification experiment with randomly selecting the training samples.

A. Data Description
Four different benchmark hyperspectral datasets collected by different sensors and at different time are utilized to conduct the HSI classification experiments.
Pavia Centre: The Pavia Centre dataset was acquired by the ROSIS sensor over the side of Ticino river, Pavia, northern Italy. The spatial size of this dataset is 1096 × 715 pixels, and corresponding geometric resolution is 1.3 m. This sensor can acquire 115 bands in total in the wavelength range of 0.43-0.86 μm. After removing the greatly noise-affected channels, the remaining 102 spectral bands are employed for experiments. The corresponding image ground truth differentiates 9 classes, and the detailed land-cover classes, training samples, and test samples are shown in Table II.
Houston 2013: The Houston 2013 dataset was collected by the ITRES CASI-1500 sensor over the University of Houston campus in June 2012, which is provided by the 2013 IEEE GRSS Data Fusion Competition [44]. The spatial size of this image dataset is 349 × 1905 pixels, and the spatial resolution is 2.5 m. This dataset has 144 spectral bands in the wavelength range of 0.38-1.05 μm. There are 15 land-cover classes within the image coverage, and the detailed land-cover classes, training samples, and test samples are shown in Table III.   TABLE III  LAND-COVER CLASSES AND SAMPLES OF THE HOUSTON 2013 DATASET   TABLE IV LAND-COVER CLASSES AND SAMPLES OF THE DIONI DATASET Dioni: The Dioni dataset is one of the HyRANK benchmark datasets which have been developed in the framework of the ISPRS Scientific Initiatives [45]. The HyRANK benchmark datasets contain two training images (i.e., Dioni and Loukia) along with the corresponding ground truth and two validation images. The spatial size of the Dioni dataset is 250 × 1376 pixels, which contains 176 spectral channels. There are 16 different land cover classes in the HyRANK benchmark datasets, and the selected Dioni dataset covers 12 classes. The detailed number of training samples and test samples along with the corresponding labels is reported in Table IV.
Houston 2018: The Houston 2018 dataset was gathered by the ITRES CASI-1500 sensor over the University of Houston campus in February 2017, which is provided by the 2018 IEEE GRSS Data Fusion Competition [46]. We only use the training portion of the whole HSI, and the ground truth is resampled to adapt the hyperspectral dataset [47]. The spatial size of this dataset is 601 × 2384 pixels at 1-m ground sampling distance. There are 48 spectral bands in the wavelength range of 0.38-1.05 μm. And there are 20 urban land-cover classes within image coverage. The detailed number of training samples as well as test samples with corresponding labels is shown in Table V.

B. Experimental Setup
To evaluate the performance of proposed HResNetAM, we use several state-of-the-art methods for comparative experiments. These models include the deep learning-based models (i.e., 3DCNN, CDCNN, SSRN, FDSSC, DBMA, and DBDA) as well as the SVM with radial basis function (RBF) kernel. In order to carry out comparative experiments more fairly, we use the same number of training samples in all methods, and 20% of the training samples are set as validation samples. Specifically, the parameters of each method are given separately according to the corresponding articles.
3DCNN: This method directly uses 3D convolution operators to extract features of HSI, the architecture of the 3DCNN in [11] contains two convolution layers and the fully connected layer. This model uses 3D image cube as input, and the input size is 5×5×B, where B refers to the spectral bands.
CDCNN: The contextual deep CNN network constructs deeper classification model with residual learning structure, which is composed of the multiscale filter bank and two residual blocks. Then three convolutional layers and one fully connected layer are utilized for HSI classification [17]. The input of CD-CNN is 5 × 5 × B block.
FDSSC: The FDSSC is based on 3D-CNN and dense block, and this model contains two dense blocks and followed by the average pooling, flatten and fully connected layer [23]. And we also use the 9 × 9 × B image block as input.
SSRN: The SSRN combines residual learning and 3D-CNN, which extracts spectral and spatial features in sequence using corresponding residual blocks, and the following average pooling layer and an fully connected layer are employed for classification. This method also uses 7×7×B image block as input [18].
DAMA: The DAMA is based on attention mechanism and dense block, which contains spectral branch and spatial branch as well as corresponding attention blocks. The convolutions with (1, 1, 7) and (7, 7, 1) kernels are utilized in spectral and spatial branches, respectively, and the size of the input is 7×7×B [31].
DBDA: The architecture of the DBDA is presented in [32], which also contains spectral and spatial dense blocks and corresponding attention blocks. And we use 7×7×B image block as input.
The DBMA [31] and DBDA [32] models utilize dense network to extract spectral and spatial features, and attention modules are employed to recalibrate extracted features. These two methods are used as comparative methods to verify the feature extraction capability of HResNet. In order to conduct the ablation study of attention mechanism, we also design the HResNet model as one comparative method, which has the same network structure with corresponding HResNetAM but without spectral and spatial attention modules.

C. Parameters Analysis and Setting
The parameters in deep learning models can influence the HSI classification to some extent, so we evaluate the main parameters in our proposed model, they are learning rate, spatial size, the number of training samples, as well as the number of scales and kernels. And in our classification experiments, because the HResNetAM with different batch sizes and epoches has relatively stable classification accuracies, so we set the batch size and epochs as 32 and 200, respectively.
1) Learning Rate: The learning rate greatly influences the convergence rate of the network and the HSI classification performance. Referring to the relevant experiments, we analyze the effect of learning rate at {0.0001, 0.0002, 0.0003, 0.0008 0.001, 0.005, 0.01} on overall accuracies. Fig. 6 shows the ten experimental results on four datasets with different learning rates. In this figure, two independent horizontal lines represent the overall range of the classification results, and the two edges of the box denote upper quartile and lower quartile, respectively. The horizontal line in box refers to median value, and the denotes abnormal outliers. It can be found that a smaller learning rate has a relatively stable classification accuracy and bigger learning rate will result in larger variance in the classification accuracy. According to the average OA and variance in four groups of HSI classification, we set the learning rate to be 0.0002, 0.0001, 0.0002, and 0.0005 for four benchmark datasets, respectively.
2) Spatial Size: For the purpose of utilizing the spatial information for spectral-spatial classification, we exploit the 3D image cube as input. The spatial size can also influence the HSI classification results, and we set neighborhood size in the range of {3, 5,7,9,11, 13}. Table VI shows the average overall accuracy and corresponding variance of the proposed method on four hyperspectral datasets with different spatial sizes. Based on    the experimental results, we find that the classification accuracy generally increases and then decreases as the neighborhood increases. Thus the optimal neighborhood sizes of the four datasets are set to be 5, 7, 7, and 11, respectively.

3) Training Samples:
The number of training samples also have great influence on HSI classification performance. For the purpose of evaluating the robustness and generalization of HRes-NetAM model toward different numbers of training samples, we randomly choose {20, 40, 60, 80, 100, 120, 140, 160, 180, 200} annotated samples per class for four datasets. Fig. 7 shows the average overall accuracy achieved by HResNetAM with different numbers of training samples on four hyperspectral datasets. From this figure, we can observe that the classification accuracy will quickly reach a relatively stable level with the increase of training samples. According to the overall accuracies of each dataset, we set 140, 180, 140, and 180 per class for training in four datasets, respectively. Since there are fewer labeled samples in some images, we set specific numbers of training samples for those datasets. In the Houston 2013 dataset, we use 100 labeled samples for Water and Tennis Court classes. In the Dioni dataset, we use 100 labeled samples for Mineral Extraction Sites class and 50 labeled samples for Fruit Trees class. And in the Houston   Fig. 8. Based on the comparative results, we can learn that the classification accuracy is lower when the numbers of scales and kernels are low, especially for the Dioni and Houston 2018 datasets. Since the proposed method can extract multiscale features, higher scales can effectively improve the classification accuracy especially for more complex images. And the optimal parameter combinations of four HSI datasets are selected as {6, 8}, {7, 8}, {7, 6}, and {8, 6}, respectively.

D. Comparative Classification Results With
State-of-The-Art Methods 1) Quantitative Comparisons: The average OAs, AAs, Kappa coefficients, and corresponding variance as well as average classification accuracy of every land-cover class with different classification methods on four benchmark HSI datasets are listed inTables VII-X. Note that the bold value in these tables represents the optimal value in the corresponding row. From the quantitative comparisons, several conclusions can be drawn, they are listed as follows.
1) First of all, in the case of the same number of training samples, the deeper the model is, the higher the classification accuracy will be. In four classification experiments, the deeper models (i.e., CDCNN, FDSSC, SSRN, DBMA, and DBDA) have higher classification accuracies than the 3DCNN. 2) The attention mechanism will improve classification to a certain degree. In four groups of HSI classification experiments, the DBMA and DBDA with attention mechanism generally have higher classification accuracies than the SSRN and FDSSC. Because the HResNet model has the same network structure with HResNetAM but without attention blocks, the classification results from HResNet can also verify the effectiveness of attention mechanism. And we can see that the HResNetAM has higher classification accuracies than HResNet in four classification experiments. 3) Comparing the overall accuracies of four different datasets, it can be found that the classification performance of four groups of HSIs is different. Among them, the Pavia Centre dataset has the highest classification accuracy and the Houston 2018 dataset has low accuracy. This is mainly due to the different levels of complexity within the hyperspectral datasets. However, by introducing the scale factor into HSI classification model, the classification performance can be significantly improved. For example, the average overall classification accuracy of the proposed HResNetAM model is 83.61%, which is 3.27% higher than the DBDA models (i.e., 80.34%). Thus, the scale factor will be helpful for the HSI classification model on complex datasets. 2) Qualitative Comparisons: Except the quantitative evaluation from Tables VII-X, the classification maps obtained by nine different methods are also exploited for qualitative evaluation. Figs. 9-12 show the classification maps on the Pavia Centre dataset, the Houston 2013 dataset, the Dioni dataset, and the Houston 2018 dataset with different classification methods, respectively. In these figures, the pseudocolor image and ground-truth are also displayed, and different land-cover classes are represented by different colors. When comparing the classification maps obtained by different methods in (c)-(k) with the ground-truth map in (b), we can learn that the proposed HResNetAM model can obtain more reasonable classification maps, which can prove the superiority of HResNetAM. In addition, when dealing with more complex images, take the Houston 2018 dataset, for example, traditional deep learning methods have more noise pixels in classification maps. Due to the introduction of scale factor and attention mechanism, the proposed HResNetAM can generate more homogeneous and reasonable classification maps.

E. Discussion
When performing the deep learning models for HSI classification, traditional models often have difficulties in extracting multiscale information at a granular level, which will affect the classification accuracy to some degree. To address this problem, we propose the HResNet with attention mechanism which can learn spectral and spatial features with different scales, and these features are fused for joint classification. The designed HResNetAM model, based on hierarchical residual learning and attention mechanism, can achieve better classification results compared with state-of-the-art deep learning models. The main reasons can be summarized as the following two aspects.
First, the importance of hierarchical features learning ability. The designed model utilizes HResNet to extract spatial and spectral features at different scales for the first time, which can learn characteristics from different receptive fields. And these global and local features can make contributions to the HSI classification results, especially for more complex images, such as the Dioni and Houston 2018 datasets, and the comparative experiments confirm the effectiveness of the scale factor for HSI classification. Second, the attention mechanism can further improve the classification performance to a certain extent. The attention mechanism is orthogonal to HResNet, so it is feasible to combine these two learning strategy for HSI classification. And the experimental results also verify the advantages of combing attention mechanism with HResNet.

IV. CONCLUSION
In this study, we propose a novel HResNet with attention mechanism model for HSI spectral-spatial classification, which have three advantages. The first one is that the proposed network utilizes hierarchical residual block to extract more discriminative spectral-spatial features of different scales, so as to maintain multiscale information for classification. The second one is that the attention mechanism is employed to calibrate the weights of hierarchical spectral and spatial features, and the third one is the double branch structure has potential in learning the spectral and spatial features separately. Therefore, the novel hierarchical residual network architecture with attention mechanism can extract more complete and discriminative information of HSI data by managing spectral-spatial features at a hierarchical level. And the residual learning structure and batch normalization can further improve the HSI classification efficiency in training process. The performance of HResNetAM has been verified on four benchmark HSIs compared with state-of-the-art models, and the experimental results have confirmed the superiority of proposed method.

ACKNOWLEDGMENT
The authors would like to thank Prof. P. Gamba for providing the Pavia Centre data, ISPRS for providing the Dioni data, and the IEEE GRSS Image Analysis and Data Fusion Technical Committee for providing the Houston 2013 and Houston 2018 datasets. The authors also want to thank the editor and reviewers for their careful reading and constructive comments.
Zhixiang Xue received the B.S. degree in measure-