Pyramidal Dilation Attention Convolutional Network With Active and Self-Paced Learning for Hyperspectral Image Classification

In recent years, deep neural networks have been widely used for hyperspectral image (HSI) classification and have shown excellent performance using numerous labeled samples. The acquisition of HSI labels is usually based on the field investigation, which is expensive and time consuming. Hence, the available labels are usually limited, which affects the efficiency of deep HSI classification methods. To improve the classification performance while reducing the labeling cost, this article proposes a semisupervised deep learning (DL) method for HSI classification, named pyramidal dilation attention convolutional network with active and self-paced learning (PDAC-ASPL), which integrates active learning (AL), self-paced learning (SPL), and DL into a unified framework. First, a densely connected pyramidal dilation attention convolutional network is trained with a limited number of labeled samples. Then, the most informative samples from the unlabeled set are selected by AL and queried real labels, and the highest confidence samples with corresponding pseudo labels are extracted by SPL. Finally, the samples from AL and SPL are added to the training set to retrain the network. Compared with some DL- and AL-based HSI classification methods, our PDAC-ASPL achieves better performance on four HSI datasets.


I. INTRODUCTION
H YPERSPECTRAL imaging is an important technique in remote sensing that collects the electromagnetic spectrum from visible to near-infrared wavelength ranges and can provide hundreds of narrow spectral band images of the same region for Earth observation. In a hyperspectral image (HSI), each pixel can be considered as a high-dimensional vector with entries corresponding to the spectral reflectance in a particular wavelength [1]. The HSI has the advantage of distinguishing subtle spectral differences and has been widely used in many fields such as precision agriculture, land use mapping, urban planning, etc. HSI classification is important for HSI analysis and has received much attention in the past few decades. According to previously available works, HSI classification methods can use spectral feature, spatial feature, and spectral-spatial features [2], [3]. Spectral feature is the fundamental characteristic of HSI, and spectral-based methods only use spectral information in the classification process. In the early days of HSI classification research, researchers focused on methods solely based on spectral features and simply performed classification on pixel vectors, such as principal component analysis (PCA) [4], linear discriminant analysis [5], etc. However, spectral-based methods ignore the rich spatial information of HSIs. The spatial information of a pixel mainly reflects the relationship between the pixel and its spatial neighbors, which can greatly improve the robustness of the model [6]. To simultaneously use spectral and spatial features, spectral-spatial-based approaches are proposed. These methods include filter-based method [7], [8], morphological methods [9], composite kernel methods [10], [11], sparse or low-rank representation methods [12], [13], [14], deep learning (DL) methods [3], [15], etc.
Recently, DL has been gradually applied to HSI classification. DL methods, such as convolutional neural network (CNN) [16] and DenseNet [17], can automatically learn deep spectralspatial features from HSIs and have achieved excellent classification performance. However, DL methods usually need a large number of labeled samples to train the network [6]. In practice, the collection of labeling samples requires human involvement, and the process is labor intensive and costly [18], [19]. Therefore, one of the problems facing DL is the inability of obtaining sufficient labeled samples. To solve this problem, experts and scholars have proposed a series of deep HSI classification methods for small-sample problems, such as data augmentation (DA) strategy [20], [21], [22], lightweight networks [23], [24], etc.
DA is a popular technique to improve the generalization ability of deep neural networks by generating more training samples. Traditional DA strategies, such as translation, clipping, flip, rotation, and adding noise, are utilized to increase both the amount and diversity of samples. In recent years, some new This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ DA strategies have been proposed. Li et al. [20] proposed a pixel-block-pair (PBP) DA method, which greatly increases the number of training samples while using a deep CNN to extract PBP features and decision fusion for final label assignment. Haut et al. [21] used a random occlusion data augmentation (RODA) method during CNN training to randomly occlude pixels in different rectangular spatial regions in the HSI to generate training images with various levels of occlusion and reduce the risk of overfitting.
Lightweight networks have the advantages of fewer parameters, less computation, and shorter inference time compared with typical deep neural networks, which reduce the dependence of labeled samples by pruning, distillation, and group convolution without degrading performance. In HSI classification, Zhang et al. [23] proposed an end-to-end 3-D lightweight convolutional neural network (3-D-LWnet) and alleviated the small-sample problem by using cross-sensor and cross-modal strategies. The LiteDepthwiseNet proposed by Cui et al. [24] decomposes standard convolution into depthwise convolution and pointwise convolution based on 3-D depthwise convolution and removes the ReLu layer and the batch normalization layer in the original 3-D depthwise convolution, which not only alleviates the overfitting phenomenon of the model on small-sized datasets, but also achieves high classification performance using minimal parameters.
The aforementioned methods have achieved good performance in small-sample classification problems, but still cannot solve the problem of labeled sample scarcity. Recently, active learning (AL) has become a hot research topic. AL selects useful samples for labeling and theoretically guarantees a significant reduction in label usage. It assumes that each sample is of different importance. In other words, only fewer samples are important to the classifier, while others are redundant. In general, AL is used as an auxiliary strategy to select more valuable samples. Moreover, AL is more versatile and can be combined with various classifiers, such as support vector machine (SVM) [25], [26], CNN [27], ResNet [28], generative adversarial networks [29], etc.
Although all these AL methods reduce the use of labeled samples to some extent, the lack of labels still largely limits the classification performance. In order to obtain more labeled samples without consuming more resources, this article expands the training set by self-paced learning (SPL). SPL is a new learning mechanism proposed in recent years that gradually incorporates easy to complex samples into the training by simulating the human learning process [30], [31]. Contrary to AL, SPL prefers to select samples with high confidence, so their combination can obtain better information about the dataset and, thus, improve the classification accuracy.
In order to alleviate the problem of insufficient labeled samples in the training of DL-based HSI classification methods, in this article, using the thoughts of semisupervised learning [32], we integrate the DL, AL, and SPL into a unified framework and propose a pyramidal dilation attention convolutional network with active and self-paced learning (PDAC-ASPL) model. In the DL part, we propose a pyramidal dilation attention convolutional (PDAC) network, which improves the original pyramidal dilation convolutional (PDC) network [33] by incorporating the squeeze-and-excitation (SE) attention modules [34], [35] into different PDC blocks. It can effectively suppress the features of unimportant channels and, meanwhile, allows the adaptive selection of spatial information by focusing on the features of the central pixel and its neighboring pixels. Except for the SE-based attention, other attention mechanisms, such as self-attention mechanism in the transformer model, can also be considered [36], [37]. By embedding the AL and SPL strategies into the PDAC main network, we can select the most informative samples with the highest uncertainty to assign true labels and query the samples with the highest confidence to assign pseudo labels. The labeled informative samples and pseudo-labeled high-confidence samples are added to the training set to refine the model training. The contributions of the proposed PDAC-ASPL are mainly threefold. 1) We integrate the DL, AL, and SPL into a unified framework and propose a PDAC-ASPL model for small-sample HSI classification. The PDAC network can effectively extract spatial-spectral features for classification. AL and SPL strategies can select the most informative samples and high-confidence samples to enlarge the training set for refining model training. 2) A PDAC network is designed for spatial-spectral feature extraction and classification. To make full use of valid information for all PDC blocks and to better use the output information from the last PDC block, an SE attention mechanism is incorporated before the first PDC block and after the last PDC block, respectively. 3) A new SPL strategy is constructed for the selection of high-confidence samples. It can effectively solve the unbalanced class problem by designing a class budget and alleviate the effect of noisy pseudo labels by using a weighted symmetric cross-entropy (SCE) loss. The rest of this article is assigned as follows. In Section II, the related works of AL and SPL are described. In Section III, the proposed model is presented in detail. In Section IV, a series of experiments are conducted on four HSI benchmark datasets. Section V provides the ablation experiments and parameter analysis. Finally, Section VI concludes this article.

A. AL Methods
In recent years, AL has been extensively studied in HSI classification. For example, Rajan et al. [38] applied the AL method to single-image classification and knowledge transfer, validating the reasonableness of a maximum-likelihood classifier and a binary hierarchical classifier. Wang et al. [26] used the supervised clustering technique and the labeling process of classification results to discover the representativeness and differentiation of samples and assigned pseudo labels to unlabeled data based on the clustering and classification results. The samples that were not assigned pseudo labels in each iteration are also considered as candidates for AL, and finally, a semisupervised AL method was designed based on the SVM. Xu et al. [25] constructed leave-one-class-out multiviews and designed a sample query strategy from the perspective of classification confidence and training contribution. The most inconsistent high-quality samples are filtered out by making full use of the iterative prediction information and spatial-spectral features of the HSI. Then, the target samples are obtained by AL in each iteration through two-layer screening, and SVM is used to obtain the final classification results. Cao et al. [27] integrated AL and DL into a unified framework, in which a CNN is trained using a limited number of labeled pixels, and the most informative pixels from the candidate pool are selected by AL and then added to the original training set to fine-tune the CNN. Ding et al. [39] proposed a clustering-inspired AL method, which selects highly informative and diverse samples from unlabeled samples in the candidate set by fast search and finding of peaks clustering methods for manual labeling and pretrains the CNN by the k-means clustering-based pseudo-labeling scheme.
Although the aforementioned works have achieved better classification results, they have not fundamentally solved the lack of labeled samples. Therefore, we combine AL with SPL to solve this problem at the root.

B. Self-Paced Learning
SPL has been previously used in the field of HSI classification. Peng et al. [31] combined SPL with sparse representation to construct a self-paced joint sparse representation model for HSI classification. It learns the weights of neighboring pixels using SPL and selects neighboring pixels with nonzero weights (i.e., easy pixels) to be added to the JSR learning process in each iteration. Yang et al. [40] proposed an SPL-based probability subspace projection (SL-PSP) method for HSI classification. After assigning a probability label to each pixel and a risk to each labeled pixel, the two regularizers are developed in SL-PSP for classification from an SPL maximum marginal and probability label graph, respectively.
Recently, there have been some studies combining SPL with AL in other fields, such as in computer vision, Lin et al. [41] first proposed to combine AL with SPL for face recognition. Subsequently, Ren et al. [42] applied AL and SPL with DL to synthetic aperture radar automatic target recognition. In each of these fields, the combination of SPL and AL has yielded more desirable results.

III. PROPOSED METHOD
The overall framework of the proposed PDAC-ASPL is shown in Fig. 1, which combines AL and SPL with DL to achieve better results with limited labeled samples. The input hyperspectral dataset D is first preprocessed by the PCA to reduce its dimensionality from B to b. Then, a densely connected PDAC network is trained with a limited number of labeled samples and outputs features for a labeled training set and an unlabeled set. Based on the features, the AL strategy picks up the most informative samples from the unlabeled set and assigns them real labels, and SPL selects high-confidence samples and assigns them pseudo labels. The samples selected by AL and SPL are added to the training set to retrain the network.

A. PDAC Network
For feature extraction, a densely connected PDAC network is designed, as shown in Fig. 2(a). In the PDAC network, all the layers in the dense convolutional network are directly connected to ensure the maximum transmission of information [33], and the dilation convolution is used to integrate the multiscale context information of the HSI.
The main structure of the PDAC network is the PDC block, as shown in Fig. 2(b). It consists of several PDC layers, and dense connections are adopted between different PDC layers. The PDC layers are composed of dilated convolution layers with different dilated factors as [33] where N k represents the kth PDC layer, n d k indicates the kth subdilated convolutional layer with dilated factor d = 2 k−1 in the kth PDC layer, and ∧ represents the stacking of subdilated convolutional layers. Different skip connections correspond to different dilation factors. In general, a shallow skip connection corresponds to a small dilated factor. The width of the network will increase as the number of PDC layers increases. The advantage of the structure is that more and larger ranges of spatial information can be obtained [33]. In this article, our network uses three PDC layers in each PDC block for feature extraction.
To make full use of valid information for all the PDC blocks and to better use the output information from the last PDC block, an SE attention mechanism is incorporated before the first PDC block and after the last PDC block [34], [35], respectively.

1) Spectral Attention Module:
Considering that the importance of different channels is different, an SE attention is first imposed on spectral feature channels, as shown in Fig. 3. In detail, for a feature map of h × w × b from the first 2-D convolution layer of the network, a global average pooling (pooling size is h × w) is used to generate a feature map of 1 × 1 × b. Then, it passes through two fully connected layers, where the number of neurons in the first fully connected layer is b/16 and that in the second fully connected layer is b. Next, the feature map of 1 × 1 × b is obtained through the sigmoid layer. Finally, the original feature h × w × b and the generated attention map of 1 × 1 × b are fully multiplied to obtain the feature map with different channel importance.
2) Spatial-Spectral Attention Module: For the feature map h of size h × w × b output by the third PDC layer, we generate a 1-D spectral attention map M se (size 1 × 1 × b) and a 2-D spatial attention map M sa (size h × w × 1).
Here, the spectral attention is slightly different from the spectral attention in the last subsection. It concatenates a global maximum pooling with the global average pooling and uses the combination of output results as input to the next layer for the purpose of better complementing the global information. The spectral attention can be expressed as where s k is the sum of the average pooling s avg k and maximum pooling s max k , and F scale (h k , s k ) refers to spectralwise multiplication between the feature map h k and the scalar s k .
For spatial attention, we first perform global average pooling and maximum pooling on input features and combine the results of both pooling and input them into the convolution layer and then use the sigmoid activation function to obtain the output sa: where * is the convolution operation and W is a learnable builtin parameter.
, and represents the spatialwise multiplication operation between the feature map h i,j (size 1 × 1 × b) and the scalar sa i,j .

B. SPL for Selecting High-Confidence Samples
SPL is a learning mechanism that borrows the idea from human learning from simple to complex [30]. The traditional SPL is limited to select "simple" samples with high confidence derived from the network. It may cause the problem that "simple" samples are from the same class and eventually result in an overfitting phenomenon. Here, we set a class budget for sample selection and design a new loss to alleviate the effect of noisy pseudo labels in SPL.
After predicting the unlabeled samples by the PDAC network, for each sample x i , a predicted one-hot labelŷ i can be obtained. SPL assigns a weight v i to each sample x i through the following optimization model: where λ is a parameter, w is the model parameter, and is the loss function.
It is easy to obtain the weight v i as The samples whose losses are smaller than λ can be considered as "easy" or high-confidence samples. However, the samples with smaller losses may come from several "easy-to-classify" categories, which may cause the unbalanced classification problem. To avoid selecting too many high-confidence samples from one class, we design a class budget M , i.e., selecting at most M samples in each class. That is, for each class, the selected high-confidence samples should have losses smaller than λ, and the number of selected high-confidence samples is smaller than M .
In the SPL step, we use a new SCE loss function to select highconfidence samples. The general cross-entropy (CE) function is where p(k|x) and q(k|x) are the predictive and true probability distributions for sample x, respectively. The CE loss can be intuitively understood as an effort to increase the predicted probability value of the sample corresponding to the label category. Wang et al. [43] indicated that the CE loss is difficult to adjust the effect of noisy labels and proposed an SCE loss as sce = ce + rce (6) where rce = − K k=1 p(k|x) log q(k|x). In simple terms, rce is to swap the label and the predicted value. In the SPL step, it is clear that the sample's labelŷ i is a one-hot pseudo label, so there are noisy labels. To alleviate the effect of noisy labels, we use a new SCE loss function as where α is a weight parameter. As rce plays an auxiliary role in alleviating the effect of noisy labels, the weight α is set as 0.7 in the experiments.

C. AL for Selecting Most Informative Samples
AL is a common machine learning method to query sample information through iterations and then to label some informative samples. It aims to use fewer labels to obtain better learning performance. AL is generally based on the uncertainty criterion and the diversity criterion to select informative samples. Existing AL-based hyperspectral classifiers normally employ off-the-shelf uncertainty-based algorithms [44], such as least confidence [45], entropy sampling [45], best versus Second best (BvSB) [46], Bayesian active learning disagreement [47], etc.
For each sample x i , the PDAC network produces a vector z i of size K × 1, which can be viewed as the class probability matrix of sample x i . In order to combine the vector z i generated by the PDAC network for sample selection by AL, we adopt the BvSB strategy based on the uncertainty criterion for querying.
The BvSB criterion is specifically designed for multiclassification problems; thus, it is well suited for HSI classification. In this criterion, we only need to consider two classes with the highest classification probability. The criterion is defined as where P B (z i ) and P SB (z i ) denote the highest and second highest class membership probability of unlabeled sample x i , respectively. For this strategy, a smaller value of BvSB(z i ) indicates that the best and second affiliation probabilities are closer. That is, the sample has higher uncertainty; therefore, it will be selected by AL.

D. Whole Training Process
Our proposed PDAC-ASPL method combines DL, AL, and SPL to achieve better results with fewer labeled samples. In the training process, the PCA is first used to reduce the dimensionality of hyperspectral dataset D (size H × W × B, total K classes) from B to b. Then, data cubes are constructed and divided into a training set T (m labeled samples per class, total Km labeled samples) and an unlabeled set U (n unlabeled samples). In the next, we use Km labeled samples in T for the initial training on the PDAC network and predict n unlabeled samples in U , which eventually outputs a matrix of size n × K that can be regarded as a probability matrix. For the AL branch, we use the BvSB strategy to select N AL samples with more information, assign them real labels, and add the selected samples with their labels to T , while removing them from U . For the SPL branch, we use the SCE loss function to calculate the loss value of each sample, select the samples with lower loss, and assign them pseudo labels. Meanwhile, we consider the class budget M and finally select N SPL samples (N SPL ≤ KM ) and then add these samples with their pseudo labels to T . At this point, the training set T completes one round of updating and serves as the new training set for the next round. These algorithms execute R rounds and are implemented iteratively together until the termination condition is met.
The entire procedure of PDAC-ASPL is summarized in Algorithm 1.

3) Salinas (SA):
The SA dataset was collected by the AVIRIS sensor in Salinas Valley, CA, USA. The size of the Salinas image is 512 × 217, and the spatial resolution is 3.7 m. There are 224 spectral bands ranging from 400 to 2500 nm, in which 20 water absorption bands are removed before classification. The dataset contains 16 ground objects and 54 129 labeled pixels. The pseudo-color composite image and ground truth map are shown in Fig. 6.

4) HuangHeKou:
The HHK dataset was captured in 2019 by the GF5_AHSI in the area around the Yellow River Estuary ("Huanghekou" in Chinese) in China. The overall image contains 330 spectral bands in the wavelength range of 390-1029 nm (VNIR) and 1005-2513 nm (SWIR). Forty-five bad bands were eliminated and the remaining 285 bands were used for classification. The dataset has the size of 1185 × 1342 pixels. There are 21 types of materials. The pseudo-color composite image and ground truth map are shown in Fig. 7.
The categories and corresponding number of pixels in four datasets are shown in Tables I-IV.

B. Method Comparison and Parameter Settings
We compare the proposed PDAC-ASPL method with other ten methods for HSI classification on four datasets. For each comparison method, we mostly used the original parameters of the referenced article. The ten compared methods and their parameter settings are listed as follows.   [48] is one of the representatives of traditional HSI classification algorithms. 2) RODA [21] is a DA method. In this method, the spatial size of data cube is 23 × 23 and the learning rate is 0.001. 3) 3-D lightweight Siamese network (3DLSN) [49] is a lightweight DL method. In this method, the spatial size of data cube is 7 × 7 and the learning rate is 0.001. 4) HybridSN [50] is a classical convolutional network combining 3-D CNN and 2-D CNN. In this method, the spatial size of data cube is 25 × 25 and the learning rate is 0.001. 5) Double-branch multiattention mechanism network (DBMA) [51] is a two-branch spatial-spectral DL classification method based on the attention mechanism. In this method, the spatial size of data cube is 7 × 7 and the learning rate is 0.0005. 6) Spectral-spatial residual network (SSRN) [52] is a supervised deep residual network. In this method, the spatial size of data cube is 7 × 7 and the learning rate is 0.0003.  [29] is an AL method based on adversarial learning. In this method, the spatial size of data cube is 25 × 25 and the learning rate is 0.001. 9) SpectralFormer (SF) [36] is a transformer-based method.
SF is able to learn spectral local sequence information from neighboring bands of the HSI to produce grouped spectral embeddings. In this method, the spatial size of data cube is 9 × 9 and the learning rate is 0.0005. 10) Spectral-spatial feature tokenization transformer (SS-FTT) [37] is a transformer-based method that can capture spectral-spatial features and high-level semantic features. In this method, the spatial size of data cube is 9 × 9 and the learning rate is 0.001.

C. Experimental Settings
In the PDAC-ASPL, the reduced spectral dimension is set to b = 30 using the PCA, and the spatial size of data cube is 11 × 11. The learning rate is 0.001. All the experiments are trained on a computer with a 2.70 GHz and 128-GB RAM CPU and two NVIDIA GeForce RTX 2080Ti GPUs based on Python 3.7 to get the results and computational time.
On the IP dataset, for our method and other two AL-based methods (i.e., MVSS-AL and FAAL), the initial training set contains 160 labeled samples (i.e., ten labeled samples per class), and the top 50 informative samples are selected and assigned true labels in each of the following four rounds of AL. Totally, 360 labeled samples are used for three AL-based methods. For the other eight algorithms (i.e., SVM, RODA, 3DLSN, HybridSN, DBMA, SSRN, SF, and SSFTT), we randomly select 360 labeled samples as the training set and the remaining samples for testing.
On the UP dataset, for our method and other two AL-based methods, the initial training set contains 45 labeled samples (i.e., five labeled samples per class), and the top 50 informative samples are selected and assigned true labels in each of the following four rounds of AL. Totally, 245 labeled samples are used for three AL-based methods. For other eight algorithms, we randomly select 245 labeled samples as the training set and the remaining samples for testing.
On the SA dataset, for three AL-based methods, the initial training set contains 48 labeled samples (i.e., three samples per class), and the top 30 informative samples are selected and assigned true labels in each of the following four rounds of AL. Totally, 168 labeled samples are used for three AL-based methods. For other eight algorithms, we randomly select 168 labeled samples as the training set and the remaining samples for testing.
On the HHK dataset, for three AL-based methods, the initial training set contains 105 labeled samples (i.e., five samples per class), and the top 30 informative samples are selected and assigned true labels in each of the following four rounds of AL. Totally, 225 labeled samples are used for three AL-based methods. For other eight algorithms, we randomly select 225 labeled samples as the training set and the remaining samples for testing.
The overall accuracy (OA), class accuracy (CA), average accuracy (AA), and kappa coefficient (κ) on the testing set are used to evaluate the classification performance of each method. To ensure the stability of the experimental results, we conduct ten random experiments for each method.

1) IP Dataset:
On the IP dataset, 360 labeled samples are used for each method, and the averaged results are shown in Table V.
From Table V, we can see that our proposed PDAC-ASPL method achieves optimal performance in terms of OA, AA, and κ. Due to the joint use of SPL and AL, the proposed method is likely to select high-confidence samples for all categories, which enlarges the training set to increase the classification performance. Hence, the proposed PDAC-ASPL generates the best averaged accuracy on all categories. In addition, for the "Grass-pasture-mowed" and "Oats" categories with limited labeled samples, PDAC-ASPL and FAAL methods correctly classify all samples, which demonstrates that the AL strategy can select informative samples in these categories to improve the small-sample classification performance. Fig. 8 visually shows the classification maps of different methods. It can be clearly seen that the classification maps of FAAL and PDAC-ASPL are much more consistent with the ground truth than the map of other methods. Compared with FAAL, our PDAC-ASPL provides much better results in the "Soybean-mintill" category as shown in blackish green color.
2) UP Dataset: On the UP dataset, 245 labeled samples are used for each method. Table VI shows the per class accuracy, OA, AA, and κ of different methods, where the bolded values indicate the best value. It can be seen that our method achieves optimal performance in terms of OA, AA, and κ using only 245 (0.6%) labeled samples. The higher κ coefficient shows that the prediction labels of our PDAC-ASPL are highly consistent with true labels. Compared with two AL-based methods, the proposed PDAC-ASPL shows 19% and 6% performance improvement in OA over MVSS-AL and FAAL, respectively. This demonstrates that both the AL strategy for selecting informative samples and the SPL strategy for selecting high-confidence samples in our PDAC-ASPL are effective. Compared with the classical DL method, such as HybirdSN and SSRN, our PDAC-ASPL also improves the OA by at least 3%. Compared with recently proposed transformer-based methods, the PDAC-ASPL improves the OA by at least 4.5%. For UP data, the categories "Asphalt" and  "Bitumen" are similar materials and are difficult to be classified from each other. MVSS-AL shows poor results on these two categories. FAAL can well classify the "Bitumen" category but performs poor on the "Asphalt" category. In contrast, the proposed PDAC-ASPL provides consistently good results on these two categories. Fig. 9 visually shows the classification maps of different methods. It can be seen that the map of PDAC-ASPL is more similar to the ground truth map than maps of comparison methods. In particular, for categories "Bare Soil" in blackish green color and "Bricks" in green color, our method shows much better results.  Table VII shows the accuracy per class, OA, AA, and κ of different methods on the SA dataset, where the bolded values indicate the best value. It can be seen that our method shows excellent performance in multiple classes. For categories with similar materials (e.g., categories 11-14 are subclass of "Lettuce"), the proposed PDAC-ASPL provides the best overall results. For categories with similar characteristics (e.g., category 8 "Grape untrained" and category 15 "Vinyard untrained" are spatially adjacent and similar), our PDAC-ASPL also shows the best results on these two categories. The excellent performance of the proposed method in distinguishing pixels with similar materials or characteristics demonstrates that the samples selected by AL and SPL are representative and discriminative.  Table VIII shows the accuracy per class, OA, AA, and κ of different methods on the HHK dataset. It can be seen that our method also produces the best overall results. As the spectral resolution of HHK image is very high, some traditional algorithms (e.g., SVM and DBMA) also show good classification performance on this dataset. However, our algorithm has at least 1.42% improvement in OA compared to the more advanced AL algorithms.

5) Effect of Training Rounds:
In the experiments, the proposed PDAC-ASPL method is trained in five rounds. To illustrate the accuracy of PDAC-ASPL model at each training round more intuitively, we conduct experiments on four datasets separately, and the results are shown in Fig. 11. It can be seen that the accuracy on four datasets gradually increases as the number of training rounds increases, which indicates that the samples select by AL and SPL in each round are useful. In addition, we can see that the OA increases slowly after four round training because the model is well trained using the selected high-confidence or informative samples.

A. Computational Time for Each Method
The computational times (in seconds) of the PDAC-ASPL and the other ten comparative methods on the four hyperspectral datasets are shown in Table IX. All machine learning methods were trained on the same device to obtain the computational times. For the eight methods except for the AL-based ones, the time includes model training and testing time. For three AL-based methods, the time refers to the total time of classifier training, testing, and sample selection and iterative training.
According to Table IX, it can be seen that the overall computational time of traditional DL methods is short, and our method runs at least 739.28 s faster than two small-sample HSI classification methods (i.e., RODA and 3DLSN) on four datasets. On the HHK dataset, which has less samples than other datasets, our method is much faster than the existing AL methods MVSS-AL and FAAL. However, since our method needs to calculate the loss between the one-hot label predicted and the

B. Ablation Experiment
The proposed PDAC-ASPL contains three main modules, i.e., PDAC, AL, and SPL modules. The PDAC modifies the original PDC by embedding the SE-based spectral and spatial attention mechanism (Att) into different PDC blocks. To verify the validity of each module in the proposed PDAC-ASPL, we conduct experiment on IP dataset and show results in Table X. For the PDC-based classification, we randomly select 360 labeled samples as the training set and the rest samples as testing set. For PDAC with AL or SPL, we take the initial training set as ten randomly selected samples per class (160 in total) and 50 samples per round for AL. It can be seen that the attention mechanism can improve the OA of the original PDC network by 4.66%. By gradually considering the AL and SPL strategies, the OA is increased by 1.58% and 1.9%, respectively. The results demonstrate that each module can improve the overall performance of the model.

1) Training Round of SPL:
The whole PDAC-ASPL model is trained in five rounds, in which the first round is the initial training. After the first round of initial training, the samples that meet the AL and SPL criteria with the corresponding labels/pseudo labels are selected and added to the second round of training, and so on, until the end of five rounds. However, one of the possible problems of adding pseudo labels after the first round may be that the model is easily underfitted due to the small number of initial training samples. That is, the prediction ability of the model is poor in the first round and the confidence of samples is low, so the new samples and corresponding pseudo labels added at this time are close to noise, which will degrade subsequent model training.
It is very important to choose an appropriate time to add pseudo labels. We analyze the OA of PDAC-ASPL on the IP dataset when SPL is first used from the second, third, fourth, and fifth rounds, respectively. The result is shown in Table XI. When SPL is added in the second round, the OA is 88.05%. When SPL is added beginning from the third round, the OA is 89.81%, which demonstrates that SPL plays a role in the case of a well-fitted model. When SPL is added beginning from the fourth or fifth round, the AL model has been relatively well trained; the samples and their pseudo labels of SPL are not significantly improved after the addition. Therefore, we choose to add SPL in the third round for four datasets to take into account both training effect of the model and confidence of the samples.
2) Threshold λ in SPL: The threshold parameter λ determines the number of samples selected by SPL. If it is too large, SPL is likely to select more samples with low confidence. If it is too small, SPL will select very few samples or even no samples. Fig. 12 shows the OA of PDAC-ASPL versus parameter λ, where λ changes in the set {10 −6 , 5 × 10 −6 , 10 −5 , 10 −4 }.
As shown in Fig. 12, when λ is 1 × 10 −6 , the OA is low because the threshold is too small, resulting in insufficient samples being selected. At this time, the SPL does not work, which can be seen from Table X. On the contrary, when λ is 1 × 10 −4 , the threshold is too large, producing too much inclusion, and thus, the model accuracy decreases. In the experiments, we set the value of λ to 5 × 10 −6 for all four datasets.

3) Class Budget M in SPL:
Considering the problem of unbalanced sample classes, we design a class budget M , i.e., selecting at most M samples in each class. Fig. 13 shows the OA versus the class budget M , where M chooses in the set {50, 100, 150, 200, 250, 300}. As the selected samples in SPL are not removed from the unlabeled set, part of selected M samples in different training rounds can be the same. As shown in Fig. 13, when the number of samples selected for each class is insufficient, the model does not achieve the desired performance. When too many samples are selected from each category, the OA decreases due to the involving of more low confidence samples.
In the experiments, we set the class budget M as 100 for four datasets.

VI. CONCLUSION
In this article, we proposed a PDAC-ASPL model for smallsample HSI classification. The proposed model effectively integrates DL for spatial-spectral feature extraction and classification, AL for the selection of informative samples with true labels, and SPL for the selection of high-confidence samples with pseudo labels. Due to the gradual and effective selection of samples from unlabeled set to training set, the network performance has dramatically improved even with very limited labeled samples for initial training. Experimental results on four HSI datasets show that the proposed method can obtain better classification results with fewer labeled samples.
In future research, we may consider expanding the application range of this semisupervised method. Specifically, we may further apply it to large-region or cross-scene image classification problems.