A Survey: Deep Learning for Hyperspectral Image Classification with Few Labeled Samples

With the rapid development of deep learning technology and improvement in computing capability, deep learning has been widely used in the field of hyperspectral image (HSI) classification. In general, deep learning models often contain many trainable parameters and require a massive number of labeled samples to achieve optimal performance. However, in regard to HSI classification, a large number of labeled samples is generally difficult to acquire due to the difficulty and time-consuming nature of manual labeling. Therefore, many research works focus on building a deep learning model for HSI classification with few labeled samples. In this article, we concentrate on this topic and provide a systematic review of the relevant literature. Specifically, the contributions of this paper are twofold. First, the research progress of related methods is categorized according to the learning paradigm, including transfer learning, active learning and few-shot learning. Second, a number of experiments with various state-of-the-art approaches has been carried out, and the results are summarized to reveal the potential research directions. More importantly, it is notable that although there is a vast gap between deep learning models (that usually need sufficient labeled samples) and the HSI scenario with few labeled samples, the issues of small-sample sets can be well characterized by fusion of deep learning methods and related techniques, such as transfer learning and a lightweight model. For reproducibility, the source codes of the methods assessed in the paper can be found at https://github.com/ShuGuoJ/HSI-Classification.git.

few-shot learning

Introduction
Hyperspectral remote sensing technology is a method that organically combines the spectrum of ground objects determined by their unique material composition with the spatial image reflecting the shape, texture and layout of ground objects, to realize the accurate detection, recognition and attribute analysis of ground objects. The resultant hyperspectral images (HSIs) not only contain abundant spectral information reflecting the unique physical properties of the ground features but also provide rich spatial information of the ground features. Therefore, HSIs can be utilized to solve problems that cannot be solved well in multispectral or natural images, such as the precise identification of each pixel. Since different materials exhibit specific spectral characteristics, the classification performance of HSI can be more accurate. Due to these advantages, hyperspectral remote sensing has been widely used in many applications, such as precision agriculture [1], crop monitoring [2], and land resources [3,4]. In environmental protection, HSI has been employed to detect gas [5], oil spills [6], water quality [7,8] and vegetation coverage [9,10], to better protect our living environment. In the medical field, HSI has been utilized for skin testing to examine the health of human skin [11].
As a general pattern recognition problem, HSI classification has received a substantial amount of attention, and a large number of research results have been achieved in the past several decades. According to the previous work [12], all researches can be divided into the spectral-feature method, spatial-feature method, and spectral-spatial-feature method. The spectral feature is the primitive characteristic of the hyperspectral image, which is also called the spectral vector or spectral curve. And the spatial feature [13] means the relationship between the central pixel and its context, which can greatly increase the robustness of the model. In the early period of the study on HSI classification, researchers mainly focused on the pure spectral feature-based methods, which simply apply classifiers to pixel vectors, such as support vector machines (SVM) [14], neural networks [15], logistic regression [16], to obtain classification results without any feature extraction. But raw spectra contain much redundant information and the relation between spectra and ground objects is non-linear, which enlarges the difficulty of the model classification. Therefore, most later methods give more attention to dimension reduction and feature extraction to learn the more discriminative feature. For the approaches based on dimension reduction, principle component analysis [17], independent component analysis [18], linear discriminant analysis [19], and low-rank [20] are widely used. Nevertheless, the performance of those models is still unsatisfactory. Because, there is a common phenomenon in the hyperspectral image which is that different surface objects may have the same spectral characteristic and, otherwise, the same surface objects may have different spectral characteristics. The variability of spectra of ground objects is caused by illumination, environmental, atmospheric, and temporal conditions. Those enlarge the probability of misclassification. Thus, those methods are only based on spectral information, and ignore spatial information, resulting in unsatisfactory classification performance. The spatial characteristic of ground objects supply abundant information of shape, context, and layout about ground objects, and neighboring pixels belong to the same class with high probability, which is useful for improving classification accuracy and robustness of methods. Then, a large number of feature extraction methods that integrate the spatial structural and texture information with the spectral features have been developed, including morphological [21,22,23], filtering [24,25], coding [26], etc. Since deep learning-based methods are mainly concerned in this paper, the readers are referred to [27] for more details on these conventional techniques.
In the past decade, deep learning technology has developed rapidly and received widespread attention. Compared with traditional machine learning model, deep learning technology does not need to artificially design feature patterns and can automatically learn patterns from data. Therefore, it has been successfully applied in the fields of natural language processing, speech recognition, semantic segmentation, autonomous driving, and object detection, and gained excellent performance. Recently, it also has been introduced into the field of HSI classification. Researchers have proposed a number of new deep learning-based HIS classification approaches, as shown in the left part of Figure 2. Currently, all methods, based on the joint spectral-spatial feature, can be divided into two categories-Two-Stream and Single-Stream, according to whether they simultaneously extract the joint spectral-spatial feature. The architecture of two-stream usually includes two branches-spectral branch and spatial branch. The former is to extract the spectral feature of the pixel, and the latter is to capture the spatial relation of the central pixel with its neighbor pixels. And the existing methods have covered all deep learning modules, such as fully connected layer, convolutional layer, and recurrent unit.
In the general deep learning framework, a large number of training samples should be provided to well train the model and tune the numerous parameters. However, in practice, manually labeling is often very time-consuming and expensive due to the need for expert knowledge, and thus, a sufficient training set is often unavailable. As shown in Figure 1 (here the widely used Kennedy Space Center (KSC) hyperspectral image is utilized for illustration), the left figure randomly selects 10 samples per class and contains 130 labeled samples in total, which is very scattered and can hardly be seen. Alternatively, the right figure in Figure 1 displays 50% of labeled samples, which is more suitable for deep learning-based methods. Hence, there is a vast gap between the training samples required by deep learning models and the labeled samples that can be collected in practice. And there are many learning paradigms proposed for solving the problem of few label samples, as shown in the right part of Figure 2. In section 2, we will discuss them in detail. And they can be integrated with any model architecture. Some pioneering works such as [28] started the topic by training a deep model with good generalization only using few labeled samples.
However, there are still many challenges for this topic.
gap Figure 1: Illustration of the massive gap between practical situations (i.e., few labeled samples) and a large number of labeled samples of deep learning-based methods. Here, the widely used Kennedy Space Center (KSC) hyperspectral image is employed, which contains 13 land covers and 5211 labeled samples (detailed information can be found in the experimental section). Generally, sufficient samples are required to well train a deep learning model (as illustrated in the right figure), which is hard to be achieved in practice due to the difficulty of manually labeling (as shown in the left figure).
In this paper, we hope to provide a comprehensive review of the state-of-theart deep learning-based methods for HSI classification with few labeled samples. First, instead of separating the various methods according to feature fusion manner, such as spectral-based, spatial-based, and joint spectral-spatial-based methods, the research progress of methods related to few training samples is categorized according to the learning paradigm, including transfer learning, active learning, and few-shot learning. Second, a number of experiments with various state-of-the-art approaches have been carried out, and the results are summarized to reveal the potential research directions. Further, it should be noted that different from the previous review papers [12,29], this paper mainly focuses on the few labeled sample issue, which is considered as the most challenging problem in the HSI classification scenario. For reproducibility, the source codes of the methods conducted in the paper can be found at the web site for the paper 1 .
The remainder of this paper is organized as follows. Section 2 introduces the deep models that are popular in recent years. In Section 3, we divide the previous works into four mainstream learning paradigms, including transfer learning, active learning, and few-shot learning. In Section 4, we performed many experiments, and a number of representative deep learning-based classification methods are compared on several real hyperspectral image data sets. Finally, conclusions and suggestions are provided in Section 5.

Deep learning models for HSI classification
In this section, three classical deep learning models, including the autoencoder, convolutional neural network (CNN), and recurrent neural network (RNN), for HSI classification are respectively described, and the relevant references are reviewed.

Autoencoder for HSI classification
An autoencoder [30] is a classic neural network, which consists of two parts: an encoder and a decoder. The encoder p encoder (h|x) maps the input x as a hidden representation h, and then, the decoder p decoder (x|h) reconstructsx from h. It aims to make the input and output as similar as possible. The loss function can be formulated as follows: where L is the similarity measure. If the dimension of h is smaller than x, the autoencoder procedure is undercomplete and can be used to reduce the data dimension. Evidently, if there is not any constraint on h, the autoencoder is the simplest identical function. In other words, the network does not learn anything.
To avoid such a situation, the usual way is to add the normalization term Ω(h) to the loss. In [31? ], the normalization of the autoencoder, referred as a sparse autoencoder, is Ω(h) = λ i h i , which will make most of the parameters of the network very close to zero. Therefore, it is equipped with a certain degree of noise immunity and can produce the sparsest representation of the input. Another way to avoid the identical mapping is by adding some noise into x to make the damaged input x noise and then forcing the decoder to reconstruct the x. In this situation, it becomes the denoising autoencoder [32], which can remove the additional noise from x noise and produce a powerful hidden representation of the input. In general, the autoencoder plays the role of feature extractor [33] to learn the internal pattern of data without labeled samples. Figure 3 illustrates the basic architecture of the autoencoder model. Therefore, Chen et al. [34] used an autoencoder for the first time for feature extraction and classification of HSIs. First, in the pretraining stage, the spectral vector of each pixel directly inputs the encoder module, and then, the decoder is used to reconstruct it so that the encoder has the ability to extract spectral features. Alternatively, to obtain the spatial features, principal component analysis (PCA) is utilized to reduce the dimensionality of the hyperspectral image, and then, the image patch is flattened into a vector. Another autoencoder is employed to learn the spatial features. Finally, the spatial-spectral joint information obtained above is fused and classified. Subsequently, a large number of hyperspectral image classification methods [35,36] based on autoencoders appeared. Most of these methods adopt the same training strategy as [34], which is divided into two modules: fully training the encoder in an unsupervised manner and fine-tuning the classifier in a supervised manner. Each of these methods attempts different types of encoders or preprocessing methods to adapt to HSI classification under the condition of small samples. For example, Xing et al. [36] stack multiple denoising autoencoders to form a feature extractor, which has a stronger anti-noise ability to extract more robust representations. Given that the same ground objects may have different spectra while different ground objects may exhibit similar spectra, spectral-based classification methods often fail to achieve satisfactory performance, and spatial structural information of objects provides an effective supplement. To gain a better spatial description of an object, some autoencoder models combined with convolutional neural networks (CNNs) have been developed [37,38]. Concretely, the autoencoder module is able to extract spectral features on large unlabeled samples, while the CNN is proven to be able to extract spatial features well. After fusion, the spatialspectral features can be achieved. Further, to reduce the number of trainable parameters, some researchers use the lightweight models, such as SVMs [39,40], random forests [41,42] or logistic regression [34,43], to serve as the classifier.
Due to the three-dimensional (3D) pattern of hyperspectral images, it is desirable to simultaneously investigate the spectral and spatial information such that the joint spatial-spectral correlation can be better examined. Some threedimensional operators and methods have been proposed. In the preprocessing stage, Li et al. [44] utilized the 3D Gabor operator to fuse spatial information and spectral information to obtain spatial-spectral joint features, which were then fed into the autoencoder to obtain more abstract features. Mei et al. [40] used a 3D convolutional operator to construct an autoencoder to extract spatialspectral features directly. In addition, image segmentation has been introduced to characterize the region structure of objects to avoid misclassification of pixels at the boundary [45]. Therefore, Liu et al. [46] utilized superpixel segmentation technology as a postprocessing method to perform boundary regularization on the classification map.

Convolutional Neural Networks (CNNs) for HSI classification
In theory, the CNN uses a group of parameters that refer to a kernel function or kernel to scan the image and produce a specified feature. It has three main characteristics that make it very powerful for feature representation, and thus, the CNN has been successfully applied in many research fields. The first one is the local connection that greatly decreases the number of trainable parameters and makes itself suitable for processing large images. This is the most obvious difference from the fully connected network, which has a full connection between two neighboring neural layers and is unfriendly for large spatial images. To further reduce the number of parameters, the same convolutional kernel shares the same parameters, which is the second characteristic of CNNs. In contrast, in the traditional neural network, the parameters of the output are independent from each other. However, the CNN applies the same parameters for all of the output to cut back the number of parameters, leading to the third characteristic: shift invariance. It means that even if the feature of an object has shifted from one position to another, the CNN model still has the capacity to capture it regardless of where it appears. Specifically, a common convolutional layer consists of three traditional components: linear mapping, the activation function and the pooling function. Similar to other modern neural network architectures, activation functions are used to bring a nonlinear mapping feature into the network. Generally, the rectified linear unit (ReLU) is the prior choice. Pooling makes use of the statistical characteristic of the local region to represent the output of a specified position. Taking the max pooling step as an example, it employs the max value to replace the region of input. Clearly, the pooling operation is robust to small changes and noise interfere, which could be smoothed out by the pooling operation in the output, and thus, more abstract features can be reserved.
In the early works of applying CNNs for HSI classification, two-dimensional convolution was the most widely used method, which is mainly employed to extract spatial texture information [47,28,48], but the redundant bands greatly enlarge the size of the convolutional kernel, especially the channel dimensionality. Later, a combination of one-dimensional convolution and two-dimensional convolution appeared [49] to solve the above problem. Concretely, one-dimensional and two-dimensional convolutions are responsible for extracting spectral and spatial features, respectively. The two types of features are then fused before being input to the classifier. For the small training sample problem, due to insufficient labeled samples, it is difficult for CNNs to learn effective features. For this reason, some researchers usually introduced traditional machine learning methods, such as attribute profiles [50], GLCM [51], hash learning [52], and Markov Random fields [53], to introduce prior information to the convolutional network and improve the performance of the network. Similar to the trend of autoencoder-based classification methods, three-dimensional CNN models have also been applied to HSI classification in recent years and have shown better feature fusion capabilities [54,55]. However, due to the large number of parameters, three-dimensional convolution is not suitable for solving small-sample classification problems under supervised learning. To reduce the number of parameters of 3D convolution, Fang et al. [56] designed a 3D separable convolution. In contrast, Mou et al. [57,58] introduced an autoencoder scheme into the three-dimensional convolution module to solve this problem. By a combination with the classic autoencoder training method, the three-dimensional convolution autoencoder can be trained in an unsupervised learning manner, and then, the decoder is replaced with a classifier, while the parameters of the encoder are frozen. Finally, a small classifier is trained by supervised learning. Moreover, due to the success of ResNet [59], scholars have studied the HSI classification problem based on convolutional residuals [57,58,60,61]. These methods try to use jump connections to enable the network to learn complex features with a small number of labeled samples. Similarly, CNNs with dense connections have also been introduced into this field [62,63]. In addition, the attention mechanism is another hotpot for fully mining sample features. Concretely, Haut and Xiong et al. [64,65] incorporated the attention mechanism with CNNs for HSI classification. Although the above models can work well on HSI, they cannot overcome the disadvantage of the low spatial resolution of HSIs, which may cause mixed pixels. To make up for this shortcoming, multimodality CNN models have been proposed. These methods [66,67,68] combine HSIs and LiDAR data together to increase the discriminability of sample features. Moreover, to achieve good performance under the small-sample scenario, Yu et al. [28] enlarged the training set through data augmentation by implementing rotation and flipping. On the one hand, this method increases the number of samples and improves their diversity. On the other hand, it enhances the model's ability of rotation invariance, which is important in some fields such as remote sensing. Subsequently, Li et al. [69,70] designed a data augmentation scheme for HSI classification. They combined the samples in pairs so that the model no longer learns the characteristics of the samples themselves but learns the differences between the samples. Different combinations make the scale of the training set larger, which is more conducive for model training.

Recurrent neural network (RNN) for HSI classification
Compared with other forms of neural networks, recurrent neural networks (RNNs) [71] have memory capabilities and can record the context information of sequential data. Because of this memory characteristic, recurrent neural networks are widely used in tasks such as speech recognition and machine translation. More precisely, the input of a recurrent neural network is usually a sequence of vectors. At each time step t, the network receives an element x t in a sequence and the state h t−1 of the previous time step, and produces an output y t and a state h t representing the context information at the current moment. This process can be formulated as: where W xh represents the weight matrix from the input layer to the hidden layer, W hh denotes the state transition weight in the hidden layer, and b is the bias. It can be seen that the current state of the recurrent neural network is controlled by both the state of the previous time step and the current input. This mechanism allows the recurrent neural network to capture the contextual semantic information implicitly between the input vectors. For example, in the machine translation task, it can enable the network to understand the semantic relationship between words in a sentence. However, the classic RNN is prone to encounter gradient explosion or gradient vanishing problems during the training process. When there are too many inputs, the derivation chain of the RNN will become too long, making the gradient value close to infinity or zero. Therefore, the classic RNN model is replaced by a long short-term memory (LSTM) network [71] or a gated recurrent unit (GRU) [72] in the HSI classification task.
Both LSTM and GRU use gating technology to filter the input and the previous state so that the network can forget unnecessary information and retain the most valuable context. LSTM maintains an internal memory state, and there are three gates: input gate i t , forget gate f t and output gate o t , which are formulated as: It can be found that the three gates are generated based on the current input and the previous state. First, the current input and the previous state will be spliced and mapped to a new input g t according to the following formula: Subsequently, the input gate, the forget gate, the new input g t and the internal memory unitĥ t−1 update the internal memory state tegother. In this process, LSTM discards invalid information and adds new semantic information.
Finally, the new internal memory state is filtered by the output gate to form the output of the current time step Concerning HSI processing, each spectral image is a high-dimensional vector and can be regarded as a sequence of data. There are many works using LSTM for HSI classification tasks. For instance, Mou et al. [73] proposed an LSTMbased HSI classification method for the first time, and their work only focused on spectral information. For each sample pixel vector, each band is input into the LSTM step by step. To improve the performance of the model, spatial information is considered in subsequent research. For example, Liu et al. fully considered the spatial neighborhood of the sample and used a multilayer LSTM to extract spatial spectrum features [74]. Specifically, in each time step, the sampling points of the neighborhood are sequentially input into the network to deeply mine the context information in the spatial neighborhood. In [75], Zhou et al. used two LSTMs to extract spectral features and spatial features. In particular, for the extraction of spatial features, PCA is first used to extract principal components from the sample rectangular space neighborhood. Then, the first principal component is divided into several lines to form a set of sequence data, and gradually input into the network. In contrast, Ma and Zhang et al. [76,77] measures the similarity between the sample point in the spatial neighborhood and the center point. The sample points in the neighborhood will be reordered according to the similarity and then input into the network step by step. This approach allows the network to focus on learning sample points that are highly similar to the center point, and the memory of the internal hidden state can thus be enhanced. Erting Pan et al. [78] proposed an effective tiny model for spectral-spatial classification on HSIs based on a single gate recurrent unit (GRU). In this work, the rectangular space neighborhood is flattened into a vector, which is used to initialize the hidden vector h 0 of GRU, and the center point pixel vector is input into the network to learn features.
In addition, Wu and Saurabh argue that it is difficult to dig out the internal features of the sample by directly inputting a single original spectral vector into the RNN [79,80]. The authors use a one-dimensional convolution operator to extract multiple feature vectors from the spectrum vector, which form a feature sequence and are then input to the RNN. Finally, the fully connected layer and the softmax function are adopted to obtain the classification result. It can be seen that only using recurrent neural networks or one-dimensional convolution to extract the spatial-spectrum joint features is actually not efficient because this will cause the loss of spatial structure information. Therefore, some researchers combine two-dimensional/three-dimensional CNNs with an RNN and use convolution operators to extract spatial-spectral joint features. For example, Hao et al. [81] utilized U-Net to extract features and input them into an LSTM or GRU so that the contextual information between features could be explored. Moreover, Shi et al. [82] introduced the concept of the directional sequence to fully extract the spatial structure information of HSIs. First, the rectangular area of the sampling point is divided into nine overlapping patches. Second, the patch will be mapped to a set of feature vectors through a threedimensional convolutional network, and the relative position of the patch can generate 8 combinations of directions (for example, top, middle, bottom, left, center, and right) to form a direction sequence. Finally, the sequence is input into the LSTM or GRU to obtain the classification result. In this way, the spatial distribution and structural characteristics of the features can be explored.

Deep learning paradigms for HSI classification with few labeled samples
Although different HSI classification methods have different specific designs, they all follow some learning paradigms. In this section, we mainly introduce several learning paradigms that are applied to HSI classification with few labeled training samples. These learning paradigms are based on specific learning theories. We hope to provide a general guide for researchers to design algorithms.

Deep Transfer Learning for HSI classification
Transfer learning [83] is an effective method to deal with the small-sample problem. Transfer learning tries to transfer knowledge learned from one domain to another. First, there are two data sets/domains, one is called a source domain that contains abundant labeled samples, and the other is called a target domain and only contains few labeled samples. To facilitate the subsequent description, we define the source domain as D s , the target domain as D t , and their label spaces as Y s and Y t , respectively. Usually, the data distribution of the source domain and the target domain are inconsistent: P (X s ) = P (X t ). Therefore, the purpose of transfer learning is to use the knowledge learned from D s to identify the labels of samples in D t .
Fine-tuning is a general method in transfer learning that uses D s to train the model and adjust it by D t . Its original motivation is to reduce the number of samples needed during the training process. Since deep learning models generally contain a vast number of parameters and if it is trained on the target domain D t , it is easy to overfit and perform poorly in practice. However, finetuning allows the model parameters to reach a suboptimal state, and a small number of training samples of the target domain can tune the model to reach the optimal state. It involves two steps. First, the specific model will be fully trained on the source domain D s with abundant labeled samples to make the model parameters arrive at a good state. Then, the model is transferred to the target domain D t , except for some task-related modules, and slightly tuned on D t so that the model fits the data distribution of the target domain D t . Because the fine-tuning method is relatively simple, it is widely used in the transfer learning method for hyperspectral image classification. To our knowledge, Yang et al. [84] are the first to combine deep learning with transfer learning to classify hyperspectral images. The model consists of two convolutional neural networks, which are used to extract spectral features and spatial features. Then, the joint spectral-spatial feature will be input into the fully connected layer to gain a final result. According to fine-tuning, the model is first fully trained on the hyperspectral image of the source domain. Next, the fully connected layer is replaced and the parameters of the convolutional network are reserved. Finally, the transfer model will be trained on the target hyperspectral image to adapt to the new data distribution. The later transfer learning models based on fine-tuning basically follow that architecture [85,86,87,88]. It is worth noting that Deng et al. [89] combined transfer learning with active learning to classify HSI.
Data distribution adaptation is another commonly used transfer learning method. The basic idea of this theory is that in the original feature space, the data probability distributions of the source domain and the target domain are usually different. However, they can be mapped to a common feature space together. In this space, their data probability distributions become similar. In 2014, Ghifary et al. [90] first proposed a shadow neural network-based domain adaptation model, called DaNN. The innovation of this work is that a maximum mean discrepancy (MMD) adaptation layer is added to calculate the distance between the source domain and the target domain. Moreover, the distance is merged into the loss function to reduce the difference between the two data distributions. Subsequently, Tzeng et al. two hyperspectral images from different scenes will be mapped to two lowdimensional subspaces by the deep neural network, in which the samples are represented as manifolds. MMD is used to measure the distance between two low-dimensional subspaces and is added to the loss function to make two lowdimensional subspaces have high similarity. In addition, they still add the sum of the distances between samples and their neighbor into the loss function to ensure that the low-dimensional manifold is discriminative. Motivated by the excellent performance of generative adversarial net (GAN), Yaroslav et al. [93] first introduced it into transfer learning. The network is named DANN (domain-adversarial neural network), which is different from DaNN proposed by Ghifary et al. [90]. The generator G f and the discriminator G d compete with each other until they have converged. In transfer learning, the data in one of the domains (usually the target domain) are regarded as the generated sample. The generator aims to learn the characteristics of the target domain sample so that the discriminator cannot distinguish which domain the sample comes from to achieve the purpose of domain adaptation. Therefore, G f is used to represent the feature extractor here.
Elshamli et al. [94] first introduced the concept of DANN to the task of hyperspectral image classification. Compared to general GNN, it has two discriminators. One is the class discriminator predicting the class labels of samples, and the other is the domain discriminator predicting the source of the samples. Different from the two-stage method, DANN is an end-to-end model that can perform representation learning and classification tasks simultaneously. Moreover, it is easy to train. Further, it outperforms two-stage frameworks such as the denoising autoencoder and traditional approaches such as PCA in hyperspectral image classification.

Deep Active Learning for HSI classification
Active learning [95] in the supervised learning method can efficiently deal with small-sample problems. It can effectively learn discriminative features by autonomously selecting representative or high-information samples from the training set, especially when the labeled samples are scarce. Generally speaking, active learning consists of five components, A = (C, L, U, Q, S). Among them, C represents one or a group of classifiers. L and U represent the labeled samples and unlabeled samples, respectively. Q is the query function, which is used to query the samples with a large amount of information among the unlabeled samples. S is an expert and can label unlabeled samples. In general, active learning has two stages. The first stage is the initialization stage. In this stage, a small number of samples will be randomly selected to form the training set L and be labeled by experts to train the classifier. The second stage is the iterative query. Q will select new samples from the unlabeled sample set U for S to mark them based on the results of the previous iteration and add them to the training set L. The active learning method applied to hyperspectral image classification is mainly based on the active learning algorithm of the committee and the active learning algorithm based on the posterior probability. In the committeebased active learning algorithm, the EQB method uses entropy to measure the amount of information in unlabeled samples. Specifically, the training set L will be divided into k subsets to train k classifiers and then use these k classifiers to classify all unlabeled samples. Therefore, each unlabeled sample corresponds to k predicted labels. The entropy value is calculated from this: where H represents the entropy value, and N i represents the number of classes predicted by the sample x i . Samples with large entropy will be selected and manually labeled [96]. In [97], the deep belief network is used to generate the mapping feature h of the input x in an unsupervised way, and then, h will be used to calculate the information entropy. At the same time, sparse representation is used to estimate the representations of the sample. In the process of selecting samples for active learning, the information entropy and representations of the samples are comprehensively considered. In contrast, the active learning method based on posterior probability [98,99,100] is more widely used. Breaking ties belongs to the active learning method of posterior probability, which is widely used in hyperspectral classification tasks. This method first uses specifies models, such as convolutional networks, maximum likelihood estimation classifiers, support vector machines, etc., to estimate the posterior probabilities of all samples in the candidate pool. Then, the approach uses the posterior probability to input the following formula to produce a measure of sample uncertainty: In the above formula, we first calculate the difference between the largest probability and the second-largest probability among the posterior probabilities of all candidate samples and select the sample with the minimum difference to join the valuable data set. The lower x BT is, the more uncertain is the sample. In proposed a similar method. However, this method only uses spectral features.
Because of the effectiveness of spatial information, in [101], when generating the posterior probability, the space-spectrum joint features are considered at the same time. In contrast, Cao et al. [100] use convolutional neural networks to generate the posterior probability.
In general, the active learning method can automatically select effective samples according to certain criteria, reduce inefficient redundant samples, and thus well alleviate the problem of missing training samples in the small-sample problem.
f ω Sample selection

Deep Few-shot Learning for HSI classification
Few-shot learning is among meta-learning approaches and aims to study the difference between the samples instead of directly learning what the sample is, different from most other deep learning methods. It makes the model learn to learn. In few-shot classification, given a small support set with N labeled samples S k N = {(x 1 , y 1 ), · · · , (x N , y N )}, which have k categories, the classifier will mask the query sample with the label of the largest similarity sample among S k N . To achieve this target, many learning frameworks have been proposed and they can be divided into two categories: meta-based model and metric-based model.
The prototype network [102] is one of the metric-based models of few-shot learning. Its basic idea is that every class can be depicted by a prototype representation, and the samples that belong to the same category should be around the class prototype. First, all samples will be transformed into a metric space through an embedding function f φ : R D → R M and represented by the embedding vector c k ∈ R M . Due to the powerful ability of the convolutional network, it is used as the embedding function. Moreover, the prototype vector is usually the mean of the embedding vector of the samples in the support set for each class c i .
In [103], Liu et al. simply introduce the prototype network into hyperspectral image classification task and use ResNet [59] to serve as a feature extractor that maps the samples into a metric space. Then, the prototype network is significantly improved for the hyperspectral image classification task by [104].
In the paper, the spatial-spectral feature is first integrated by the local pattern coding, and the 1D-CNN converts it to an embedding vector. The prototype is the weighted mean of these embedding vectors, which is contrary to the general prototype network. In [105] Xi et al. replace the mapping function with hybrid residual attention [106] and introduce a new loss function to force the network to increase the interclass distance and decrease the intraclass distance.

Training samples
Test sample The relation network [107] is another metric-based model of few-shot learning. In general, it has two modules: the embedding function f φ : R D → R M and relation function f ψ : R 2M → R. The function of the embedding module is the same as the prototype network, and its key idea is the relation module. The relation module is to calculate the similarity of samples. It is a learnable module that is different from the Euclidean distance or cosine distance. In other words, the relation network introduces a learnable metric function based on the prototype network. The relation module can more precisely describe the difference of samples by the study. During inference, the query embedding f ψ (x i ) will be combined with the support embedding f ψ (x j ) as C(f ψ (x i ), f ψ (x j )). Usually, C(·, ·) is a concatenation operation. Then, the relation function will transform the splicing vector to a relation score r i,j , which indicates the similarity between x i and x j .

Embedding features
Several works have introduced the relation network into hyperspectral image classification to solve the small sample set problem. Deng et al. [108] first introduced the relation network into HSI. They use a 2-dimensional convolutional neural network to construct both the embedding function and relation function. Gao et al. [109] and Ma et al. [110] have also proposed a similar architecture.
In [111], to extract the joint spatial-spectral feature, Rao et al. implemented the embedding function with a 3-dimensional convolutional neural network. The Siamese network [112,113,114] is a typical network in few-shot learning. Compared with the above network, its input is a sample pair. Thus, it is composed by two parallel subnetworks f φ1 : R D → R M with the same structure and sharing parameters. The subnetworks respectively accept an input sample and map it to a low-dimensional metric space to generate their own embedding f φ1 (x i ) and f φ1 (x j ). The Euclidean distances D(x i , x j ) is used to measure their similarity.
The higher the similarity between the two samples is, the more likely they are to belong to the same class. Recently, the Siamese network was introduced into HSI classification. Usually, a 2-dimensional convolutional neural network [115,116] is used to serve as the embedding function, as in the above two networks. In the same way, several methods combined the 1-dimensional convolution neural network with the 2-dimensional one [117,118] or use a 3-dimensional network [119] for the joint spectral-spatial feature. Moreover, Miao et al. [120] have tried to use the stack autoencoder to construct the embedding function f φ1 . After training, the model has the ability to identify the difference between samples.
To obtain the final classification result, we still need a classifier to classify the embedding feature of the sample, which is different from the prototype network and the relation network. To avoid overfitting under limited labeled samples, an SVM is usually used as a classifier since it is famous for its lightweight.

Experiments
In most papers, comprehensive experiments and analysis are introduced to describe the advantages and disadvantages of the methods in the paper. However, the problem is that different papers may choose different experimental settings. For example, the same number of samples for training or test is used in the experiments, and the chosen samples are normally different since they are chosen randomly. To evaluate different methods fairly, we should use the exact same experimental setting. That is the reason why we design experiments to evaluate different methods.

Patch1
Patch2 Feature maps Figure 9: Architecture of the Siamese network.
As described above, the main methods of small-sample learning currently include the autoencoder, few-shot learning, transfer learning, active learning, and data augmentation. Therefore, some representative networks of the following methods-S-DMM [121], SSDL [37], 3DCAE [40], TwoCnn [122], SSLstm [75] and 3DVSCNN [123], which contain convolutional network models and recurrent network models, are selected to conduct experiments on three benchmark data sets-PaviaU, Salinas and KSC. All models are based on deep learning. Here, we only focus on the robustness of the model on a small-sample data set, so they classify hyperspectral images based on joint spectral-spatial features.
According to the sample size per category in the training data set, the experiment is divided into three groups. The first has 10 samples for each category, the second has 50 samples for each category and the third has 100 samples for each category. At the same time, to ensure the stability of the model, each group of experiments is performed ten times, and the training data set is different each time. Finally, models are evaluated by average accuracy (AA) and overall accuracy (OA).

Introduction of data sets
• Pavia University (PaviaU): The Pavia University data set consists of hyperspectral images, each with 610*340 pixels and a spatial resolution of 1.3 meters, which was taken by the ROSIS sensor above Pavia University in Italy. The spectral imagery continuously images 115 wavelengths in the range of 0.43∼0.86 um. Since 12 of the wavelengths are polluted by noise, each pixel in the final data set contains 103 bands. It contains 42,776 labeled samples in total, covering 9 objects. In addition, its sample size of each object is shown in Table 1.
• Salinas: The Salinas data set consists of hyperspectral images with 512*217 pixels and a spatial resolution of 3.7 meters, taken over the Salinas Valley in California by the AVIRIS sensor. The spectral imagery continuously images 224 wavelengths in the range of 0.2∼2.4 um. Since 20 of the bands cannot be reflected by water, each pixel in the final data set contains 204 bands. It contains 54,129 labeled samples in total, covering 16 objects. In addition, its sample size of each object is shown in Table 2.
• Kennedy Space Center (KSC): The KSC data set was taken at the Kennedy Space Center (KSC), above Florida, and used the AVIRIS sensor. Its hyperspectral images contain 512*641 pixels, and the spatial resolution is 18 meters. The spectral imagery continuously images 224 wavelengths in the range of 400∼2500 nm. Similarly, after removing 48 bands that are absorbed by water and have a low signal-to-noise ratio, each pixel in the final data set contains 176 bands. It contains 5211 label samples, covering 13 objects. Moreover, its sample size of each object is shown in Table 3.

Selected models
Some state-of-the-art methods are choose to evaluate their performance. They were trained using different platforms, including Caffe, PyTorch, etc. Some platforms such Caffe are not well supported by the new development environments. Most models are our re-implementations and are trained using the exact same setting. Most of the above model settings are based on the original paper, and some are modified slightly based on the experiment. All models are trained and tested on the same training data set that is picked randomly based on pixels and the test data set, and their settings have been optimally tuned. The implementation situation of the code is shown in Table 4. The descriptions of the chosen models are provided in the following part.
• SAE LR [34]. This is the first paper to introduce the autoencoder into hyperspectral image classification, opening a new era of hyperspectral image processing. It adopts a raw autoencoder composed of linear layers to extract the feature. The size of the neighbor region is 5 × 5, and the first 4 components of PCA are chosen. Subsequently, we can gain a spatial feature vector. Before inputting into the model, the raw spatial feature    F  T  F  T  T  T  T  T and the spatial feature are stacked to form a joint feature. To reduce the difficulty of training, it uses a greedy layerwise pretraining method to train each layer, and the parameters of the encoder and decoder are symmetric. Then, the encoder concatenates a linear classifier for fine tuning. According to [34], the hidden size is set to 60, 20, and 20 for PaviaU, Salinas, and KSC, respectively.
• S-DMM [121]. This is a relation network that contains an embedding module and relation module implemented by 2D convolutional networks. The model aims to make samples in the feature space have a small intraclass distance and a large interclass distance through a learnable feature embedding function and a metric function. After training, all samples will be assigned to the corresponding clusters. Finally, a simple KNN is used to classify the query sample. In the experiment, the neighbor region of the pixel is fixed as 5 × 5 and the feature dimension is set to 64.
• 3DCAE [40]. This is a 3D convolutional autoencoder adopting a 3D convolution layer to extract the joint spectral-spatial feature. First, 3DCAE is trained by the traditional method, and then, an SVM classifier is adopted to classify the hidden features on the top of 3DCAE. In the experiment, the neighbor region of the pixel is set to 5 × 5 and 90% of the samples are used to train the 3D autoencoder. There are two different hyperparameter settings corresponding to Salinas and PaviaU, and the model has not been tested on KSC in [121]. Therefore, on the KSC, the model uses the same hyperparameter configuration as on the Salinas because they are collected by the same sensor.
• SSDL [37]. This is a typical two-stream structure extracting the spectral and spatial feature separately through two different branches and merging them at the end. Inspired by [34], the author adopts a 1D autoencoder to extract the spectral feature. In the branch of spatial feature extraction, the model uses a spatial pyramid pooling layer to replace the traditional pooling layer on the top convolutional layer. The spatial pyramid pooling layer enables the deep convolutional neural network to generate a fixedlength feature. On the one hand, it enables the model to convert the input of different sizes into a fixed-length, which is good for the module that is sensitive to the input size; on the other hand, it is useful for the model to better adapt to objects of different scales, and the output will include features from coarse to fine, achieving multiscale feature fusion. Then, a simple logistic classifier is used to classify the spectra-spatial feature.
In the experiment, 80% of the data are used to train the autoencoder through the method of greedy layer-wise pretraining. Moreover, in the spatial branch, the size of the neighbor region is set to 42*42 and PCA is used to extract the first component. Then, the overall model is trained together.
• TwoCnn [122]. This is a two-stream structure based on fine-tuning. In the spectral branch, it adopts a 1D convolutional layer to capture local information of spectral features, which is entirely different from SSDL. In particular, transfer learning is used to pretrain parameters of the model and endow it with good robustness on limited samples. The pairs of the source data set and target data set are Pavia Center-PavaU, Indian pines-Salinas, and Indian pines-KSC. In [122], they also did not test the model on KSC. Thus, we regard Indian pines as the source domain for KSC, given that both data sets come from the same type of sensor. The neighbor region of the pixel is set to 21*21. Additionally, it averages along the spectral channel to reduce the input dimension, instead of PCA. In the pretraining process, 15% of samples of each category of Pavia and 90% of samples of each category of Indian pines are treated as the training data set, and the rest serve as the test data set. To make the number of bands in the source data set and target data set the same, we filter out the band that has the smaller variance. According to [122], all other layers are transferred except for the softmax layer. Finally, the model is fine-tuned on the target data set with the same configuration.
• 3DVSCNN [123]. This is a general CNN-based image classification model, but it uses a 3D convolutional network to extract spectral-spatial features simultaneously followed by a fully connected network for classification. The main idea of [123] is the usage of active learning. The process can be divided into two steps: the selection of valuable samples and the training of the model. In [123], an SVM serves as a selector to iteratively select some of the most valuable samples according to Eq. (10). Then, the 3DVSCNN is trained on the valuable data set. The size of its neighbor region is set to 13*13. During data preprocessing, it uses PCA to extract the top 10 components for PaviaU and Salinas, and the top 30 components for KSC, which contain more than 99% of the original spectral information and still keep a clear spatial geometry. In the experiment, 80% of samples will be picked by the SVM to form a valuable data set for 4 samples in each iteration. Then, the model is trained on the valuable data set.
• CNN HSI [28]. The model combines multilayer 1 × 1 2D convolutions followed by local response normalization to capture the feature of hyperspectral images. To avoid the loss of information after PCA, it uses 2D convolution to extract spectral and spatial joint features directly, instead of 3D convolution. At the same time, it also adopts a dropout layer and data augmentation, including rotation and flipping, to improve the generalization of the model and reduce overfitting. After data augmentation, an image can generate eight different orientation images. Moreover, the model removes the linear classifier to decrease the number of trainable parameters. According to [28], the dropout rate is set to 0.6, the size of the neighbor region is 5 × 5, and the batch size is 16 in the experiment.
• SSLstm [75]. Unlike the above methods, SSLstm adopts recurrent networks to process spectral and spatial features simultaneously. In the spectral branch, called SeLstm, the spectral vector is seen as a sequence. In the spatial branch, called SaLstm, it treats each line of the image patch as a sequence element. Therefore, along the column direction, the image patch can be well converted into a sequence. In particular, it fuses the predictions of the two branches in the label space to obtain the final prediction result, which is defined as where P (y = j|x i ) denotes the final posterior probability, P spe (y = j|x i ) and P spa (y = j|x i ) denote the posterior probabilities from spectral and spatial modules, respectively, and w spe and w spa are fusion weights that satisfy the sum of 1. In the experiment, the size of the neighbor region is set to 32*32 for PaviaU and Salinas. In addition, for KSC, it is set to 64*64. Next, the first component of PCA is reserved on all data sets. The number of hidden nodes of the spectral branch and the spatial branch are 128 and 256, respectively. In addition, w spe and w spa are set to 0.5 and 0.5 separately.

Experimental results and analysis
The accuracy of the test data set is shown in Table 5, Table 6, and Table  7. Corresponding classification maps are shown in Figure 11∼19. The final classification result of the pixel is decided by the voting result of 10 experiments.
Taking Table 5 as an example, the experiment is divided into three groups, and the sample sizes in each group are 10, 50, and 100, respectively. The aforementioned models are conducted 10 times in every experiment sets. Then, we count the average of their class classification accuracy, AA, and OA for comparing their performance. When sample size is 10, S-DMM has the highest AA and OA, which are 91.08% and 84.45% respectively, in comparison with the AA and OA of 71.58% and 60.00%, 75.34 % and 74.79%, 74.60% and 78.61%, 75.64% and 75.17%, 72.77% and 69.59%, 85.12% and 82.13%, 72.40% and 66.05% for 3DCAE, SSDL, TwoCnn, 3DVSCNN, SSLstm, CNN HSI and SAE LR. Besides, S-DMM has the largest number of class classification accuracy. When the sample size is 50, S-DMM and CNN HSI have the highest AA and OA respectively, which are 96.47% and 95.21%. In the last group, 3DVSCNN and CNN HSI have the highest AA and OA, which are 97.13% and 97.35%. According to the other two tables, we can conclude with a similar result.
As shown in Table 5, Table 6 and Table 7, we can conclude that most models' performance on KSC, except for 3DCAE, is better than the other two data sets. Especially when the data set contains few samples, the accuracy of S-DMM is up to 94%, superior to other data sets. This is because the surface objects on the KSC itself have a discriminating border between each other, regardless of its higher spatial resolution than that of the other data sets, as shown in Figure  17∼19. In the other data sets, models easily misclassify the objects that have a similar spatial structure, as illustrated in Meadows (class 2) and Bare soil (class 6) in PaviaU and Fallow rough plow (class 4) and Grapes untrained (class 8) in Salinas, as shown in 11∼16. The accuracy of all models on Grapes untrained is lower than other classes in Salinas. In Figure 10, on all data sets, as the number of samples increases, the accuracy of all models will improve together.
As shown in Figure 10, when the sample size of each category is 10, S-DMM and CNN HSI have achieved stable and excellent performance on all data sets. They are not sensitive to the size of the data set. In Figure 10(b) and Figure 10(c), with increasing sample size, the accuracy of S-DMM and CNN HSI have improved slightly, but their increase is lower than that of others. In Figure 10(a), when the sample size increases from 50 to 100, we can obtain the same conclusion. This result shows that both of them can be applied to solve the small-sample problem in hyperspectral images. Especially for S-DMM, it has gained the best performance on the metric of AA and OA on Salinas and KSC in the experiment with a sample size of 10. On PaviaU, it still wins the third place. This result also proves that it can work well on a few samples. Although TwoCnn, 3DVSCNN, and SSLstm achieve good performance on all data sets, when the data set contains fewer samples, they will not work well. It is worth mentioning that 3DVSNN uses fewer samples to train than other models for selecting valuable samples. This approach may not be beneficial for those classes with few samples. As shown in 7, 3DVSCNN has a good performance on OA, but a bad performance on AA. For class 7, when its sample size increases from 10 to 50 and 100, its accuracy drops. This is because the total sample size of it is the smallest on KSC. Therefore, it contains few valuable samples. Moreover, the step of selecting valuable samples would cause an imbalance between the classes, which leads to the accuracy of class 7 decreasing. On almost all data sets, autoencoder-based models achieve poor performance compared with other models. Although unsupervised learning does not need to label samples, if there are no constraints, the autoencoder might actually learn nothing. Moreover, since it has a symmetric architecture, it would result in a vast number of parameters and increase the difficulty of training. Therefore, SSDL and SAE LR use a greedy layerwise pretraining method to solve this problem. However, 3DCAE does not adopt this approach and achieves the worst performance on all data sets. As shown in Figure 10, it still has considerable room for improvement.
Overall, classification results based on few-shot learning, active learning, transfer learning, and data augmentation are better than autoencoder-based unsupervised learning methods on the limited sample in all experiments. Fewshot learning benefits from the exploration of the relationship between samples to find a discriminative decision boarder. Active learning benefits from the selection of valuable samples, which enables the model to focus more attention to indistinguishable samples. Transfer learning makes good use of the similarity between different data sets, which reduces the quantity of data required for training and trainable parameters, improving the model's robustness. According to raw data, the method of data augmentation generates more samples to expand the diversity of samples. Although the autoencoder can learn the internal structure of the unlabeled data set, the final feature representation might not have task-related characteristics. This is the reason why its performance on a small-sample data set is inferior to supervised learning.

Model parameters
To further explore the reasons why the model has achieved different results on the benchmark data set, we also counted the number of trainable parameters of each framework (including the decoder module) on different data sets, which are shown in Table 8. On all data sets, the model with the least number of training parameters is the SAE LR, the second is the CNN HSI and the most is the TwoCnn. SAE LR is a lightweight architecture in all models for the simple linear layer, but its performance is poor. Different from other 2D convolution approaches in HSI, CNN HSI solely uses a 1 × 1 kernel to process an image. Moreover, it uses a 1 × 1 convolution layer to serve as a classifier instead of the linear layer, which greatly reduces the number of trainable parameters. The next is the S-DMM. This also explains why S-DMM and CNN HSI are less affected by augmentation in sample size but very effective on few samples. Additionally, the problem of overfitting is of little concern in these approaches. Stacking the spectral and spatial feature to generate the final fused feature is the main reason for the large number of parameters of TwoCnn. However, regardless of its potentially millions of trainable parameters, it can work well on limited samples, benefiting from transfer learning, which decreases trainable parameters and achieves good performance on all target data sets. Next, the models with the most parameters are successively 3DCAE and SSLstm. 3DCAE's trainable parameters are at most eight times those of SSDL, which contains not only a 1D autoencoder in the spectral branch but also a spatial branch based on a 2D convolutional network, but 3DCAE is still worse than SSDL. Although 3D convolutional and pooling modules can greatly avoid the problem of data structure information loss caused by the flattening operation, the complexity of the 3D structure and the symmetric structure of the autoencoder increase the number of model parameters, which make it easy to overfit the model. 3DVSCNN also uses a 3D convolutional module and is better than 3DCAE, which first reduces the number of redundant bands by PCA. That may also be applied to 3DCAE to decrease the number of model parameters and make good use of characteristics of 3D convolution, extracting spectral and spatial information simultaneously. The main contribution of the parameter of SSLstm comes from the spatial branch. Although the gate structure of LSTM improves the model's capabilities of long and short memory, it increases the complexity of the model. When the number of hidden layer units increases, the model's parameters will also skyrocket greatly. Perhaps it is the coupling between the spectral features and recurrent network that make performance of SSLstm not as bad as that of 3DCAE on all data sets, which has a similar number of parameters and even achieved superior results on KSC. Moreover, there are no methods that were adopted for solving the problem of few samples. This finding also shows that supervised learning is better than unsupervised learning in some tasks.

The speed of model convergence
In addition, we compare the convergence speed of the model according to the changes in training loss of each model in the first 200 epochs on each group of experiments (see Figure 20∼22). Because the autoencoder and classifier of 3DCAE are be trained separately, and all data are used during training the autoencoder, it is not comparable to other models. Therefore, it is not be listed here. On all data sets, S-DMM has the fastest convergence speed. After approximately 3 epochs, the training loss tends to become stable given its fewer parameters. Although CNN HSI has a similar performance to S-DMM and fewer parameters, the learning curve of CNN HSI's convergence rate is slower than that of S-DMM and is sometimes accompanied by turbulence. The second place regarding performance is held by TwoCnn, which is mainly due to transfer learning to better position the initial parameters, and it actually has fewer parameters requiring training. Thus, it just needs a few epochs to fine-tune on the target data set. Moreover, the training curve of most models stabilizes after 100 epochs. The training loss of the SSLstm has severe oscillations in all data sets. This is especially noted in the SeLstm, where the loss sometimes has difficulty in decreasing. When the sequence is very long, the challenge might be that the recurrent neural network is more susceptible to a vanishing or exploding gradient. Moreover, the pixels of the hyperspectral image usually contain hundreds of bands, which is the reason why the training loss has difficulty decreasing or oscillations occur in SeLstm. In the spatial branch, it does not have this serious condition because the length of the spatial sequence depending on patch size is shorter than spectral sequences. During training, the LSTM-based model spent a considerable amount of time because it cannot train in parallel.

Conclusions
In this paper, we introduce the current research difficulties, namely, few samples, in the field of hyperspectral image classification and discuss popular learning frameworks. Furthermore, we also introduce several popular learning algorithms to solve the small-sample problem, such as autoencoders, few-shot learning, transfer learning, activate learning, and data augmentation. According to the above methods, we select some representative models to conduct experiments on hyperspectral benchmark data sets. We developed three different experiments to explore the performance of the models on small-sample data sets and documented their changes with increasing sample size, finally evaluating their effectiveness and robustness through AA and OA. Then, we also compared the number of parameters and convergence speeds of various models to further analyze their differences. Ultimately, we also highlight several possible future directions of hyperspectral image classification on small samples: • Autoencoders, including linear autoencoders and 3D convolutional autoencoders, have been widely explored and applied to solve the sample problem in HSI. Nevertheless, their performance does not approach excellence. The future development trend should be focused on few-shot learning, transfer learning, and active learning.
• We can fuse some learning paradigms to make good use of the advantages of each approach. For example, regarding the fusion of transfer learning and active learning, such an approach can select the valuable samples on the source data set and transfer the model to the target data set to avoid the imbalance of the class sample size.
• According to the experimental results, the RNN is also suitable for hyperspectral image classification. However, there is little work focused on combining the learning paradigms with RNN. Recently, the transformer, as an alternative to the RNN that is capable of processing in parallel, has been introduced into the computer vision domain and has achieved good performance on some tasks such as object detection. Therefore, we can also employ this method in hyperspectral image classification and combine it with some learning paradigms.
• Graph convolution network has been growing more and more interested in hyperspectral image classification. Fully connected network, convolution network, and recurrent network are just suitable for processing the euclidean data and do not solve with the non-euclidean data directly. And image can be regarded as a special case of the euclidean-data. Thus, there are many researches [124,125,126] utilizing graph convolution networks to classify HSI.
• The reason for requiring a large amount of label samples is the tremendous trainable parameters of the deep learning model. There are many methods proposed, such as group convolution [127], to light the weight of a deep neural network. So, how to construct a light-weight model further is also a future direction.
Although few label classification can save much time and labor force to collect and label diverse samples, the models are easy to suffer from over-fit and gaining a weak generalization. Thus, how to avoid the over-fitting and improve model's generalization is the huge challenge of HSI few label classification in the application potential.