Hyperspectral Image Classification Based on Unsupervised Regularization

Due to the powerful feature expression ability of deep learning and its end-to-end nonlinear mapping relationship, deep-learning-based methods have become the mainstream method for hyperspectral image (HSI) classification tasks. However, the accuracy of deep learning methods greatly depends on the use of a large number of labeled samples to train the model. Also, HSIs have few labeled samples and unbalanced categories, which make the depth model prone to overfittingand seriously affect the classification accuracy. Therefore, how to alleviate the overfitting phenomenon caused by small samples in the classification problem based on deep learning is still a problem that needs to be solved. Considering that it is relatively easier to obtain a large number of unlabeled samples in the field of remote sensing, making full use of the unsupervised information learned from unlabeled data can regularize the supervised classification model, which can effectively alleviate the overfitting phenomenon caused by the small samples problem. In the supervised training process, unsupervised information from the overall distribution of the sample is introduced to guide the regularization of the model, so as to realize the effective classification of the data in the case of a small number of labeled samples. Experimental results demonstrate the effectiveness of the proposed method in terms of HSI classification with few training samples.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) has a strong pixel representation ability. Hyperspectral imaging is based on the spectral reflectance of ground objects, so it has strong ground penetration and its resolution is not easily affected by color shading. It has unique advantages in military reconnaissance, agricultural observation, geological prospecting, and transportation planning [1]. As a prerequisite for the practical application of HSIs, the classification of HSIs is of great significance. HSI classification refers to dividing each pixel into a specific feature category according to the spectral curve provided by each pixel. There are many research works on HSI classification, but because the ground annotation of remote sensing images is expensive, HSI classification lacks enough training samples. Also, the high dimension and spectral redundancy of HSIs lead to their high data volume characteristics. Small samples are associated with high-dimensional characteristics, which is easy to cause a "dimension disaster" [2], that is, the dimensionality is too high and the samples are too few, so that the accuracy of the classification task is reduced due to the overfitting of the model [16].
In recent years, a lot of algorithms have emerged for HSI classification, including traditional algorithms based on statistical theory and algorithms based on deep learning. According to whether the classification algorithm is pixel-by-pixel or uses the semantic information of the pixels around the pixel, the existing HSI classification can be divided into two types: 1) pixel-level classification and 2) super-pixel-level classification [4].
The traditional methods of pixel-level classification include the following: 1) Linear classifiers: such as logistic regression [5] and Gaussian maximum likelihood classification [6], respectively, assuming that the sample obeys the Bernoulli distribution and the normal distribution, and constructing the likelihood function to find the decision. The boundary then classifies each pixel. 2) Distance-based classifiers: such as K-nearest neighbor classification [25], minimum distance classification [8], and support vector machine (SVM) [9]. The main idea is to use the distance between the test sample and each class as a decision. The model determines the test sample as the closest class to it, where SVM uses the distance of the sample in the feature space after kernel function mapping. Except for SVM, these methods cannot solve the "curse of dimensionality" of HSIs. But this problem can be alleviated by dimensionality reduction or band selection. That is, first perform feature extraction on the input sample and then perform the classification operation. Feature extraction methods include feature extraction method based on binary discrete wavelet transform [10] and fast dimensionality reduction method based on dynamic programming [11]. In addition to the dimensionality reduction of the data [27], another method for high-dimensional problems is band selection, such as independent component analysis for band selection [28]; the bands containing more information are selected by evaluating the average weight coefficient of each band, using an adaptive band weight measurement method based on information entropy [14]. These methods of deleting redundant bands reduce the computational complexity of HSI classification and ease the high-dimensional problem to a certain extent [26]. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The task of HSI classification based on deep learning mainly focuses on proper feature representation and effective classifier design. For example, the deep belief network (DBN) [32] uses a network of restricted Boltzmann machines, which learns layer by layer to extract robust nonlinear features in HSIs. However, DBN adopts a fully connected (FC) structure [17], which leads to too many parameters in the model, and the effect is not good in the case of small samples. Different from FC networks, convolutional neural networks (CNN) [40] use partial connections to share weights, thereby reducing the number of parameters. Since the local connection of CNN is suitable for dealing with the situation where there are few available training samples for HSIs [19], a series of classification methods based on CNN and its variants have appeared in HSI classification, such as AlexNet [15] designed deeper on the basis of the original CNN structure, so it has better feature representation capabilities but it brings more parameters. In addition, there are GoogleNet [13], VGGNet [3], DensNet [7], recurrent neural network, etc. These networks make the original CNN wider or use convolutional kernels of the same size, fixed pool size, and direct connection structure to reduce the difficulty of training, compared with the general deep network to achieve better scores classification results [24].
Considering that it is relatively easier to obtain a large number of unlabeled samples in the field of remote sensing, making full use of the unsupervised information learned from unlabeled data can regularize the supervised classification model, which can effectively alleviate the phenomenon of small samples and the overfitting problem. Based on this clustering assumption that adjacent samples have similar output values on the same manifold structure, this chapter designs an HSI classification framework that shares unsupervised information, that is, introduces unsupervised training from the overall distribution of samples in the supervised training process. Supervise the information and guide the regularization of the model, so as to effectively classify the data in the case of a small number of label samples.
The main work of this article is embodied in the following three aspects.
1) A shared feature extraction module (SFEM) is designed based on the Kullback-Leibler (KL) sparse stack autoencoder (AE) structure, which is used to extract both the labeled data and the unlabeled data in the feature extraction stage to obtain their consistency information.
That is, the structural information of the sample is put into the process of supervised learning as prior knowledge to provide regularity for the learning process. 2) Use the supervised learning and unsupervised clustering processes to classify and cluster the two kinds of data in the same dataset. The information obtained from unsupervised learning that causes classification loss due to the introduction of unsupervised clustering is introduced into the supervised learning process. The process provides information on similarities and differences between classes, and more information can be obtained from the supervised learning part. 3) Test the effectiveness of this scheme in the mainstream dataset of HSIs. In particular, when dividing samples, focusing on the categories with a small number of tags, we can see the superior performance of this method in the case of small samples.

II. RELATED WORK
In HSI classification, traditional methods require fewer parameters and low computational complexity. However, because HSIs collected under natural conditions do not generally have a certain distribution law in the sense of mathematics and statistics, it is based on the distribution assumption or the traditional method of Vapnik-Chervonenkis (VC) dimensional theory is limited. Deep-learning-based methods can extract high-level features of data through multiple hidden deep networks, but deep-learning-based methods require more labeled data for training. The classification effect depends on the number of effective training samples, so it appeared a lot of labeled training samples. The lack of image and the imbalance of HSI categories have become important factors limiting its development.
The method based on deep learning merges the spatial features into the classifier, extracts the spatial features and spectral features separately during feature extraction, and then uses feature fusion technology to perform joint classification. For example, using 2-D CNN to perform feature extraction on the machine neighborhood of hyperspectral pixels containing spatial information while performing principal component analysis (PCA) operations on the spectral dimension can extract discriminative spatial features while reducing the computational cost. Based on this, Liang et al. [12] introduced sparse representation technology to encode deep spatial features extracted by CNN into low-dimensional sparse features to improve feature representation capabilities. Long et al. [18] used the trained fully convolutional networks 8 (FCN8) to explore deep multiscale spatial structure information and used a weighted fusion mechanism to fuse the original spectral features and deep multiscale spatial features, and finally input the fused features into the classifier for execution classification prediction.
Recently, model regularization (MR) is a method to effectively alleviate the model overfitting caused by the small sample problem. In deep learning, too few training samples and more parameters in the model will cause the model to change from the limited training data. The model learned in the medium lacks generalization ability, that is, the phenomenon of overfitting. In response to this problem, there are several strategies in HSI processing: transfer learning (TL) [22], active learning (AL), and model optimization [30]. TL is a method that learns useful information from auxiliary data and introduces it into the target dataset to effectively reduce the data dependence of the algorithm. Deng et al. [48] used the initial values of deep network parameters trained on other remote sensing datasets to initialize the 2-D CNN for classification. Compared with random initialization, the TL algorithm converged faster after parameter migration [29]. AL is based on the selection of training samples, adding unlabeled data samples as new training samples to the training dataset, thereby adding labeled training samples. Ma et al. [20] combined AL with iterative training sampling, expanded the multidimensional dataset by iteratively incorporating other spatial classification information into the unlabeled data samples enhanced by AL, and updated the current training samples in a single iteration. While further improving the accuracy of classification, it reduces the inconsistency of classification. Finally, it is a method based on MR. MR refers to adding some restriction rules to the target function that needs to be trained, reducing the parameter space, and then constraining the solution space and minimizing classification errors. In a hyperspectral dataset, different collected light, weather, and shading will cause the spectral reflectance of the same object to deviate. The efficiency of TL relies on the consistency of the dataset, so the use of TL to solve the small sample problem in HSI classification has the problem of auxiliary dataset selection. The method of AL to expand training samples is to select data that are helpful for classification from unlabeled samples, query human experts, and obtain the label of the sample. Additional information is required, which is often not available in actual classification applications. Therefore, choosing a simpler and more feasible MR can alleviate the overfitting problem caused by small samples by introducing prior knowledge into the sample and reducing the loss of model structure.
Currently, making use of the unsupervised information contained in the dataset can improve the classification performance. The machine learning algorithm can be divided into two categories according to whether the training process uses labeled training samples: 1) supervised learning and 2) unsupervised learning. First, supervised learning establishes training sets based on samples in different categories and then makes decisions based on training parameters. The unsupervised training process does not require labeled samples, and the main purpose is to extract useful features from a large amount of unlabeled data. Because of the high cost of remote sensing data processing and labeling, the number of unlabeled samples in hyperspectral data is much larger than that of labeled samples [31]. To deal with this problem, more and more research works are devoted to designing an unsupervised deep learning framework for HSI data to realize an encoder-decoder that can learn without using label information, and at the same time through migrate the trained network and fine-tune the labeled dataset to improve classification performance [34]. The advantage of the unsupervised algorithm is that it does not need to have label data to obtain its own distribution information in the sample, but it also has the disadvantage that the classification is not accurate enough and the category needs to be determined manually. Therefore, we consider using a clustering algorithm to initially extract the difference information of the sample in the feature space and introduce it as a regular term into the supervised classification process to improve the classification accuracy.
HSI has a large amount of data and many feature channels. Liu et al. used the Ghost module to reduce the complexity of the model, and combined with the extended morphological profile (EMP) features, propose an HSI classification method based on EMP features and Ghost module (GhostEMP). GhostEMP can improve the efficiency of operation [33]. Also, Shen et al. proposed a method named GLSESP to improve the performance of the supervised classification. They used the global spatial and local spectral similarity to extend the labeled sample size.
Also, in order to alleviate band redundancy, they extended subspace projection, which projects the original image to a lower-dimensional subspace. GLSESP is also very practical and effective in HSI classification [23]. Recently, CNNs have been widely used for HSI classification due to their detailed representation of features. However, the current CNN-based HSI classification methods mainly follow a patch-based learning framework. These methods not only limit the use of global information but also require a high computational cost. So, Xu et al. used an image-based global learning framework for HSI classification. They proposed a dual-channel convolutional network (DCCN) for HSI classification to maximize the exploitation of the global and multiscale information of HSI [35].
Also, CNNs have emerged as a popular choice for HSI analysis now. However, the performances of traditional CNN-based patchwise classification methods are limited by insufficient training samples, and the evaluation strategies tend to provide overoptimistic results due to training-test information leakage. To address these concerns, Liang Zou et al. proposed a novel spectral-spatial 3-D fully convolutional network to jointly explore the spectral-spatial information and the semantic information. It takes small patches of original HSI as inputs and produces the corresponding sized outputs, which enhances the utilization rate of the scarce labeled images and boosts the classification accuracy [36].

III. PROPOSED METHOD
The core idea of this article is to introduce the unsupervised information contained in the whole sample consisting of a small amount of labeled data and a large amount of unlabeled data as a regularization constraint in the training process of supervised HSI classification and use the classification loss to backpropagate. The supervised classification model can learn the unsupervised information of the full set of samples in addition to the information contained in the labeled samples and alleviate the overfitting caused by small samples. For a given hyperspectral dataset x ∈ i w×h×d , where w × h represents the width and height of the image, and d represents the number of spectral bands. There are a total of n samples in the input dataset, of which l samples belong to the labeled dataset The n − l samples other than the labeled samples constitute the unlabeled dataset, The purpose of the classification task is to train a classifier by X L and use this classifier to correctly classify X U .

A. Unsupervised Pretraining Based on Stacked AE
In order to make better use of the information contained in unlabeled samples, this article first designs a stacked AE (SAE), which uses backpropagation to perform unsupervised learning on labeled samples and unlabeled samples and perform feature extraction together. Then, input the features into the corresponding classifier for end-to-end training. The purpose of the AE is to learn an effective representation of the input data. AE uses a fictitious three-layer network, assuming that the original data are also the target output, and uses the loss of the real output and the target output to construct a supervision error for training. After the training is completed, the output layer is removed to obtain the feature expression of the input data. The structure of AE is shown in Fig. 1.
The training process of the AE is expressed as follows: In order to learn more meaningful expressions and prevent AE from becoming a linear encoder and learn identity expressions, this chapter adds regular constraints to the hidden Layer L2 and constructs AE as a sparse AE. Specifically, by adding a sparsity penalty during training to reconstruct the error as (3). In order to limit the sparsity of AE hidden layer neurons, use KL divergence to constrain the average activation value of most of the hidden layer neurons: Specify a sparsity parameter that represents the average of hidden neurons on the training set activity, use KL divergence to measure the relative entropy of the expected activation and the actual activation of the actual neuron, and then add it as a regular term to the objective function. Finally, the loss function of AE can be expressed as m represents the dimension of the input data, a j represents the activation value on the hidden layer neurons j, and β is the weight of the sparsity penalty item. Since KL divergence is a measure of the asymmetry of the difference between two probability distributions, the introduction of the sparse regular AE of KL divergence in this chapter can better learn the similarity information of samples of X L and X U as the same kind. In deep learning, the deep network can learn multiple expressions of the original data layer by layer. Based on the same principle, this article uses an SAE structure to stack three AEs, the input of each AE is based on the output of the previous AE, learning a more abstract representation of features. The structure of the stack AE is shown in Fig. 2.
In order to implement an SAE for shared feature extraction and layer-by-layer unsupervised pretraining, a labeled dataset and an unlabeled dataset are used to train the AEs that have undergone KL divergence sparse regularization step by step. For the input vector x, the high-level representation of x is first obtained, and then, the second-order feature representation of the original data x is obtained in the input. The last self-encoder is input, and after the processor, a softmax layer is added. The output of this softmax layer serves as the input to the next layer, and the high-level feature representation of x is output after processing. After unsupervised and training, the softmax layer and the last AE are canceled, and the final SAE with a three-layer structure is obtained. The pretrained SAE fits the structure of the training data to a large extent. This SAE serves as the shared feature extractor of the labeled data and unsupervised data of the image, which can well obtain the unsupervised data contained in the whole sample. The supervision information reflects the relationship between sample similarity and corresponding label similarity.

B. Few Shot HSI Classification Framework With Shared Unsupervised Information
In order to alleviate the overfitting problem caused by the small number of labeled samples, this article proposes a training framework (Shared Unsupervised Information Classification Framework, SUICF) that shares the unsupervised information contained in all samples into the supervised classification process, using the loss function. The way to guide the supervised classification model is as shown in Fig. 3.
The model is completed in two steps. On the one hand, SAE uses backpropagation to perform feature extraction on raw data through unsupervised learning, and the extracted features of labeled data and unlabeled data are input into the supervised feature extractor and unsupervised feature extractor respectively. On the other hand, all data are extracted. This article uses the K-means clustering algorithm, the parameter k is set to the number of categories of the input data, after clustering, all the data get the pseudolabel of their own category.
Specifically, based on KL sparse stack automatic encoder structure, an SFEM is designed to extract tagged data and unlabeled data in the feature extraction phase to obtain their consistency information. The supervised learning and unsupervised clustering processes are used to classify and cluster two kinds of data in the same dataset. Because unsupervised clustering is Fig. 3. Few shot HSI classification framework for sharing unsupervised information. The upper part is SAE: Layer-by-layer pretrained network based on backpropagation algorithm to update parameters, and the lower part is K-means: to get pseudolabels from the fused data.
introduced, the information obtained from unsupervised learning is introduced into the supervised learning process in the way of classification loss. The interclass similarity between the data captured by the K-means algorithm is trained by CNN using the pseudotags generated by clustering, and the unsupervised information input is effectively strengthened in a supervised way.
The pretrained stack self-encoder fits the structure of training data to a large extent. As the shared feature extractor of image-tagged data X L and unsupervised data X U , this SAE can well obtain the unsupervised information contained in the whole sample, that is, the relationship between sample similarity and corresponding tag similarity.
The interclass similarity between the data captured by the K-means algorithm is trained by the CNN using the pseudolabels generated by the clustering, and the unsupervised information is effectively strengthened in a supervised way. The feature extraction module based on the pretraining of the shared SAE effectively shares the unsupervised information in the data, makes the unsupervised information flow to the supervision task, and provides an effective regularity for the network. In this framework, the input data of the three branches are calculated through three softmax layers to calculate the probability that each pixel belongs to a certain category. Cross-entropy is calculated for supervised data supervision features and their corresponding pseudo labels as regular term J 1 . Similarly, J 2 is the cross-entropy for the unsupervised features of the supervised data and their corresponding pseudolabels as the regular term, and J 3 is the cross-entropy of unsupervised data and its corresponding pseudolabel as a regular term In our classification model, a KL discrete stack AE is designed. Its function is to perform feature extraction on the labeled data and unlabeled data in the sample in the same way. In addition, the unsupervised pretraining method retains its weight, reduces the training parameters of the classifier, reduces the structural risk of the classification model, and improves the classification accuracy.

C. Classification Model Design Based on 3-D CNN
In order to extract the spatial and spectral features of the input raw hyperspectral data at the same time, 3-D CNN is used as the backbone network for sharing the unsupervised information classification model. 3-D CNN is usually used to process video files because it takes three-dimensional data as the input attribute so that it can capture two-dimensional pictures and one-dimensional time features in video files at the same time, so it has made achievements in dynamic target recognition and human behavior understanding. As far as this topic is concerned, HSIs are different from images in ordinary computer vision tasks. They are a collection of one-dimensional features that record the spectral response of an object and twodimensional features that characterize the spatial distribution of the target. Therefore, the use of 3-D CNN to process HSIs directly obtains the spectrum space joint representation of the original data and then performs end-to-end training, which is easier and more accurate for the implementation of classification tasks.
Based on the network structure of 3-D CNN, this section presents the main processing units included in the proposed framework, namely the SAE, supervised and unsupervised feature extraction modules, the parameterization details of the classifier, and the regularization of the model. The implementation process of clustering operations. Finally, this section introduces the training process of this classification model. 1) Stack-type AE: The SAE designed in this section has two AEs based on KL divergence and a softmax classifier. The purpose of unsupervised pretraining is to reduce the hidden weight W and bias term of the network within the parameter space. Generate a better starting point than random initialization for the subsequent supervised training phase. Specifically, the two hidden AEs use the same structure, but their input parameters are different; the input to the second AE comes from the output of the first AE, and the parameters are set in Table I. After pretraining the SAE, the decoder is separated, the weight of the encoder is saved, and SAE is added to the 3-D CNN-based classification model. This main classification model uses the loss function generated by three crossentropy for training. 2) Supervised and unsupervised feature extraction module: The main steps of the supervised feature extraction module include global average pooling (GAP), batch normalization (BN), and nonlinear activation. The use of GAP instead of full-connection operation here reduces the redundancy of full connection parameters. Set the BN operation to improve the training speed, and it is no longer sensitive to the weight scale. Use LeakyReLU for nonlinear activation to increase the convergence rate. The unsupervised feature extraction module uses the same settings as the supervised feature extraction module. 3) Classifier settings: This article sets up three classifiers to classify supervised features, unsupervised features, and fusion features. The classifiers are all implemented with the softmax layer, and the purpose is to map different input features to the real label space and the cluster label space. 4) Clustering algorithm: The important part of the information in the shared unsupervised information classification model in this article is the prior knowledge of the sample. The prior knowledge from unsupervised clustering is introduced into the supervised classifier to provide a regularity for the classification model, thereby reducing the model's sample dependence. This article uses the K-means algorithm to cluster all pixels and characterize the sample prototype. Set the value of k to the number of true categories in the sample. That is, the pseudolabels obtained by the clustering algorithm are used as input data. 5) Loss function: The loss function in this article is given by the cross-entropy combination obtained by the three classifiers. The specific formula is as follows: Among them, λ 1 , λ 2 , and λ 3 are balance coefficients, J 1 , J 2 , and J 3 are cross-entropies between the output of the three classifiers and the real labels and the pseudolabels produced by clustering, respectively. Since the three losses are distinguishable, the backpropagation algorithm can be used to effectively train this framework in an end-to-end manner.

IV. EXPERIMENTS RESULTS AND ANALYSIS
This section conducts different experiments on four HSI classified datasets [Pavia University, Kennedy Space Center (KSC), Indian Pines, Salinas] to verify the effectiveness of the proposed method.

A. Datasets
This article uses four internationally popular public benchmark hyperspectral datasets to evaluate the experimental results of the proposed HSI classification algorithm, namely 1) Pavia University, 2) KSC, 3) Indian Pines, and 4) Salinas.
1) Pavia University is a scene captured by ROSIS sensors during a flight mission over Pavia in northern Italy. The size of the original data is 610×610×103, the geometric resolution is 1.3 m, and it contains nine types of ground objects, such as asphalt, gravel, grass, and trees.

2) KSC is the data collected by NASA's AVIRS Research
Center at an altitude of approximately 20 km at the KSC in Florida. AVIRIS collected data in 224 10-nm-wide bands, the center wavelength of the data was 400-2500 nm, and the spatial resolution was 18 m.

B. Comparison Methods
In order to verify the effectiveness of the algorithm proposed in this chapter, the four datasets are compared with the classic algorithms in the field of HSI classification. The comparison algorithms are SVM, DBN, and 3-D CNN [37]. SVM is a typical example of traditional algorithms for HSI classification. It can still play a better role in the case of limited training samples. DBN is a generative model used to represent the probability distribution between predicted data and labels. The fine-tuned and pretrained DBN has good performance in HIS classification tasks. 3-D CNN is a successful example used to extract the spatial-spectral features of HSIs synchronously in recent years. The backbone network of the algorithm in this chapter is 3-D CNN.
In the experiment, the proportion of samples used for training on the four datasets is 5%, and the remaining samples are used for testing.

D. Experiments Results
Under the premise of using the same experimental settings, we conducted a series of classification experiments with different methods of four hyperspectral datasets, and the results were summarized in Tables II-V. And it is worth mentioning that the test data did not participate in unsupervised learning.
We use supervised learning and unsupervised clustering processes to classify and cluster two kinds of data in the same dataset. Because unsupervised clustering is introduced, the information obtained from unsupervised learning is introduced into the supervised learning process in the way of classification  loss. This process not only provides information of similarities and differences between classes but also can obtain more information from supervised learning.

1) Comparison Results on the Pavia University Dataset:
For the classification of bitumen, self-blocking bricks, and painted metal sheets, the algorithm proposed in this chapter can achieve 100% accuracy, which shows the ability of this algorithm to distinguish man-made materials. It can be calculated from Table II that on the Pavia university dataset, compared with the traditional algorithm SVM, the average classification accuracy of the algorithm proposed in this chapter has increased by 11.14%, and the overall classification accuracy has increased by 11%. The Kappa coefficient increased by 0.1316. In addition, compared with the method based on deep learning, the average classification accuracy of the algorithm proposed in this chapter has increased by more than 5%, the overall classification accuracy has increased by 5%, and Kappa has increased by 0.0537.
2) Comparison Results on the KSC Dataset: Both traditional methods and deep-learning-based method classification results on the KSC dataset have merits from Table III. The method based on deep learning obtains better performance than the traditional method SVM due to its deep feature extraction ability. The classification results on the category shrubs (Scrub) and salt marsh (Salt Marsh) are significantly better than SVM. But the classification effect of SVM on category 13 graminoid marsh is better. Grass swamps account for a small proportion in the KSC dataset, because SVM is not sensitive to high-dimensional data, and methods based on deep learning, including the algorithm in this chapter, have poor performance in categories with a small number of samples because of too many parameters. Compared with the current classification algorithm, the AA value of the algorithm proposed in this article is increased by 2.2%, OA is increased by 4.2%, and Kappa is increased by 0.0221.

3) Comparison Results on the Indian Pines Dataset:
The classification accuracy of 3-D CNN on the wheat category is higher than the algorithm in this chapter. However, in most categories, the algorithm in this chapter is better than 3-D CNN. In general, the algorithm proposed in this chapter improves AA by more than 6.3%, OA by 5.32%, and Kappa by more than 0.04 on the Indian Pines dataset from Table IV. 4) Comparison Results on the Salinas Dataset: It also can be seen from Table V, on the Salinas dataset, our experimental results are still better than SVM, DBN, and 3-D CNN. In general, the accuracy in many classes has reached 100%. Also, OA is increased by 2%-7% and AA is increased by 2%-4% than others.

5) Other Hyperspectral Classification Frameworks:
From Tables VI and VII, we can see that our model is much better than other models in the dataset of Pavia University. We found the Grass-pasture-mowed and Oats categories of the Indian Pines dataset have only 28 and 20 samples, respectively. Even if 5% of the samples were selected, only one sample was available for training in this experiment. However, because our algorithm can regularize the supervised classification model by making full use of the unsupervised information learned from the unlabeled data, which can effectively alleviate the overfitting problem caused by the small sample phenomenon. Because the Pavia University dataset has a larger sample size than the Indian Pines dataset, the accuracy of the classification is relatively high, our algorithm also has obvious advantages over SAGP, MCNN-CP, and others.
In Table VIII, the sample distribution of the Salinas dataset is more balanced than the Indian Pines and Pavia University datasets, and the classification difficulty is lower. Because our algorithm effectively strengthens the input unsupervised information in a supervised way, and the feature extraction module efficiently shares the unsupervised information in the data, the unsupervised information flows to the supervised task in the classification process, providing an effective regularity for the network, so the experimental results are more robust.   SAGP and SSUN, respectively. Besides, compared with BTA-Net algorithms, although our algorithm is slightly lacking in computational complexity, in combination with OA, AA, and Kappa on Indian Pines, Pavia University, and Salinas datasets, our algorithm still has certain advantages. In general, our algorithm has greatly improved its accuracy while maintaining a relatively good computational complexity.

F. Classification Performance in the Case of Small Samples
In order to verify that the algorithm in this chapter alleviates the overfitting of the model by introducing regularization and improves the classification accuracy in the case of small samples. It alleviates the problem of insufficient classification accuracy of existing algorithms when the training samples are small and compares with other experiments under the condition of small samples. Specifically, the proportion of input training samples to the total number of pixels is set to 0.5%, 1%, 2%, 3%, and 4% on these four datasets and compare them with other experiments. The results are as follows in Fig. 4.
It can be seen that this algorithm proposed in the article has a small number of training samples, taking 1% and 3% as examples, and the classification accuracy on the four datasets is higher than the comparison algorithm. Especially in the case of 1% training samples, the algorithm OA proposed in this chapter can take more than 75%, which is a big improvement compared with the comparison algorithm. Especially the comparison with 3-D CNN shows that this chapter introduces the unsupervised information of the sample species into the training process, in the case of a very low sample size, the accuracy of the model did not decrease quickly, which can improve the effectiveness of the algorithm when the training samples is too few.
In the third type of object gravel (Gravel), no matter the traditional algorithm SVM or the method based on deep learning, there is a phenomenon of misclassification, as shown in Fig. 5. Part of the gravel is divided into trees (Trees), as shown in Fig.  6. Extracting the spectral information of gravel and trees in the Pavia University dataset, as shown in Fig. 6, it can be seen that the spectral curves of gravel and trees are relatively similar, so it is easy to misclassify. However, the accuracy of the algorithm proposed in this chapter is above 96% in these two categories, which can prove the effectiveness of the algorithm in the similar band.   It can be seen from Fig. 7 that in the comparison experiments, some of the shrubs (Scrub) in the upper right corner of the image were mistakenly classified as salt marsh (Salt Marsh). But our algorithm avoids this, and the classification accuracy rate in the salt marsh category is 100%. In addition, the classification ability of mud flats is also more accurate in this chapter. As can be seen from Fig. 8, wheat on the Indian Pines dataset is easily classified as woods. SVM and DBN have low classification accuracy for these two categories. There are many misclassifications between stone-steel-towers and buildings-grass-trees-drives (buildings-grass-trees-drives).
It also can be seen from Fig. 9, on the Salinas dataset, our algorithm performed better than SVM, DBCN, and 3-D CNN on the class fallow_rough_plow.

V. CONCLUSION
Based on the idea of introducing regularization to alleviate overfitting, this article shares the unsupervised information of the complete set of samples into the training process of supervised classification and designs a 3-D CNN-based shared unsupervised information HSI classification model. Considering that HSIs have fewer training samples in practical applications, the classification method that introduces unsupervised information proposed in this chapter aims to alleviate the overfitting problem caused by small samples in the depth model. When compared with the traditional method SVM and the typical methods based on deep learning, such as DBN and 3-D CNN, this algorithm proposed in this chapter has higher classification accuracy in most categories. Also, in the case of reducing training samples, the algorithm proposed in this article is still advantageous.