Spatial-spectral hierarchical vision permutator for hyperspectral image classification

ABSTRACT In recent years, the convolutional neural network (CNN) has been widely applied to hyperspectral image classification because of its powerful feature capture ability. Nevertheless, the performance of most convolutional operations is limited by the fixed shape and size of the convolutional kernel, which causes CNN cannot fully extract global features. To address this issue, this article proposes a novel classification architecture named spatial-spectral hierarchical Vision Permutator (S2HViP). It contains a spectral module and a spatial module. In the spectral module, we divide the data into groups along the spectral dimension and treat each pixel within the group as a spectral token. Spectral long-range dependencies are obtained by fusing intra- and inter-group spectral correlations captured by multi-layer perceptrons (MLPs). In the spatial module, we first model spatial information via morphological methods and divide the resulting spatial feature maps into spatial tokens of uniform size. Then, the global spatial information is extracted through MLPs. Finally, the extracted spectral and spatial features are combined for classification. Particularly, the proposed MLP structure is an improved Vision Permutator, which presents a hierarchical fusion strategy aiming at generating discriminative features. Experimental results show that S2HViP can provide competitive performance compared to several state-of-the-art methods.


Introduction
Hyperspectral image (HSI) contains hundreds of narrow and contiguous spectral bands, with wavelengths spanning the visible to the infrared spectrum (Chang, 2007).Because of its rich spectral and spatial information, HSI has been widely used in various fields, such as target detection, precision agriculture, mineral exploration, and image fusion (Carrino et al., 2018;Imani & Ghassemian, 2020;Shen et al., 2022;Wei et al., 2019).HSI classification is one of the most vibrant tasks in remote sensing applications and has undoubtedly received a lot of attention, which aims at assigning a spectrum to one certain class (Bioucas-Dias et al., 2013;Camps-Valls et al., 2013;Liu et al., 2020).In the early stages of the study, most methods used exclusively spectral characteristics of a pixel to complete the classification task.Some of these methods focused on feature extraction or dimensionality reduction, such as principal component analysis (PCA; Prasad & Bruce, 2008), independent component analysis (Tu, 2000), and linear discriminant analysis (Bandos et al., 2009).Some other pixel-wise approaches were designed as classifiers, e.g.support vector machines (SVM; Melgani & Bruzzone, 2004), multinomial logistic regression (Pal, 2012), and random forest (Ham et al., 2005).
The problem of purely using spectral information is not only reflected in the lack of spatial information, but also that spectral data is affected by incident illumination, atmospheric effects and instrument noise.These factors greatly limit the accuracy of pixel-level classification methods (Li, Song et al., 2019).To compensate for the inadequacy of single spectral information, some researchers proposed new methods to obtain spatial features for information supplementation (Benediktsson et al., 2005;Camps-Valls et al., 2006;Fauvel et al., 2013;C. Zhao et al., 2017).The results demonstrated the validity of spatialspectral classification approaches.In addition, some approaches attempted to explore effective feature extraction algorithms for better revealing the structural information of HSI, which can help with subsequent classification tasks.Manifold learning is one of the widely used techniques which can obtain the nonlinear characteristics of data, by combining it with other approaches to improve performance, such as sparse representation (Y.Y. Tang et al., 2014), hypergraph framework (Duan et al., 2021), and geodesic distance (Duan et al., 2022).
In the past few years, with the continuous improvement of computer performance, deep learning has rapidly become a hotspot of research (L.Zhang et al., 2016).The learning process of deep learning methods is fully automated through data-driven technology (Liu, Wu et al. 2022).Due to its outstanding performance, deep learning has been used in object detection, action recognition, and natural language processing (Minar & Naher, 2018).Furthermore, deep learning has also been adopted by many researchers to implement HSI classification tasks.Compared with traditional approaches, deep learning can learn complex and abstract information from data by simulating the recognition process of the brain.Among different deep network models, convolutional neural networks (CNNs) are widely used in HSI classification research (Makantasis et al., 2015, July;Yang et al., 2018).CNNs can capture contextual spatial information in an end-to-end and hierarchical manner.Additionally, CNNs adopt the weight-share mechanism, which greatly reduces the network parameters.Although the role of CNNs in HSI classification is undeniable, it is not good at capturing longrange dependencies of the data.
The advent of the visual Transformer (Dosovitskiy et al., 2020) has given researchers new inspiration to reconsider the image classification process in terms of sequential data to obtain long-range dependencies.The Transformer was first developed by Vaswani et al. (2017) and applied in natural language processing.Numerous experiments have verified the powerful image processing capability of Transformers in the computer vision domain (Carion et al., 2020;Z. Liu et al., 2021;Touvron et al., 2021).In contrast to the Transformer, Tolstikhin et al. (2021) discarded the self-attention operation and proposed an MLP-Mixer (a.k.a., Mixer) architecture based only on multilayer perceptrons (MLPs).It obtains the global perceptual field by matrix transposition and token-mixing projection.Subsequently, Hou et al. (2022) proposed a new MLP-based method named Vision Permutator (ViP), which further splits the information encoding of spatial dimensions to produce more precise positional information.Therefore, ViP has three branches for encoding width, height and channel dimensions, respectively, and it directly sums the features of multiple branches at the end.However, for HSI, the channel information also contains rich features, and aggregating the three branches in one step may result in insufficient fusion of spatial and spectral information.
To address the problems mentioned above, a novel spatial-spectral hierarchical Vision Permutator (S2 HViP) network is proposed in this paper for HSI classification.First, in response to the characteristic that HSI feature maps are usually represented in the format of 3-D cubes, we design a hierarchical Vision Permutator (HViP).By utilizing a hierarchical fusion strategy, HViP can not only extract long-range dependencies in parallel from each of the three dimensions of HSI, but also improve the fusion results between dimensions.Then, we design a spectral feature extraction module and a spatial feature extraction module to capture different information.In the spectral module, we group the raw data along the spectral dimension, and the pixels in each group are different spectral tokens.The spectral features are implemented by computing intra-and inter-group spectral correlations.In the spatial module, morphological methods are introduced to facilitate the subsequent network to model deep spatial features.Finally, the features extracted from the different modules are fused to further improve the representation capacity of deep features.
The key contributions of this article can be summarized as follows: (1) We propose a novel S

Related works
In recent years, HSI classification research has continued to advance.During this period, the researchers have tried a variety of methods to improve classification performance.We divide those methods into traditional approaches, CNN-based approaches, Transformer-based approaches, and MLP-based approaches.In this section, we will review the development of classification methods.

Traditional approaches
In the early time, the spectral characteristics of HSI were the focus of most methods.However, high intraclass spectral variability and low inter-class spectral variability both bring challenges for pixel-wise approaches (Ghamisi et al., 2017;L. He et al., 2018).
Considering that the 3-D structure presented by HSI contains a wealth of spectral and spatial features, the introduction of spatial information can effectively improve classification performance (Ghamisi et al., 2018).Morphological profiles (MPs) are generated by morphological transformation approaches, which are used to model spatial information (Pesaresi & Benediktsson, 2001).Based on this, the extended MP is further proposed to finish the HSI classification task (Benediktsson et al., 2005).Random fields and probabilistic graphical models can incorporate spatial features into the classification stage.Markov random field is one of the classic methods used for HSI classification (J.Li et al., 2012), it can be utilized in combination with other technologies, such as SVM (Ghamisi et al., 2014), multinomial logistic regression (Khodadadzadeh et al., 2014), ensemble classifiers (Xia et al., 2015) and active learning (S.Sun et al., 2015).Besides, conditional random field (J.Zhao et al., 2018), superpixel segmentation algorithm (Tu et al., 2020), collaborative representation (Liu et al., 2016) and subspace projection (Wang et al., 2016) are also effective ways to utilize spatial features.

CNN-based approaches
Benefiting from the advancement in parallel computing, deep learning has become the popular method to handle data.Compared with traditional approaches, deep learning-based classifiers use multiple nonlinear networks to hierarchically construct high-level features in an automated manner.Numerous experiments demonstrate its effectiveness in the field of HSI classification (Audebert et al., 2019;Zhu et al., 2017).Typical methods include stacked autoencoder (Chen et al., 2014), deep belief network (T.Li et al., 2014), recurrent neural network (Mou et al., 2017), and CNN (Chen et al., 2016).However, most of the methods in the above models utilize vector inputs except for the CNN, which causes the spatial contextual relationships between pixels to be ignored.The excellent ability of CNN to extract spatial information makes it popular in the HSI field (Guo et al., 2020;X. Li et al., 2019).Paoletti et al. (2019) presented a deep residual network by using pyramidal bottleneck residual blocks.Song et al. (2018) introduced a deep feature fusion network (DFFN) to fuse the outputs of different hierarchical layers to obtain more robust features.In addition to the 2D-CNN-based approaches described above, using a 3-D kernel to extract spectral space features is also a natural solution for HSI classification (Hamida et al., 2018;Zhong et al., 2018).Roy et al. (2020) Yu et al. (2022), it aims to make full use of the global and multi-scale information of HSI.

MLP-based approaches
Recently  2022) proposed an MLP architecture, which applies a ladder-like connected structure to obtain contextual interaction for HSI classification tasks.Moreover, Lin et al. (2022) proposed a multi-scale U-shape MLP, which consists of the designed multi-scale channel block and the U-shape MLP.

Proposed framework
In this section, we will introduce the overall structure of S 2 HViP in detail.Our method contains two modules, the spatial module and the spectral module.
Figure 1 shows the framework diagram of our method.

Spectral module
An initial HSI data I 2 R H�W�B is given, where H, W, and B represent the height, the width, and the number of spectral bands, respectively.We divide I into non-overlapping M groups along the spectral dimension by 3D-CNN to get grouped data X group 2 R H�W�M , where M is the number of groups, and consider each pixel in the groups as a spectral token.If B is not divisible by M, we use the first few bands of I to pad.Then, X group is projected to spectral tokens data X spe 2 R H�W�C by a linear embedding layer, where C is the hidden layer dimension number.The above process is spectral patch embedding, details of the implementation can be found in Table 1.We enter X spe into the subsequent MLP-based layers.The flowchart of the hierarchical Permutator (HP) block is shown in Figure 2, the biggest difference between our method and Transformer is that we abandon the self-attention operation.A basic HP block consists of LayerNorms, skip connections, hierarchical Permute-MLP (a.k.a., Permute-MLP) and Channel-MLP.We choose the Gaussian error linear unit (GELU; Hendrycks & Gimpel, 2016) as the activation function.The data handling process of HP can be expressed as: where LN means LayerNorm, Y spe and Z spe represent the spectral features extracted at different stages respectively.The Z spe serves as the input to the subsequent HP blocks.
The visual illustration of the hierarchical Permute-MLP is shown in Figure 3, which has three branches.Linear projections are used to model the input 3-D features separately along their respective dimensions.In the spectral module, we indicate height features, width features and channel features of the extracted feature  respectively.The height and width dimensions represent the spectral correlation within the same group, and the channel dimension stands for the spectral correlation between groups.Specifically, we illustrate the information extraction process using the example of height.We conduct a height-channel permutation operation on X spe in the beginning.First, the channel dimension is divided into T parts to yield

À
� have been produced by three branches.We use a hierarchical fusion approach to get more recognisable features.The intra-group spectral information is first fused to refine the correlated features between different pixels in the same group.The fused features are then combined with intergroup correlations to obtain spectral long-range dependencies, and inter-group correlations are obtained by mining the spectral features of different groups.To distinguish the importance of different branches, we adopt the split-attention proposed by H. Zhang et al. (2020) to assign weights in the fusion process.The results of the recalibrated fusion of different branches are added to the next FC layer.The fusion operation for correlations of the spectrum can be calculated as follows: where FC � ð Þ stands for an FC layer.
Xspe means the output spectral features.After several HP blocks, the spectral information will be sent to the global average pooling layer to obtain the spatial vector for fusion and data classification.

Spatial module
In the spatial module, the extended multi-attribute profile (EMAP) (Dalla Mura, Atli Benediktsson et al., 2010) is utilized to extract preliminary spatial features.Specifically, attribute profiles (APs) (Dalla Mura, Benediktsson et al., 2010) are a multi-level decomposition of the input image based on attribute filters.In contrast to MPs, APs can process images according to  a variety of flexibly defined attributes.An AP is obtained by: where f is the input image, ϕ j f ð Þ and γ j f ð Þ represent opening and closing operators respectively.Opening and closing are a pair of opposite morphological operators that process images through sliding windows called the structuring element (SE).The size of SE will affect the degree of image processing.Then, we calculate the AP for each principal component (PC) after dimension reduction to get extended APs (EAPs), which can model the needed spatial features.
where m is the number of spectra after PCA.An EMAP contains different EAPs with the aim of extracting different information from the scene, which can be expressed as: where a 1 ; a 2 ; . . .; a k f g denotes the k different attributes.
Mathematically, the input to the spatial module is also I 2 R H�W�B , PCA is first used to process the data to obtain the primary spatial information of the m PCs.After procedures of the EMAP, we have preliminary spatial features I emap 2 R H�W�G , G is the number of the spectral channels.The next step is to divide the obtained features into several tokens evenly and convert the channel dimension number to the hidden layer dimension number.In this module, we choose 2D-CNN to accomplish the spatial token embedding operation.The convolutional layer is composed of C 2-D kernels, The spatial size and stride of each 2-D kernel are both p.Then, C-dim spatial tokens data X spa 2 R h�w�C are generated, where h and w enumerate the height and the width of tokens, satisfying h ¼ H=p and w ¼ W=p.The implementation details of the embedding operation are available in Table 2.These spatial tokens will be sent into a series of HP blocks to extract deeper robust spatial information.Instead of flattening the spatial dimensions, we input the 3-D data X spa directly and encode the width and height of the tokens separately.This has the advantage of enabling HViP to capture long-range dependencies in one spatial dimension while preserving position information in the other dimension.In the layer of Permute-MLP, we first perform a weighted fusion of the location information of the height X spa H and width X spa W branches to produce spatial feature representations.Based on the strategy of hierarchical fusion, after fusing the features of X spa H and X spa W , the information from the channel branches X spa C is utilized as a complement to further combine to generate the global spatial features.The corresponding calculation process is given as: where Xspa means the output spatial features.The deep spatial long-range dependencies are obtained after several HP blocks.At the end of the spatial module, the final spatial vector is got by passing through a global average pooling layer, which is also used for fusion and data classification.

Weight fusion
Considering the extracted spectral and spatial features are generated from separate modules, we utilize the weighted fusion approach to obtain spectral-spatial joint features.Specifically, different weighted scores are assigned to spectral features F spe and spatial features F spa , fusing these two features by summing.The fusion progress can be formulated as: where F denotes the fusion features, λ stands the weighting parameter in the range of 0; 1 ½ �.We send F into a FC layer to get the final feature vector.Note that the length of the vector is the same as the number of classes, and the vector is treated as a class-specific response.

Hyperspectral datasets
In this paper, the effectiveness of the proposed method was validated on three benchmark hyperspectral datasets.The detailed data description is presented below.
(1) The Indian Pines (IP) was collected by the Airborne Visible/Infrared Imaging Spectrometer   5.

Experimental configuration
To verify the effectiveness of the proposed method, we compare it with several recently proposed HSI classification methods, including DFFN (song et al., 2018), SSRN (Zhong et al., 2018), DBMA (Ma et al., 2019), pResNet (Paoletti et al., 2019), HybridSN (Roy et al., 2020), A 2 S 2 K-ResNet (Roy et al., 2021), FreeNet (Zheng et al., 2020), DPSCN (Dang et al., 2021), SSTN (ZZhong et al., 2022), and Mixer (Tolstikhin et al., 2021).For the fairness of the experiment, we adopt unified measurements to illustrate the effectiveness of our framework.These measurements are overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa).Details of experimental configurations are presented as follows.For the IP dataset, 1% of labelled samples are randomly chosen from each class to form the training set, and the rest are used as the testing set.For UP and SV datasets, 0.5% and 99.5% of labelled samples are selected for the training set and the testing set.The stochastic gradient descent (SGD) is utilized to train parameters in the whole architecture.The batch size is 50 and the learning rate is 0.001.We set the size of the input patch to 9.And the network runs 200 epochs at a time to get the final result.In the spatial module, two kinds of APs are selected to calculate EMAPs: the moment of inertia and area.For the moment of inertia, all three datasets are computed with the same values {0.2, 0.3, 0.4, 0.5}.The difference is the attribute of area, which has a total of 14 values, different datasets have different values.For the IP dataset, the initial value of the area value is set to 100 with a step size of 400.For the SV dataset, the values of the area are from the range of 270 to 7300 in the step of  540.For the UP dataset, the area values ranged from 770 to 10,769 and the step size is being 769.

Results analysis
For the aforementioned classification approaches, Tables 6-8 show the OAs, AAs, and Kappas on IP, SV, and UP datasets, respectively.As can be seen, the proposed method performs better than other methods in three datasets.In the following content, we will further analyse the experimental results of three datasets.
As shown in Table 6, improvements in terms of OA compared to DFFN, SSRN, DBMA, pResNet, HybridSN, FreeNet, A 2 S 2 K-ResNet, DPSCN, SSTN, Mixer are 21.19%, 8.43%, 8.31%, 22.91%, 20.55%, 9.64%, 8.31%, 4.84%, 4.18%, and 5.69% respectively.In the case of the IP dataset, the diversity of land cover and the imbalance between sample sizes pose a challenge to the classification of this dataset, especially the quantitative difference between categories 7 and 9 and other categories.Even so, compared with the previous approaches, the proposed S 2 HViP obtains the best AA and achieves satisfactory accuracy in the class with a small number of samples.In comparison with the second-best method DPSCN, the improved AA value is 4.69%.This is probably due to the great power of capturing long-range spatial interactions of our method, resulting in a lower error score rate when distinguishing between different land-cover classes.The classification map of the IP dataset is displayed in Figure 7.It is noteworthy that our delineated boundaries are sharper and straighter in the boundary regions of different feature classes, which also reflects the fact that S 2 HViP can better extract global distribution features.This helps to generate a more reasonable regional division.
Table 7 shows the classification results of the SV dataset.The characteristic of this dataset is that similar objects are clustered in a similar area.Therefore, the SV dataset has very strong spatial similarity and rich texture features.Since FreeNet uses the whole HSI as input, the spatial features are well preserved.Therefore, FreeNet gets good results of 97.98% OA.However, the proposed S 2 HViP obtains a better OA of 98.85%.This can be attributed to the long-distance dependencies extracted by S 2 HViP, which may be important for classification tasks in large-scale regions.Compared with the MLPbased method Mixer, our approach shows 0.66%, 1.04%, and 0.98% improvements in terms of OA, AA, and Kappa.One reason is that S 2 HViP further splits the spatial information into width and height dimensions  to encode the information separately, thus obtaining finer location information.Another reason is the distribution fusion strategy facilitates better interaction of the information in the three dimensions of the feature map.The classification map of the dataset is shown in Figure 8.The map still demonstrates strong classification capabilities of S 2 HViP.Our method has fewer misclassification points in comparison with other methods, which means our classification graph is closer to the truth of the SV dataset.
Focusing on the UP dataset, Table 8 shows the classification results.As shown in Table 8, the proposed approach achieves the best OA, AA, and Kappa values of 98.40%, 96.89%, and 97.87%, respectively, while the SSTN network achieves 96.35%, 94.27%, and 95.13%, respectively.The promotions of the OA, AA and Kappa values are 2.05%, 2.62%, and 2.74%, respectively.Compared to Transformer-based methods, S 2 HViP abandons the self-attention operation.Nevertheless, the results strongly prove the effectiveness of the MLP-based structure.According to Figure 9, compared to the classification maps of other methods, our method has fewer pixels of other   categories appearing within the same category region.
On the one hand, our method can efficiently extract long-range dependencies through global receptive fields, reducing the likelihood that pixels within connected regions are mistaken as other high-similarity objects.On the other hand, the introduction of the morphological approaches can complement local spatial features, which helps the subsequent information extraction process.As a result, S 2 HViP obtains competitive results in comparison with other methods.

Investigation of different numbers of training samples
To further verify the effectiveness of S 2 HViP, we investigate the performance of the different approaches with different training sample sizes.
For IP and SV datasets, 0.1%, 0.3%, 0.5%, 1% and 3% of samples are randomly selected.And for the UP dataset, we randomly choose 0.3%, 0.5%, 1%, 3% and 5% of samples as training sets.The specific results counted on different datasets are displayed in Figure 10.Overall, the proposed S 2 HViP shows better performance.As the sample size increases, the performance of all methods tends to improve in general, and the differences between them become smaller.Nevertheless, when the sample is insufficient, our method still shows better performance than other methods.The results in the figure demonstrate the robustness and efficiency of S 2 HViP, which could be attributed to its excellent long-range dependencies extraction capability.In the case of limited training samples, our method is still able to capture satisfactory global distribution information, resulting in better classification results.

Investigation of different sizes of patch
To explore the influence of different patch sizes on the experimental results, on the basis of 9 × 9, we further conducted experiments on the patch sizes of 3 × 3, 6 × 6 and 12 × 12, and the results are shown in Table 9.For IP and UP datasets, OA first increases and then decreases as the patch size increases.In addition, S 2 HViP makes good use of the spatial aggregation properties of the SV dataset.
The results show that the value of OA continues to improve as the patch size increases.This indicates that the SV dataset is more adaptive to large patch sizes.
Considering the running time of the model, we finally choose 9 × 9 as the consistent patch size of the experiment.

Ablation study
The proposed S 2 HViP consists of a spatial module and a spectral module.The purpose of the former is to obtain robust spatial features.The latter is dedicated to extracting spectral information from the rich spectral features.To investigate the respective functions of these two parts, we divide the network into two As can be gathered from Table 10, the spectral module outperforms the spatial module overall without the initial modelling of the image using EMAP.After modelling the spatial information using EMAP, the resulting spatial features are more easily extracted in subsequent modules.The sliding window mechanism of SE also helps to compensate for the local spatial information of feature maps.These factors lead to a significant improvement in the performance of the spatial module.In addition, the global spatial-spectral information obtained by fusing the different features gives the best results when using dual modules.To further validate the effectiveness of the hierarchical fusion strategy, we compare the performance gap between HViP and direct weighted fusion approaches, i.e.ViP.The results are listed in Table 11.On three different datasets, the HViP achieves 1.78% (IP), 0.17% (SV), and 0.45% (UP) improvement in terms of OA compared to ViP.The experimental results demonstrate the effectiveness of the hierarchical fusion strategy, which can help the three dimensions of the feature map to fuse better and generate more discriminative global features.

Conclusion
In this article, we present a novel MLP-based deep classification method.Considering that HSI is presented as 3-D data, we encode the three dimensions individually and combine them through a hierarchical fusion strategy.The strategy is demonstrated to be effective in improving the ability of feature extraction.The proposed S 2 HViP contains two feature extraction modules for learning the spectral and spatial information of HSI respectively.In the spectral module, we first capture intra-and intergroup spectral correlations by grouping.Then, we aggregate different correlations to refine long-range spectral dependencies.In the spatial module, we introduce morphological approaches to better model spatial features.Based on this, deep spatial information is further captured through MLPs.Finally, the information interaction between the spectral and spatial domains is achieved by weighted fusion, which effectively enhances the classification results.Experiments on three benchmark HSI datasets demonstrate the satisfactory performance of S 2 HViP and show the potential of the MLP-based network for HSI classification.
In the future, we will consider the application of more innovative morphological methods to the improve feature extraction capability of the network.Moreover, the proposed method still has room for improvement in terms of time consumption.So we will also explore a lightweight MLP-based model for classification.
,Tolstikhin et al. (2021) demonstrated that the self-attention layer of Transformers is not necessary and presented a concise alternative model, MLP-Mixer.It is made up of two components: the token-mixing MLP and the channelmixing MLP.The token-mixing MLP projects feature maps along the spatial dimension to obtain spatial features between different locations.The channel-mixing MLP acts independently on each channel of the feature map to capture the communication between different channels.Inspired by MLP-Mixer,Hou et al. (2022) proposed an effective MLP architecture, Vision Permutator (ViP).The main difference from the MLP-Mixer is that ViP splits spatial feature representation into heightcoding and width-coding features respectively, and performs linear projections independently.It allows ViP to capture long-range dependencies along one spatial direction while retaining location information along the other direction.The MLP-based method has also received attention in the hyperspectral field.He and Chen (2021) proposed a pure MLP architecture for the HSI classification task, which demonstrates that MLP networks can provide promising classification performance.Meng et al. (2021) used Mixer as the backbone of the network to extract spatial and spectral information alternately by matrix transposition and MLPs, which allows for interaction between different information.A model based on MLP network and residual learning was presented by X. J.Tang et al. (2022), MLP effectively removes translation invariance and local connectivity.Gong et al. (

Figure 1 .
Figure 1.The overall architecture of the proposed S 2 HViP.

Figure 3 .
Figure 3. Basic structure of the hierarchical Permute-MLP layers.

(
AVIRIS) sensor over the Indian Pines test site in Northwestern Indiana, USA.It contains 145 × 145 pixels with a spatial resolution of 20 m/pixel (mpp) and 224 spectral bands in the wavelength range from 400 to 2500 nm.We utilize 200 bands after removing four bands containing zero values and 20 noisy bands.The ground truth includes 16 land-cover classes and 10,249 labelled pixels.Figure 4 shows the false-colour composite image and corresponding ground reference of IP.Table 3 outlines the number of labelled samples and classes of IP. (2) The Salinas Valley (SV) was recorded by the AVIRIS sensor over Salinas Valley, California, USA.The available dataset is composed of 204 bands with 512 × 217 pixels with a spatial resolution of 3.7 m after the low signal-to-noise ratio (SNR) bands were removed.The ground truth includes 16 land-cover classes and 54,129 labelled pixels.Figure 5 shows the false-colour composite image and corresponding ground reference of SV and the details of labelled samples and classes are presented in Table 4. (3) The University of Pavia (UP) was acquired by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 610 × 340 pixels with a spatial resolution of 1.3 m and has 115 spectral channels in the wavelength range from 0.43 to 0.86 µm.We utilize 103 bands after removing 12 noisy bands.The ground truth includes 9 land-cover classes and 42,776 labelled pixels.Figure 6 shows the false-colour composite image and corresponding ground reference of UP.The detailed numbers of labelled samples and classes of UP are listed in Table

Figure 4 .
Figure 4. False-colour composite image and ground-truth map of Indian Pines.

Figure 5 .
Figure 5. False-colour composite image and ground-truth map of Salinas Valley.

Figure 6 .
Figure 6.False-colour composite image and ground-truth map of University of Pavia.

Figure 10 .
Figure 10.The comparison results using different training sample ratios on (a) Indian Pines, (b) Salinas Valley, (c) University of Pavia.
He et al. (2020)k to construct a vision backbone based exclusively on the Transformer.It demonstrates excellent global feature extraction capabilities.Due to its powerful performance, Transformer is not unexpectedly migrated to HSI classification tasks.J.He et al. (2020)proved that the language model can be applied in HSI classification and proposed HSI-BERT, which is a model consisting of numerous attention layers.Inspired by the HSI-BERT, ZZhong et al. (2022) designed a spectralspatial Transformer network (SSTN) that replaced convolution operations with attention modules.
Sun et al. (2022)022)everal spatial attention modules and spectral correlation modules.The spatial attention module is responsible for capturing the interactions between pixels at all locations, while the spectral correlation module is concerned with the correlation between a compact set of spectral vectors to all locations.Hong et al. (2022)applied the densely sampled method to group the spectral dimensions, and then, a cross-layer Transformer encoder module was employed to learn advanced features from groupwise adjacent bands.Liu, Yu et al. (2022)successfully replaced the traditional convolutional layer with the Transformer, designing a spectral-spatial HSI classification model named DSS-TRM.L.Sun et al. (2022)developed a spectral-spatial feature tokenization Transformer method, which organically combines CNN and Transformer to capture spectral-spatial information and high-level semantic information.

Table 1 .
Configuration of the spectral module for the Indian Pines dataset.
(9, 9, 180) ----(1, 1, 180) maps by X spe H , X spe W , and X spe C is the number of spectra in each part, and it is numerically equal to H. Next, the first and third dimensions of X C�C .After feature extraction, the original input dimension can be obtained only by performing the inverse process of dimension operation described earlier to yield X À� `2 R N�W�H .All X Hi are spliced together along the third dimension in the next step to outputX spe H i À � `2 R N�W�C .Responsible for encoding X spe H À � `is a fully connected (FC) layer with weight W spe H 2 R

Table 2 .
Configuration of the spatial module for the Indian Pines dataset.

Table 3 .
Land cover classes and the numbers of samples in the Indian Pines dataset.

Table 4 .
Land cover classes and the numbers of samples in the Salinas Valley dataset.

Table 5 .
Land cover classes and the numbers of samples in the University of Pavia dataset.

Table 6 .
Classification results for the Indian Pines dataset using 1% training samples.

Table 7 .
Classification results for the Salinas Valley dataset using 0.5% training samples.

Table 8 .
Classification results for the University of Pavia dataset using 0.5% training samples.

Table 9 .
Overall accuracy(OA) on three datasets with different patch sizes.

Table 10 .
The ablation analysis of different modules (OA%) for three datasets.

Table 11 .
The ablation analysis of hierarchical integration strategy (OA%) for three datasets.