Spectral-Spatial Mamba for Hyperspectral Image Classification

Recently, deep learning models have achieved excellent performance in hyperspectral image (HSI) classification. Among the many deep models, Transformer has gradually attracted interest for its excellence in modeling the long-range dependencies of spatial-spectral features in HSI. However, Transformer has the problem of quadratic computational complexity due to the self-attention mechanism, which is heavier than other models and thus has limited adoption in HSI processing. Fortunately, the recently emerging state space model-based Mamba shows great computational efficiency while achieving the modeling power of Transformers. Therefore, in this paper, we make a preliminary attempt to apply the Mamba to HSI classification, leading to the proposed spectral-spatial Mamba (SS-Mamba). Specifically, the proposed SS-Mamba mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks. Firstly, the token generation module converts any given HSI cube to spatial and spectral tokens as sequences. And then these tokens are sent to stacked spectral-spatial mamba blocks (SS-MB). Each SS-MB block consists of two basic mamba blocks and a spectral-spatial feature enhancement module. The spatial and spectral tokens are processed separately by the two basic mamba blocks, respectively. Besides, the feature enhancement module modulates spatial and spectral tokens using HSI sample's center region information. In this way, the spectral and spatial tokens cooperate with each other and achieve information fusion within each block. The experimental results conducted on widely used HSI datasets reveal that the proposed model achieves competitive results compared with the state-of-the-art methods. The Mamba-based method opens a new window for HSI classification.


Introduction
Hyperspectral images (HSIs) can capture hundreds of narrow spectral bands across the electromagnetic spectrum, containing both rich spatial and spectral information [1].Compared with traditional RGB images and multispectral images, HSIs can provide richer information about the land-covers, and thus have been widely exploited in many applications, such as environmental monitoring [2], precision agriculture [3], mineral exploration [4], and military reconnaissance [5].Aiming at pixel-level category labeling, Classification is a fundamental task for HSI processing and applications.It has become a hot topic in remote sensing research, drawing significant academic and practical interest [6], [7].
HSI classification mainly consists of feature extraction and classifier classification.Researchers mainly focused on spectral features in the early time.They usually mapped the original spectral features to a new space through linear or nonlinear transformations such as principal component analysis (PCA) [8], linear discriminant analysis (LDA) [9], and manifold learning methods [10], [11].These methods only used spectral features without considering spatial information, which limited the classification performance.Therefore, spectral-spatial feature extraction techniques have attracted lots of attention, including extended morphological profiles (EMP) [12], extended multi-attribute profiles (EMAP) [13], Gabor filtering [14], sparse representations [15], etc.The commonly used classifiers included support vector machines [16], logistic regression [17], etc. Different combinations of feature extraction techniques and classifiers form a variety of methods.
However, these methods relied on manually crafted features, requiring careful design tailored to specific scenarios.Additionally, their ability to extract high-level image features is also limited [18].
In recent years, the rapid development of deep learning has greatly propelled the advancement of HSI classification research.Deep learning networks can automatically learn high-level and discriminative features based on the data characteristics and have been widely applied in the field of HSI classification [19], [20].Typical deep learning models include stacked autoencoder (SAE) [21], deep belief networks (DBN) [22], convolutional neural network (CNN) [23], recurrent neural networks (RNNs) [24], etc. From these many models, convolutional neural network has achieved many successes and gained lots of interests [25].Various CNN architectures have been proposed for extracting spectral and spatial features, including one-dimensional (1D) CNN [26], 2D CNN [27], 1D-2D CNN [28], and 3D CNN [29].For example, Zhong et al. proposed an end-to-end spectral-spatial 3D CNN, deepening the network using residual connections [30].Zhang et al. proposed a 3D dense connected CNN for classification, utilizing dense connections to ensure effective information transmission between layers [31].Gong et al. introduced a novel CNN, which could extract deep multi-scale features from HSI and enhance classification performance by diversified metric [32].In [33], a double-branch dualattention mechanism was combined with CNN for enhanced feature extraction.[34] introduced a two-stream residual separable convolution network, specifically designed to address issues of redundant information, data scarcity, and class imbalance typically encountered in HSIs.Researchers have also extensively investigated the integration of CNN with various machine learning techniques, such as transfer learning, ensemble learning, and few-shot learning, to enhance the performance of CNN models under different conditions.Yang et al. proposed an effective transfer learning approach to deal with images with different sensors and different number of bands [35].The proposed method could use multi-sensor HSIs to jointly train a robust and general CNN model.In [36], a pixel-pair feature generation mechanism was specifically designed to augment the training dataset size, while ensemble learning strategies were also employed to mitigate overfitting problem and bolster the robustness of CNN-based classifiers.Yu et al. [37] utilized the scheme of prototype learning to address few-shot HSI classification, emphasizing the efficiency in using training samples.
Compared with CNNs which mainly focus on modeling locality, Transformers have demonstrated proficiency in capturing long-range dependencies, enabling a comprehensive understanding of spatial and spectral features' relationships in HSI [38].Consequently, Transformer-based HSI classification methods have emerged as a promising approach [39], [40].For instance, He et al. proposed HSI-BERT, where each pixel within a given HSI cube sample is treated as a token for the Transformer to capture global context.It was viewed as the initial application of a Transformer-based model for classification with competitive accuracies.[41] Recognizing the significance of long-range dependencies in spectral bands, Hong et al. introduced SpectralFormer, utilizing a pure Transformer to process spectral features [42].Tang et al. devised a double-attention Transformer encoder to separately capture spatial and spectral features of HSIs, which were subsequently fused to enhance discriminative capabilities.[43].In [44], a cross spatial-spectral dense transformer was proposed to extract spatial and spectral features in a dense learning manner.And the cross-attention mechanism was used for feature fusion.In addition to utilizing purely Transformer as described above, researchers have explored the fusion of Transformers with CNNs for feature extraction to leverage the strengths of both models effectively [45].For example, Sun et al. [46] proposed a network comprising 2D and 3D convolution layers to preprocess input HSI samples, followed by a Gaussian-weighted feature tokenizer to generate input tokens for Transformer blocks.To extract multiscale spectral-spatial features in HSIs, Wu et al. introduced an enhanced Transformer utilizing multiscale CNNs to generate spectral-spatial tokens with hash-based positional embeddings [47].Experimental results demonstrated that leveraging multiscale CNN features is advantageous in improving the classification performance.In [48], a lightweight CNN and Transformers in a dual-branch manner are integrated to a basic spectral-spatial block to extract hierarchical features.The proposed model outperformed comparison methods by a significant margin.
However, the traditional Transformer architecture introduces its own set of challenges, primarily due to its quadratic computational complexity driven by the selfattention mechanism [49].This complexity becomes prohibitive when dealing with the high-dimensional data typical of HSI, which contains both spatial and spectral information.Therefore, fully modeling the long range dependency of spatial-spectral features make a Transform model computationally heavy and, hence, impractical even when compared with other deep learning models.However, structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, recently [50].As an advanced SSM, Mamba has emerged as a promising model which outperforms in both computational efficiency and feature extraction capability [51].Like Transformers, Mamba models can capture long-range dependencies but with greater computational efficiency, making them well-suited for high-dimensional datasets like HSI.
Therefore, in this paper, we explore the application of the Mamba model for HSI classification and design a spatial-spectral learning framework based on the Mamba model, named spectral-spatial mamba (SS-Mamba).The SS-Mamba model mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks to extract the deep and discriminant features of HSI.The token generation module transforms HSI cubes into sequences of spatial and spectral tokens.These tokens are then processed through several stacked spectral-spatial Mamba blocks.Each spectralspatial Mamba block employs a double-branch structure that includes spatial mamba feature extraction, spectral mamba feature extraction and spectral-spatial feature enhancement module.The obtained spatial and spectral tokens are processed by the two basic Mamba blocks, respectively.After that, the spectral tokens and spatial tokens cooperates with each other in the designed spectral-spatial feature enhancement module.The dual-branch architecture and spectral-spatial feature enhancement effectively maximize spectral-spatial information fusion and thus improve the performance of HSI classification.
The main contributions are summarized as follows: 1) A spectral-spatial Mamba-based learning framework is proposed for HSI classification, which can effectively utilize Mamba's computational efficiency and powerful long-range feature extraction capability.2) We designed a spectral-spatial token generation mechanism to convert any given HSI cube to spatial and spectral tokens as sequences for input.It improves and combines the spectral and spatial patch partition to fully exploit the spectral-spatial information contained in HSI samples.3) A feature enhancement module is designed to enhance the spectral-spatial features and achieve information fusion.By modulating the spatial and spectral tokens using the HSI sample's center region information, the model can focus on the informative region and conduct spectral-spatial information interaction and fusion within each block.
The rest of the paper is organized as follows.Section 2 presents the proposed SS-Mamba for HSI classification.In Section 3, the experiments are conducted on four widely used HSI datasets.The results are presented and discussed.Section 4 gives a discussion about the proposed method.In Section 5, the conclusion is briefly summarized.

Methodology
The flowchart of the proposed spectral-spatial Mamba model for HSI classification is depicted in Figure 1.It can be seen that the proposed SS-Mamba mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks to extract the deep and discriminant features of HSI.Each spectral-spatial Mamba block employs a double-branch structure that includes spatial mamba feature extraction, spectral mamba feature extraction and spectral-spatial feature enhancement module.To start with, an HSI cube is generated as model's input for any given pixel using spatial neighborhood region, following most of the HSI classification methods.The HSI input samples are then processed to generated spectral and spatial tokens.And these tokens are sent to several spectral-spatial mamba blocks.Each of the blocks is composed of two basic mamba blocks and a spectral-spatial feature enhancement module.The obtained spatial and spectral tokens are processed by the two basic Mamba blocks, respectively.After that, the spectral tokens and spatial tokens cooperates with each other in the designed spectralspatial feature enhancement module.After feature extraction by these spectral-spatial mamba blocks, the spectral and spatial tokens are averaged and then added to form the final spectral-spatial feature for the given HSI sample.Finally, the obtained feature is sent to a fully connected layer to accomplish classification.And the spectral-spatial information interaction and fusion occurs in the token generation process (early stage), spectralspatial Mamba blocks (middle stage) and mean token addition (last stage).

Overview of the State Space Models
The state space model serves as a framework for modeling relationship between input and output sequences.Specifically, it maps a one-dimensional input () ∈ ℝ to an output () ∈ ℝ through the hidden state ℎ() ∈ ℝ  .This procedure can be formulated through the following ordinary differential equation: Here,  ∈ ℝ × represents the state matrix, while  ∈ ℝ ×1 and  ∈ ℝ ×1 are the system parameters.To adapt this continuous system for deep learning applications in discrete sequences like image and text, structural state space model (S4) further employs discretization.Specifically, a timescale parameter ∆ is introduced, and a consistent discretization rule is applied to convert  and  into discrete parameters  ̅ and  ̅ , respectively.The discretization employed in S4 model uses the Zero Order Holding (ZOH) rule, defined as: And the discretization SSM can then be calculated in recurrence denoted by the following equations: For fast and efficient parallel training, the above process in recurrence can also be reformulated using convolution: where  ̅ denotes the structured convolutional kernels.It is obtained by where  denoted the input sequence's length.
In Mamba, the selective scan mechanism is further used.Specifically, the matrices , C and ∆ are generated from the input data  , allowing the model to dynamically modeling the contextual relationship within the input sequence.Inspired by the Transformer and Hungry Hungry Hippo (H3) architectures [52], the normalization, residual connectivity, gated MLP, and SSM are combined to form a basic Mamba block, which constitutes the fundamental component of a Mamba network.The structure of a basic Mamba block is shown in Figure 1.

Spectral-Spatial Tokens Generation
To utilize mamba's power sequence modeling ability, the HSI input is need to be transformed to sequence as Mamba's input.Given an HSI training sample  ∈ ℝ ×× , where  and  are the height and width of the input, respectively, and B is the number of spectral bands, the proposed method converts it to spatial tokens sequence and spatial tokens sequence, respectively.Figure 2 illustrates the spectral and spatial token generation process.
For spatial token generation, there are mainly three steps including spectral mapping, spatial partition and patch embedding.
We firstly process spectral features before performing spatial partition for the spatial Mamba.The purpose is to fully exploit the spectral-spatial information contained in HSI samples in the case that the spatial partition-based mamba feature extraction mainly focuses on spatial tokens' relationships.Specifically, the input HSI sample is firstly reshaped to a tensor with shape of  × , and then a lightweight multilayer perception (MLP) is used for spectral feature mapping: where MLP(•) denotes the corresponding mapping function and  ′ is the mapped feature dimension.The spectrally mapped HSI sample is then spatially partitioned into  =    2 ⁄ non-overlapped patches  ̅  , where  ̅  ∈ ℝ × 2  ′ and   is the spatially partitioned patch size.After that, a linear layer is used to project these patches to a given dimension and obtained spatial input tokens: where   ∈ ℝ  2  ′ × denotes the learnable matrix in the linear layer and  represents the tokens' dimension.
...  Similar to spatial token generation, the spectral token generation process also contains three steps including spatial mapping, spectral partition and patch embedding.
To start with, we firstly extract the center region as the input  ̂∈ ℝ ×× , where  is a small integer, i.e., 3. A small center region can make the spectral feature robust when compared with only center pixel.Like the spatial token generation does, spatial feature is processed before performing spatial partition.The obtained input  ̂ is reshaped to a tensor with shape of  ×  2 , and then another lightweight MLP is used for spatial feature mapping: The spatially mapped HSI sample   is then spectral partitioned into  =    ⁄ non-overlapped patches  ̅  , where  ̅  ∈ ℝ ×   ′ and   is the spectrally partitioned patch size.After that, the spectral input token can be obtained by patch embedding: where   ∈ ℝ    ′ × denotes the learnable matrix in the linear layer.Following vision Transformer does, the obtained tokens are added with positional embedding to provide location information within the HSI sample.It is worth noting that the spatial tokens are provided with 2D sinusoidal position embedding, while the spectral tokens using 1D positional embedding.

Spectral-Spatial Mamba Block
The spectral-spatial feature extraction is mainly achieved by several spectral-spatial mamba blocks, which consists of two distinct basic mamba blocks and a feature enhancement module.The two basic mamba blocks are mainly used for spatial and spectral feature extraction, respectively.And in the feature enhancement module, the spatial or spectral tokens are modulated using the HSI sample's center region information from the tokens of the other type.By this way, the spectral tokens and spatial tokens cooperates with each other and achieve information fusion in each blocks.
Specifically, in the l-th block, the spectral and spatial tokens are firstly processed by mamba blocks: where   After the mamba's feature extraction, the obtained spectral and spatial tokens are then sent to the feature enhancement module for information interaction, just as Figure 3 shows.For the spatial tokens, we first take out the token corresponding to the central patch, denoting as  1  : And for the spectral tokens, these  tokens are averaged to obtain the center region's spectral feature from the view of the spectral branch, denoted as  2  : The obtained features are then fused by average: An MLP is used for further feature extraction and the sigmoid activation is also used for scaling: where  represents the sigmoid activation function.The obtained feature   can be viewed as the modulated weights, which is going to be multiplied with the spectral and spatial tokens.Before multiplication, it should cope with matching dimension.We make N and M copies of  for spatial and spatial tokens, respectively: The obtained modulated matrices    and    are then multiplied with the spectral and spatial tokens for forcing the model to focus more on the most informative central region as the input HSI image's labels is mainly determined by the center pixel's label and the spectral information is mainly needed here for enhancement.The procedure can be formulated as: where  means the element-wise multiplication.
After feature extraction by  spectral-spatial mamba blocks, the obtained spectral and spatial tokens are averaged and then added to form the final spectral-spatial feature for the given HSI sample, which can be formulated by: And the representation  of the input HSI sample is sent into a fully connected layer to obtain the final logit prediction: where FC(•) denotes the mapping function of a fully connected layer and  ̂ is the predicted label vector for the input HSI sample .Besides,  is the number of classes.The cross-entropy loss is used for optimizing the designed model: where  ̂ is the k-th element of  ̂, and   is the k-th element of the one-hot label vector.

Datasets Description
The performance of the proposed mamba-based method is assessed using four widely-used HSI datasets: Indian Pines, Pavia University, Houston, and Chikusei.
1) Indian Pines: The Indian Pines dataset primarily records agricultural areas in Northwestern Indiana, USA, captured in June 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS).The dataset comprises 145×145 pixels with a spatial resolution of 20 meters.It includes 220 spectral bands, covering wavelengths from 400 nm to 2500 nm.For application, 200 bands are retained after removing 20 bands with low signal-to-noise ratio.The dataset includes 10,249 labeled samples belonging to 16 distinct land-cover categories.The dataset's false color and ground-truth maps are illustrated in Figure 4, and the distribution of training and test samples is detailed in Table 1.2) Pavia University: The Pavia University dataset, captured by the Reflective Optics System Imaging Spectrometer (ROSIS) over the University of Pavia, Italy, contains 610 × 340 pixels at a fine spatial resolution of 1.3 m.This dataset records 103 spectral bands and labels 42,776 pixels from nine land-cover classes.Figure 5 shows the false color and ground-truth maps, while Table 2 shows the distribution of training and test samples of each class.
3) Houston: The Houston dataset focuses on an urban area around the University of Houston campus, USA.It was captured by the National Center for Airborne Laser Mapping (NCALM) for the 2013 IEEE GRSS Data Fusion Contest [53].The dataset contains 349 × 1905 pixels with a spatial resolution of 2.5 m per pixel and includes 144 spectral bands spanning 380 nm to 1050 nm.It consists of 15,029 labeled pixels categorized into 15 land-cover classes.Figure 6 illustrates both the false color and ground-truth maps, and Table 3 details the class-wise distribution of training and test samples.
4) Chikusei: The Chikusei dataset is an aerial hyperspectral dataset acquired using the Headwall Hyperspec-VNIR-C sensor in Chikusei, Japan, on July 29, 2014 [54].It comprises 128 spectral bands spanning wavelengths from 343 nm to 1018 nm.The dataset spatial size is 2517×2335, with a resolution of 2.5 meters.It includes 19 types of land-covers, encompassing urban and rural areas.In Figure 7, the true color composite image and the corresponding ground truth map of the Chikusei dataset are depicted.Table 4 shows the distribution of training and test samples of each class.

Experimental Setup
To demonstrate the effectiveness of the proposed SS-Mamba, kinds of HSI classification methods are selected and evaluated as comparison methods.These comparison methods are listed as follows.1) EMP-SVM [16]: this method utilizes EMP for spatial feature extraction followed by a classic SVM for final classification.This approach is commonly employed as a benchmark against deep learning-based methodologies.
2) CNN: it is a vanilla CNN which simply contains four convolutional layers.It is viewed as a basic spectral-spatial deep learning-based model for HSI classification.3) SSRN [30]: it is a 3D deep learning framework which uses three-dimensional convolutional kernels and residual blocks to improve the CNN's performance of HSI classification.4) DBDA [33]: it is an advanced CNN model that integrates a double-branch dualattention mechanism for enhanced feature extraction.It is used for comparison with Transformer-based methodologies, which rely on self-attention mechanisms.5) MSSG [54]: it employs a super-pixel structured graph U-Net to learn multiscale features across multilevel graphs.As a graph CNN and global learning model, MSSG is contrasted with the proposed Mamba and patch-based methods.6) SSFTT [46]: SSFTT is a spatial-spectral Transformer that designs a unique tokenization method and use CNN to provide local features for Transformer.7) LSFAT [55]: it is a local semantic feature aggregation-based transformer which has the advantages of learning multiscale features.8) CT-Mixer [45]: it is an aggregated framework of CNN and Transformer, which is hoped to utilized both the advantages of the above two classic models effectively.The listed comparison methods encompass a diverse kind of methods, including traditional method, CNN-based methods, Transformer-based models, attention-based methods, and mixture methods, and thus offers a comprehensive and exhaustive basis for comparison.For details, one can refer to the original papers.
When using SS-Mamba, a spatial window size of 27×27 (i.e.,  =  = 27) is used for any given pixel to generate the model's HSI input sample.As for the hyper-parameters in the proposed SS-Mamba, the spatial partition size is set to three, that is   =3.And the spectral partition size is set to two, that is   =2.The embedding dimension for each mamba block is 64, that is  =64.The widely-used multi-step learning rate scheduler is employed for training.Specifically, the initial learning rate is 0.0005, and it is divided by two every 80 epochs.The total number of epochs is set to 180 for all the four datasets.And the mini-batch strategy with a batch size of 256 is used for all the datasets.
The classification performance is primarily evaluated based on metrics including overall accuracy (OA), average accuracy (AA), and the Kappa coefficient (K).To ensure the reliability and robustness of the classification methods' output results, the experiments are repeated ten times with varying random initializations.Besides, the training samples are randomly selected from all labeled samples each time.However, the training samples remain the same for all the involved methods each time.

Ablation Experiment with Basic Sequence Model
To demonstrate the classification ability of Mamba model, the ablation experiments with different sequence models are conducted on the four datasets.These sequence models include long short-term memory network (LSTM), gated recurrent unit (GRU), and Transformer.The overall framework of the model remains unchanged, with only the basic sequence model varying.The related results are showed in Tables 5-8.From the obtained results, one can see that i) Spectral-spatial models achieve the highest accuracies with the same basic sequence model, followed by the spatial model, with the spectral model proved to be the least effective.For example, in Table 5, spectral-spatial LSTM achieves better results than spatial LSTM and spectral LSTM, with improvements of 3.76 percentage points and 32.85 percentage points in terms of OA, respectively.Besides, the spectral-spatial mamba outperforms spatial mamba by 2.77 percentage points, 5.14 percentage points, and 0.0364 in terms of OA, AA, and K on the Pavia University dataset, respectively.The results indicate that the designed spectral-spatial learning framework is effective for different sequence models.ii) With the learning framework, the Mamba-based models achieve higher accuracies than the classical sequence models such as LSTM, GRU, and Transformer.For examples, the spectral-spatial mamba outperforms spectral-spatial GRU by 0.45 percentage points, 0.52 percentage points, and 0.0046 in terms of OA, AA, and K on the Pavia University dataset, respectively.On the Houston dataset, spectral-spatial mamba yields better results than Transformer, GRU, and LSTM, with improvements of 0.92 percentage points, 0.49 percentage points, and 0.80 percentage points for OA, respectively.one can also draw similar conclusion for spatial or spectral learning methods on the four datasets.The results indicate that the used Mamba-based sequence models are effective for different learning frameworks.
Notably, we find that all the spectral models seem harder to be trained well compared with the other models, especially on the Indian Pines dataset.Specifically, the spectral models need larger learning rates and more epochs to train.Besides, these models are all much harder to train and perform worst on the Indian Pines dataset, probably due to the low quality of the dataset.

Ablation Experiment with Feature Enhancement Module
For spectral-spatial Mamba model, we have designed the feature enhancement module, which can use the HSI sample's center region information from the tokens to enhance spectral-spatial features.To demonstrate the effectiveness of this module, the ablation experiment is conducted.And Table 9 shows the ablation experimental results.
From the results, one can see that the spectral-spatial Mamba with feature enhancement module achieves higher accuracies when compared with the model without feature enhancement module.For example, the feature enhancement module improves the classification results by 2.09 percentage points, 1.78 percentage points, and 0.0226 in terms of OA, AA, and K on the Houston dataset, respectively.The results indicate that the designed feature enhancement module is effective.

Classification results
The classification results of different methods on the four datasets are shown in Tables 10-13.From these results, one can see that the proposed SS-Mamba have achieved superior classification performance over other comparison methods when using twenty training samples per class across all the four datasets.
Specifically, it can be seen that 1) The CNN-based methods usually achieve higher classification accuracies than the Transformer-based methods.As an example, on the Pavia University dataset, the MSSG performs better than the CT-Mixer with an improvement of 1.46 percentage points for OA, 3.44 percentage points for AA, and 0.0194 for K, respectively.On the Houston dataset, the overall accuracies of the comparison Transformerbased methods are lower that 93%.In contrast, the overall accuracies achieved by CNNbased methods are generally higher than 93%.And the OA of MSSG is even close to 94%.It seems that it is necessary to improve Transformer-based methods in the case of limited training sample.2) The designed spectral-spatial Mamba model obtains highest accuracies when compared with other methods on the four datasets.For example, in Table 9, SS-Mamba yields better results than the DBDA, with an improvement of 0.53 percentage points for OA, 3.68 percentage points for AA, and 0.0073 for K. On the Houston dataset, the proposed method achieves better results than MSSG, DBDA, and SSFTT, with improvements of 0.38 percentage points, 0.63 percentage points, and 4.12 percentage points in terms of OA, respectively.Compared with Transformer, Mamba, which is also a sequence model, has better classification performance, which shows the potential application of sequence model in HSI classification.

Classification Maps
As qualitative classification results, the classification maps for different methods on the four datasets are shown in Figures.8-11.From these maps, it can be clearly seen that the proposed SS-Mamba achieves better classification performance than the comparison methods.For example, on the Pavia dataset, many methods tend to misclassify some pixels from Asphalt, Trees and Bricks.The reason may be that they are located close to each other.This suggests that these models may focus more on spatial information brought by spatial window, potentially neglecting the spectral characteristics for accurate classification.In contrast, the proposed method demonstrates good performance in this context.Thanks to the designed spectral-spatial learning framework and the feature extraction ability of Mamba, it can make full use of spectral-spatial information and improve the performance of HSI classification.Table 14 shows the number of parameters (i.e., Param.) and consuming time (i.e., Test Time) when tested on a batch of 100 samples on the Pavia University dataset.The SS-LSTM/GRU/Transformer means the corresponding spectral-spatial sequence model in the ablation experiment.

Complexity Analysis
The results reveals that the Mamba model exhibits faster inference times compared to the Transformer model.Additionally, sequential models like Mamba generally necessitate longer inference times but also achieve higher classification accuracies when compared to CNN models.By the way, the newly proposed Mamba-2 (i.e., the second generation of Mamba) has faster training and inference process, which will further benefit our work in the future.

Features Maps
Taking the two input samples of the Pavia University as an example, Figure 12 shows the corresponding spatial feature maps of each block.And different columns show the feature maps on different feature channels (i.e., the first 8 channels).It can be seen that the model still retains some of the image details in the token of each block.On the one hand, different channels focus on different image information and their feature maps are not the same.On the other hand, the feature maps of different blocks have some differences.And the image details (semantic structure) are gradually blurred with the depth.However, probably because the model is shallow with a few blocks, the change of feature maps with depth is not particularly obvious.

Comparison of Proposed Classification Method with Spectral Unmixing-based Methods
Spectral unmixing technique plays an important role in HSI processing and application and there are many classification methods which utilize spectral unmixing for improving performance.Therefore, it is necessary to discuss the proposed SS-Mamba's advantages over the spectral unmixing-based classification methods.In [57], hard examples were firstly selected based on specific criteria.Then spectral unmixing was used to improve their quality, facilitating subsequent classification tasks.In [58], the researchers focused on designing autoencoders for unmixing tasks based on the linear mixing model and then using the encoder's features for classification.Li et al. collected the abundance representations from multiple HSIs to form an enlarged dataset, and the used the enlarged abundance dataset to train a classifier like CNN, which could alleviate the overfitting problem [59].Based on these relevant works, we would like to emphasize the distinctive advantages of our classification approach over these spectral unmixing-based methods.
While spectral unmixing methods are effective in enhancing image quality and extracting spectral information, the whole classification methods typically involve a multistep process that includes spectral unmixing followed by separate feature extraction and classification steps.This can be time-consuming and requires prior knowledge of hyperspectral mixing models, which may vary in accuracy and impact the final classification results.
In contrast, our proposed classification method is designed as an end-to-end framework.This streamlined approach eliminates the need for intermediate processing stages and expert knowledge of spectral unmixing models, making it more efficient and user-friendly.By leveraging the capabilities of Mamba for feature extraction and classification simultaneously, our method offers a more seamless and accessible solution for achieving high classification accuracy.In the feature, we can try to combine the proposed method and rspectral unmixing technique to obtain better classification performance.

Limitations and Feature Work
The SS-Mamba divides the HSI samples as tokens, which are the input of the SS-Mamba.The token generation process may disrupt the semantic structure of the HSI samples to some extent.Specifically, the current partitioning method does not take object's orientation and shape into consideration, potentially causing pixels belonging to the same object to be distributed across different patches.Therefore, the future work will focus on the combination with traditional architectures like CNN for enhancing local feature extraction and improve the generation of tokens.

Conclusions
In this study, we make an effective exploration to build a spectral-spatial model for HSI classification, which purely using an emerging sequence model named Mamba.The proposed model converts any given HSI cube to spatial and spectral tokens as sequences.And then stacked Mamba blocks are used to effectively model the relationships between the tokens.Besides, the spectral and spatial tokens are enhanced and fused for more discriminant features.The proposed SS-Mamba is then evaluated on four widely used datasets (i.e., Indian Pines, Pavia University, Houston and Chikusei datasets), and some main conclusions can be drawn from the experimental results as follows: 1) Through a comparative analysis of classification results, it is evident that the proposed SS-Mamba can make full use of spatial-spectral information, and it can achieve superior performance for HSI classification task.2) The ablation experiments show that as a sequence model, Mamba is effective and can gain competitive classification performance for HSI classification when compared with other sequence models like Transformer, LSTM, and GRU.
3) The ablation experiments also show that the designed spectral-spatial learning framework is effective for different sequence models, when compared with spectralonly or spatial-only models.4) The designed feature enhancement module is effective to enhance spectral and spatial features and improve the SS-Mamba's classification performance.This research explored the prospects of the utilization Mamba model in HSI classification and provides new insights for further studies.

Conflicts of Interest:
The authors declare no conflict of interest.

Figure 1 .
Figure 1.The illustration for the design details of the proposed SS-Mamba -working flow.For illustrative purpose, a single image flow instead of a batch is showed here.The proposed SS-Mamba mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks to extract the deep and discriminant features of HSI.And the spectral-spatial information interaction and fusion occurs in the token generation process (early stage), spectralspatial Mamba blocks (middle stage) and mean token addition (last stage).

Figure 2 .
Figure 2. The procedure of spectral and spatial tokens generation.The spatial branch processes the entire region by partitioning it along the spatial dimension after spectral mapping, while the spectral branch focuses on the center region, partitioning it along the spectral dimension after spatial mapping.
and    denote the l-th block's output spectral and spatial tokens, respectively.And    and    represents the mapping function of basic mamba blocks introduced in Section Ⅱ.A for spectral and spatial tokens, respectively.

Figure 3 .
Figure 3.The illustration of spectral-spatial feature enhancement module.

Figure 12 .
Figure 12.The spatial feature maps of each block.For simplicity of illustration, only the spatial feature maps on the first 8 channels are given.Each row corresponds to the feature maps of different channels, while each column represents the feature maps outputted by different blocks.

Author Contributions:
Conceptualization: Y.C.; methodology: L.H. and Y.C.; writing-original draft preparation: L.H., Y.C. and X.H.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the Natural Science Foundation of China Grants 62371169, 61971164 and U20B2041.

Table 1 .
Land cover classes and numbers of samples in the Indian Pines dataset.

Table 2 .
Land cover classes and numbers of samples in the Pavia University dataset.

Table 3 .
Land cover classes and numbers of samples in the Houston dataset.

Table 4 .
Land cover classes and numbers of samples in the Chikusei dataset.

Table 5 .
Ablation experiment with different sequence model on the Indian Pines dataset

Table 6 .
Ablation experiment with different sequence model on the Pavia University dataset

Table 7 .
Ablation experiment with different sequence model on the Houston dataset

Table 8 .
Ablation experiment with different sequence model on the Chikusei dataset

Table 9 .
Ablation experiment for the feature enhancement module

Table 10 .
Testing data classification results (mean ± standard deviation) on the Indian Pines dataset

Table 11 .
Testing data classification results (mean ± standard deviation) on the Pavia University dataset

Table 12 .
Testing data classification results (mean ± standard deviation) on the Houston dataset

Table 13 .
Testing data classification results (mean ± standard deviation) on the Chikusei dataset

Table 14 .
Complexity analysis for different models