Spatial–Spectral ConvNeXt for Hyperspectral Image Classification

Hyperspectral image (HSI) classification is a difficult task due to the heterogeneous spatial–spectral information, high-dimensionality, and noise effect in the HSI. Lately, an enhanced convolutional approach, i.e., ConvNeXt, has demonstrated a stronger feature representation capability than the popular vision transformer approaches. This article presents a spatial–spectral ConvNeXt approach, called SS-ConvNeXt, for hyperspectral classification. To better learn the spatial and spectral information in the HSI, the Spatial-ConvNeXt block, Spectral-ConvNeXt block, and spectral projection module are, respectively, designed. The depthwise and pointwise convolutions are adopted to reduce the model size and prevent vanishing gradient. The proposed model is evaluated against 14 other state-of-the-art methods on four different HSI datasets. Moreover, extensive ablation studies are conducted to investigate the roles of building blocks in the proposed model. The results demonstrate that the proposed method not only can achieve a high classification accuracy but also can better preserve class boundaries and reduce within-class noise.


I. INTRODUCTION
H YPERSPECTRAL image classification (HSIC) aims to estimate the semantic class labels for each pixel on a hyperspectral image (HSI) [1]. It is one of the most important HSI processing tasks, and has been widely used to support various applications, e.g., land cover and crop mapping [2], [3], urban monitoring [4], minerals mapping [5], etc. Despite its importance, the HSIC is a challenging task due to the heterogeneous spectral-spatial information, high dimensionality, and noise effect in an HSI, which make it very difficult to extract discriminative features from the HSI [6].
Digital Object Identifier 10.1109/JSTARS.2023.3282975 profile (MP) [12], and extended multiattribute MP (EAMP) [13]. However, these feature extractors are mostly knowledge driven, and as such, cannot effectively adapt to the data characteristics for the enhanced HSIC. On the other hand, data-driven feature learning approaches, represented by deep convolutional neural networks (CNNs), have led to an improved spatialspectral feature extraction capability that greatly boost the HSIC performance; e.g., see [14], [15], [16], [17], [18], [19], [20], [21], and [22]. Recently, the transformer [23] model, originally designed for natural language processing (NLP), has proven stronger feature learning capability than CNNs by using the attention mechanism and has been successfully used for the HSIC [24], [25], [26], [27], [28]. Nevertheless, transformer models have a quadratic complexity with respect to the input size, leading to a high computational cost and the risk of overfitting given limited training samples. To overcome these limitations, more recent transformer approaches, e.g., Swin transformer [29], tend to reuse key CNN features, such as local windows and weight sharing mechanisms. Transformer-based architectures become increasingly like CNNs [30]. Therefore, for enhancing the HSIC, it is essential to explore enhanced CNN approaches that avoids transformer's limitations.
Recently, to compare with transformer models, i.e.,vision transformer (ViT) [31] and Swin transformer [29], the Con-vNeXt method [32] is proposed to improve the traditional CNN approaches. The ConvNeXt introduces the Swin transformer design concepts to modernize standard residual neural networks (ResNet), leading to a better performance than transformerbased architectures on ImageNet classification [33], object detection, and semantic segmentation tasks on COCO [34]. Con-vNeXt's success is owing to rethink and redesign the key CNN components. From the macrodesign perspective, the ConvNeXt has four main characteristics. First, the ConvNeXt adopts a four stages architecture, and changes the stage compute ratio into (3,3,9,3), which represents the number of blocks in each stage and likely to be the optimal distribution of computation. Second, the ConvNeXt uses a 4 × 4 nonoverlapping convolution to aggressively downsample the input images at the network's beginning. Third, following the strategy proposed in the ResNeXt, the ConvNeXt uses depthwise convolution and 1 × 1 convolution to separate the mixed spatial and channel information. Fourth, motivated by the transformer block, the ConvNeXt also uses inverted bottlenecks and revisits the use of large-sized convolutions. At the microscale, fewer activation functions and normalization layers are adopted in the ConvNeXt, which is the same with transformer models. Moreover, the ConvNeXt performs better by replacing ReLU with GELU and substituting batch normalization (BN) with layer normalization (LN). The aforementioned macro and microdesigns make the ConvNeXt a cutting-edge model. Given the advantages of the ConvNeXt, the critical research question is how to adapt ConvNeXt under the consideration of effectively extracting spatial and spectral feature separately, one of the prime challenges in HSIC. This article, therefore, presents a new spatial-spectral ConvNeXt approach, called SS-ConvNeXt, for hyperspectral classification, with the following characteristics.
1) To better learn the discriminative spatial and spectral information in the HSI, we design a new Spatial-ConvNeXt (Spa-cv) block and a Spectral-ConvNeXt (Spe-cv) block. The Spa-cv and Spa-cv blocks are used to implement a four-stage architecture, with the number of blocks being (3, 3, 9, and 3), respectively. The Spa-cv block is used to implement the first two stages, and Spe-cv is used to implement the last two stages. 2) We use both depthwise and pointwise convolutions to reduce the model size and prevent vanishing gradient.
To decouple spatial and spectral information learning, instead of using depthwise and pointwise convolutions together in all blocks, we use depthwise convolution in Spa-cv for spatial information learning and use pointwise convolution in Spe-cv for spectral information learning. 3) To better learn the rich spectral information in the HSI, instead of performing downsampling in the spatial domain using 4 × 4 and 2 × 2 convolution layers, as conducted in the ConvNeXt, we perform projection in the spectral domain using the pointwise convolution layer to enhance discriminative features in stages regularly, which we call the spectral projection module (SPM). The aforementioned characteristics of the proposed model enable efficient discriminative spatial-spectral feature learning, leading to an enhanced HSIC approach that can better address the key HSI challenges. We qualitatively and quantitatively evaluate the classification performance of the proposed methods on four HSI datasets. The results demonstrate that the proposed model not only can achieve high classification accuracy but also can better preserve class boundaries and reduce within-class noise. The proposed model approach shows significant improvement over the original ConvNeXt (i.e., ConvNeXt-T [32]) approach and various state-of-the-art (SOTA) CNNs-based and transformer-based backbone networks.
The remainder of this article is organized as follows. Section II introduces the proposed model. Section III presents the relevant experimental results and highlights the comparison of our results with other results. Section IV draws the discussion. Finally, Section V concludes this article.

A. Problem Formulation
The HSI is denoted by X, where the ith pixel x i is extracted as a 3-D cube of size W × W × P , with W being the patch size and P being the number of bands in the HSI. The class labels of x i is denoted by y i , which takes discrete values, i.e., y i ∈ {1, 2, . . ., C}, where C is the total number of classes. The task of the HSIC aims to estimate the labels of all pixels, i.e., Y = {y i |i ∈ T }, where T is a total number of pixels. Deeplearning-based approaches solve this task by mapping x i to y i using a neural network model y i = g(x i , Θ), and estimating the model parameters Θ using training samples. Once g(x i , Θ) is established, it can be used to predict all pixels in X and generate classification maps. Fig. 1 shows the overall architecture of the proposed SS-ConvNeXt model. As we can seen in the top row of Fig. 1, the proposed model consists of four stages, where the first two stages are implemented by the Spa-cv block and the last two stages implemented by Spe-cv block. The Spa-cv block uses depthwise convolutions for spatial information learning, whereas the Specv block uses pointwise convolutions for spectral information learning. Different stages are connected by SPM via a pointwise convolution layer. The adaptive average pooling (AAP) layer and a fully connected layer are used to generate the class label. Mathematically, the proposed model can be formulated as

B. Overall Architecture
where PC 1×1 is the pointwise convolution layer, i.e., a convolution layer with kernel size being 1. GELN represents the coactivation function of GELU and LN layer.

C. Spa-cv Module
As indicated in Fig. 1, we design a new Spa-cv module to implement the first two stages in the proposed model, where Spa-cv consists, sequentially, of a depthwise convolution layer, a layerwise convolution (LN) layer, an expansion linear layer, a GELU activation layer, another linear layer, and a dropout and scaling layer. The residual learning approach is also adopted by using a skip connection operation. The use of depthwise convolution in the Spa-cv module encourages Spa-cv to focus on learning the spatial information in the HSI. Moreover, with less parameters, depthwise convolution also reduces the size of the proposed model.
The Spa-cv module in (1) can be expressed as where DConv 3×3 is depthwise convolution with a total of 64 convolution filters of size 3 × 3 × 1 in the first Spa-cv module and 128 same-sized convolution filters in the second Spa-cv module.

D. Spe-cv Module
As indicated in Fig. 1, we design a new Spe-cv module to implement the last two stages in the proposed model, where Spe-cv consists, sequentially, of a pointwise convolution layer, a layerwise convolution (LN) layer, an expansion linear layer, a GELU activation layer, another linear layer, and a dropout and scaling layer. Similar to Spa-cv, the residual learning approach is adopted by using a skip connection operation. The use of pointwise convolution in Spe-cv encourages the Spe-cv module to focus on learning the spectral information in the HSI in an efficient manner.
The Spe-cv module in (1) can be expressed as where PConv 1×1 is pointwise convolution with a total of 256 1 × 1 convolution filters in the first Spe-cv module and 512 filters in the second Spe-cv module, respectively.

E. Spectral Projection Module (SPM)
As indicated in Fig. 1, we design SPM to connect different stages using pointwise convolution layer. By applying the SPM to patch-wise samples, more discriminative features in spectral domain can be established in stages regularly, instead of performing spatial downsampling as in the ConvNeXt model. In detail, before the first and third stages, we insert a pointwise convolutional layer, an LN layer, and a GELU layer. Before the second and the last stage, we add LN layer and pointwise convolution layer.

A. Data Description
To evaluate the performance of the proposed method, four classical HSI datasets are adopted, i.e., Indian Pines (IN), 1 Pavia Available: https://www.ehu.eus/ccwintco/index.php/ Hyperspectral_Remote_Sensing_Scenes University (PU), WHU-Hi-HongHu (WHHH), and WHU-Hi-HanChuan (WHHC). 2 1) IN Data: IN data were collected in 1992 by the Airborne Visible/Infrared Imaging Spectrometer sensor over Northwestern Indiana, USA. The HSI consists of 145 × 145 pixels at a ground sampling distance (GSD) of 20 m and 220 spectral bands covering the wavelength range of 400-2500 nm with a 10-m spectral resolution. In the experiment, 24 water-absorption bands and noise bands were removed, and 200 bands were selected. There are 16 mainly investigated categories in this studied scene.
2) PU Data: PU data were acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over PU and its surroundings, Pavia, Italy. This dataset has 103 spectral bands ranging from 430 to 860 nm. Its spatial resolution (SR) is 1.3 m, and its image size is 610 × 340. Nine land-cover categories are covered.
3  To train the proposed model, the CrossEntropy loss function is adopted and the gradient descent approach, i.e., the Adam optimizer [35], is used to estimate the unknown parameters in the proposed model. The ExponentialLR scheduler is adopted, and the initial learning rate is set to be 0.0001 and decayed by multiplying a factor of 0.9 after each one-tenth of the total 400 epochs. We set batch size of 16, 32, 64, and 64 and patch size of 9, 9, 13, and 13 for IN, PU,WHHC, and WHHH, respectively, to allow a better computational efficiency.
3) Methods for Ablation Studies: Moreover, to investigate the performance gain of the proposed SS-ConvNeXt, five variants of SS-ConvNeXt (i.e., SS-ConvNeXt(E), SS-ConvNeXt(D), SS-ConvNeXt(F), Spa-ConvNeXt, and Spe-ConvNeXt) as well as the ConvNeXt-T [32] model are also compared in the ablation studies. Fig. 2 shows the architecture design of variants of the SS-ConvNeXt. In Fig. 2, the SS-ConvNeXt(E) is the same with the SS-ConvNeXt, except that it exchanges the location of Spa-cv and Spe-cv modules in the SS-ConvNeXt. The difference between the SS-ConvNeXt(D) and SS-ConvNeXt is that the SS-ConvNeXt(D) replaces the SPM in the SS-ConvNeXt with the 2 × 2 spatial downsampling layer before stages 2 and 4. The SS-ConvNeXt(F) fuses the Spa-cv and Spe-cv modules in a branch manner, with three feature fusion methods, i.e., point-wise multiplication, point-wise addition, and concatenation. The Spa-ConvNeXt and Spe-ConvNeXt only use the spatial and spectral modules, respectively.

D. Numerical Evaluation
We conduct experiments on these four datasets to investigate the classification accuracy performance of the SS-ConvNeXt and other compared algorithms under a different number of labeled samples; four errorbar plots are drawn based on the OA. The proportion of training samples, fixed training samples, and training samples for per class for the IN dataset is in the The results are shown in Fig. 3(a)-(d). In general, the classification accuracy of each algorithm increases as the number of training samples increases. Moreover, deep learning models exhibit better stability in their classification results, as reflected by their lower standard deviation compared to the traditional classification methods (i.e., SVM and 1D-CNN); as anticipated, the results unequivocally demonstrate that the proposed SS-ConvNeXt surpasses other methods with superior OA values on all four datasets. Since the labeling process of HSI data samples is time consuming, the classification performance in the case of small samples can better test the quality of the algorithm. For example, in the PU dataset, under 1% training proportion, our SS-ConvNeXt's OA can reach 98.50%; and in the WHHH dataset, under 0.1% training ratio, our SS-ConvNeXt's OA can reach 92.8%, which is much higher than other algorithms.
Tables I-IV also shows the numerical results of four datasets. 1) For IN dataset, in Table I, the proposed SS-ConvNeXt model achieves an OA of 94.34% with only 200 labeled training samples, which is 4% higher than the second best method, i.e., SSTN. In Fig. 3(a), the bar of the proposed method is much higher than the other methods, regardless of the number of training samples. With the increase of training samples, the OAs obtained by the proposed method increase very fast, but its standard deviations decreases. 2) For PU dataset, in Table II,  which is about 2% higher than the second best method, i.e., SSTN. Moreover, in Fig. 3(b), the bar of the SS-ConvNeXt is higher than the other methods in all cases, with the only exception when there is five labeled samples per class. In Fig. 3(b), the standard deviation of the SS-ConvNeXt is lower than the other methods. Table III, the SS-ConvNeXt achieves an OA of 97.75% with only 0.5% training samples, which is about 2.6% higher than the second best method, i.e., SSRes. Fig. 3(c) indicates that the proposed methods can outperform all methods in all cases. 4) For WHHC dataset, in Table IV, the SS-ConvNeXt achieves an OA of 96.74% with only 0.5% training samples, which is 3% higher than the transformer-based model SSTN. In Fig. 3(d), with the increase in training numbers, the SS-ConvNeXt significantly performs better than other networks. Table V compares the proposed method with the published results of another advanced methods, which indicates that the proposed approach performs the best in most cases.

E. Visual Evaluation
Figs. 4-7 show classification maps of different methods on four datasets. Region of interests are used to highlight differences. Overall, on all datasets, the proposed SS-ConvNeXt offers better classification maps that are closest to the groundtruth map. Moreover, referring to RGB composite image, the SS-ConvNeXt shows less inner class misclassification, more accurate class boundaries and edges, and finer details with a less oversmoothing phenomenon.
The conventional approach, as exemplified by the SVM and 1D-CNN models, yields classification maps that are noisy and exhibit discontinuous land cover blocks, resulting in rough classification outcomes. Classic backbone networks, as exemplified by 2D-CNN and 3D-CNN models, and HybridSN, show better classification maps with less noise. The method based on the residual network, as exemplified by SSRN, A 2 S 2 K-ResNet, and SSRes, has strong feature extraction ability, which improves the classification accuracy to a certain extent. A transformer-based network, represented by SSFTT and SSTN, performs better because of the attention mechanism.     The figure indicates that the SS-ConvNeXt outperforms other methods in identifying most areas for the Corn-notill class (the red box on the left in Fig. 4) in the IN dataset, while also maintaining more precise class boundaries and edges as shown in the area that is zoomed in. The SS-ConvNeXt has also accurately classified the building boundary on the PU dataset. The WHHH and WHHC datasets demonstrate that the SS-ConvNeXt has better performance in terms of clearer delineation, although the distribution structure of the ground cover of these two agricultural scenes is very large and complex.

F. Ablation Analysis
Table VI shows the results achieved by variants of the proposed SS-ConvNeXt model, whose architecture is illustrated in Fig. 2. The ConvNeXt-T is also included for comparison. As we can see in Table VI, the SS-ConvNeXt outperforms all its variants and the original ConvNeXt-T method. In Table VI, the SS-ConvNeXt implementations with different number of blocks (i.e., [1,1,3,1], [2,2,6,2,], and [3,3,9,3]) achieve the similar classification performance. In this article, we use the number of block in [3,3,9,3] as shown in Fig. 1. The poor performance of the ConvNeXt is probably due to the fact that it is not designed for the HSIC, and thereby, cannot efficiently capture the discriminative spectral and spatial information. Accuracies of the SS-ConvNeXt is more than 1% higher than that of the SS-ConvNeXt(E).
The difference between the SS-ConvNeXt(D) and SS-ConvNeXt is that the SS-ConvNeXt(D) replaces the SPM in the SS-ConvNeXt with a 2 × 2 spatial downsampling layer before stages 2 and 4. The better performance of the SS-ConvNeXt and SS-ConvNeXt(D) demonstrates the superiority of the proposed SPM over spatial downsampling. The SS-ConvNeXt(F) with a concatenation operation shows better results. We observe that the parallel SS-ConvNeXt(F) performs worse than the serial SS-ConvNeXt, which might be because different brunches of the parallel SS-ConvNeXt(F) are concentrated in a manner that is insufficient for interactions between the spatial and spectral branches, whereas in serial SS-ConvNeXt, spatial and spectral information is extracted by stages, allowing more efficient extraction of both low-and high-level features in a hierarchical manner. Fig. 8 shows classification maps of different variants of the SS-ConvNeXt. Overall, the SS-ConvNeXt provides better preserved class boundaries with less within-class artifacts and noise. Direct application of the original ConvNeXt-T model to HSIC gives the worst results. Fig. 9 shows performance variation of the SS-ConvNeXt with changes of patch size, learning rate, and different activation functions (i.e., GELU and ReLU) on four datasets. Except for IN dataset, the accuracy indicator increases with patch size on the remaining three datasets. Additionally, the SR of these four datasets is quite different. IN has an SR of only 20 m, nevertheless, PU, WHHH, and WHI-Hi-HanChuan have an SR of 1.3, 0.043, 0.109 m, respectively. This influence of the window size can be interpreted as the smaller patch size containing insufficient spatial information on the high SR HSI dataset, and the larger patch size is not conducive to extracting key information on the low SR HSI dataset. Based on this observation, we set Fig. 9. Errorbar plots of the OA (averaged by ten runs) achieved by different hyperparameter settings (i.e., patch size and learning rate) and different activation functions. the patch size of the IN and PU datasets to be 9, and that of the WHHH and WHHC datasets to be 13.

G. Hyperparameter Sensitivity Analysis and Feature Map Visualization
We also conduct ablation study on the learning rate strategy. As we can see in Fig. 9(middle), the ExponentialLR strategy enables a higher performance than all other fixed learning-ratebased approaches. We, therefore, adopt ExponentialLR to train our model.
As shown in the red box of Fig. 10, feature maps achieved by the GELU function can better perceive detailed information than the ReLU function. SS-ConvNeXt with ReLU fails to perceive the boundary between different object types. The OA is slightly improved with the use of GELU as shown in Fig. 9(right). So, we use the GELU function in the SS-ConvNeXt.

IV. DISCUSSION
A. "Serial structure (Spatial-Spectral)" or "Parallel scheme (Spatial and Spectral)" We propose a spatial-spectral ConvNeXt approach for HSIC. The architecture of SSRN, SSTN, and SSRes are serial structure, which means extracting spatial information first, and then, spectral information. The ablation analysis shows that serial structure SS-ConvNeXt performs worth than the parallel scheme SS-ConvNeXt(F), which combines the spatial and spectral branches separately. At the same time, it is proved that only extracting spatial or spectral information cannot solve the HSIC problem well.

B. How the Window Size Affects the Accuracy?
The spatial size of input data is one of the main factors that influence the HSIC performance. Based on the observation in ablation analysis, smaller spatialized input contains insufficient spatial information on high SR HSI dataset (e.g., WHHH dataset), and the larger spatialized input is not conducive to extracting key information on low SR HSI dataset (e.g., IN dataset). Consequently, to make a fair comparison, ensuring the consistency of the window size of the same dataset is a fair guarantee.

V. CONCLUSION
This article has presented a new spatial-spectral convolution neural network model, called SS-ConvNeXt, for the HSIC. This new model was inspired by the recent ConvNeXt model, which has demonstrated stronger feature representation capability than the popular ViT approaches. The proposed SS-ConvNeXt was tailor designed to the characteristics of HSIs, and thereby, can efficiently learn discriminative spatial-spectral information for the enhanced HSIC. To better learn the spatial and spectral information in the HSI, the Spa-cv and Spe-cv blocks were, respectively, designed. The depthwise and pointwise convolutions were adopted to reduce the model size and prevent vanishing gradient. The proposed model was evaluated against 14 other state-of-the-art methods on four different datasets. Moreover, extensive ablation studies were conducted to investigate the roles of building blocks in the proposed model. The results demonstrated that the proposed SS-ConvNeXt not only can achieve a high classification accuracy but also can better preserve class boundaries and reduce within-class noise.