Multi-Scale Depthwise Separable Capsule Network for hyperspectral image classification

Addressing the challenges in effectively extracting multi-scale features and preserving pose information during hyperspectral image (HSI) classification, a Multi-Scale Depthwise Separable Capsule Network (MDSC-Net) is proposed in this article for HSI classification. Initially, hierarchical features are extracted by MDSC-Net through the employment of parallel multi-scale convolutional kernels, while computational complexity is reduced via depthwise separable convolutions, thus reducing the overall computational load and achieving efficient feature extraction. Subsequently, to enhance the translational invariance of features and reduce the loss of pose information, features of various scales are processed in parallel by independent capsule networks, with improvements in max pooling achieved through dynamic routing. Lastly, features of different scales are concatenated and integrated through the concatenate operation, thereby facilitating precise analysis of multi-level information in the hyperspectral image classification process. Experimental comparisons demonstrate that MDSC-Net achieves average accuracies of 94%, 98%, and 99% on the Kennedy Space Center, University of Pavia, and Salinas datasets, respectively, indicating a significant performance advantage over recent HSI classification models and validating the effectiveness of the proposed model.


Introduction
Hyperspectral imaging (HSI), as a remote sensing data source rich in physical and chemical properties, has demonstrated immense potential for applications in fields such as earth science, resource exploration, and environmental monitoring [1][2][3].In these applications, the task of HSI classification fundamentally involves the precise identification of land cover based on the spectral characteristics of pixels, thereby aiding scientific research and decision-making.The classification of HSI, especially its feature extraction, occupies a crucial position in the domains of machine learning and image analysis, and has long been a subject of keen interest among researchers.Although this task poses significant challenges for machines, precise labels can be created through manual annotation of images, providing high-quality training data for supervised learning and thereby effectively supporting the learning process.I In the field of feature extraction, the main approaches to addressing the problem are divided into two categories: one is based on pixels, and the other is based on regions.In pixel-based classification methods, each pixel is classified individually, primarily based on its independent spectral information, without consideration of its spatial relationship with surrounding pixels [4].Although straightforward, this approach has clear limitations as it neglects the potential spatial relationships between pixels.Conversely, a more efficient method involves analyzing the data cube as a whole, thereby leveraging both spatial and spectral information [5].This approach more comprehensively considers the spatial connections between pixels, offering possibilities for more accurate image analysis.Hence, this study focuses on region-based analysis methods.
Neural networks, particularly Convolutional Neural Networks (CNN) that have been developed since the 1970s, have shown significant effectiveness in addressing such issues [6].Architectures like AlexNet [7], ResNet [8], and GoogLeNet [9], although originally designed for processing the red, green, and blue (RGB) bands in visible light, are equally applicable and effective for multi-channel data analysis.With the widespread application of CNN in image processing, their outstanding performance in HSI classification tasks has been confirmed by numerous studies [10][11][12].For instance, 3D-CNN have been used for feature extraction from HSI [13]; the introduction of attention mechanisms [14] has optimized feature map extraction; and network structures like HybridSN [15] and SpectralNET [16] have achieved breakthroughs in HSI classification accuracy.Notably, recent studies [17,18] have further proposed various methods to enhance HSI classification accuracy.The aforementioned research, although achieving significant accomplishments in HSI classification tasks, still faces limitations in aspects such as the reduction of spatial information due to pooling operations, and large model parameter size.Notably, CNN struggle to effectively recognize object pose information, as their convolutional filters cannot represent feature transformation activities.To overcome the limitations of traditional CNN approaches, the concept of capsule networks was first introduced by Hinton and colleagues [19].Capsules are groups of neurons that describe an entity's pose and probability of existence, containing more information about the entity's attributes than scalar neurons in CNN.The capsule network (CapsNet), proposed by Sabour and others [20], encodes the probability of an object's presence and its pose through the length and orientation of activity vectors, enhancing the performance of capsule networks in image analysis.Further, H. Zhang and colleagues applied capsule networks to HSI classification [21], achieving significant results.Additionally, Hinton and colleagues proposed matrix capsules with EM routing [22], addressing some deficiencies of the dynamic routing algorithm in [20], such as using the negative log variance of Gaussian clusters to measure the consistency between pose vectors, and representing poses with matrices instead of vector lengths.
Recently, the trend has been to apply capsule network technology to HSI analysis.Methods based on capsule networks have demonstrated significant effectiveness in extracting deep semantic features of hyperspectral images.This approach utilizes neurons within capsule networks to simultaneously capture both the spectral and spatial features of HSI images, as evidenced by several studies [23][24][25].Although existing capsule networks have improved some aspects of CNN in HSI classification tasks, they face significant challenges in parsing complex spectral-spatial features that vary with scale.Particularly in capturing and analyzing multiscale features, capsule networks often fail to fully extract spectral characteristics, thereby affecting the effective interpretation of rich geophysical details and multi-level information in HSI, which in turn reduces the accuracy of the classification process.In summary, the problem this study aims to address in HSI classification tasks is that the pooling operations in convolutional networks often lead to a reduction in spatial information, thereby diminishing the ability to effectively recognize object pose information [26].Additionally, capsule networks have not been sufficiently capable of extracting multi-scale spectral features [27].To overcome the limitations of existing methods, we propose an innovative Multi-Scale Depthwise Separable Capsule Network (MDSC-Net).In summary, the main contributions of this paper are as follows: 1.A novel multi-scale capsule network architecture is introduced, utilizing multi-scale convolutional kernels to extract features at different levels, effectively parsing the rich detail and multi-layered information in hyperspectral images.This enhances the precise identification and understanding of complex terrains and land cover types.Feature maps at each scale are processed independently by capsule networks to accurately capture subtle changes in land cover at different scales, ensuring the integrity of spatial information and improving the efficiency of hyperspectral image data analysis.
2. By employing a dynamic routing mechanism instead of the traditional max-pooling method, the dimensionality of feature maps is effectively reduced, and the pose matrices in capsule networks enhance the model's translation invariance of features.This approach improves the network's ability to capture details when handling complex terrains and ensures the in-depth analysis of the rich hierarchical information in hyperspectral image data, achieving highly accurate and efficient data processing.
3. Depthwise separable convolution is incorporated into the architecture by replacing the original three-dimensional convolution structure with depthwise and pointwise convolutions.This ensures efficient feature extraction while significantly reducing the model's computational complexity, thereby effectively alleviating the overall computational burden.
3. In addition to evaluating the effectiveness of the proposed MDSC-Net architecture in terms of overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa), this method is also compared with existing state-of-the-art hyperspectral image classification methods in terms of efficiency.

Basic principles of the EM routing algorithm
The EM routing algorithm plays a crucial role in selecting the optimal capsule pathways to optimize information flow within this model.The algorithm involves M-steps and E-steps.The E-step equation is as follows: Where θ represents parameters, Z is the latent variable, X is the observed variable, L is the observed variable, t denotes the iteration number.This step selects the optimal output capsule by evaluating the response of each capsule to the observed data.In the M-step, our objective is to maximize the Q function.The M-step equation is as follows: The Q function is defined as the expectation θ t of the complete data log-likelihood function L of the observed data X with respect to Z, given the current parameter estimates.This expectation is used to update the model parameters in each iteration of the EM algorithm.The convergence property of the EM algorithm ensures that logLðy t jXÞ does not decrease after each iteration, ultimately converging to a local maximum or saddle point.
To better understand the process, here is a specific example: Assume we have a simple binary classification capsule network where each input vector can belong to two different capsule output classes.The initial parameter settings are θ (0) , which includes the means m ð0Þ 1 and m ð0Þ 2 , and the variances s ð0Þ 1 and s ð0Þ 2 for the two classes.The E-step uses Bayes' theorem to calculate the posterior probability x i of each observed data point γ belonging to each class.In the t-th iteration, we compute the probability of each input vector x i belonging to the two output capsules.Given the observed data X = {x 1 ,x 2 ,. ..,xN } and the latent variables Z = {z 1 ,z 2 ,. ..,zN }, the posterior probability is calculated using Bayes' theorem as follows: indicates the probability that, at the t-th iteration, the i-th sample x i belongs to the first output capsule.The M-step uses the posterior probabilities calculated in the E-step to update the parameters.The formulas for updating the mean and variance are as follows: By iteratively calculating the posterior probabilities and updating the parameters in this manner, the EM algorithm continuously optimizes the model parameters.Ultimately, it converges to a local maximum or saddle point, thereby selecting the optimal capsule path to enhance information flow.

Model topology
In this section, the design of the Multi-Scale Deep Separable Network (MDSC-Net) model is presented.Its overall structure is depicted in Fig 1 .The architecture of the model is divided into six parts, each responsible for different processing and analytical tasks.The first part is dedicated to data preprocessing, where dimensionality reduction is conducted through Principal Component Analysis (PCA), followed by edge padding and image segmentation.In the second part, the initialization layer employs a 5×5 convolutional kernel to preliminarily extract features, providing a foundation for deep learning.The third part utilizes a multi-scale convolutional structure akin to the Inception module, extracting rich features through convolutional kernels ranging from 3×3 to 11×11, thereby enhancing the model's ability to recognize details.The fourth part consists of a capsule layer, including a primary capsule layer and two intermediate al capsule layers.In the fifth part, an integration layer is employed, where the extracted multi-scale features are concatenated.Finally, the classification is performed in the classification capsule layer, where image classification is determined based on the activation values in the feature maps.

Data preprocessing module
The preprocessing stage of the data, as illustrated in Fig 2, aims to enhance computational efficiency, reduce data redundancy, and adjust the data to meet the input requirements of the model.
The original hyperspectral dataset possesses a spatial resolution of m×n and consists of k spectral bands.Initially, PCA is performed to reduce the dimensionality of the dataset, compressing it from k spectral bands to d bands that contain the most information.This not only improves computational efficiency and reduces data redundancy but also mitigates the impact of noise by discarding those bands that contain less information, as these bands are more susceptible to noise interference.Subsequently, to adapt to the model's input requirements and preserve information at the edges, zero-padding is applied to the dataset.Specifically, considering a selected window size w, a zero boundary with a width of p pixels is added around all four sides of the dataset, altering its dimensions to (m+2p)×(m+2p)×d.
Next, a pixel-wise sliding window operation is performed on the padded image using w×w pixel blocks (patches), with a stride of 1, ensuring every original pixel can serve as the central point.This process generates m×n samples, reflecting the region-based approach of the model.Given the prevalence of non-informative background information in the dataset, which is not conducive to the training of the model, samples identified with a background label (label value 0) have been systematically eliminated.In each training iteration, 64 samples are sequentially selected from the dataset to form a batch.Following the aforementioned preprocessing steps, the input data format for the model is finalized as 64×w×w×d.

Initial convolution and activation module
During the initialization phase of the model, as shown in Fig 3, the aim is to enhance the model's ability to learn and extract complex features from the raw data.Specifically, the first deep convolutional layer (Deep Convolutional Convolution, DW Conv) employs d 5×5 convolutional kernels to extract spatial features from each independent channel.
Subsequently, the data processed by the convolutional layer are subjected to a normalization layer and a Rectified Linear Unit (ReLU) activation layer.The normalization layer standardizes the convolved data, enhancing the model's generalization capability and accelerating the training process.It also aids in reducing noise interference by ensuring that all features are scaled equally, mitigating the unequal impact of noise on features at different scales.The ReLU layer introduces non-linearity, thereby boosting the model's ability to learn complex data structures.After processing through these two layers, the structural dimensions of the data remain unchanged.
The pointwise convolution layer (PW Conv), using 64 1×1×d convolutional kernels, follows next.It integrates the outputs from the normalization and activation layers and adjusts the number of output channels.Another pass through a normalization layer and a ReLU activation layer completes the feature extraction function of the initialization convolution module on the raw data.In this manner, 64 distinct feature maps are extracted from the channels of the raw data, with two example feature maps illustrated in Fig 4.

Multi-scale capsule convolution module
The multi-scale capsule convolutional module, illustrated in Fig 5, is designed to extract multiscale features and pose information from HSI, thereby enhancing the model's performance and classification accuracy on complex datasets.The module employs a structure akin to the Inception model for processing data activated by ReLU.The initial stage includes an adjustment layer with 1×1×64 convolutional kernels, aimed at modulating the number of feature maps for subsequent multi-scale convolutional layers.
Specifically, for the paths processing 3×3 and 5×5 convolutional kernels, the output of this adjustment layer comprises 96 feature maps.The subsequent 3×3 and 5×5 depth convolutional layers further process these feature maps, ultimately outputting 128 feature maps through a 1×1×96 pointwise convolutional layer.This design is intended to capture more refined spatial features while enhancing the feature extraction capability by increasing the number of convolutional kernels.
On the other hand, for paths with larger convolutional kernels (i.e., 7×7, 9×9, 11×11), the adjustment layer outputs 16 feature maps.After processing by 7×7, 9×9, 11×11 depth convolutional layers and a 1×1×16 pointwise convolutional layer, each layer ultimately outputs 64 feature maps.This approach not only reduces the number of parameters but also effectively captures coarse spatial features by adjusting the use of convolutional kernels of different sizes, thereby achieving a balance between model simplification and precise feature extraction in HSI processing.
Following these convolutional layers is a primary capsule layer that integrates pose information and activation values.Subsequently, the model incorporates two intermediate capsule layers.Ultimately, the feature maps output by the five different convolutional kernel pathways are concatenated along the feature map dimension.At the end of the model, classification is performed by a classification capsule layer.Unlike traditional fully connected layers, this classification layer uses the activation values within the feature maps to vote, facilitating effective classification.

Capsule layer module
The capsule layer structure within the multi-scale capsule convolution module is illustrated in Fig 6 .The Primary Capsule Layer enhances the richness and precision of feature representation by integrating pose information and activation values.In this layer, the calculation formula for pose information is B×P 2 , where B represents the number of capsule types and P denotes the dimensions of the pose matrix.Using a 1×1×128 convolutional kernel, the pose layer generates B×P feature maps containing pose information.Similarly, the activation layer employs a 1×1×128 convolutional kernel to produce 32 feature maps containing activation values, culminating in a feature map dimension of B×P+32.Subsequently, the model incorporates two intermediate capsule layers, both utilizing the Expectation-Maximization (EM) routing algorithm for optimized processing.Specifically, the first intermediate capsule layer comprises a 3×3 depth convolutional layer and a 1×1×544 pointwise convolutional layer, aimed at capturing more complex feature information.The second intermediate capsule layer includes a 3×3 depth convolutional layer and a 1×1×272 pointwise convolutional layer, further refining feature extraction and processing.

Loss function
In capsule networks, the transformation matrix W converts the outputs of lower-level capsules (a group of neurons) into the inputs for higher-level capsules, allowing the lower-level capsules to predict the activation states of the higher-level capsules.The MDSC-Net employs a spread loss function for training this transformation matrix W. This loss function is specifically designed for optimizing capsule networks, aiming to diminish the impact of model weight initialization and to streamline the hyperparameter tuning process.The calculation of Li's spread loss is as follows: Here, m represents the margin, while at and ai denote the activations of the target class t and class i, respectively.The model incurs a penalty for the squared difference between predictions and the margin m when the disparity between the true class and other classes is less than m.To prevent the emergence of ineffective capsules at the onset of training, the initial margin m is set to a lower value of 0.2.

Model architecture
The detailed explanation of the MDSC-Net attention-dense network model includes the types of layers used, the dimensions of the generated output maps, and the required number of parameters.The specific layer types and their corresponding parameters are illustrated in detail in Fig 7 .4 Experimental results and analysis

Experimental setup and dataset description
All the expriments and testing environment were carried out on the PyTorch 1.10.2framework, accelerated with CUDA 11.3.The Adam optimizer was selected to combine the exponential decay learning rate to optimiza the strategy training model.The initial learning rate was set at 0.003 and decreased with a decay rate of 0.96 as the number of iterations increased.The batch size is 64 for all cases.To achieve optimal model performance for each dataset, epochs are 400 for KSC Dataset, 300 for Pavia University and 200 for Salinas Dataset.The training was performed in a desktop computer running Windows 11, an Intel Core i7-12700H processor, an NVIDIA RTX 3070 Ti graphics card with 8GB of VRAM, and 16GB of system RAM.
To comprehensively evaluate the performance of the newly proposed MDSC-Net model, three available hyperspectral remote sensing imaging datasets were selected: the Kennedy Space Center Dataset, the Pavia University Dataset, and the Salinas Dataset.The details of the dataset are as follows: 1.The KSC Dataset was acquired from the area around the Kennedy Space Center, USA.It is made up of 13 categories across 224 spectral bands, reduced to 176 after removing water vapor noise.The resolution of the original image is 512×614 pixels, equating to an 18m×18m area per pixel.For computational efficiency in the MDSC-Net model, the band count was reduced to 20 through PCA and the images were resized to 532×634 pixels, further segmented into 21×21 pixel blocks.All background pixel blocks (labeled as zero) were excluded from the experimental dataset.The category details of the Kennedy Space Center dataset are shown in Table 1.In the entire dataset, there are 260 samples in the training set, 1823 samples in the validation set, and 3128 samples in the test set, with the training set comprising 5% of the total samples.3. Designed for hyperspectral image classification, the Salinas Dataset was obtained around the region of California's Salinas Valley, USA, by using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS).It features images at 512×217 pixels, a 3.7m spatial resolution, and originally 224 bands, reduced to 204 after noise removal.The ground truth consists of 16 types of features, including vegetables, bare soil, and vineyards.The original image is preprocessed to obtain blocks with 30 bands and a size of 23×23 pixels.The category details of the Salinas dataset are also shown in Table 3.In the entire dataset, there are 2706 samples in the training set, 25711 samples in the validation set, and 25712 samples in the test set, with the training set comprising 5% of the total samples.

Evaluation indicators
Different widely utilized quantitative measurement methods are utilized for assessing the network model proposed.These include Overall Accuracy (OA), which reflects the proportion of correctly classified samples, serving as an intuitive performance measure; Average Accuracy (AA), ensuring balanced performance assessment across different categories; Precision, measuring the accuracy of predicted positive samples; Recall, indicating the proportion of actual positives correctly predicted; the Kappa coefficient, evaluating the model's performance surpassing random levels; and the F1 Score (F1), providing a comprehensive assessment of classification accuracy and recall by calculating the harmonic mean of Precision and Recall.The formulae for these metrics are as follows: In the context of the equation, is the total number of categories; N is the total number of samples; x ii represents the number of samples of class i correctly identified as class i; Accuracy i is the precision of class i; True Positives (TP) denotes the number of actual positives correctly identified; False Positives (FP) indicates incorrect positive identifications among negatives; False Negatives (FN) represents missed positives, where actual positives are incorrectly classified as negatives.

Results of ablation studies
To validate the effectiveness of the proposed MDSC-Net model, ablation experiment were conducted on the Salinas dataset.The performance of different models was compared using three metrics: Overall Accuracy (OA), Average Accuracy (AA), and Kappa Score(Kappa), with consistent parameter settings and training strategies across models.Two comparative models have been selected: one is the capsule network model utilizing single-scale instead of multi-scale convolution kernel, and the other is a multi-scale convolutional network model using maxpooling layer directly instead of dynamic routing mechanism in the capsule network.Description of each model is shown in Table 4, with the comparative results of the ablation studies presented in Table 5.
Table 5 shows that the overall performance of MDSC-Net has been significantly improved after extracting multi-level features through multi-scale convolution kernel and replacing the max-pooling layer with the dynamic routing mechanism in capsule network.Specifically, compared with MDSC-0 (the capsule network model with single-scale convolutional kernel), MDSC-Net has improved the three key performance indexes of OA, AA and Kappa by 1.94%, 1.91% and 1.23% respectively.Against MDSC-1 (the multi-scale convolutional network model with max-pooling layer), the improvements were 7.44%, 7.50%, and 7.44% in the same metrics.The comparison between MDSC-0 and MDSC-Net highlights that the multi-scale convolutional kernel can capture rich multi-level features, thus effectively enhancing the capsule network's capability to process ground detail and multi-level information in HSI.The comparison between MDSC-1 and MDSC-Net indicates that substituting traditional maxpooling with the dynamic routing mechanism of capsule networks prevents the loss of precise location and posture information of objects, and realizes the translation invariance of the enhanced model features, and further improve the ability of feature extraction.
The classification results of ablation experiments are presented in Fig 8, including: (a) the original image of the dataset; (b) the manually annotated ground truth; (c) the output from the capsule network model using only single-scale convolutional kernel; (d) the output from the multi-scale convolutional network model with max-pooling layer; and (e) the output from the MDSC-Net model, which displayed more refined features compared to (c) and (d).This result is attributed to MDSC-Net's ability to extract rich hierarchical features through the application of multi-scale convolutional kernel, and effectively preserves object posture information via the capsule network, thereby more comprehensively extracting the feature details of HSI.
The first comparative experiment, as shown in Table 6 shows the quantitative analysis of the segmentation results on Kennedy Space Center dataset.The MDSC-Net model shows better classification accuracy than all other methods with AA, OA, FI, and Recall of 93%, 94%, 94%, and 92% respectively.The experiment shows that compared with SPP, DCNN, and 3-D CNN, the comparative metrics AA, OA, Recall, and F1 Score improved by 1% to 8% respectively.The overall parameter is higher than CNN_HSI and lower than SpectralNET.It can be concluded that the proposed MDSC-Net provides superior classification results on the Kennedy Space Center dataset with an acceptable parameter volume.
The second experiment, illustrated in Fig 10 , presents the classification results of various models on the Pavia University dataset, with our proposed method achieving the most accurate segmentation results.Table 7 lists the segmentation results on the Pavia University dataset.Compared to the six comparative models, the MDSC-Net has shown significant improvements in AA, OA, FI, and Recall by 97%, 98%, 94%, and 92% respectively.In summary, the MDCS-Net exhibited better segmentation outcomes than the benchmark algorithms on the Pavia University dataset.It showed superior performance with reasonable computational parameters, offering significant advantages over other models.These results confirm the robustness and broad applicability of MDSC-Net.
In the third experiment, as shown in Fig 11, the proposed method achieved better classification results on the Salinas dataset, with finer extraction results from minor to major categories, closely approximating the manually annotated Ground Truth.
As can be seen from Table 8, for Salinas dataset, the proposed method still maintained the most advanced performance, with AA, OA, F1 Score, and Recall all at 99%.For hyperspectral remote sensing image classification, some models are complex [34][35][36], while others capture insufficient semantic information [37][38][39], necessitating efficient classification algorithms that ensure high accuracy while quickly processing large volumes of remote sensing image data.The introduction of multi-scale convolution strategy increases the computational complexity due to the parallel computation involving multiple scale convolution  This approach not only enhances accuracy but also reduces the parameter count significantly, making it more suitable for real-world image classification tasks.This is attributed to MDSC-Net's adoption of multi-scale convolution kernels, depthwise separable convolutions, and capsule networks, which together enhance its ability to efficiently process and interpret complex image data while minimizing computational overhead.

Cross-validation experiment and analysis
To ensure the effectiveness and generalization ability of the proposed MDSC-Net model, this study employs 5-fold cross-validation on the Kennedy Space Center dataset.In the 5-fold cross-validation, the dataset is equally divided into five subsets, with each subset used as the test set in turn, while the remaining four subsets serve as the training set.This validation method effectively reduces random errors and the risk of overfitting during the model  From the comparative experiments, model parameter experiments, and cross-validation experiments, the following observations can be made: By employing multi-scale convolutional kernels, depthwise separable convolutions, and capsule networks, MDSC-Net can effectively extract complex ground object information from hyperspectral images, maintain the translational invariance of features, and significantly reduce computational complexity and the number of parameters, thereby optimizing performance and efficiency.
Compared to state-of-the-art models, MDSC-Net performs better in terms of Overall Accuracy (OA), verifying its consistency and stability across all categories.Observing the confusion matrix, it is evident that MDSC-Net achieves high classification accuracy even in small sample categories.For instance, in categories 1, 4, 5, and 6 of the Kennedy Space Center dataset, despite having fewer training samples, the classification accuracy does not significantly decrease.However, the MDSC-Net model has certain limitations.Due to the reduction in the number of parameters, the time complexity is relatively increased, requiring approximately 4 hours of training on a computer equipped with a 3070 Ti GPU.

Conclusion
This article presents a Multi-Scale Depthwise Separable Capsule Network for hyperspectral image (HSI) classification, demonstrating significant effectiveness in the classification of hyperspectral imagery.By employing multi-scale convolutional kernels, the network is capable of capturing features across different scales, effectively deciphering the complex terrain details and hierarchical information in HSIs.Moreover, the model employs depthwise separable convolutions to streamline processing, achieving efficient feature extraction with reduced computational load.The incorporation of capsule networks, as an improvement over traditional max-pooling layers, allows the model to reduce the size of feature maps while maintaining translational invariance and preventing the loss of terrain posture information.Ablation studies show that the MDSC-Net, with its multi-scale convolutional kernels and capsule networks, significantly enhances HSI classification performance.Comparative experiments on three HSI  datasets further affirm the performance advantages of this model over existing ones, including structural simplicity, lower computational complexity, and higher classification accuracy.
For future work, this model can be utilized in precision agriculture for crop health monitoring.By analyzing hyperspectral images, issues such as pest infestations and nutrient   This study has some limitations, such as the potential for improved classification accuracy on datasets with a smaller number of samples.Additionally, the generalizability of the model to other hyperspectral image datasets remains to be verified.Challenges also exist regarding processing speed in real-time applications and computational demands on resource-limited devices like embedded systems.Future research should focus on exploring effective ways to handle large datasets under constrained computational resources.Incremental Principal Component Analysis (IPCA) and incremental learning methods can be employed to gradually extract the main components of the data, reducing the demand for computational resources and storage, and supporting dynamic model updates to lower computational costs.Moreover, the MDSC-Net model faces the challenge of maintaining efficient performance on batterypowered devices.This requires adjustments through quantization and sparsity techniques to reduce energy consumption and the exploration of adaptive computing methods to achieve an optimal balance between performance and energy consumption.Furthermore, investigating cross-modal learning strategies, such as combining HSI data with LiDAR data, may enhance the accuracy and robustness of HSI classification, thereby improving the performance and applicability of the MDSC-Net model in real-world scenarios.Finally, to enhance the model's generalizability across different datasets, it is crucial to test and optimize it on a more diverse range of datasets.

2 .
The Pavia University Dataset was collected by the University of Pavia, Italy, which is widely used in hyperspectral image classification and clustering algorithm research.The dataset is 610×340 pixels in size, has a spatial resolution of 1.3 meters, and initially contains 103 bands.It covers three different scenes: urban, rural, and wilderness, showcasing a variety of ground objects, including various buildings, land covers, and plant types.After compressing to 20 bands and dividing into 19×19 pixel blocks, the category details of the Pavia University dataset are shown in Table 2.In the entire dataset, there are 4277 samples in the training set, 11549 samples in the validation set, and 26950 samples in the test set, with the training set comprising 10% of the total samples.

Fig 9 ,
demonstrates the classification results of various classification models on the Kennedy Space Center dataset.Different subgraphs are Capsule network model utilizing only single-scale convolutional kernels MDSC-1 Multi-scale convolutional network model employing max-pooling layers MDSC-Net Multi-scale network model with an alternative to max-pooling layers https://doi.org/10.1371/journal.pone.0308789.t004labeled in the Figure: (a) the RGB composite image; (b) the manually annotated ground truth; and (c) to (h) correspond to the outputs of several advanced classification techniques, including SPP, DCNN, 3-D CNN, SPL-SR, CNN_HSI, SpectralNET, and the proposed MDSC-Net model in (i).

Table 6 . Quantitative comparative experimental results on the KSC dataset.
ensuring the reliability and consistency of the experimental results.The confusion matrix, averaged over the five test sets, is shown in the Fig 13.