Land Cover Mapping in a Mangrove Ecosystem Using Hybrid Selective Kernel-Based Convolutional Neural Networks and Multi-Temporal Sentinel-2 Imagery

: Mangrove ecosystems provide numerous ecological services and serve as vital habitats for a wide range of flora and fauna. Thus, accurate mapping and monitoring of relevant land covers in mangrove ecosystems are crucial for effective conservation and management efforts. In this study, we proposed a novel approach for mangrove ecosystem mapping using a Hybrid Selective Kernel-based Convolutional Neural Network (HSK-CNN) framework and multi-temporal Sentinel-2 imagery. A time series of the Normalized Difference Vegetation Index (NDVI) products derived from Sentinel-2 imagery was produced to capture the temporal behavior of land cover types in the dynamic ecosystem of the study area. The proposed algorithm integrated Selective Kernel-based feature extraction techniques to facilitate the effective learning and classification of multiple land cover types within the dynamic mangrove ecosystems. The model demonstrated a high Overall Accuracy (OA) of 94% in classifying eight land cover classes, including mangrove, tidal zone, water, mudflat, urban, and vegetation. The HSK-CNN demonstrated superior performance compared to other algorithms, including random forest (OA = 85%), XGBoost (OA = 87%), Three-Dimensional (3D)-DenseNet (OA = 90%), Two-Dimensional (2D)-CNN (OA = 91%), Multi-Layer Perceptron (MLP)- Mixer (OA = 92%), and Swin Transformer (OA = 93%). Additionally, it was observed that the structure of the network, such as the types of convolutional layers and patch sizes, affected the classification accuracy using the proposed model and, thus, the optimum scenarios and values of these parameters should be determined to obtain the highest possible classification accuracy. Overall, it was observed that the produced map could offer valuable insights into the distribution of different land cover types in the mangrove ecosystem, facilitating informed decision-making for conservation and sustainable management efforts.


Introduction
Mangrove ecosystems are critically important coastal environments that provide a wide range of ecological and socio-economic services [1].These unique ecosystems act as natural barriers against coastal erosion and storm surges, protecting coastlines and inland communities [2].They serve as nurseries for many fish species, supporting both local livelihoods and global fisheries [3].In addition, these ecosystems support rich biodiversity, providing habitats for numerous species of birds, mammals, and marine life [4].Mangroves as a constituent of these ecosystems are also significant carbon sinks, sequestering carbon at rates far exceeding those of terrestrial forests, thus playing a critical role in mitigating climate change [5,6].It is, however, essential to recognize that mangrove ecosystems extend beyond the boundaries of the mangroves.These intricate ecosystems encompass a multitude of interrelated land cover types, such as mangroves, tidal zones, mudflats, water bodies, and barren, that impact the entire ecosystem [7].Therefore, monitoring all relevant land covers in such an ecosystem is essential to support ecological balance preservation [8].
Mangroves live in brackish sea water and are highly salt-tolerant due to their unique ultrafiltration and biological mechanisms [9].These valuable ecosystems provide manifold environmental, ecological, and socio-economic benefits, including water purification [10], coastal protection [11], biodiversity conservation [12], and ecotourism [13].Furthermore, mangroves have a high potential to sequester atmospheric carbon and help mitigate climate change [14,15].In particular, the Intergovernmental Panel on Climate Change (IPCC) reported the significance of mangroves as blue carbon sources with high carbon burial and accumulation [16].Although mangroves offer numerous benefits, their extent has declined continuously over the past decades.For example, it has been argued that 62% of global mangroves were lost between 2000 and 2016 [17,18].The mangroves' losses have been associated with natural events, including tropical cyclones [19] and droughts [20], as well as anthropogenic activities, such as aquaculture development [21] and urban expansion [22].Therefore, it is vital to establish robust workflows to monitor the mangrove ecosystems for effective conservation and avoid further degradation.
In situ data collection is a conventional approach for mapping and monitoring mangrove ecosystems.Such an approache can deliver highly accurate information about the ecosystem; however, it is resource-intensive and impractical for frequent large-scale monitoring.These issues could become more challenging because of access limitations to these ecosystems as they are tidally inundated.Consequently, employing other advanced approaches that could reduce such limitations and provide reliable information is highly imperative.For example, leveraging remote sensing technologies along with advanced Artificial Intelligence (AI) models is beneficial in overcoming these limitations and providing frequent synoptic views from mangroves anywhere in the world [23].Furthermore, the utility of open access satellite images significantly reduces expenses and enables low-cost monitoring programs for mangrove ecosystems.Overall, remote sensing allows the accurate mapping of mangrove ecosystems, which is one of the fundamental needs for monitoring and conservation [23][24][25].Therefore, many studies have been conducted to map mangrove ecosystems using different remote sensing datasets and AI algorithms [26,27].For instance, multispectral [28], hyperspectral [29], Light Detection and Ranging (LiDAR) [30], and Synthetic Aperture Radar (SAR) [31] remote sensing data have been employed for mangrove mapping in different parts of the world.These datasets have been incorporated with AI algorithms, such as the Support Vector Machine (SVM), random forest, and Artificial Neural Networks (ANNs).For instance, Ghorbanian et al. [23] applied SAR and multispectral data of Sentinel-1 and Sentinel-2 to a random forest algorithm within the Google Earth Engine (GEE) platform for mangrove ecosystem mapping in Iran.Likewise, Kumar et al. [32] calculated vegetation indices using hyperspectral data from Hyperion and fed them into an SVM algorithm for mangrove mapping in India.
Previous studies have demonstrated the applicability of incorporating remote sensing data and AI algorithms for mangrove mapping.However, few studies have explored the potential of deep learning and Convolutional Neural Network (CNN) algorithms for mangrove ecosystem mapping.These algorithms have become prevalent for land cover mapping in other disciplines because of their advantages in feature extraction and hierarchical learning abilities [33].Accordingly, scholars have started to implement such algorithms for mangrove mapping.For instance, de Souza Moreno et al. [34] used Sentinel-1 time-series data to map mangroves in the Cananéia-Iguape Estuarine Complex, Brazil.To this end, three backbones comprising the Visual Geometry Group (VGG-16), Residual Network (ResNet-101), and EfficientNet-B7 were incorporated within the UNet architecture, and their results confirmed the better performance of the EfficientNet-B7.An F-score of 85.36% was achieved using the annual time series of Sentinel-1 data (29 images) with both polarizations.Likewise, Guo et al. [35] implemented the Capsules-UNet architecture to extract mangroves along the Maritime Silk Road.They employed Landsat archive imagery from 1990, 2000, 2010, and 2015 and produced four mangrove maps with an average Fscore of 73.25%.They reported a significant decline in mangrove cover (approximately 21.5%, equal to 1,356,686 ha) during the last 25 years in their study area.In another study, Sentinel-2 spectral bands and indices, such as the Normalized Difference Vegetation Index (NDVI), were fed into a modified ResNet-101 model for mangrove extraction in Hainan Island, China [36].The ResNet-101 architecture was modified with Multiscale Context Embedding (MCE), a Boundary Fitting Unit (BFU), and a Global Attention Model (GAM) and obtained an F-score of 97.87%.Finally, Jamaluddin et al. [37] developed a Fully Convolutional Network (FCN) to map mangrove degradation after Hurricane Irma in 2017.This network applied Convolutional Long Short-Term Memory (ConvLSTM) for feature extraction from bi-temporal Sentinel-2 imagery and reported that 26.64% (i.e., around 41,000 ha) of mangroves along the southwest coasts of Florida were degraded.
Although previous studies have successfully incorporated remote sensing and deep learning algorithms for coastal land cover mapping, they considered two or three land cover classes with a focus on delineating mangrove extents [26][27][28].Therefore, further studies are required to investigate the applicability of deep learning algorithms for detailed mangrove ecosystem mapping.The importance of considering detailed land covers is the necessity of mapping the essential components of a mangrove ecosystem (e.g., tidal zone and mudflat) for ecological balance preservation [6,38].Moreover, the potential of the most recent CNN models for mangrove mapping has not been thoroughly investigated.Accordingly, in this study, a new Hybrid Selective Kernel-based CNN (HSK-CNN) algorithm was developed for detailed mangrove ecosystem mapping using multi-temporal open-access Sentinel-2 imagery.The Selective Kernel (SK) module allows for multiscale spatial information extraction from remote sensing imagery to improve the mapping results [39].Furthermore, a hybrid approach was considered by combining Two-Dimensional (2D) and Three-Dimensional (3D) convolutional layers, with the advantage of injecting temporal information.The utility of temporal information was reported to be efficient for reducing water level fluctuation and inundation effects in the inter-tidal zones where mangroves grow [23].Finally, the robustness of the proposed algorithm was validated through visual interpretation, statistical accuracy assessment, and comparing its results with other state-of-the-art algorithms.

Study Area
The study area (Figure 1) contains the largest mangrove ecosystems in the Persian Gulf and Oman Sea [40], which has a substantial environmental and ecological impact on nearby regions.It is located on the Khuran Strait, between the northern coast of Qeshm island and Hormorgan Province, south of Iran.The mangrove ecosystem, called the Harra by the local community, covers an area of 20 km × 20 km between latitudes of 26 • 43 ′ and 26 • 59 ′ N and longitudes of 55 • 28 ′ and 55 • 48 ′ E. The mangrove ecosystem is formed by a network of shallow waterways with many tidal channels, and it is a habitat for vast species of flora and fauna [41].The semi-diurnal tidal fluctuations cause the water level to change between 0.3 m and 4.6 m [42], emphasizing the importance of considering these variations (i.e., tidal effect) for mapping studies.At high tides, only the tree canopy is visible above the water level, while trunks and aerial roots are underneath the water level and can be seen at low-tide conditions.This ecosystem is a part of Iran's national protected areas and is under different international protection programs, such as the Ramsar Convention.Avicennia marina is the major mangrove species in this region, which grows in sediments with a small amount of dissolved oxygen and high salinity concentrations.Commercial practices, such as oil leakage, fishing, shrimp farming, tourist boat trips, and mangrove leaf cutting, are among the factors disturbing these valuable natural environments.
a small amount of dissolved oxygen and high salinity concentrations.Commercial practices, such as oil leakage, fishing, shrimp farming, tourist boat trips, and mangrove leaf cutting, are among the factors disturbing these valuable natural environments.

Reference Polygons for Classification
Reference samples are required to either train supervised algorithms or evaluate their performances.In order to collect reliable reference samples, very high-resolution satellite images embedded in the Google Earth and Esri basemaps were used as the primary sources.Additionally, four other sources, false-color (NIR-Red-Green bands) satellite images of Sentinel-2, previous maps of the study area, a global mangrove extent map [43], and a global distribution of tidal flat map [44], were considered to ensure the reliability of sample collection.The reference samples were collected in Geographic Information System (GIS) polygon formats with a suitable spatial distribution over the study area (see Figure 1).Attention was paid to selecting homogenous sites for each land cover class and avoiding mixed pixels.In total, 824 reference polygons (304.83ha) for eight major land cover classes were collected (see Table 1).These reference samples were randomly divided

Reference Polygons for Classification
Reference samples are required to either train supervised algorithms or evaluate their performances.In order to collect reliable reference samples, very high-resolution satellite images embedded in the Google Earth and Esri basemaps were used as the primary sources.Additionally, four other sources, false-color (NIR-Red-Green bands) satellite images of Sentinel-2, previous maps of the study area, a global mangrove extent map [43], and a global distribution of tidal flat map [44], were considered to ensure the reliability of sample collection.The reference samples were collected in Geographic Information System (GIS) polygon formats with a suitable spatial distribution over the study area (see Figure 1).Attention was paid to selecting homogenous sites for each land cover class and avoiding mixed pixels.In total, 824 reference polygons (304.83ha) for eight major land cover classes were collected (see Table 1).These reference samples were randomly divided into test (70%) and train (30%) datasets in polygon format to ensure spatially disjoint samples in each dataset.The train dataset was further divided into training (80%, which was equal to 24% of the total reference polygons) and validation (20%, which was equal to 6% of the total reference polygons) subsets.This resulted in a final distribution of 70%, 24%, and 6% of the test, training, and validation samples, respectively.In the next step, the reference polygons were converted to a raster format, where each raster cell became a sample point.These sample points then served as the center pixels for the extracted patches.For each sample point, a 7 × 7 × 12 pixel window was extracted, where 7 × 7 represents the spatial dimension in pixels and 12 represents the temporal dimension corresponding to the 12 monthly NDVI images used in our analysis.This patch size covers an area of 70 m × 70 m in the Sentinel-2 images for each time step.These patches were allowed to overlap to ensure comprehensive coverage and to capture a wide range of spatial contexts.Furthermore, it provided sufficient spatial context for the model to learn local patterns and textural characteristics of different land cover types while maintaining computational efficiency.These patches along with their corresponding class labels were finally utilized as input data to train, validate, and test the developed classification model.

Satellite Imagery
In this study, Sentinel-2 multispectral satellite images were utilized.Sentinel-2 is a European satellite launched by the European Space Agency (ESA) carrying the MultiSpectral Instrument (MSI), which is capable of covering the entire surface of Earth in 12-day cycles.In order to decrease the 12-day gap in revisiting capability, the ESA has placed two identical satellites of Sentinel-2 A/B into the same orbit, but with a 180 • phase difference in their orbits, reducing the revisit time to 6 days.MSI acquires information from the Earth's surface in 13 spectral bands with 10-60 m spatial resolutions.Here, 12 cloud-free MSI images, one in each month, captured in 2020, were utilized for the task.

Methodology
In this study, an HSK-CNN model (Figure 2) was developed for accurate mangrove ecosystem mapping.This algorithm can extract deep features from time series of Sentinel-2 images in multiple Receptive Field (RF) sizes.The HSK-CNN is composed of modules with 2D and 3D convolutional layers.

Time-Series NDVI Products
Sentinel-2 imagery can be negatively affected by clouds or atmospheric conditions.The surface reflectance product of Sentinel-2 images (COPERNICUS/S2_SR) was utilized due to its atmospheric correction by the European Space Agency (ESA) within the Google Earth Engine (GEE) [45].The GEE was useful in reducing the complexity of data manipulation and improved processing without the requirement to download large volumes of data.Since satellite images acquired at different times of a year improve the accuracy of mangrove ecosystem mapping and reduce the effects of seasonality and diurnal tidal variations, we utilized 12 Sentinel-2 datapoints acquired from 1 January 2020 to 28 December 2020.Using these images, the time-series NDVI (Equation ( 1)) products were produced.
where NIR and Red refer to the near-infrared (band 8, 842 nm) and red (band 4, 665 nm) spectral bands, respectively.

Time-Series NDVI Products
Sentinel-2 imagery can be negatively affected by clouds or atmospheric conditions.The surface reflectance product of Sentinel-2 images (COPERNICUS/S2_SR) was utilized due to its atmospheric correction by the European Space Agency (ESA) within the Google Earth Engine (GEE) [45].The GEE was useful in reducing the complexity of data manipulation and improved processing without the requirement to download large volumes of data.Since satellite images acquired at different times of a year improve the accuracy of mangrove ecosystem mapping and reduce the effects of seasonality and diurnal tidal variations, we utilized 12 Sentinel-2 datapoints acquired from 1 January 2020 to 28 December 2020.Using these images, the time-series NDVI (Equation (1)) products were produced.
where NIR and Red refer to the near-infrared (band 8, 842 nm) and red (band 4, 665 nm) spectral bands, respectively.The input data were prepared in arrays  ∈  × × over the study area, where H, W, and C indicate the height, width, and number of image channels, respectively.In our case, where we have used time-series datasets, the number of channels refers to the considered time epochs of 12 NDVI images within the year.
Patches with equal shapes were required to be fed as the input data into the network.Thus, for each pixel (i, j) from the input dataset X, a 3D patch  , ∈  × × with a size of The input data were prepared in arrays X ∈ R H×W×C over the study area, where H, W, and C indicate the height, width, and number of image channels, respectively.In our case, where we have used time-series datasets, the number of channels refers to the considered time epochs of 12 NDVI images within the year.
Patches with equal shapes were required to be fed as the input data into the network.Thus, for each pixel (i, j) from the input dataset X, a 3D patch x i,j ∈ R P×P×C with a size of P × P × C was extracted.This neighborhood window was fed into the proposed model acting as a mapping function M θ : X → Y that maps x i,j → y i,j , where y i,j is its final corresponding land cover class for the focused pixel x i,j , Y ∈ R H×W is the classification map, and θ are learnable parameters of the network.

Convolutional Layers
After preparing the input 3D patches, i.e., time-series NDVI patches and their corresponding reference patches, they were ingested into a hybrid network of 3D and 2D convolution layers.These layers were capable of extracting meaningful features from the input dataset.Each convolution layer was followed by a ReLU activation function and a batch normalization layer.The deep feature extraction section of the proposed network was a combination of three 3D convolutional feature extractors, followed by three 2D convolutional feature extractors, where 3D/2D SK modules, along with Global Max Pooling (GMP) layers, were put between each of these layers for feature map reduction (see Figure 2).
Convolutional layers are specialized types of linear operations that were utilized as the core building blocks in the proposed HSK-CNN model.In the 2D convolution layer, the feature map ( f ) at position (α, β) on the y th feature x th layer is given by Equation (2) [46,47]: where F is the activation function; b x,y is the bias parameter; m is the number of feature maps in the (l − 1) th layer; and 2r + 1 and 2s + 1 are the width and height of each kernel along the number of image channels, respectively.Similarly, the feature map ( f ) of the 3D convolution operation is defined as [48,49]: in which F is the activation function; b x,y is the bias parameter; m is the number of feature maps in the (l − 1) th ; and 2r + 1, 2s + 1, and 2t + 1 are the width, height, and depth of each kernel along the number of image channels, respectively.The generated deep features were fed into a Global Average Pooling (GAP) and GMP layers, in parallel, and the concatenated outputs were inserted into a Fully Connected (FC) layer and finally into a Softmax classifier, to be able to predict the land cover class for each input patch.It should be noted that the classification result was assigned to the central pixel of the patch.

Selective Kernel-Based (SK) Network Modules
The main idea behind the proposed network was to improve the accuracy and efficiency of simple CNN structures by adaptively selecting the most descriptive features using different RF sizes in the network.RF is a critical hyperparameter in designing CNNs and enables the networks to be sensitive to features with different spatial scales.Thus, utilizing fixed RFs could decrease the robustness and generality of a CNN.If the RF size was selected to be too large, fine-grained structures could be missed out, and small RF sizes could not detect overall patterns.With the SK network design, we wanted to allow the input image to be convolved with different RF sizes in different branches.The effect of these branches could be adaptively adjusted by an attention mechanism based on the input feature dataset.Taking advantage of the SK modules could help to select the most descriptive spatial/spectral feature maps through an attention mechanism based on SK layers.The attention mechanism recalibrates the weights of each feature response nonlinearly with respect to other features (i.e., channels).As demonstrated in Figure 2, the proposed HSK-CNN was composed of three 3D SK modules followed by three 2D SK modules, where each SK module followed a 3D or 2D convolutional layer.Each SK module contained three main operators of Spilt, Fuse, and Select, which are further described in the following subsections.As an example, Figure 3 provides a detailed overview of the proposed 3D SK modules in terms of layers, kernel size, and output map dimensions.

Split
Multiple paths were generated by the Split operator with different kernel sizes corresponding to different neuron RF sizes.To this end, for a given input patch (x i, j ∈ R P×P×C ), two convolution layers with different kernel sizes (1 × 3 and 3 × 1) were used to generate feature maps F2D ∈ R P×P×C and ∼ F 2D ∈ R P×P×C .These feature maps in the 2D SK module were obtained by convolution layers (H 2D 1 ) and (H 2D 2 ) with kernel sizes (1 × 3) and (3 × 1), respectively.Furthermore, a 3D point-wise convolution (H 3D 1 ) with kernel size (1 × 1 × 3), and a standard 3D convolution layer (H 3D  2 ) with a kernel size (3 × 3 × 3) were utilized to produce the feature maps in the 3D SK module for a given 3D patch, x i, j ∈ R P×P×C .The feature maps were generated by convolutions followed by Batch Normalization and ReLU functions.
Remote Sens. 2024, 16, x FOR PEER REVIEW 8 of 18 nonlinearly with respect to other features (i.e., channels).As demonstrated in Figure 2, the proposed HSK-CNN was composed of three 3D SK modules followed by three 2D SK modules, where each SK module followed a 3D or 2D convolutional layer.Each SK module contained three main operators of Spilt, Fuse, and Select, which are further described in the following subsections.As an example, Figure 3 provides a detailed overview of the proposed 3D SK modules in terms of layers, kernel size, and output map dimensions.

Split
Multiple paths were generated by the Split operator with different kernel sizes corresponding to different neuron RF sizes.To this end, for a given input patch ( , ∈  × × ), two convolution layers with different kernel sizes (1 × 3 and 3 × 1) were used to generate feature maps  ∈  × × and  ∈  × × .These feature maps in the 2D SK module were obtained by convolution layers ( ) and ( ) with kernel sizes (1 × 3) and (3 × 1), respectively.Furthermore, a 3D point-wise convolution ( ) with kernel size (1 × 1 × 3), and a standard 3D convolution layer ( ) with a kernel size (3 × 3 × 3) were utilized to produce the feature maps in the 3D SK module for a given 3D patch,  , ∈  × × .The feature maps were generated by convolutions followed by Batch Normalization and ReLU functions.

Fuse
This part integrated the multiscale feature maps from the multiple paths in the Spilt part to obtain a global and comprehensive representation for weight selection.This was employed using element-wise summation of feature maps.Then, GAP and GMP were performed to generate channel-wise statistics of the data.GAP reduced the dimension of feature maps to  ∈  , a one-dimensional vector of a size equal to the number of channels where each element acts as a weight for that channel, by taking the average of the  ×  spatial dimension at each channel.The GAP for 2D and 3D feature maps can be defined by the following equations.

Fuse
This part integrated the multiscale feature maps from the multiple paths in the Spilt part to obtain a global and comprehensive representation for weight selection.This was employed using element-wise summation of feature maps.Then, GAP and GMP were performed to generate channel-wise statistics of the data.GAP reduced the dimension of feature maps to s c ∈ R C , a one-dimensional vector of a size equal to the number of channels where each element acts as a weight for that channel, by taking the average of the P × P spatial dimension at each channel.The GAP for 2D and 3D feature maps can be defined by the following equations.
where s c 2D and s c 3D denote aggregation results for cth element of the input feature map for 2D and 3D SK modules, respectively.The results of GAP and GMP were concatenated into a vector s c and fed into an FC layer, z, with the ReLU activation function and Batch Normalization.The result, z ∈ R d , was a feature vector (Equation ( 6)) to enable guidance for precise and adaptive selections of most descriptive features.
where W f c , being the weights of the neurons in the Fully Connected layer, acts as a weight matrix with size d × C. Parameter d = max( C r , L), where L = 32 is the minimum value of d, played an important role in the structure of the SK modules.r is the reduction rate, which was considered to control the compression rate of z.

Select
The Select operator integrated the feature maps from different RF sizes based on the computed weights.This was controlled by an attention mechanism, which selected the most important regions of the feature vector, z.To achieve this goal, a softmax function was applied to the results of the Fuse operator using the following equations.
where A, B ∈ R C×d and a, b ∈ R C are soft attention vectors of F and ∼ F, respectively.This operation was the same for both 2D and 3D SK modules, with the constraint that a + b = 1.Finally, the output feature map was obtained using Equation (9).

Accuracy Assessment
It is important to evaluate the performance of the classification algorithms using independent reference samples and appropriate evaluation criteria.In this study, both visual and statistical accuracy assessments were conducted to evaluate the accuracy of the produced mangrove maps.Visual interpretation was conducted to evaluate the performance of the HSK-CNN compared to high-resolution satellite imagery.Statistical accuracy assessment was performed by comparing the maps with the test samples.To this end, criteria, including the Overall Accuracy (OA), Cohens Kappa Coefficient (CKC), Mathews Correlation Coefficient (MCC), and Balanced Accuracy (BA), derived from the confusion matrix were employed.
The results of HSK-CNN were also compared to those of multiple classification algorithms, such as random forest, XGBoost, 2D-CNN, MLP-Mixer [50], Swin Transformer [51], and 3D-DenseNet [52,53].This allowed us to have a more comprehensive validation of our proposed model.

Implementation
The inputs to all algorithms were the patches extracted from the time-series NDVI images, along with the corresponding reference patches to support training, validation, and accuracy assessment steps.All of the convolutional kernel weights were randomly initialized and were trained using the back-propagation algorithm.The Adam function was selected as the optimizer and the loss was computed using the cross-entropy function.The minibatches with a size of 100 samples were utilized, and the networks were trained for 500 epochs without data augmentation.The deep learning models were implemented in the Python 3.8 environment using several libraries, such as tensorflow, gdal, and sklearn.
Each of the classification algorithms utilized different hyperparameters.To ensure a fair comparison, the algorithms were initialized with their default or best configurations, which were mentioned in their related reference studies.These hyperparameters are provided in Table 2.

Results
The classification maps produced using the proposed HSK-CNN model and other algorithms are illustrated in Figures 4 and 5, respectively.The optical satellite image of the study area, along with the three zoomed regions, isalso provided in Figure 4 for better comparison purposes.It can be observed that different classes were correctly classified using most classification algorithms, with a superior performance for HSK-CNN.In general, the results obtained from the convolutional-based models (i.e., HSK-CNN, 2D-CNN, MLP-Mixer, Swin Transformer, and 3D-DenseNet) were smoother and contained fewer noisy pixels, which was a direct influence of using convolutional kernels in these types of algorithms.These smoother maps could be better observed in the results of the HSK-CNN and Swin Transformer algorithms.The 2D-CNN model and 3D-DenseNet models, which contained relatively fewer convolutional layers, generated more noisy outputs compared to other convolutional-based models.Overall, the random forest and XGBoost algorithms, which are pixel-based methods, resulted in the most noisy pixels.
Table 3 shows the accuracies of different classification algorithms for mangrove mapping in the study area.The results showed that the proposed HSK-CNN model achieved the highest performance with an OA of 94% among different methods.Other convolutionalbased models also produced high accuracy (e.g., OAs were more than 90%).The nonconvolutional-based models had the lowest classification accuracy, where random forest and XGBoost resulted in OAs of 85% and 87%, respectively.
Figure 6 shows the confusion matrices of different models.The proposed HSK-CNN (Figure 6g) demonstrated the best overall performance, with higher accuracy levels across all classes and minimal misclassifications.The proposed model particularly excelled in distinguishing between tidal zones, vegetation, and water classes, which are critical for accurate mangrove ecosystem mapping.fewer noisy pixels, which was a direct influence of using convolutional kernels in these types of algorithms.These smoother maps could be better observed in the results of the HSK-CNN and Swin Transformer algorithms.The 2D-CNN model and 3D-DenseNet models, which contained relatively fewer convolutional layers, generated more noisy outputs compared to other convolutional-based models.Overall, the random forest and XGBoost algorithms, which are pixel-based methods, resulted in the most noisy pixels.Table 3 shows the accuracies of different classification algorithms for mangrove ma ping in the study area.The results showed that the proposed HSK-CNN model achiev the highest performance with an OA of 94% among different methods.Other convo tional-based models also produced high accuracy (e.g., OAs were more than 90%).T non-convolutional-based models had the lowest classification accuracy, where rando forest and XGBoost resulted in OAs of 85% and 87%, respectively.Two different complementary scenarios were investigated to provide a deeper understanding of the effect of different network structures in the proposed HSK-CNN model.
First, the effect of removing 2D or 3D convolutions on the results of the classification was studied.In this regard, the 2D or 3D convolutional layers were omitted from the feature extraction section of the network architecture, and the algorithm was trained using new architectures.The results of this investigation are presented in Table 4.It was observed that the hybrid structure of the proposed model, which included both 2D and 3D convolutional layers, provided a superior performance.Moreover, the results showed that the effect of 3D convolutional layers was higher than that of the 2D layers.Figure 6 shows the confusion matrices of different models.The proposed HSK-CNN (Figure 6g) demonstrated the best overall performance, with higher accuracy levels across all classes and minimal misclassifications.The proposed model particularly excelled in distinguishing between tidal zones, vegetation, and water classes, which are critical for accurate mangrove ecosystem mapping.Second, the effect of utilizing various patch sizes as network inputs was studied.In this regard, three patch sizes of 7 × 7, 9 × 9, and 11 × 11 pixels were tested and the results are demonstrated in Table 5.The results showed that although higher patch sizes resulted in higher classification accuracies, there was an optimum value, after which the classification accuracy did not change.It is finally worth noting that increasing the size of the patches increases the computation time and, thus, there should be a trade-off between the accuracy level and computation time.
The confusion matrices of different algorithms showed that the HSK-CNN model performed particularly well in distinguishing between mangroves and other vegetation types, a common challenge in remote sensing-based mangrove mapping.The model also showed a high accuracy in delineating water bodies and tidal zones, which is crucial for understanding the dynamics of these coastal ecosystems.Overall, the HSK-CNN model provided better accuracies of all land cover types in the mangrove ecosystems.This high level of overall classification accuracy is critical for holistic mapping and understanding of the ecosystem.The slightly lower accuracy for mangroves in our model may be due to its balanced approach to handling all classes, which may be beneficial for overall ecosystem analysis, but may slightly compromise the accuracy of any single class.While our proposed HSK-CNN model achieved the highest OA, it is important to note that some other models, specifically XGBoost, 2D-CNN, and MLP-Mixer, also showed high accuracies for the mangrove class itself (e.g., see Figure 6b-d).This showed that these models might be particularly effective at identifying mangrove classes.
The superior performance of our HSK-CNN model could be attributed to several key features that addressed the challenging task of informative feature extraction in complex mangrove ecosystems.First, the multiscale feature extraction was achieved by the SK modules, which allowed the model to adaptively capture features at different scales.This is particularly advantageous in mangrove ecosystems, where land cover types exhibit different spatial patterns and textures.Second, the integration of temporal information was another critical aspect of our model.It effectively captured seasonal variations and tidal effects in the study area using time-series NDVI datasets.This temporal aspect is critical in coastal environments where water levels and vegetation phenology can significantly affect the appearance of land covers throughout the year.Third, the hybrid 2D-3D architecture of the proposed model combined 2D and 3D convolutional layers, allowing the model to effectively process both spatial and temporal information.This approach enabled the model to learn both spatial context information and temporal patterns, which was particularly beneficial for distinguishing between spectrally similar classes, such as different water bodies or vegetation types.Fourth, our HSK-CNN employed a novel attention mechanism to focus on the most relevant features for each land cover class.This attention mechanism adaptively weighed the importance of different spatial and temporal features, improving the model's ability to discriminate between similar land cover types.
Although the proposed model showed a high potential for land cover classification in mangrove ecosystems, it is suggested to implement and evaluate the proposed model's performance in diverse global mangrove ecosystems and to assess its generalizability to potentially lead to a standardized global mapping tool.Additionally, it is suggested to integrate complementary data sources, such as SAR and LiDAR imagery, to further enhance the model's capabilities.SAR data could provide insight into mangrove structure and biomass, especially in cloud-covered areas, while LiDAR could provide detailed canopy height models.

Conclusions
Accurately mapping and monitoring mangroves is fundamental for effective conservation and management strategies aimed at preserving these invaluable ecosystems.In this study, we introduced an innovative deep learning model, called HSK-CNN, for mapping mangroves using multi-temporal Sentinel-2 NDVI products.The proposed algorithm incorporated SK feature extraction techniques, enabling it to efficiently learn and classify mangrove areas in the input remote sensing imagery.The outcomes of our study demonstrated the superior performance of the proposed HSK-CNN model in mangrove classification, achieving an OA of 94%.This outperformed several other algorithms, including random forest, XGBoost, 3D-DenseNet, 2D-CNN, MLP-Mixer, and Swin Transformer, which recorded lower OA values ranging from 85% to 93%.Furthermore, our analysis revealed that the architecture of the neural network, such as the types of convolutional layers and patch sizes, influenced the classification accuracy.Therefore, determining the optimal scenarios and values of these parameters is crucial for maximizing classification accuracy.In conclusion, our findings underscore the importance of accurately mapping mangroves, as the resulting maps provide valuable insights into the distribution of and changes in mangrove ecosystems.This information is vital for guiding informed decision-making processes aimed at conserving and sustainably managing these critical habitats.

Figure 1 .
Figure 1.(A) The study area and the distribution of all reference polygons within the study area, (B) the location of the study area within the Persian Gulf; (C) a zoomed image from the locations of mangroves; (D) the distribution of test samples; (E) the distribution of training samples; and (F) the distribution of validation samples.

Figure 1 .
Figure 1.(A) The study area and the distribution of all reference polygons within the study area, (B) the location of the study area within the Persian Gulf; (C) a zoomed image from the locations of mangroves; (D) the distribution of test samples; (E) the distribution of training samples; and (F) the distribution of validation samples.

Figure 2 .
Figure 2. General framework of the proposed HSK-CNN model for mangrove ecosystem mapping.SK, GMP, GAP, FC, and NDVI stand for Selective Kernel-based, Global Max Pooling, Global Average Pooling, Fully Connected, and Normalized Difference Vegetation Index, respectively.

Figure 3 .
Figure 3.The structure of the proposed 3D SK module.

Figure 3 .
Figure 3.The structure of the proposed 3D SK module.

Figure 4 .
Figure 4. (a) Study area with three zoomed regions.(b) Mangrove ecosystem map produced using the proposed HSK-CNN model (see Figure 1 for the colors of different classes).

Figure 4 .
Figure 4. (a) Study area with three zoomed regions.(b) Mangrove ecosystem map produced using the proposed HSK-CNN model (see Figure 1 for the colors of different classes).

Figure 5 .
Figure 5.The mangrove ecosystem maps produced using different classification algorithms fro the study area (see Figure 1 for the colors of different classes).

Figure 5 .
Figure 5.The mangrove ecosystem maps produced using different classification algorithms from the study area (see Figure 1 for the colors of different classes).

Table 1 .
The number of reference patches for different land cover classes that were divided into training, validation, and test.

Table 3 .
The accuracies of different algorithms based on various accuracy measures derived from the confusion matrices.OA, CKC, MCC, and BA refer to the Overall Accuracy, Cohens Kappa Coefficient, Mathews Correlation Coefficient, and Balanced Accuracy, respectively.

Table 3 .
The accuracies of different algorithms based on various accuracy measures derived fro the confusion matrices.OA, CKC, MCC, and BA refer to the Overall Accuracy, Cohens Kappa C efficient, Mathews Correlation Coefficient, and Balanced Accuracy, respectively.

Table 4 .
The effects of removing the 2D and 3D convolutional layers from the network structure of the proposed mangrove mapping model.SK, OA, CKC, MCC, and BA refer to Selective Kernelbased, Overall Accuracy, Cohens Kappa Coefficient, Mathews Correlation Coefficient, and Balanced Accuracy, respectively.

Table 5 .
The effects of using different patch sizes in the network structure of the proposed mangrove mapping model.OA, CKC, MCC, and BA refer to Overall Accuracy, Cohens Kappa Coefficient, Mathews Correlation Coefficient, and Balanced Accuracy, respectively.