Water extraction from optical high-resolution remote sensing imagery: a multi-scale feature extraction network with contrastive learning

ABSTRACT Accurately spatiotemporal distribution of water bodies is of great importance in the fields of ecology and environment. Recently, convolutional neural networks (CNN) have been widely used for this purpose due to their powerful features extraction ability. However, the CNN methods have two limitations in extracting water bodies. First, the large variations in both the spatial and spectral characteristics of water bodies require that the CNN-based methods have the ability of extracting multi-scale features and using multi-layer features. Second, collecting enough samples is a difficult problem in the training phase of CNN. Therefore, this paper proposed a multi-scale features extraction network (MSFENet) for water extraction, and its advantages are contributed to two distinct features: (1) scale features extractor (MSFE) is designed to extract multi-layer multi-scale features of water bodies; (2) contrastive learning (CL) is adopted to reduce the sample size requirement. Experimental results show that MSFE can effectively improve the small water body extraction performance, and the CL can significantly improve the extraction accuracy when the training sample size is small. Compared with other methods, MSFENet achieves the highest F1-score and kappa coefficient in two datasets. Furthermore, spectral variability analysis shows that MSFENet is more robust than other neural networks in a spectrum variation scenario.


Introduction
Water is the source of life and essential for land ecological system (Du, Ottens, and Sliuzas 2010), which has a significant impact on public health, economic development and living environment (J.J. Li et al. 2021).Therefore, it is of great importance to obtain spatiotemporal distribution of surface water bodies timely and accurately.With rapid developments of the earth observation technologies in recent decades, optical remote sensing images with high spatial, temporal, and spectral resolution have become more accessible (Wang et al. 2022).Hence, it is a promising way to use optical high-resolution images for surface water body extraction.Although optical high-resolution remote sensing images can provide much useful information for surface water body extraction (Chen et al. 2020), including spectral features (Zhou et al. 2021), texture features, etc., there are still some challenges for this task.Firstly, water bodies are much similar with low-albedo objects in image features such as shadows, thus they are easily confused with these dark objects.Secondly, spectral signals of water bodies vary greatly in remote sensing images due to the different solar altitude angles, the interference of atmospheric conditions and topography.Thirdly, the differences in both sediment content of water bodies and density of aquatic plants will cause spectral variations of water bodies in remote sensing images.Finally, small target detection in remote sensing images is always a difficult problem because small objects can be easily affected by their surroundings (Zhou et al. 2022).In one word, the extraction of small water bodies is still challenging.
The methods of water extraction from optical highresolution remote sensing images mainly include water index methods (WIMs) and image classification methods (ICMs).Generally, WIMs have three steps for water extraction: (1) selecting the bands closely related to identifying water bodies in terms of spectral characteristics; (2) constructing different water index models by combining water-correlated spectral bands; (3) determining a threshold to classify water and non-water (Su et al. 2021).To date, many water indexes have been proposed for water extraction.(McFeeter 1996) proposed the normalized difference water index (NDWI) with green and NIR bands.
However, NDWI is greatly affected by shadows of buildings, so it is difficult for water extraction in builtup areas.To overcome shortcomings of NDWI, Xu (2005) developed the modified normalized difference water index (MNDWI) by replacing NIR band in NDWI with SWIR band.Yan, Zhang, and Zhang (2007) proposed the enhanced water index (EWI) by combining NDWI and MNDWI.Feyisa et al. (2014) developed the automated water extraction index (AWEI) using multiple bands (1, 2, 4, 5 and 7) of Landsat 5 TM to reduce noise affects in built-up and mountainous areas.Additionally, there are also other less used water indexes, such as the revised normalized difference water index (RNDWI) (Cao et al. 2008), new water index (NWI) (Ding 2009), Gaussian normalized difference water index (GNDWI) (Shen et al. 2013), false normalized difference water index (FNDWI) (Zhou et al. 2014), and shadow water index (SWI) (Chen et al. 2015).Although WIMs can extract waterbodies simply and quickly, the results of WIMs are often unsatisfactory because the spectral similarity between waterbodies and low-albedo objects and the spectral variability within water bodies.Moreover, the thresholds of WIMs should be determined appropriately to produce accurate water maps, which is not an easy task.Finally, since high spatial resolution imagery usually only has four spectral bands (i.e.blue, green, red and NIR), most WIMs methods are not applicable.
Various ICMs have been proposed to overcome the shortcomings of WIMs and utilize spatial information of optical high-resolution remote sensing images (Nath and Deb 2010).ICMs perform water extraction by combining spectral, shape and texture features and using various classifiers in machine learning.Commonly used classifiers in ICMs include decision tree (Fu, Wang, and Li 2008), random forest (RF) (Cui, Wang, and Huang 2022), support vector machine (SVM) (Nandi, Srivastava, and Shah 2017), expert system (Pekel et al. 2016), etc.Although ICMs can achieve better results than WIMs with various features, they need to manually construct these features for a specific water extraction task.Additionally, the scope of application of these manually constructed features is limited, and it is difficult to extract water bodies in different regions.
In recent years, deep learning, especially convolutional neural networks (CNN), provides an effective approach for automatically learning features at multiple levels (Chen et al. 2016), and has been widely used in scene classification (Liu, Zhong, and Qin 2018;De Lima and Marfurt 2020), object detection (Ren et al. 2017;Chen, Zhang, and Ouyang 2018), and semantic segmentation (Shelhamer, Long, and Darrell 2017;W. Z. Zhao et al. 2017).As a specific semantic segmentation task, many CNN models have been designed for water extraction.Chen et al. (2018) proposed a selfadaptive pooling layer for water extraction to reduce the loss of features in the pooling process.Chen et al. (2020) performed spatial-spectral convolution to extract features from both the spatial and the spectral dimensions by factorizing 3D-convolutions into 2Dspatial convolutions and 1D-spectral convolutions.M. Y. Li et al. (2021) applied DenseBlocks in DenseNet (Huang et al. 2017) to construct a dense-local-featurecompression (DLFC) network with each layer receiving all its previous feature maps.In addition, this network can extract water bodies from different images of one sensor and different sensors automatically.Wang et al. (2022) proposed a SADA-Net for water extraction, in which both atrous spatial pyramid pooling (ASPP) module and dual attention (DA) module were utilized.Lu et al. (2022) developed a weakly supervised deep learning model named neighbor feature aggregation network (NFANet) to improve the label quality by recursive training.
Compared with ICMs, deep learning methods can automatically extract features which is more beneficial to water extraction.However, lots of samples required in deep neural networks training process are often hardly obtained.To reduce the need of large samples, contrastive learning (CL) is introduced into water extraction in this paper.CL is an unsupervised representation learning method for extracting unsupervised features (He et al. 2020).CL aims to learn the representation space by contrasting semantically positive and negative sample pairs, such that the features of positive sample pairs are similar while the features of negative sample pairs are different.Several recent studies have proposed unsupervised visual representation learning using approaches related to contrastive loss to produce promising results (Wu et al. 2018;Bachman, Hjelm, and Buchwalter 2019).However, the use of image pairs as positive and negative samples for CL in these works is not applicable for pixel-level tasks such as water extraction.Recently, some works (Chaitanya et al. 2020;Xie et al. 2021;Zhao et al. 2021;Bai et al. 2022) have shown that CL can give a proper representation for semantic segmentation tasks requiring highdimensional features at pixel level.Therefore, pixelwise CL is preferred to pretrain deep learning models to reduce the training sample size.
The variation of water body size is also a challenging problem.While most current methods are developed, they cannot meet the requirement for extracting water bodies with different sizes.In addition, the sizes of feature maps decrease in the process of feature extraction layer by layer, where small water bodies with inconspicuous features tend to be ignored, causing biased results.Therefore, multilayer and multi-scale features are required to solve these problems.The multi-layer features refer to the feature extracted from different layer of CNN.Specifically, the low-layer features mainly express the detailed information of targets, while the deeplayer features mainly express the overall information of targets, such as the semantic information of targets.The multi-scale features refer to the feature extracted from several different scales.Small-scale features with small receptive fields are suitable for detecting small targets, while large-scale features with large receptive fields are suitable for large targets extraction.There are already some modules for extracting multi-scale features in the field of semantic segmentation, such as spatial pyramid pooling (SPP) (He et al. 2015), pyramid pooling module (PPM) in PSPNet (H. S. Zhao et al. 2017), and atrous spatial pyramid pooling (ASPP) in DeepLabv2 (Chen et al. 2018), etc.However, these modules for multi-scale features extraction without multi-layer settings mainly perform pooling operations on the feature maps, thus leading to the loss of information.Therefore, a multi-scale features extraction network (MSFENet) with multi-scale features extractor (MSFE) is developed to extract multi-layer and multi-scale features for water extraction in this work.
In summary, the main contributions of this work can be summarized as follows: The MSFENet can be trained with cross-entropy loss which can make the output of the MSFENet close to the ground truth.Considering an image with N pixels, the output p i of the MSFENet represents the probability of the i th pixel belonging to water and the y i represents whether the i th pixel is water.Then cross entropy loss can be formulated as follows:

Multi-scale feature extractor
To solve the information loss problem caused by increasing the receptive field through pooling or general convolution, atrous convolution is used for the feature extraction of multiscale information in MSFE.
As a special kind of convolution, atrous convolution has a dilation rate parameter in addition to the parameters of general convolution.Figure 2 shows the atrous convolution with different dilation rates.The atrous convolution with a dilation rate of d can be regarded as a general convolution with d À 1 zeros between each row and each column of the convolution kernel.The calculation formula of the receptive field of the n th layer of CNN r n is as follows: where k n denotes the kernel size of the n th layer of CNN, and s i denotes the stride of the i th layer of CNN.Therefore, the atrous convolution can increase the receptive field of CNN by increasing the dilation rate without increasing the amount of CNN parameters.Each DACM is formed by parallel atrous convolutions with different dilation rates to extract multiscale features under different receptive fields (Figure 3).There are three kinds of atrous convolutions in the DACM with the dilation rates 1, 3, and 5, respectively.An 1 � 1 convolution is adopted in each atrous branch for rectified linear activation, and the features of all branches and the original features are combined to obtain the output of the DACM.The branch with a large reception field can excel in extracting efficient features for large objects, while the branch with a small receptive field can extract  features for small objects.Through the parallel connections between different branches, the DACM can extract multi-scale features for objects with different sizes.The MSFE consists of four SCs and four DACMs in total, and each DACM is designed to extract features from a corresponding SC.Thus, multi-layer multi-scale features can be extracted and used in decoder for final water extraction.

Pixel-wise supervised contrastive learning
The key of CL is to select positive and negative sample pairs.In most cases, CL is unsupervised, and different images constitute negative sample pairs and an image with its distorted version constitute positive sample pairs.In this work, pixel-wise CL is performed for water extraction since it is a semantic segmentation task.In addition, pixel-wise CL requires pixel-level positive and negative sample pairs, which cannot be achieved by data augmentation.Therefore, labeled images are used to select positive and negative sample pairs.Pixels in two images with same labels are positive sample pairs, while pixels in two images with different labels are negative sample pairs as shown in Figure 4.
The purpose of CL is that features learned from positive samples are similar while feature learned from negative samples are different, which is generally achieved by contrastive loss.Let I denote an image, I 0 its distorted version, N I the number of pixels in image I, and N I 0 the number of pixels in image I 0 .The contrastive loss can be formulated as follows: where N þ k denotes the number of positive samples of pixel k, N À k denotes the number of negative samples of pixel k, N I , z denotes normalized deep features extracted which are the output of the fourth DB in this work, � ð Þ denotes dot product of two vectors, and τ > 0 is a hyper-parameter.Therefore, the final loss function of the MSFENet is:

Accuracy assessment
Four evaluation metrics are used for accuracy assessment in water body extraction, which are precision (P), recall (R), F1-score (F1), and kappa coefficient (K), respectively.These four-evaluation metrics can be computed as follows: where TP denotes true positive, which represents the correctly extracted number of water pixels, FP denotes false positive, which represents incorrectly extracted number of water pixels, TN denotes true negative, which represents the correctly extracted number of non-water pixels, and FN denotes false negative, which represents the incorrectly extracted number of non-water pixels.The larger values of these four metrics, the better the water extraction results are.

Data introduction
Gaofen Image Dataset (GID) is a large-scale land-cover dataset produced from GF-2 satellite images.GID consists of two parts: a fine land-cover classification set (FLCCS) and a large-scale classification set (LSCS) (Tong et al. 2020).There are fifteen land-cover categories in the FLCCS as well as five land-cover categories in the LSCS.Water is a separate category in the LSCS while it is divided into several subcategories in the FLSSC, thus the LSCS is utilized for water extraction for convenience.The LSCS contains 150 GF-2 images which are distributed in nearly 60 cities in China, and each GF-2 image has a corresponding pixel-level labeled image that indicates the category of each pixel.Though water is included in the LSCS, the labels of water are not accurate.Therefore, four GF-2 images are selected from LSCS for the experiments in this article, while water bodies in these images are relabeled by carefully visual interpretation (Figure 5).The image size and spatial resolution of each GF-2 image areis 3.2 m and 6800 � 7200, mrespectively.Four bands are included in GF-2 images (i.e, NIR, red, green, and blue).Different water bodies, including ponds, lakes, and rivers with different sizes and different characteristics, are included in study areas.In addition, low-albedo objects such as shadows are widespread in study areas and can be easily confused with water bodies.Therefore, the datasets used in this article is very suitable for evaluating water extraction models.

Data preprocessing
The four GF-2 images and corresponding relabeled images are divided into training set, validation set and test set, in which (a) and (b) in Figure 5  In accuracy assessment phase, when large water bodies are well extracted, the evaluation for small water bodies will be biased because of imbalanced samples.To better evaluate the water extraction accuracy, two regions mainly containing small water bodies in the test set are selected to evaluate the results of different models on small water bodies (Figure 6).
The targets with sizes of less than 32 � 32 pixels are usually defined as small targets in the field of object detection.Therefore, water bodies with the number of pixels less than 1000 are regarded as small objects in this paper.The area distribution histogram and cumulative area distribution histogram of the water bodies in regions 1 and 2 are shown in Figure 7.The proportion of small water bodies in these two regions are 97.57% and 88.69% respectively, which are suitable for evaluating the ability of the model for extracting small water bodies.Assuming that Figure 5(b) is the water extraction result obtained by a model, Figure 5(c) is the ground truth, and the four metrics are shown in Table 1.For the entire test set, although only a few large water bodies are extracted, the kappa coefficient is still close to 70%, and the F1-score even exceeds 70%.This result obviously does not reflect the accurate accuracy of water extraction results well.For region 1, since no water bodies are extracted, precision and F1score cannot be calculated, and the values of recall and kappa coefficient are zero.For region 2, the values of F1-score and kappa coefficient are 18.51%, and 17.28%, respectively.The evaluation results of these two local regions can better reflect the overall accuracy of water extraction results.samples in validation phase, clip image patches in prediction phase, and merge water prediction results and other deep learning methods.An i7-12700F center processing unit and a 12 GB NVIDIA GeForce RTX 3060 graphics processing unit are used to run the experiments.

Experiments and analysis
During the training phase, distorted images are obtained by contrast enhancement in CL.The optimizer, initial learning rate, max epochs and training batch size used in all experiments were Adam, 0.0002, 50, and 4, respectively.To make the MSFENet converge faster, the pre-trained model parameters on ImageNet are used as initialization parameters for encoder and the learning rate will be halved once the F1-score of the validation set stops increasing for 5 consecutive epochs.In addition, early stopping can help to prevent the MSFENet from overfitting; thus training process will be terminated once the F1-score of the validation set stops improving for 10 consecutive epochs.
During the prediction phase, only original image patches are fed into the MSFENet, while the training phase requires the distorted image patches as the input of the MSFENet.Additionally, due to the limitation of video memory, the remote sensing image in the test set needs to be cropped into image patches for prediction.The most common used method is to crop a remote sensing image into image patches without overlap; however, this method may result in stitching lines during the resultant merging process.To solve this problem, overlapping cropping method is used in the prediction of water bodies using deep learning models as shown in Figure 8.Each 512 � 512 image patch (red boxes in Figure 8) obtained by cropping has 512 � 128 pixels overlapping with each of the adjacent image patch.The white part inside the red boxes in Figure 8, i.e. the part beyond the image range, is padded with zero during the cropping processing.All image patches are fed into trained deep learning models to obtain prediction results after image cropping.Only the 384 � 384 pixels (yellow boxes) in the center of each prediction patches are considered as valid results, and the final water extraction result is obtained by merging all the 384 � 384 prediction patches.

Ablation study
Ablation experiments with different settings are conducted to verify the influences of each component.The three components, SC, DACM, and CL, are analyzed in detail.The encoder-decoder architecture, i.e. the MSFENet without MSFE proposed in this work, is used as baseline for a fair comparison.Firstly, the influences of multi-layer features obtained by SC on experimental results and small water bodies are discussed in this section.Subsequently, the effects of multi-scale features obtained by DACM on experimental results and small water bodies are discussed.Finally, the impacts of CL on experimental results are discussed.
Table 2 shows the accuracy assessment results of the four methods: the baseline, the baseline with SC, the MSFENet, and the MSFENet with CL on test set.The MSFENet achieves highest F1-score and kappa coefficient, which are 1.1% and 1.28% higher than the baseline, respectively.The baseline with SC also achieves good results, which are 0.71% and 0.82% higher than the baseline.In addition, compared with the baseline, the MSFENet brings a 2.66% increase in recall at the expense of 0.62% precision.The higher recall means less water bodies are missed; thus, the MSFENet can largely reduce omission rate of water bodies.However, the accuracy assessment result of the MSFENet with CL shows that CL cannot further increase the accuracy of water extraction based on the MSFENet, and this phenomenon will be discussed separately in detail in the next section.
Although the accuracy assessment results on test set illustrate that MSFE can effectively improve the accuracy of water extraction, the results cannot fully reflect the effectiveness of MSFE on small water extraction.Tables 3 and 4 show accuracy assessment results with different network configurations on regions 1 and 2. In region 1, the MSFENet achieves 1.83% and 1.88% improvements over baseline in F1score and kappa coefficient, respectively.In region 2, the F1-score and kappa coefficient of the MSFENet are 2.31% and 2.18% higher than those of the baseline, respectively.Compared with the baseline, the MSFENet even achieves 4.61% and 5.23% improvements in recall in regions 1 and 2, which are 1.95% and 2.67% higher than the improvement on the test set.The accuracy assessment results in regions 1 and 2 show that MSFE can effectively extract multi-scale features and reduce the omission rate.In addition, SC is more important to enhance the extraction accuracy of small water bodies than DACM in MSFE.In addition, CL is also not effective in improving the accuracy of water extraction in regions 1 and 2. The results of different network configurations are shown in Figure 9. Multi-layer features are extracted in the baseline by initial convolution layer and four RBs, but only the most abstract features extracted by the fourth RB are used by the decoder for water extraction.Since the baseline can only extract abstract features resulting in more information loss for water bodies, therefore, small water bodies are missed, and the boundary information of large water bodies is not accurate enough (Figure 9(c)).Fortunately, SC enables the decoder to make full use of the multi-layer extracted by encoder.For small water body extraction and accurate boundary information acquisition, the lower-layer features that undergo fewer pooling operations are more effective compared to deep-layer features.As shown in Figure 9(d), the baseline with SC can detect many small water bodies that are missed by the baseline, and the boundaries of other water bodies are more accurate.However, although adding SC to the baseline can effectively reduce the omission rate, it also increases false alarm rate, which is more obvious in Figure 9(d).The accuracy assessment results in Tables 2, 3, and 4 also show that after adding SC to the baseline, the precision has a significant decrease, which represents an increase of false alarm rate.After adding DACM to the baseline and forming MSFE with SC, the false alarm rate stops dropping and starts to rise again.MSFE is able to extract both multi-layer features and multi-scale features in each layer, which can further reduce the omission rate and make the boundaries of water bodies more accurate, and thus reduces the false alarm rate enhancement.As shown in Figure 9(e), none of mis-detected water bodies in Figure 9(d) exist anymore.The accuracy assessment results in Tables 2, 3, and 4 also show that the MSFENet achieves a significant improvement in precision than the baseline with SC.

The effectiveness of contrastive learning
In ablation study, CL cannot further effectively improve the water extraction accuracy in the case of using all training samples.The reason is that the amount of data in the training set is much more than that of the test set, the MSFENet is already enough to extract effective features from the training samples when sufficient training samples are used, and CL cannot further improve water extraction accuracy this time.However, when the number of training samples is insufficient, the MSFENet cannot extract effective features from training samples, but adding CL to the MSFENet can effectively increase the representation ability of features, thus improving water extraction accuracy.
To verify the ability of CL, nine experiments are done using 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100% of the total training samples to explore if CL can increase the water extraction capability of the MSFENet in a small sample size.Figure 10 shows the accuracy curve with the increasing of samples sizes.As the sample size increases, the accuracy improved by the MSFENet + CL becomes lower and lower.It can be seen from Figure 10, among the nine experiments, the greatest accuracy improvement from CL is achieved when only 1% of the training samples are used to train the MSFENet as shown by the accuracy assessment results in Table 5.The F1-score and kappa coefficient on test set yield 2.67% and 3.04% improvement, showing that CL is still very effective when the sample size is small.An interesting finding is that CL is more helpful in extracting small water bodies when the sample size is small, leading to 22.54% and 13.31% improvements of F1-score and kappa coefficient in region 1, and 4.72% and 4.96% improvements of F1-score and kappa coefficient in region 2. Therefore, CL is still crucial for water extraction when samples are obviously insufficient for a largescale mission.

Comparison with other methods
The proposed MSFENet is compared with NDWI, RF, FCN (Shelhamer, Long, and Darrell 2017), PSPNet (Zhao et al. 2017), UNet (Ronneberger, Fischer, and Brox 2015), and Deeplabv3+ (Chen et al. 2018) on the selected four GF-2 images in GID.NDWI method classifies the pixels with NDWI greater than 0.3 as water bodies.For the RF, the images were segmented into image objects and then object-based features were extracted for training RF.The rest four deep learning methods were implemented by the MMSegmentation (MMSC 2020), and the backbone of FCN, PSPNet, and Deeplabv3+ are ResNet34.All methods except NDWI are trained with train set, and all methods are evaluated with test set.
The values of the four metrics are shown in Table 6.Compared with those of NDWI, RF, FCN, PSPNet, UNet,  and Deeplabv3+, the F1-score of the MSFENet achieves a significant improvement of 2.38%, 3.62%, 3.36%, 3.33%, 1.53%, and 2.88%, respectively, the kappa coefficient of the MSFENet achieves a significant improvement of 2.96%, 4.2%, 3.89%, 3.86%, 1.77%, and 3.32%, respectively.The UNet method has the second highest F1-score and kappa coefficient.The reason is that UNet also has multiple SCs, which can extract multi-layer feature of water bodies.The effectiveness of SC for water extraction has been verified in ablation study, but the accuracy of UNet for water extraction still does not reach the level of the proposed baseline.NDWI achieves the third highest accuracy, and the main reason for its high accuracy is the selection of a suitable threshold.A total of 11 thresholds from 0 to 0.5 at 0.05 intervals were tested for experiments, and a threshold of 0.3 was finally selected to achieve the highest accuracy.In the practical water extraction, NDWI cannot achieve such high accuracy because the threshold is hardly selected for a large-scale mission.RF achieves the lowest accuracy, because its features used for water extraction are manually defined and selected, which cannot meet the requirements of different types of water extraction.The remaining three methods, FCN, PSPNet, and DeepLabv3+, achieve approximate accuracy for water extraction.Similarly, to compare the extraction accuracy of these methods on small water bodies, the four evaluation metrics calculated on regions 1 and 2 are shown in Tables 7 and 8.The MSFENet still achieves the highest F1-score and coefficient in regions 1 and 2, indicating that the MSFENet can achieve the best extraction results for small water bodies.With the multi-layer extraction capability of SC, the F1-score and kappa coefficient of UNet are only 4.28% and 4.4% lower than those of the MSFENet in region 1, respectively, and are only 1.49% and 1.57% lower than those of the MSFENet in region 2, respectively.DeepLabv3+, on the other hand, with the multi-scale features extraction capability of ASPP, also achieves high F1-score and kappa coefficient in regions 1 and 2. In contrast, NDWI method achieves the lowest accuracy for water extraction in region 1, and the accuracy in region 2 is also not very high, indicating the poor extraction ability of NDWI for small water bodies.The remaining three methods, RF, FCN, and PSPNet, are also much less capable of extracting small water bodies than the MSFENet.Figure 11 shows the results of different methods, demonstrating the same results as in Tables 6, 7 and 8, where the MSFENet has a stronger extraction capability for small water bodies as well as being able to obtain more accurate water boundaries.

Spectral variability analysis
Spectral variation of the same category of ground objects brings great challenges to water extraction from remote sensing images.To check the water extraction stability of the proposed MSFENet under the spectral variation, a simulated image is generated by replacing the second band with the third band of the GF-2 image in test set (Figure 12).The MSFENet is trained on the training set, tested on simulated image, and compared with four deep learning models, FCN, PSPNet, UNet, and DeepLabv3+.
Table 9 shows the values of four evaluation metrics of different methods on the simulated image.After the spectral variation, the F1-score and kappa coefficient of all methods have a large decrease, but the MSFENet is still able to achieve the highest F1-score and kappa coefficient.This is mainly because MSFE in the MSFENet can extract multi-layer multi-scale  features of water bodies.The spectral features changes in spectral change analysis, but the shape of water bodies does not change, and the multi-layer multi-scale features can reflect the shape characteristics of water bodies.DeepLabv3+ and PSPNet, two networks that can extract multi-scale features using ASPP and PPM respectively, can also achieve high accuracy.But FCN, a network that can only extract single-scale features, achieve a lower accuracy.The reason for the lowest accuracy of UNet is that its encoder is not ResNet34, which has strong feature extraction ability, but a series of convolutional layers, which cannot effectively cope with spectral variations.When the whole area of the simulated image is evaluated, the MSFENet does not achieve a significant advantage in accuracy.But when two regions of mainly small water bodies are evaluated, the MSFENet not only achieves the highest accuracy of water extraction, but also has a significant improvement from the second highest accuracy of water extraction.As shown in Tables 10 and 11, the F1score and kappa coefficient of the MSFENet are 14.28% and 14.58% higher than the second highest F1-score and kappa coefficient in region 1, 20.18% and 20.9% higher than the second highest F1-score and kappa coefficient in region 2.This indicates that the MSFENet is more stable than other methods and thus can extract more robustness features for small water bodies.The water extraction results of different methods on the simulated image are also shown in Figure 13.It is obvious that the MSFENet can obtain better results for water extraction than other methods, which is consistent with the results of accuracy assessment.

Validation on LoveDA
To further validate the effectiveness of the MSFENet, experiments were performed to compare the water extraction accuracy of the MSFENet, FCN, PSPNet, UNet and DeepLabv3+ on LoveDA (Wang et al. 2021).LoveDA contains 5987 high-resolution images with size of 1024 � 1024, including 2522 images in training set, 1669 images in validation set and 1796 images in test set.The spatial resolution of these images is 0.3 m, and all images with red, green, blue bands are obtained from Google Earth Platform.The images in LoveDA are in Nanjing, Changzhou, and  Table 12 shows the values of four evaluation metrics of different methods on LoveDA.All five methods achieve a low accuracy of water extraction, because the images in LoveDA only contain R, G, and B three bands, unlike the images in GID with NIR bands.Another issue is that labels in GID images are more accurate than LoveDA as all the labels in GID were produced manually.Therefore, mislabeling of LoveDA causes accuracy drops of all models.Moreover, the spatial resolution of images in LoveDA is 0.3 m, which is finer than 3 m of GID images.Fine spatial resolution will increase heterogeneity in landscape, resulting in rising difficulty in objection identification.Although all methods achieve a low accuracy, the MSFENet still outperforms other methods with highest accuracy in LoveDA.Improvements are 1.35% and 1.46% of F1score and kappa coefficient compared to DeepLabv3 +.UNet produces the lowest accuracy as the backbone of UNet is not ResNet34, which leads to no pre-trained parameters to load in the training step.Figure 14 shows the results of different methods in LoveDA, and the MSFENet achieves the best results both in terms of boundaries and accuracy.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the National Natural Science Foundation of China under Grant 41871372

Figure 1 .
Figure 1.The architecture of multi-scale features extraction network (MSFENet), including four residual blocks (RB) which are composed of several basic blocks (BB), four dense atrous convolution modules (DACM), and five decoder blocks (DB).

Figure 4 .
Figure 4. Illustration of pixel-level positive and negative sample pairs.Different colours represent different categories.Ac, Ad, Ba, Bb, Cc, Cd, Dc, and Dd are positive sample pairs.Aa, Ab, Bc, Bd, Ca, Cb, Da, and Db are negative sample pairs.

Figure 5 .
Figure 5. Four selected GF-2 images and the corresponding labelled images.Row a shows standard false color of GF-2 images, Row B shows labelled images in LSCS, and Row C shows relabelled images by visual interpretation.

4. 1
Implementation details All codes are implemented by Python and C++.Pytorch deep learning framework in Python is utilized to implement the MSFENet and other deep learning methods.GDAL library in C++ is employed to clip training samples in training phase, clip validation

Figure 7 .
Figure 7.The water area distribution histogram of region 1 (a), region 2 (b), and cumulative water area distribution histogram of region 1 (c), region 2 (d).

Figure 8 .
Figure 8. Water bodies prediction process using deep learning models.The yellow boxes represent valid areas after cropping, and the red boxes represent all areas after cropping.

Figure 10 .
Figure 10.Accuracy curves with the increasing of sample sizes.(a) test set, (b) region 1, and (c) region 2.

Figure 12 .
Figure 12.The GF-2 image in the test set (a) and simulated image (b).
Wuhan, China, and cover both urban and rural areas.Seven land cover types are included in LoveDA, which are background, buildings, roads, water, barren, forests, and agriculture.Since the ground truth is not provided in the test set of LoveDA, validation set is divided into two random equal parts, one as the new validation set and the other as the new test set.All the images are cropped into 512 � 512 patches, finally 10,088 images are used for training, 3336 images for validation and 3340 images for testing.The experimental settings are as same as experiments on GID.
a multi-scale features extraction network (MSFENet) with CL from optical high-resolution remote sensing imagery water extraction.In the MSFENet, a MSFE is designed to extract multi-layer multiscale features of water bodies, and then these features are combined to obtain the final water extraction results.Compared with other methods, the MSFENet achieves the best performance both in GID and LoveDA.Moreover, the recall, F1-score, and kappa coefficient are improved by 9.49%, 4.28%, and 4.4% in region 1, as well as 4.93%, 1.49% and 1.12% in region 2, compared with UNet, which achieves the second highest F1-score and kappa coefficient on test set.These results indicate that the MSFENet can effectively improve the extraction accuracy of small water bodies.In addition, the results of spectral variability analysis show that the MSFENet is more stable than other neural networks, especially for small water body extraction.The recall, F1-score, and kappa coefficient are improved by 21.61%, 14.28%, and 14.58% in region 1, as well as 26.14%, 20.18%, and 20.9% in region 2, compared with PSPNet, which achieves the second highest F1-score and kappa coefficient in regions 1 and 2 of the simulated image.Furthermore, CL cannot further improve the water extraction accuracy in the case of sufficient sample size but can bring a huge improvement in the case of insufficient sample size.In future work, the structure of MSFENet will be continuously optimized, and a larger test set will be used to test the ability of MSFENet for water extraction and to explore the contribution of CL to the model.

novel multi-scale features extractor. A
• A• A

Table 1 .
Accuracy assessment results of water label image in LSCS.

Table 2 .
Accuracy assessment results of ablation study on test set (bold numbers refer to the highest values).

Table 3 .
Accuracy assessment results of ablation study on region 1 (bold numbers refer to the highest values).

Table 4 .
Accuracy assessment results of ablation study on region 2 (bold numbers refer to the highest values).

Table 5 .
Accuracy assessment results when 1% of the training samples are used (bold numbers refer to the highest values).

Table 6 .
Accuracy assessment results of different methods on test set (bold numbers refer to the highest values).

Table 7 .
Accuracy assessment results with different methods in region 1 (bold numbers refer to the highest values).

Table 8 .
Accuracy assessment results with different methods in region 2 (bold numbers refer to the highest values).

Table 9 .
Accuracy assessment results of different methods on the simulated image (bold numbers refer to the highest values).

Table 10 .
Accuracy assessment results of different methods on the region 1 of the simulated image (bold numbers refer to the highest values).

Table 11 .
Accuracy assessment results of different methods on the region 1 of the simulated image (bold numbers refer to the highest values).

Table 12 .
Accuracy assessment results of different methods in LoveDA (bold numbers refer to the highest values).