A landslide extraction method of channel attention mechanism U-Net network based on Sentinel-2A remote sensing images

ABSTRACT Accurate landslide extraction is significant for landslide disaster prevention and control. Remote sensing images have been widely used in landslide investigation, and landslide extraction methods based on deep learning combined with remote sensing images (such as U-Net) have received a lot of attention. However, because of the variable shape and texture features of landslides in remote sensing images, the rich spectral features, and the complexity of their surrounding features, landslide extraction using U-Net can lead to problems such as false detection and missed detection. Therefore, this study introduces the channel attention mechanism called the squeeze-and-excitation network (SENet) in the feature fusion part of U-Net; the study also constructs an attention U-Net landside extraction model combining SENet and U-Net, and uses Sentinel-2A remote sensing images for model training and validation. The extraction results are evaluated through different evaluation metrics and compared with those of two models: U-Net and U-Net Backbone (U-Net Without Skip Connection). The results show that proposed the model can effectively extract landslides based on Sentinel-2A remote sensing images with an F1 value of 87.94%, which is about 2% and 3% higher than U-Net and U-Net Backbone, respectively, with less false detection and more accurate extraction results.


Introduction
Landslides are a common geological disaster, in which a large amount of rocks, debris, or soil is loosened by rivers, rain, earthquakes, and other factors before moving, downward along a slope under the effect of gravity (Prakash, Manconi, and Loew 2020).Landslides are characterized by high hazard and destruction, susceptibility, and suddenness, posing a serious threat to the safety of human life and property worldwide while, causing damage to the surface environment (Cheng et al. 2021;Thomas et al. 2021;Wandong et al. 2021).Therefore, rapid and accurate landslide extraction of landslide areas is essential for carrying out emergency relief work and postdisaster recovery.In addition, accurate landslide extraction to obtain spatial information on landslides, including the location and extent information, is the basis for landslide susceptibility modeling, risk assessment, and other work (Ghorbanzadeh, Gholamnia, and Ghamisi 2022;Trinh et al. 2022).
Currently, landslide extraction methods mainly include field surveys and investigations, remote sensing methods, and deep learning methods (Li et al. 2022;Mohan et al. 2021).Field surveys and investigation are the most direct and traditional methods for landslide identification, obtaining landslide information with high accuracy but requiring a lot of labor, material, and time (Li, Shi, Lu, et al. 2016).Remote sensing methods include visual interpretation, pixel based, and object based (Chen, Trinder, and Niu 2017;Guzzetti et al. 2012;Li, Shi, Myint, et al. 2016;Martha et al. 2012;Zhao et al. 2017).Visual interpretation was the earliest method used in remote sensing interpretation.It is a professional method for landslide extraction based on the characteristics of image tone, texture, shape, and position (Xu et al. 2009).The results of visual interpretation are generally more accurate, but the method requires staff with extensive knowledge and experience, hence requiring a lot of time and effort, which may fall in meeting the needs of emergency disaster relief decision-making (Liu et al. 2020;Wang et al. 2021).The pixel-based landslide detection method mainly adopts a change detection strategy.By observing land cover changes in different periods, landslides can usually be detected (Yang, Wang, and Shi 2013).However, this method only uses the spectral feature information of a single pixel, ignoring the correlation between adjacent pixels, which can cause the misclassification of pixels (Wandong et al. 2021;Yi and Zhang 2020).The object-oriented method can effectively use the spectral characteristics of the target object on the image to reduce errors in pixel classification information extraction (Blaschke, Feizizadeh, and Hölbling 2014).However, the object-oriented method is generally only used for specific research areas when setting thresholds; hence, its universality needs to be further improved (Keyport et al. 2018).With its strong feature extraction capability, deep learning can extract intrinsic features and deep features (Liu and Wu 2016;Sarkar and Mishra 2018;Tien Bui et al. 2020;Zhu et al. 2020).A fully convolutional network (FCN) (Long, Shelhamer, and Darrell 2015) is the first end-to-end fully convolutional network model for pixel-level prediction.Based on FCN, U-Net introduced concatenate operations through the skip connection structure to effectively fuse high-and low-dimensional features of images and greatly improve segmentation accuracy (He et al. 2021;Ronneberger, Fischer, and Brox 2015).The U-Net network model can handle the complex features of remote sensing images well and has strong feature-learning capability (Shamsolmoali et al. 2019).
U-Net was initially used for biomedical image segmentation; later, it was widely used in landslide extraction based on remote sensing images, and satisfactory results were achieved (Dong et al. 2022;Qi et al. 2020;Zhang et al. 2020).Soares, Dias, and Grohmann (2020) used the U-Net model to realize the automatic extraction of landslides in the mountains of Rio de Janeiro, Brazil.However, when the U-Net model is directly used to extract small-scale landslides because of the insufficient RGB spectral information of remote sensing images, the landslide characteristics are not obvious, and the model has trouble learning characteristics.When there are objects in the image that are similar to the landslide spectrum, such as bare land, dry land, and so forth, the U-Net model will cause false and missed detection.Liu et al. (2020) added the residual learning unit to U-Net and expanded the input data from three channels of RGB to six channels by adding a digital surface model (DSM), slope, and aspect, three parameters closely related to landslides; in doing this, they were able to achieve good results.However, less attention has been paid to the multiple feature channels formed after the U-Net skip connection.In addition, when extracting landslides, U-Net directly fuses the extracted shallow and deep features with features through the skip connection structure, and the semantic difference between them is large, which easily generates a semantic gap (He et al. 2021;Pang et al. 2019), thus causing interference with the deep features learned by the U-Net model.Meanwhile, multiple feature channels are formed after feature fusion, and the U-Net model lacks attention to the importance of feature channels, hence affecting the model performance and accuracy of the landslide extraction results.Therefore, how to distinguish the importance of feature channels, enhance learning for landslide features, effectively distinguish landslides from their easily confused features, and improve the accuracy of landslide extraction is a thorny problem faced by various landslide extraction methods (Bragagnolo et al. 2021).
In the present paper, the channel attention mechanism network model squeeze-and-excitation network (SENet) is added after the skip connection of U-Net (Hu, Shen, and Sun 2018) so that the improved U-Net model can adjust the weights of the feature channels, focus on the feature channels with a large contribution to feature classification, and process the fused feature channels more effectively, focusing on learning landslide features and, thus, solving the problem of different semantic gaps arising from a direct jump connection of U-Net to features of different dimensions.Meanwhile, combined with Sentinel-2A remote sensing data for landslide extraction, Sentinel-2A remote sensing data contain 13 spectral bands with rich spectral information, which is conducive to the model's learning of landslide features, thus improving the accuracy of landslide extraction. The

Experimental scene and data sources
Lanzhou City is located at 35°34 ′ 20 ′′ -37°07 ′ 07 ′′ N, 102°35 ′ 58 ′′ -104°34 ′ 29 ′′ E. It is situated in the northwest region of China and is one of the most important central cities in the western region (Figure 1).The study area covers an area of about 2565 km 2 .The topography of Lanzhou is high in the west and south and low in the northeast, and the Yellow River flows from west to south to the northeast and crosses the entire territory.Climatically, Lanzhou is deeply inland and belongs to the temperate semiarid climate zone, with a dry climate, annual average temperature of 10.3°C, and annual average precipitation of 327 mm, mainly concentrated in June to September and, mostly in the form of heavy and torrential rainfall; this has been coupled with the increasing impact of human engineering activities on the geological environment.The increasing impact of human engineering activities on the geological environment in recent years has made landslide disasters in the territory frequent.The types of landslides in Lanzhou are mainly mixed, medium, and small loess landslides.The landslides mainly develop in loess and loess-like soils, which are difficult to distinguish from the surrounding features, while the landslide boundaries are difficult to define (Mei and Zhang 2010).
Sentinel-2A satellite remote sensing data obtained from the official website of the USGS (United States Geological Survey, http://www.usgs.gov)were used for landslide extraction.Sentinel-2A is a high-resolution multispectral imaging satellite that uses a multispectral imager (MSI) with an altitude of 786 km and width of 290 km, covering 13 spectral bands with rich waveform information.Sentinel-2A covers the visible, near-infrared, and mid-infrared bands and has a ground resolution of 10, 20, and 60 m at three levels.The specific sensor parameters are shown in Table 1.To avoid the influence of clouds on satellite imaging, the cloud coverage should be set to 0% when downloading images.According to the cloud coverage setting, the present study selected Lanzhou's remote sensing image data obtained by the Sentinel-2A satellite in December 2021.

Methods
First, two Sentinel-2A images were selected to ensure the coverage of the entire study area, and the experimental dataset was constructed after preprocessing to generate experimental samples.Then, a  channel attention U-Net model was constructed with TensorFlow as the base framework, and the model was trained using the experimental dataset to extract the landslides.Finally, the extraction results of the U-Net, U-Net Backbone, and improved attentional U-Net models were compared, the experimental results were evaluated using relevant evaluation metrics and the performance of the landslide extraction model was analyzed.The specific idea is shown in Figure 2, and the following subsections provide detailed information about various aspects of the methods.

Data processing and dataset
The acquired Sentinel-2A image data downloaded from the USGS were Level-1C (L1C) data, which were not atmospherically corrected.The L1C-level data needed to be preprocessed before it could be used for landslide extraction.To remove radiometric errors caused by atmospheric influences, the true surface reflectance of the feature was inverted.First, atmospheric correction of the L1C level data was required to obtain Level-2A (L2A) data.In the present paper, the L1C-level data were processed using Sen2cor, a plug-in released by European Space Agency (ESA) to produce L2A-level data, to obtain L2A-level data.Next, the L2A-level data were resampled at 10 m resolution using SNAP software and exported for subsequent processing and analysis.Finally, the resampled image data were imported into ENVI software for mosaic processing to ensure that it could contain the entire study area.
After the data preprocessing has been completed, the landslide dataset needed to be produced.The production process included the following three parts: Sentinel-2A image data: The 12 preprocessed bands of Sentinel-2A images were loaded into Arc-Map 10.5 for processing, and Sentinel-2A images of the study area were obtained.Because of the large size of the original satellite image, to make it compatible with the network input requirements, code was written in Python to crop the image using an overlapping cut strategy with a step size of 128; doing this dividied the original Sentinel-2A satellite image into several subimages of size 256 × 256 and normalized them, with each one containing one or more landslide events (Figure 3a).In deep learning, the number of training samples is generally sufficient, but there is still a shortage of high-quality representative training samples.To increase the number of samples and noisy data to improve the generalization ability and robustness of the model, the present study employed some data enhancement strategies to generate more training samples.For the enhancement method, the cut image was mirrored and rotated (90°counterclockwise) to obtain 5,448 training samples of a size of 256 × 256.
For the dataset, first, according to the historical landslide data, the landslides in the study area were vectorized and labeled on Google Earth through visual interpretation.Then, the labeled landslide vector data were imported into ArcMap 10.5 to convert these into raster data and export the data as a binary image to obtain ground truth, which was the label for deep learning model training.The foreground (white) represents the landslide area, and the background (black) represents the nonlandslide area.Then, the same cutting and data augmentation was performed for the ground truth as for the Sentinel-2A image.Finally, the 5,448 training samples and the ground truth obtained above were randomly divided according to the ratio of the training set: validation set: test set as 6:2:2.After the division was finished, the training set contained 3,270 samples, the validation set contained 1,089 samples, and the test set contained 1,089 samples.The training set, validation set, and test set were mutually exclusive (Figure 3b).The training set was used to train the model, the validation set was used to select the optimal model parameters, and the test set was used to quantitatively evaluate the performance of the model.

U-Net network
The U-Net is a two-dimensional image semantic segmentation network based on a FCN (Ronneberger, Fischer, and Brox 2015).One critical idea of U-Net is the skip connection structure, which can fuse low-dimensional and high-dimensional features and significantly improve segmentation accuracy.At the same time, U-Net can accurately segment images based on few training data  and has a fast training speed, making the network widely used in remote sensing image segmentation.The U-Net architecture consists of a systolic path (encoder), extended path (decoder), and skip connection structure, forming a U-shaped structure, which is a typical encoder-decoder structure (Chang et al. 2021;Liu et al. 2019).
The encoder was used to extract features from the image and reduce spatial dimensionality.The encoder was similar to the standard CNN architecture and contained convolutional layers and downsampling layers.The convolutional layer was used to extract image features, and the downsampling layer was used to filter unimportant high-frequency information and perform feature dimensionality reduction.Repeated convolution and pooling operations can fully extract the higher-level features of an image.The role of the decoder is to upsample the feature maps extracted by the encoder to match the size of the input image, thus performing pixel-level semantic prediction of the input image.The decoder contained convolutional layers, upsampling layers, and dimensional splicing operations (concatenate).In the decoding process, the number of channels of the feature map was halved at each upsampling; then, the feature maps of the corresponding scales in the encoding and decoding parts were fused by a dimensional splicing operation (concatenate).Dimensional splicing (concatenate) splices the channel dimensions of the feature map together to form thicker features.This operation can retain higher-resolution, more detailed information to improve the resolution and edge accuracy of the final segmentation results.In the present study, U-Net was used as the basic network of the landslide boundary information extraction model.The number of U-Net channels started from 32 and rose to 512 after four downsampling iterations.The structure is shown in Figure 4.
The skip connection structure of U-Net can improve segmentation accuracy by combining shallow, low-level mappings from the encoder with deep, high-level features from the decoder.However, some studies have pointed out that segmentation is not effective when the segmentation target is very small and that the skip connection structure directly combines shallow and deep features, which tends to generate a semantic gap (semantic gap) (He et al. 2021).Although the skip connection is a natural design, the feature mappings at the same scale for encoder and decoder networks are semantically different, and they cannot be guaranteed to be the best match of feature fusion (Abdollahi and Pradhan 2021).Therefore, in the current study, a U-Net (U-Net Backbone) network model with the skip connection structure removed (Wang et al. 2022;Xiao, Yang, and Sadovnik 2021) (structure shown in Figure 5) was used for landslide extraction, and the model was compared with U-Net and the improved attentional U-Net.

Channel attention mechanism U-Net network
The attention mechanism (AM) is derived from the study of human vision: human vision obtains the targeted focus by quickly scanning the global image, that is, the focus of attention; then, they eye invests more attention resources in this area to obtain more detailed information about the target that needs to be focused on while suppressing other irrelevant information (Quader et al. 2020).Starting in 2014, AM was first applied in the field of machine translation and then widely used in the field of computer vision (Niu, Zhong, and Yu 2021;Oktay et al. 2018).AMs applied to the field of computer vision have been generally classified into the spatial domain attention, channel domain attention, and mixed domain attention.The core of the AM is to let the network focus on what it needs to pay more attention to.The AM is generally embodied in the form of weights, allocating limited information processing resources to the important parts, thus achieving the effect of improving the performance of deep learning models (Guo et al. 2022;Yang 2020).In the current study, only the landslide features were extracted.One of the difficulties in landslide research is that other features such as vegetation, bare ground, dry land, and so forth are treated as backgrounds, and the background parts that occupy most of the image area are regarded as irrelevant information.Some backgrounds have high spectral similarity with landslides, and the complex background information affects the accuracy of landslide extraction.Considering that U-Net forms more 'thick' features after the concatenate operation of the channel dimension in the skip connection part, the number of channels increases to avoid the semantic gap and redundant information that may arise after the concatenate operation to affect the feature-learning ability of the model.Therefore, the present study incorporated SENet, a channel attention mechanism, after each skip connection of the U-Net network to estimate the contribution of different feature channels to landslide classification and enhance or suppress different channels according to their contribution.The squeeze-and-excitation block (SE-block) is the core of SENet, and its structure is shown in Figure 6.The modified attention U-Net network structure is shown in Figure 7.
SENet aims to model the correlation between different channels and automatically obtain the importance of each feature channel through network learning before then assigning different weight coefficients to each channel to reinforce the important features and suppress the nonimportant ones.The implementation of SENet can be abstracted into three steps, as follows: Squeeze operation: The first is the squeeze operation, which follows the spatial dimensionality of the feature map U by global averaging pooling (formula 1) to perform feature compression on it; this turns each two-dimensional feature channel into a one-dimensional real number z [ R C , which has a global perceptual field and whose output dimensionality is the same as that of the input feature channel: where u c is a certain feature map, H and W are the height and width of the feature map, C is the number of channels, and z c is a scalar value reflecting the global features of the feature map.
Excitation operation: This operation is used to fully capture the dependence between channels and to learn the nonlinear relationship between channels.It is a mechanism similar to the gate in a recurrent neural network.The weight is generated for each feature channel by parameter W. The excitation operation is mainly composed of two fully connected layers and two activation functions.The first fully connected layer uses ReLU as the activation function, and the second fully connected layer uses Sigmoid as the activation function.The expression formula is shown in formula 2: where

Training model
The training of the deep learning model is based on high computational power (Qi et al. 2020).U-Net stitches together the feature maps in the channel dimension to form thicker features.At the same time, the intermediate variables in the network and large number of intermediate parameters generated when using the optimization algorithm cause U-Net to consume a large amount of video memory during the training process, hence requiring high hardware requirements.The hardware   configuration of this experiment is shown in Table 2, and the software configuration is shown in Table 3.
There are numerous parameters to be set and adjusted in the process of model training, and batch size is one of the more important parameters.Batch size is often set between a few dozen and few hundred but generally not more than 1000.In this experiment, considering the memory limitation of the graphics card, the batch size was set to 64.
After several pre-experiment tests and considering the hardware limitations, the computational efficiency of the model, and the accuracy of the results, the number of iterations (epoch) during the experiment was set to 128 times, the learning rate to 0.00001, and the optimizer as the Adaptive moment estimation (Adam) (Kingma and Ba 2014).Adam is simple to implement and is very suitable for a wide range of nonconvex optimization problems in the field of machine learning (Xiao, Yang, and Sadovnik 2021).In the present paper, we study the image semantic segmentation problem, which is a binary classification of image pixels, and binary cross-entropy (BCE) has been chosen as the loss function.The BCE was applied to the binary classification task and can be defined as follows: Among them, N represents the number of classification categories, y is a binary label 0 or 1, and p(y) is the probability that the output belongs to the y label.
In the experiments, all models were trained from the initial stage without involving any pretrained models.The area under the ROC curve (area under curve, AUC) was selected to monitor the training process of the models during model training, and the U-Net, U-Net Backbone, and modified attentional U-Net were trained under the same experimental settings.

Evaluation index
To make a comprehensive quantitative evaluation of the performance of the trained model for landslide extraction, six commonly used quantitative evaluation indexes were used: Kappa coefficient (Kappa), AUC, mean intersection over union (MIoU), precision (P), recall (R), and F1-score (F1).The classification task involved was a binary classification task and the confusion matrix shown in Figure 8, where TP (true positive) indicates that the model correctly predicted the landslide target, FN (false negative) indicates that the landslide target was incorrectly predicted as a nonlandslide, FP (false positive) indicates that the non-landslide target was incorrectly predicted as a landslide, and TN (true negative) indicates that the model correctly predicted the non-landslide target.However, these four indicators are relatively basic and do not clearly reflect the comprehensive performance of the model.Precision and recall were calculated based on these four basic indicators (Gao et al. 2021).
The Kappa coefficient is a measure of consistency, which for classification problems is the agreement between the model prediction and actual classification result.The Kappa coefficient can be calculated based on the confusion matrix and takes values between −1 and 1, usually greater than 0. The formula for calculating the Kappa coefficient is as follows: where n is the total number of columns of the confusion matrix (total number of categories); X ii is the number of samples in row i and column i of the confusion matrix, that is, the number of correctly classified samples; X i+ and X +i are the total numbers of samples in row i and column i, respectively; and N is the total number of samples used for accuracy evaluation.
The AUC is defined as the area under the ROC curve and is a performance metric measuring the merits of a learner.Simply put, the larger the AUC value, the higher the correct rate of the classifier.
The calculation formula is as follows: where ins i denotes the serial number of the i-th sample, the probability score is ranked from small to large, and M and N are the number of positive and negative samples, respectively.
ins i [ positiveclass rank ins i denotes the rank sequence number of positive samples only.
MIoU is a standard evaluation index of the semantic segmentation network; it is the ratio of the intersection and union between the prediction results of each category and real labels and is the result of summation and re-average.The calculation formula is as follows: Precision measures the ratio of correct targets to those judged to be correct, that is, the percentage of correctly predicted (the true label is positive) samples among all samples for which the model predicts a positive case; this is calculated as follows: Recall measures the proportion of the correct target judged as the correct target in the correct target; that is, the percentage of all samples with positive labels is predicted.The calculation formula is as follows: The F1-score (F1) is the harmonic mean of precision and recall, which is a more balanced indicator.
Considering both the precision and recall, this can better reflect the comprehensive performance of the model.The larger the F1, the better the performance of the model.The calculation formula is as follows: 3. Results and analysis

Landslide extraction model performance analysis
The To further evaluate model performance, the present study used the confusion matrix to calculate five evaluation metricsprecision, recall, MeanIoU, F1, and Kappa to quantitatively evaluate the prediction results of the three models.The evaluation metrics were calculated as shown in Figure 10. Figure 10 shows that the attention U-Net model outperformed the U-Net model and the U-Net Backbone model in the four comprehensive evaluation metrics of precision, F1, Kappa, and Mean-IoU.Still, the recall value of the proposed attention U-Net model (95.19%) was slightly lower than that of the U-Net model (96.09%).Recall is used to evaluate the detection coverage of the detector for all targets to be detected.From the analysis above, the U-Net model had a high false detection rate for small landslides.The experimental scenario in the present paper had smaller landslides, so it is highly likely that the U-Net model identified other objects as landslides, which eventually led to its high recall value.Deep learning evaluation metrics need to synthesize multiple metrics to better reveal the performance of the model.In the present study, the proposed attention U-Net model had better balance and comprehensive performance by quantifying five metrics in general.Therefore, the proposed attention U-Net model had high performance and was an excellent model for landslide extraction.

Analysis of fine landslide extraction results
In the present study, U-Net, U-Net Backbone, and the proposed channel attention mechanism U-Net were used to extract the landslide boundary of the main urban area of Lanzhou City.The fine extraction results for typical areas are shown in Figure 11.The first column in Figure 11 shows the synthetic images of the Sentinel-2A satellite in 2, 3, and, 4 bands, the second column shows the ground truth,  By selecting several typical regions with fine landslide results for the analysis, we found that the U-Net model extracted landslide boundaries more finely.The main reason may be that the U-Net model added the skip connection structure, fused the low-level features and high-level features of the landslide in the image during the landslide extraction process, and supplemented the semantic information of the landslide in the decoding stage; hence, the U-Net model was more refined in landslide boundary segmentation compared with the U-Net Backbone.However, compared with the real values, the U-Net model had a misdetection phenomenon in the extraction of landslide results (Figure 11c, rows 1-5, red circles), probably because of the lack of attention to the multiple feature channels formed after the fusion of the skip connections during the training of the U-Net model, which affected the model performance and tended to identify small non-landslide targets with small differences from landslide spectral features as landslides when extracting landslides, this caused the phenomenon of landslide false detection, which affected the accuracy of landslide extraction, showing that the U-Net model was still lacking in landslide feature extraction and learning.
Compared with the U-Net model, the U-Net Backbone model extracted rougher landslide boundaries and blurred the landslide boundaries, especially for some fine landslides.The adhesion problem occurred; that is, the model directly ignored the landslide boundaries and surrounding features while identifying adjacent landslides as the same one (Figure 11d, rows 1, 4, 5, and 6, red circles), resulting in misdetection; hence the overall segmentation results and edge details were not fine enough.
Compared with the U-Net and U-Net Backbone models, the channel attention mechanism U-Net model proposed in the present study extracted results closer to the ground truth (Figure 11b) in each typical scenario selected (Figure 11a), and all of them could extract fine landslide boundary information (Figure 11e).In addition, the analysis revealed that the proposed model extracted the lowest false detection rate of landslides and was especially able to extract small and slender landslides as well (Figure 11e, rows 2-6, red circles).Because the improved U-Net model had the skip connection structure for fusing low-level and high-level features and an attention module -SE-block for processing the fused feature channelsit could suppress noise, emphasize important channels, focus on learning landslide features, and better distinguish landslide and non-landslide features in images, in turn improving model robustness, enabling finer prediction of landslide contour information, and reducing the false detection rate of landslides in complex geographic environments.
To further quantify the proposed model for the landslide extraction boundary accuracy problem in detail, the current study used probability mapping plots to analyze and generate probability mapping plots for the three models for landslide extraction (Figure 12).The probability mapping plot was the raw probability (from 0 to 1) of all pixels in the input image classified as landslides.The color bar on the right was a scale measuring the raw probability.Blue indicates low probability and red indicates high probability.The first column is ground truth, where all pixels are labeled as feature pixels (slippery slope) or background pixels (nonslippery slope).The second and fourth columns show the probability mapping plots of the U-Net model, the U-Net Backbone model, and the improved attentional U-Net model for landslide segmentation, respectively.
As can be seen, the probability mapping plots of the three models of U-Net, U-Net Backbone, and the attention U-Net for extracting landslides compared with ground truth (Figure 12a), the proposed channel attention mechanism U-Net model worked the best for classifying pixels in the test images, especially for small landslide identification with finer and more accurate boundaries.Compared with the improved channel attention mechanism U-Net model, the U-Net model showed more misclassification of landslide pixels for small landslides (Figure 12b); that is, some non-landslide pixels had a higher probability of being identified as landslide pixels, indicating insufficient learning ability for landslide features and weaker model performance.In Figure 12, the color change representing low probability and high probability was mainly transited at the edge of the landslide.The U-Net Backbone showed greater uncertainty at the edge of the landslide.The color change area at the edge of the landslide in the probability mapping plots was larger than that at the edge of the landslide (the area between blue and red) in the probability mapping plots of the improved attention U-Net (Figure 12c); that is, the probability that the pixel at the edge of the landslide was determined to be a landslide was lower than that of the improved attention U-Net (Figure 12d).At the same time, the third row of Figure 12c also showed a misclassification of non-landslide pixels.This also indicated that the U-Net Backbone model was not sufficiently refined in the extraction of landslide boundaries, especially for small landslides.In summary, through the quantitative analysis of the probability mapping plots, the attention U-Net model proposed in the current paper was able to take into account the characteristics of the landslide and extract fine landslide boundary information, especially for small landslides.

Analysis of landslide extraction results on a large scale
The types of landslides in Lanzhou City are mainly mixed, medium, and small loess landslides and are mainly composed of loess and secondary loess of various genesis; in addition, their spectral and textural characteristics have a high similarity to the surrounding environment, which brings difficulties in extracting landslides.In the present study, based on Sentinel-2A images, three models trained U-Net, U-Net Backbone, and improved channel attention U-Netwere used to extract landslides in the main urban area of Lanzhou City.The models were then used to verify the overall performance of the model proposed.The extraction results are shown in Figure 13.From Figure 13, it can be seen that the improved channel attention U-Net model performed the best and could extract landslide boundary information accurately, especially for small landslides.The extraction result was the closest to the ground truth.Through quantitative statistical calculations, the ground truth had a total of 255 landslides with a total area of 17.83 km 2 ; the U-Net model extracted 307 landslides with a total area of 20.00 km 2 ; the U-Net Backbone model extracted 230 landslides with a total area of 18.88 km 2 ; the improved channel attention U-Net model extracted 252 landslides with a total area of 19.16 km 2 .Here, the U-Net model extracted the largest number of landslides and the largest total area of landslides.The U-Net Backbone model extracted the fewest landslides and had the smallest landslide area.The number of landslides extracted by the attention U-Net model proposed in the present paper was close to the ground truth.Based on the above analysis, it can be seen that compared with U-Net and improved attention U-Net models, U-Net Backbone had a weaker learning ability for landslide features, thus ignoring some landslides and having a higher missed detection rate.The U-Net model had a higher false detection rate for small landslides, and the improved attention U-Net model had the best landslide extraction results.Combining the visual effect, the improved attention U-Net model had the best extraction effect on landslides among the three models and was the closest to the ground truth.In general, by incorporating SENet, a channel attention mechanism, the improved attention U-Net model could assign different weights to the feature channels generated after fusing low-dimensional features and high-dimensional features of images through the skip connection structure, focusing on the channels with large contributions to landslide feature classification; this helped to effectively improve the recognition of the differences between landslide and non-landslide spectral features and enhance the performance of the network in extracting landslide features, thus improving the accuracy of landslide extraction and reducing the leakage and false detection rates of large-scale landslide extraction.

Comparison of models
Accurate and rapid landslide identification is crucial for disaster monitoring and emergency rescue work.Therefore, the learning efficiency of deep learning models is critical.For a deep learning model, the complexity of the model has a large impact on the extraction time of landslides.In the present study, four metrics were selected to measure the complexity of the model: the number of model parameters, model size, training time per step, and testing time of the test set (Table 4) (Yi and Zhang 2020).
Based on the analysis of the four evaluation indexes of the complexity of the evaluation model, the model complexity of the U-Net network proposed in the current paper was slightly higher than that of the U-Net and U-Net Backbone networks, but there was no significant difference.The attention U-Net model proposed in the present paper used a feature recalibration strategy to obtain the importance of each channel, introduced fewer parameters, and required fewer computing resources (Goel et al. 2020;Niu, Zhong, and Yu 2021).At the same time, because the improved attention U-Net model incorporated the SENet module, which aided in emphasizing the important channel while suppressing noise (Suzuki and Yamane 2020), improving the accuracy of landslide feature information processing, and improving model performance, its landslide extraction result accuracy was better than that of the U-Net and U-Net Backbone models; this led to a reduction in the false detection rate and missed detection rate.Considering model complexity and landslide extraction result accuracy, the complexity of the proposed attention U-Net model was found to be acceptable and can meet the demand for accurate and fast landslide detection for disaster monitoring and emergency response.

Comparison with previous work
Currently, some scholars have used deep learning methods for landslide identification and extraction, achieving good results (Ghorbanzadeh et al. 2019;Lei et al. 2019).Bragagnolo et al. (2021) identified and semantically segmented landslide scars in a region of Nepal based on the U-Net model and Landsat satellite images, which proved the feasibility of the U-Net method.However, in the face of a wide range of complex scenarios, the U-Net model still has some shortcomings.Scholars began to improve the U-Net model to extract landslides, obtaining better extraction results.Liu et al. (2020) embedded residual learning units in U-Net and added terrain data to extract landslides, obtaining better results than the original U-Net model.In addition, some scholars have used multisource data to extract landslides.Chen et al. (2022) extracted potential active landslides in the Three Rivers Region of the Qinghai-Tibet Plateau based on improved U-Net (DRs-UNet) combined with InSAR deformation phase images.
Compared with existing studies, the method proposed in the current paper mainly took into account the fact that Sentinel-2A image data in multiple bands will form multiple feature channels after being input into the model.The model performance was improved by focusing on the features that significantly contributed to feature classification.In contrast, the lack of focus on the importance of feature channels in existing studies has led to a high rate of missed and false detection.Therefore, in comparison, the method proposed in the present study can identify landslides quickly and accurately and is an excellent algorithm.

Application of the model in other scenarios
To further verify the generalizability of the model, 12 bands of Sentinel-2A images from a typical area with extensive landslide distribution in the Bailong River basin were selected for validation.
The typical area was about 720 km 2 , from 104°26 ′ 19.06 ′′ E to 104°47 ′ 37.27 ′′ E and from 32°53 ′ 13.19 ′′ N to 33°4 ′ 57.70 ′′ N. The number of landslides in the validation area was large, and the landslide samples were based on Google Earth images obtained by visual interpretation to identify 123 landslides.Because the geographic environment of the Bailong River basin was similar to that of Lanzhou City, the model obtained from the training of the Lanzhou City landslide dataset was used as the initial weight, and some of the mutually exclusive landslide samples were selected for fine-tuning the model parameters in the surrounding areas outside the validation area.The landslides in the validation area were extracted using the proposed attention U-Net model; and the extraction results are shown in Figure 14.
The results show that the proposed attention U-Net method achieved most of the landslide boundary extraction in the validation area.The metrics of landslide extraction results were as follows: F1 93.53%, precision 93.19%, recall 94.61%, MeanIoU 94.30%, and Kappa 93.30%.Hence, the proposed method had a low false detection rate and good evaluation index for landslide extraction, and the visual effect was closer to the ground truth, proving the universality and robustness of the proposed method for landslide extraction.

Limitations and prospects
The attention U-Net model proposed in the present study is simple in structure and easy to deploy.The model can focus more on the feature channels that contribute to the classification of landslide features while fusing together deep and shallow features to improve the performance of the model and the accuracy of landslide extraction.It can be easily embedded into other classification or detection models to improve the performance of the model and can quickly and accurately extract landslides, reducing the false detection rate and missed detection rate.However, it will overextract the surrounding non-landslide features, resulting in problems such as the overdetection of landslides (Figure 13).The main reasons for this are summarized into the following three points: First, the deep learning model requires a large amount of training data (Yi et al. 2019).The current paper used a data enhancement method to expand training data, but the data enhancement strategy was insufficient from a spatial perspective (mirroring, rotation) to enhance the data.The representativeness of training samples was not enough to cover all kinds of landslides, making it difficult for the proposed model to fully cope with some complex and rare landslides.
Second, landslides are natural disasters occurring under the action of gravity, so the occurrence of landslides is closely related to topographic factors.The use of optical images alone cannot effectively distinguish landslides from features with characteristics similar to landslides (bare ground, special artificial buildings, etc.); hence, the present study did not include topographic data other than remotely sensed images in the data, leading to a false detection rate in complex environments.
Third, landslides have different spatial scales, and their areas range from a few square meters to several square kilometers (Ghorbanzadeh et al. 2019;Lei et al. 2019).In the present study, we used Sentinel-2A images with limited spatial resolution, so very small-scale landslides were not clear enough in visual features, making the model ignore some of the detailed information on landslides during training and leading to an increased rate of missed detection.
In future work, we will have different kinds of landslides in multiple scenes to expand the samples and prepare richer data for the model's learning of landslide features.At the same time, we can consider training new samples from different regions and types through transfer learning (Zhang, Zhang, and Du 2016).We will consider inputting DEM and slope data together with optical images into the model for model training and landslide extraction (Liu et al. 2020).At the same time, we will consider using more high spatial resolution remote sensing images for landslide extraction.

Conclusion
In the present study, an attentional U-Net landslide extraction model combining SENet and U-Net was proposed, and landslides were extracted using 12 bands of Sentinel-2A images.We introduced the channel attention mechanism network SENet after the skip connection part of the U-Net network, which was used to adjust the weights of different feature channels; this allowed model to focus on the feature channels that contribute more to landslide classification, focus on learning landslide features, better distinguish the spectral differences between landslides and non-landslides, and enhance the performance of the network in extracting landslide features.The present study used the same test set to perform landslide extraction experiments on the three models of U-Net, U-Net Backbone, and the proposed attention U-Net.The results showed that, compared with the U-Net and U-Net Backbone models, the proposed attention U-Net model achieved better performance by assigning different weights to different feature channels.The landslide extraction accuracy was higher, the false detection rate and missed detection rate were lower, and the landslide edge extraction was more refined.The quantitative evaluation results showed that the evaluation indexes of the proposed method were generally high, including an F1 value of 87.94%, which was about 2% and 3% higher than the U-Net and U-Net Backbone models, respectively, verifying the effectiveness of the method.In addition, the proposed model showed robustness and universality when applied to other scenarios, which again indicates that the proposed model is reliable and can meet the demand for the rapid acquisition of landslide disaster information.
objectives of the present paper are as follows: (1) to propose an attention U-Net landside extraction model combining SENet and U-Net to reduce the false and missed detection rate of landslide identification in remote sensing images and improve the accuracy of landslide extraction; (2) to use the U-Net and U-Net Backbone models for the same remote sensing image landslide extraction task and compare the extraction performance and results of the improved attention U-Net model; and (3), based on the Sentinel-2A satellite image, use the improved attention U-Net model to select landslide prone scenes (Lanzhou City) for large-scale landslide extraction.

Figure 3 .
Figure 3. Sample of the partial dataset (12 bands, ground truth).(a) Image of a band of Sentinel-2A in the study area (band 8, the yellow rectangle represents the position of the part of the subimages obtained after cutting, with the number corresponding to that in Figure 3b), (b) Subimages obtained after cutting (12 bands, ground truth).

Figure 9 .
Figure 9. Training process curves of the three models.(a) U-Net, (b) U-Net Backbone, (c) Attention U-Net.

Figure 10 .
Figure 10.Evaluation index of landslide extraction results of different models.

Figure 11 .
Figure 11.Extraction results of the three models on the test set.(a) Sentinel-2A images, (b) Ground truth, (c) U-Net, (d) U-Net Backbone, (e) Attention U-Net.The red circles represent the landslides extracted by different methods, showing large differences compared with the ground truth.

Figure 13 .
Figure 13.Landslide extraction results of the three models in the experimental area.

Figure 14 .
Figure 14.Application of the model in other scenarios.

Table 2 .
Basic system platform configuration.

Table 3 .
The core software configuration.

Table 4 .
Comparisons of model complexity.