Boundary-Assisted Learning for Building Extraction from Optical Remote Sensing Imagery

: Deep learning methods have been shown to signiﬁcantly improve the performance of building extraction from optical remote sensing imagery. However, keeping the morphological characteristics, especially the boundaries, is still a challenge that requires further study. In this paper, we propose a novel fully convolutional network (FCN) for accurately extracting buildings, in which a boundary learning task is embedded to help maintain the boundaries of buildings. Speciﬁcally, in the training phase, our framework simultaneously learns the extraction of buildings and boundary detection and only outputs extraction results while testing. In addition, we introduce spatial variation fusion (SVF) to establish an association between the two tasks, thus coupling them and making them share the latent semantics and interact with each other. On the other hand, we utilize separable convolution with a larger kernel to enlarge the receptive ﬁelds while reducing the number of model parameters and adopt the convolutional block attention module (CBAM) to boost the network. The proposed framework was extensively evaluated on the WHU Building Dataset and the Inria Aerial Image Labeling Dataset. The experiments demonstrate that our method achieves state-of-the-art performance on building extraction. With the assistance of boundary learning, the boundary maintenance of buildings is ameliorated. a detailed explanation of the working mechanism of our improvements. The further experiment on satellite images also shows that our method improves the performance of building extraction from satellite images with relatively lower quality. In our future research, we will ulteriorly improve our method.


Introduction
Building extraction from optical remote sensing imagery is one of the fundamental tasks in remote sensing, which plays a key role in many applications, such as urban planning and construction, natural crisis and disaster management, and population and regional development [1][2][3]. The years of development of Earth observation technology have made high-quality remote sensing images available, and the spatial resolution and elaborate spectral, structure, and texture information of objects are increasingly being represented [4]. These make various objects in the imagery distinguishable and make accurate extraction of buildings possible. Meanwhile, the amount of remote sensing imagery is increasing rapidly, which puts forward a high demand for automatic image processing.
During the past several decades, the major methods for building extraction from aerial or satellite imagery consisted of designing features (spectrum, edge, shape, shadow, and so on) that could best represent buildings [5]. For example, Huang et al. [6] utilized the intrinsic spectral-structural properties of buildings and proposed the morphological building index. Wang et al. [7] introduced a semi-automatic building extraction method by tracking the edge and linear features. Hu et al. [8] proposed an enhanced morphological building index for automatic building extraction based on shape characteristics. Ok et al. [9] modeled the directional spatial relationship between buildings and their shadows. Those methods have made some achievements, but only work under certain conditions, as the features are handcrafted, and such empirically designed features have poor generalization ability.
In recent years, deep learning techniques [10], especially convolutional neural networks (CNNs), have been applied to many fields and have shown important implications in the field of computer vision. CNNs automatically and efficiently learn hierarchical features that map the original inputs to the designated labels without any prior knowledge. Due to their powerful feature representation ability, CNNs outperform traditional approaches by leaps and bounds in many vision missions [11]. Early CNNs [12][13][14] focused on the classification of whole images, but lacked accurate identification and positioning of objects in images. Since Long et al. [15] first put forward fully convolutional networks (FCNs), semantic segmentation, a pixel-level classification task [16] that aims to assign each pixel to the class of its enclosing object, has been dramatically developed. Based on the FCN paradigm, a large variety of state-of-the-art (SOTA) FCN frameworks have been proposed to improve the segmentation performance. For example, the encoder-decoder network (SegNet [17]) adopts elegant structures to better recover object details. Multi-scale networks (PSPNet [18], DeepLab [19]) extract information from feature maps of various scales to fit in with objects of different sizes. Networks with skip connections (SharpMask [20], U-Net [21]) combine multi-level features to generate crisp segmentation maps. These classic FCNs have made significant achievements on natural scene or medical image datasets, such as PASCAL VOC [22], Microsoft COCO [23], and BACH [24]. Inspired by the great success of CNNs in computer vision, researchers have extended them to remote sensing image processing [25]. Building extraction, which can be settled via semantic segmentation, has also benefited a lot from FCNs. For example, Yi et al. [26] modified the U-Net [21], proposing an encoder-decoder network to perform urban building segmentation. Maggiori et al. [27] developed a multi-scale structure in an FCN to reduce the tradeoff between recognition and precise localization. Ji et al. [5] proposed a Siamese structure that takes an original image and its down-sampled counterpart as inputs to improve the prediction of large buildings. Liu et al. [4] introduced a spatial residual inception module that aggregates multi-scale contexts to capture buildings of different sizes.
The past few years have witnessed the attainments of FCNs designed for remote sensing imagery segmentation in building extraction. However, owing to the complex shapes, the occlusion of shadows, and the similarity of some artificial features, making segmented buildings maintain their unique morphological characteristics, such as straight lines and right angles, is still a problem that requires immediate resolution [4], as shown in Figure 1.  [21]. There exist clutters around the boundaries, and some buildings are incomplete. The straight lines and right angles of some buildings are not well preserved.
The morphology of buildings is reflected in their boundaries; thus, some studies have made efforts to predict more accurate boundaries by extracting more boundary information or making use of post-processing techniques. For instance, Sun et al. [28] proposed a method based on the SegNet [17], in which an active contour model is used to refine boundary prediction. Yuan [29] designed a deep FCN that fuses outputs from multiple layers; a signed distance function was designed as the output representation to delineate the boundary information in this work. Shrestha et al. [30] adopted conditional random fields (CRFs) as the post-processing technique to improve the quality of building boundaries in their segmentation results. Xu et al. [31] designed a deep network that takes remote sensing images as input, as well as some hand-crafted features, to extract urban buildings, and employed a guided filter for further optimization. However, methods focusing on acquiring more information about boundaries require more sophisticated structures or use auxiliary data, such as light detection and ranging (LiDAR) [28], normalized differential vegetation index (NDVI), and normalized digital surface model (NDSM) [31], while the post-processing stage usually complicates the methods. To further improve the morphology of buildings, a few studies attempted multi-task frameworks, incorporating segmentation and boundary prediction. For instance, Bischke et al. [32] proposed a multi-task model that preserves the semantic segmentation boundary by optimizing a comprehensive loss, which is composed of the losses of boundary categories and segmentation labels. However, the model training requires considerable time because there is no direct connection between the tasks.
To contrapose this phenomenon, we propose a boundary-assisted learning method for segmenting buildings while keeping the boundary morphology of buildings. On the one hand, an ancillary task, boundary learning, is embedded in parallel with the segmentation task to help maintain the boundaries. On the other hand, a spatial variation fusion (SVF) module is introduced to establish an association between the two tasks, in which way the boundary learning task can obtain a prototype from the segmentation task, while the segmentation task can be constrained by boundary learning, making them promote each other. In addition, to enlarge the receptive fields and decrease the computational cost, separable convolutions [33] with large filters are adopted as the substitute for standard convolutions. Additionally, following the prevalent neural network enhancement technique-the attention mechanism [34]-we introduce the convolutional block attention module (CBAM) [35], which improves visual tasks by combining the channel attention and spatial attention, into our network to boost the model. On this basis, we propose a novel end-to-end FCN framework to accurately extract buildings from remote sensing images. Our experiments exhibit that without using auxiliary data or post-processing, our method achieves superior performance over some SOTA works on two public challenging datasets, known as the WHU Building Dataset [5] and Inria Aerial Image Labeling Dataset [36]. The innovations and contributions of this paper are summarized in the following points.
• A boundary-assisted learning pattern is proposed, with the assistance of which the boundary morphology maintenance of buildings is markedly ameliorated. Moreover, the SVF module combines the segmentation task and boundary learning task so that they can interact with each other, making the network easier to train. • A new FCN-based architecture is proposed. The utilization of separable convolutions reduces the number of parameters in the model while expanding the receptive fields by using large filters. The introduction of a CBAM plays a role in boosting the model.

Methodology
The main purpose of this paper is to explore a means that improves building segmentation by overcoming the inaccurate boundary morphology extracted from remote sensing images. The key idea lies in using the boundary learning task to help maintain buildings' morphological characteristics and guide the network to optimize the segmentation results. This section begins with an overview of the proposed framework. Then, the significant components, including the SVF module, separable convolution, and CBAM, are elaborated. The loss functions come at the end.

Overall Framework
The overall pipeline of our framework is shown in Figure 2, which consists of three stages. In the first stage, an image is fed into a backbone network to generate multi-level semantics. The backbone is modified from the Darknet-53 [37] proposed by YOLOv3 [37]; we replace the original residual block with our "Separable Residual Block" and downsample the input image four times (in YOLOv3, an input image is down-sampled five times; as excessive down-sampling will lead to information loss of small buildings, we only down-sample the images four times). In the second stage, the feature maps with different spatial resolutions are gradually up-sampled and aggregated to generate the segmentation mask. In the last stage, a preliminary boundary mask is produced after the spatial variation operation is performed; then, the boundary mask is fused with the features extracted from the aggregated semantics to generate the refined boundary. Both the segmentation and boundary masks are generated to compute the loss function in the training phase, but only the former two stages are reserved while testing. Figure 2. The overall architecture of our framework. The first stage produces multi-level semantics with different spatial resolutions (1/2, 1/4, and 1/16 of the input size, respectively). The second stage parses them to the segmentation mask. The last stage generates the boundary mask. While testing, only the former two stages are reserved.

Significant Modules
As mentioned above, some key components play important roles in our framework. This subsection gives details about them.

Spatial Variation Fusion
After several steps in the second stage, we obtain the segmentation mask probability map F ∈ R H×W×1 . Inspired by the spatial gradient fusion proposed in [38], we modify it and propose the SVF, which easily generates buildings' semantic boundaries from segmentation masks by deriving spatial variation. We use adaptive pooling to derive the spatial variation ∇F: where (x, y) denotes the location of the mask probability map and |·| refers to the absolute value function. pool s k is the max-pooling operation with kernel size k and stride s. k controls the width of the derived boundary; in our framework, we set k to 3 and s to 1. In [38], average pooling is used to produce boundaries. Compared to this, max-pooling can produce more distinct boundaries. Some examples are shown in Figure 3. Once the boundary mask is produced, it is fused with those single-channel features by concatenation; then, the fusion is again fed into a convolution layer to produce the final boundary mask probability map. The SVF module provides a prototype for boundary learning, while the derived boundary acts as a constraint on segmentation learning. Compared with separate segmentation and boundary generation, the network training becomes easier, as the two tasks can interact with and benefit each other.

Separable Convolution
Separable convolution was initially designed for mobile vision applications [34], which aim to achieve equal or even better results with less computation. It performs a depth-wise spatial convolution with a filter size k that acts separately on each input channel, followed by a point-wise convolution with a filter size of 1 that mixes the resulting output channels. Based on separable convolutions, our "Separable Residual Block" is designed as presented in Figure 4.
The "Separable Residual Block" enlarges the receptive field by applying larger filters; meanwhile, the number of parameters is reduced. For example, suppose that the input tensor has C in channels and the output tensor has C out channels; then, the number of filter parameters in a standard convolution layer (with a filter size of 3) reaches C in · C out · 3 · 3. However, the number of filter parameters in a separable convolution layer (with a filter size of 5) only reaches C in · 5 · 5 + C in · C out · 1 · 1 = C in · (5 · 5 + C out ), which is dramatically less than that of standard convolution. Thanks to the separable convolution, the number of parameters in our model is reduced by about 18 million.  [37], and (b) denotes the separable convolution and our advanced "Separable Residual Block".

Convolutional Block Attention Module
Deep neural networks (DNNs) extract hierarchical semantic features, of which the low-level features have poor semantic information but rich spatial location information owing to the small receptive view and large resolution. On the contrary, the high-level features have strong semantic information but weak spatial location information because of the large receptive view and low resolution. Hence, indiscriminate concatenation of features from different levels may cause inconsistencies, which makes networks confused about the attention allocation for high-level features and low-level features [39]. On the other hand, DNNs generate the feature representation of complex objects by collecting different regions of semantic sub-features. However, those sub-features are often spatially affected by similar patterns and noisy backgrounds [40]. Thus, it is necessary to emphasize the important parts and suppress the unimportant parts.
Due to these factors, semantic segmentation based on FCNs is a task that needs to weigh features in the channel dimension and sub-features in the spatial dimension. Therefore, following [35], we introduce the CBAM, which infers attention along the two separate dimensions, into our framework. The CBAM is designed as shown in Figure 5; it is a combination of a channel attention mechanism and spatial attention mechanism. Channel attention (Figure 6a) "tells" the network how to allocate weights for different channels. Given an input feature map, we first squeeze and excite its spatial information [41] by using global average-pooling and global max-pooling, generating two descriptors: V avg and V max . The former integrates the global information of each channel, while the latter gathers clues about distinctive object features. Then, the descriptors are forwarded to a shared block (consisting of two convolution layers with a filter size of 1) to produce two vectors, which are next merged by an element-wise sum and finally output the 1D attention vector V after a sigmoid function is applied. The channel attention is computed as: Spatial attention (Figure 6b) "tells" the network which regions are informative and should be paid more attention. We first perform average-pooling and max-pooling along the channel axis and concatenate them to generate a descriptor, which helps highlight informative regions [42]. Then, this is fed into a convolution layer with a filter size of 7, which finally produces the 2D attention map M with a sigmoid operation. The spatial attention is computed as: In short, the computation in the CBAM is as follows: Figure 6. The channel attention mechanism and spatial attention mechanism. (a) and (b) respectively denote the channel attention module and spatial attention module.

Loss Functions
As illustrated in Figure 2, two masks are generated from our proposed network-a segmentation probability map and a boundary probability map. Accordingly, we compute two loss functions that correspond to them.
Segmentation loss: For semantic segmentation, the cross-entropy loss is most commonly used, which treats all pixels equally. However, we think that treating backgrounds and buildings equally may make a network less able to recognize buildings in scenes it has never seen before. Therefore, to guide a network to pay more attention to buildings themselves, we propose the foreground-enhanced loss function for this binary classification task: where y i and p i respectively denote the label and predicted probability of pixel i, and N is the number of pixels in a mini-batch. ω i (ω i > 0) is the weight assigned to pixel i. In our experiments, if pixel i belongs to buildings, ω i = 2; otherwise, ω i = 1. In this way, buildings contribute more to the loss function, making the network focus more on them while training. Boundary loss: Boundary learning suffers from a higher missing rate, as boundaries are very sparse. To alleviate this impact, we follow [39,43] to define the following classbalanced cross-entropy loss function: where β is the percentage of non-boundary pixels in the ground truth.
In general, a multi-task framework trains all tasks at the same time. However, boundary learning has more difficulties in training from scratch due to the sparsity of boundaries. Therefore, in the early phase of training, we froze the boundary learning task and only trained the segmentation task. When the network can generate a reasonably accurate segmentation mask, we add boundary learning. Thus, the integrated loss function for network optimization in the early phase is: and the integrated loss function in the later phase is: where λ is the weight for balancing the losses (we set it to 2 in our experiments). By this means, a preparatory boundary mask can be provided by the spatial variation ∇F to reduce the difficulty of boundary learning.

Experiments and Comparisons
This section first introduces the datasets used in the experiments and the experimental implementation details, and then displays the building extraction results and provides qualitative and quantitative comparisons between our methodology and some SOTA FCNs.

Datasets
The WHU Building Dataset [5] and Inria Aerial Image Labeling Dataset [36] were chosen to evaluate our proposed methodology.
WHU Building Dataset (this dataset is available at https://study.rsgis.whu.edu.cn/ pages/download/building_dataset.html): This dataset contains four sub-datasets, from which we selected the aerial imagery dataset and the satellite dataset II. The former covers a surface area of about 450 km 2 in New Zealand, including 8189 tiles of 512 × 512 size with a 0.3 m spatial resolution, which were officially divided into a training set (4736 images), validation set (1036 images), and testing set (2416 images). The latter covers 860 km 2 in East Asia with 0.45 m ground resolution. This sub-dataset has 17,388 512 × 512 tiles, among which 13,662 tiles were separated for training and the rest were used for testing. Inria Aerial Image Labeling Dataset (this dataset is available at https://project.inria. fr/aerialimagelabeling/download/): This dataset has 360 tiles of 5000 × 5000 size with a spatial resolution of 0.3 m, covering 10 cities all over the world. It covers various types of urban buildings, such as sparse courtyards, dense residential areas, and large venues.
Many images from the WHU Building Dataset do not contain any buildings; we performed data scrubbing by excluding such images. The sizes of the images from the Inria Aerial Image Labeling Dataset were too large, so we cropped them into 480 × 480 tiles with a stride of 452 pixels to fit in with the GPU's capacity. Note that the Inria Aerial Image Labeling Dataset only provides ground truth for the training set. Therefore, we followed the official suggestion and selected the first five images of each city from the training set for validation, and the rest were used for training. For both datasets, we adopted random flipping to augment them.

Implementation Details
All work was done with TensorFlow [44] using Python. For optimization, we adopted an Adam optimizer [43] with a base learning rate of 0.001, which decayed at a rate of 0.9 after every epoch. All models were trained for up to 20 epochs on their corresponding datasets, only saving the best weights. The pixel values of images were rescaled between 0 and 1 before being input into networks, and L2 regularization was introduced in all convolutions with a weight decay of 0.0001 to avoid over-fitting. For the aerial imagery dataset from the WHU Building Dataset, we initialized the convolution kernels with "He initialization" [45]. For the Inria Aerial Image Labeling Dataset, we used the pre-trained weights on the WHU aerial imagery dataset for initialization. Where applicable, we accelerated the training with an NVIDIA GTX 1070Ti GPU. As U-Net [21] had been shown to have obtained nice performance in building extraction [46] and quite a few research works have been inspired by it, we also used re-completion of U-Net as the baseline. It respectively took 430 and 618 minutes to train our model and U-Net on the WHU Aerial Building Dataset, while training our model and U-Net on the Inria Aerial Image Labeling Dataset took 756 and 1026 minutes, respectively.

Results and Comparisons
To demonstrate the superiority of our methodology in building extraction, and especially the maintenance of morphological characteristics of buildings, we simultaneously list the segmentation results of U-Net and our method and focus on comparing the prediction of boundaries. We also quantitatively compare our methodology with some SOTA FCNs by adopting four metrics, that is, precision, recall, F1-score, and intersection-over-union (IoU), to evaluate the performance from multiple perspectives. The precision, recall, and F1-score are respectively defined as: where TP, FP, and FN represent the pixel numbers of true positives, false positives, and false negatives, respectively. Building pixels are positive while the background pixels are negative. IoU is defined as: where P p denotes the set of pixels predicted as buildings, and P t denotes the ground truth set. |·| denotes the function to calculate the number of pixels in the set. Figure 7 shows the segmentation examples of the WHU Aerial Building Dataset. Through visual inspection, it appears that our method outputs finer segmentation results. There are fewer false positives and clutters in our results. The morphological characteristics of buildings, such as the straight lines and angles, are also better preserved in our results. The last two rows indicate that fewer errors occur in our method, while there exist more wrongly predicted pixels in U-Net, especially around the boundaries. For further comparison, we compare our framework to several SOTA methods, including the well-known FCNs proposed by the computer vision community, that is, SegNet [17], Deeplab [19], and RefineNet [47], as well as some recent works on remote sensing, that is, SRI-Net [4], CU-Net [46], and SiU-Net [5]. The quantitative comparison is summarized in Table 1. Our method outperforms the baseline (U-Net) by leaps and bounds and ranks first among these methods in terms of recall, F1, and IoU, while only the precision is slightly lower than that of SRI-Net. Table 1. Quantitative comparison (%) of several state-of-the-art (SOTA) methods used on the WHU Aerial Building Dataset (the highest values are underlined). The first three methods and SRI-Net were implemented by [4], SiU-Net [5] is the official method provided by the dataset, and the others were implemented by [5].

Comparison on the Inria Aerial Image Labeling Dataset
Building extraction examples from the Inria Aerial Image Labeling Dataset are displayed in Figure 8. The results produced by our method contain much fewer errors, and less chaos occurs around the boundaries of buildings. In addition, it is observed that our method outputs more complete buildings compared to U-Net, especially large buildings. This is due to the larger convolution kernels in the separable convolutions and the fact that our network is deeper, which expands the receptive fields of our network, making it capture more context information.
The quantitative comparison with SOTA methods is reported in Table 2. On this dataset, our method also achieves the highest recall, F1, and IoU, far beyond the others. The performance on this aerial dataset is poorer than that on the WHU Aerial Building Dataset; the main reason is that there are more challenging cases, such as higher buildings and shadows. In addition, a few incorrect labels exist in this dataset, as has been illustrated in [5].
In summary, our method achieves the best performance for three indicators, while only the precision is slightly lower. We hypothesize that this could be due to our foregroundenhanced loss function, which highlights buildings more than backgrounds and may cause a few more false positives. Nevertheless, it improves the recall, as fewer building pixels are wrongly predicted as backgrounds. More details are discussed in Section 4.1. Table 2. Quantitative comparison (%) with several SOTA methods on the Inria Aerial Image Labeling Dataset (the highest values are underlined). The first three methods and SRI-Net were implemented by [4], who cropped the Inria aerial images to tiles of 256 × 256 pixels.

Effectiveness of Boundary-Assisted Learning
To verify the effectiveness of boundary-assisted learning in preserving the morphological characteristics of buildings, we designed and trained other kinds of networks that have the same structure as our boundary-assisted learning framework, except that the components for boundary generation are removed. As has been mentioned in Section 3.3.2, our foreground-enhanced loss function may influence the performance; to further discuss this, we trained the networks on the WHU Aerial Building Dataset with different loss functions by changing the weight ω i in Equation (5). In total, we trained eight networks on this dataset-four of them simultaneously learned building extraction and boundary detection, while the other four only learned building extraction. All models were trained for 20 epochs. Table 3 exhibits the performances of networks with different configurations. The table clearly shows that with the assistance of boundary learning, the network attains better performance on all metrics no matter what value of ω i is set (except for ω i = 3, where the precision is slightly lower, and ω i = 2, where the recall is negligibly lower). Figure 9 also supports this conclusion; the network with the boundary learning task outputs buildings with more complete and explicit boundaries, and fewer pixels are wrongly classified. The weight ω i determines how much attention our network pays to buildings, and the attention grows as ω i increases. In our ablation study, we find that larger ω i leads to lower precision, but brings higher recall most of the time. This is because the more attention is given to buildings, the less likely they are to be classified as backgrounds, but the probability of classifying backgrounds as buildings is increased, too. For a compromise, it is appropriate to set this parameter to 2, which makes the network achieve the best comprehensive performance. Figure 9. Examples of building extraction produced by "w/B" and "B". From left to right: aerial images, labels, predictions of "w/B", predictions of "B", errors of "w/B", and errors of "B". The weight ω i is set to 2.

Analysis of the Attention Module
The attention module, CBAM, plays an important role in boosting our network. We take CBAM-3 as an example and visualize the channel attention as well as the spatial attention to illustrate how it works within the network.
The values of the channel attention vector and spatial attention map are visualized in Figure 10. Channel attention is designed to automatically weigh the channels in the adjacent-level concatenated feature maps, and we analyze it by visualizing the 1D channel attention vector (the former half corresponds to the lower-level feature maps with higher spatial resolution, as well as the latter half for the higher-level one with lower spatial resolution) and drawing the histogram. The vectors indicate that CBAM-3 tends to choose larger weights to act on higher-level feature maps in general, and the tendency becomes obvious as building areas increase. Spatial attention is designed to automatically weigh different regions in feature maps. From the 2D spatial attention map, we can see that the spatial attention successfully guides CBAM-3 to put larger weights on buildings than backgrounds, and the emphasis varies in dealing with different buildings. CBAM-3 focuses more on boundaries when buildings are large and high, while more attention is paid to whole individuals for small and low buildings.

Evaluation on Satellite Images
The experiments above were performed on aerial image datasets. To explore whether our method also applies to satellite images, we trained our network on the WHU Satellite Building Dataset II, and all training settings were kept the same as for the WHU Aerial Building Dataset. The quantitative evaluation and building extraction results are in Table 4 and Figure 11, respectively. Compared to the aerial image datasets, the resulting accuracies decreased. Nevertheless, our method is still ahead of the others in the four metrics. The other two methods also perform worse on this dataset, and we argue that the main reason is that those satellite images are poorer in quality and buildings are more blurred, which increases the difficulty of building extraction, as can be observed from the aerial and satellite datasets. However, it is indispensable to further improve current methods for satellite imagery, as they can provide a wider range of Earth observations, which is irreplaceable by aerial imagery.

Conclusions
In this paper, a state-of-the-art FCN model for performing building extraction from optical remote sensing images was proposed. The proposed method mainly focuses on three key improvements. (1) A boundary learning task and the spatial variation fusion module are embedded in the semantic segmentation architecture, which helps ameliorate the morphological characteristics of segmented buildings. (2) Separable convolution with a larger kernel is adopted to expand receptive fields, while the number of model parameters is prominently reduced. (3) The convolutional block attention module, which combines channel attention and spatial attention, is utilized to boost the segmentation performance of the model. Experiments on two challenging aerial image datasets exhibit the superiority of our method. The comparisons demonstrate that our method obtains state-of-the-art building extraction results and ameliorates building boundaries. Moreover, ablation studies give a detailed explanation of the working mechanism of our improvements. The further experiment on satellite images also shows that our method improves the performance of building extraction from satellite images with relatively lower quality. In our future research, we will ulteriorly improve our method.

Data Availability Statement:
Codes and models that support this study are available at the private link: https://gitee.com/sheng029/rs-balnet.