1. Introduction
Efficient extraction of buildings from remote sensing images can provide geospatial data of buildings with wide coverage, clear spatial information, and fast update speed for urban planning, disaster management, and other scenarios [
1,
2,
3,
4,
5,
6,
7]. With the continuous progress of remote sensing technology, researchers use building feature information in various multi-temporal high-resolution remote sensing images to discriminate and manually extract urban buildings by visual interpretation and manual labelling [
8,
9]. However, due to different lighting conditions, some non-buildings (containers, cars, and roads) may have similar spectral and spatial features to buildings; visual interpretation may then misjudge buildings, which leads to mislabeling of non-buildings by manual annotation, resulting in poor building extraction [
10,
11,
12]. Therefore, researchers are trying to use better methods to improve the building extraction accuracy.
Presently, researchers primarily extract building information from remote sensing images using both traditional and deep learning methods. Traditional methods for urban building extraction from remote sensing images include clustering algorithms, support vector machines, and random forests, among others [
13,
14,
15]. Gavankar et al. devised an object-based approach that leverages high-resolution multispectral satellite images, combining K-means clustering and shape parameters to extract building outlines [
16]. While this approach combines pixel-level information with object-level features, thus improving the accuracy of building extraction, clustering algorithms may struggle to differentiate between buildings and other features when they share similar spectral characteristics in complex background environments. Arham et al. explored object-based image analysis methods, combining support vector machines (SVMs) with rule-based image classification for building extraction tasks [
17]. SVMs can provide relatively high precision in building extraction, especially with appropriate feature engineering and parameter tuning, even in complex urban settings. However, an SVM’s performance can be limited by factors such as noise and occlusions (e.g., trees, clouds, and tall buildings), especially when extracting fine-grained building outlines. Chen et al. employed a method based on random forests and superpixel segmentation to automatically extract buildings from remote sensing data [
18]. While random forests exhibit robustness to noisy data and missing information, they struggle to provide detailed internal decision rules in complex urban scenes, resulting in reduced interpretability and potentially affecting the accuracy of urban building extraction. The performance of these three traditional methods heavily relies on the chosen feature sets, and inappropriate or incomplete feature selection can lead to suboptimal building extraction results. Therefore, a more precise research approach is needed to enhance the accuracy of urban building extraction.
With the rapid development in the field of computer vision, researchers have started applying deep learning methods to urban building extraction tasks [
19]. Deep learning, based on neural network structures, autonomously learns the relevant features of buildings in large-scale high-resolution remote sensing images (such as spectral features, scale features, and texture features), enabling efficient and accurate building extraction [
20]. Currently, deep learning-based building extraction methods have been widely applied in areas such as object detection and semantic segmentation [
21,
22]. However, while object detection methods can successfully detect buildings, they are unable to extract more detailed urban building contours. Therefore, researchers have chosen to employ deep learning-based semantic segmentation methods for urban building extraction. In 2016, through the encoder–decoder structure, Zhong et al. applied a fully convolutional network (FCN) to extract buildings from high-resolution remote sensing images to achieve pixel-level segmentation [
23]. Subsequently, derived semantic segmentation methods such as PSPNet [
24], U-Net [
25], HRNet [
26], and the Deeplab series [
27,
28,
29,
30] have been utilized to further enhance building extraction efficiency [
31,
32]. Aiming to solve the problems of insufficient high-precision building datasets and the inability of semantic segmentation to further classify buildings, Ji et al. established a large-scale, high-precision building dataset (WHUbuildingdataset) covering multiple sample forms (raster and vector) and multiple data sources (aviation and satellites), and achieved the identification and extraction of buildings through an instance segmentation method based on Mask R-CNN [
33]. Since traditional building extraction methods make it difficult to accurately segment buildings, roads, and trees in complex scenes, Xu et al. proposed a multi-layer feature fusion dilated convolution ResNet model, effectively overcoming interference from non-building objects like trees and roads [
34]. Considering the problems of low segmentation accuracy and blurred edge contour in traditional building extraction methods, Zhang et al. proposed the combination of a U-net neural network and a fully connected CRFs network and optimized the segmentation results according to image features, which significantly improved the segmentation accuracy and building contour integrity [
35]. Yang et al. adopted the Deeplabv3plus algorithm to enhance the expression ability of building detail information, and compared the classification performance of Deeplabv3, Deeplabv3plus and BiseNet by using the building sample library, which solved the problem that machine learning has poor robustness in building extraction tasks and finds it difficult to fully mine the deep features of buildings [
36,
37]. Although these methods improve accuracy and efficiency to some extent, building features are not easily extracted due to problems such as buildings being occluded by other ground features (high-rise buildings and trees) and building edges being mixed with other non-building elements (roads, vehicles, and trees), which in turn result in partially missing building outlines. Therefore, a deep learning method is needed to avoid the occlusion and mixing problems arising between non-building feature elements and buildings and to maintain the integrity of building contours for the task of urban building extraction.
In conclusion, we propose an improved deep learning network to address the issues of incomplete contours in the building extraction process. To tackle the problem of missing building contours, we introduced the coordinate attention module into the improved network. By learning the coordinate information of different positions, we enhanced the attention given to the accurate positioning of building space, and then improved the accuracy of building edge contour extraction. In order to further optimize the building contour, we designed a pooling fusion module to improve the clarity of the building contours and enhance the perception of the overall structure of the building, so as to achieve a comprehensive optimization of the building contour. It is worth mentioning that the current method is low in computational efficiency and cannot support large-scale urban building extraction applications. To achieve fast extraction of urban buildings, we employ a lightweight backbone network to enhance the model’s inference efficiency by reducing the number of model parameters. The main contributions of this research are as follows:
- (1)
We propose an advanced deep learning-based method for extracting urban buildings from high-resolution remote sensing images. In this study, we replaced the backbone network by adopting a lightweight model to address the issue of low computational efficiency in existing models. Additionally, we introduced an attention mechanism to enhance the focus on the spatial coordinate information of building contours at different locations in the image, aiming to alleviate the problem of missing architectural outlines.
- (2)
Our improved network incorporates a fusion module that combines strip pooling with Atrous spatial pyramid pooling to introduce lateral context information to further recover building contour profiles by enhancing the network’s representation of building edge features. We validate the significant role of strip pooling in enhancing the feature extraction of urban buildings.
4. Discussion
According to the ablation experiments, the improved model not only improves the accuracy of building extraction, but also significantly improves the computational efficiency of the model. In comparison with the baseline model, the improved model increases the mIoU by 0.64%. As indicated in
Table 1 and
Table 2, employing Mobilenetv2 as the backbone network significantly boosts the model’s prediction speed, increasing FPS by 96.89, but results in a 0.34% decrease in mIoU. This suggests that while Mobilenetv2 improves the model’s prediction speed, it slightly compromises accuracy in feature extraction. In addition, as can be seen in
Figure 6, the improved algorithm incorrectly segmented non-building entities and omitted small buildings when using Mobilenetv2 as the backbone network, suggesting that the algorithm’s ability to extract the features of small buildings is insufficient, meaning, in turn, that it fails to extract small building contours. The introduction of the coordinate attention (CA) mechanism enables the network to learn the coordinate information of different locations in the input image, focusing on the exact location of the building in space, which in turn better captures the edges and contour features of the building, resulting in an increase of 0.27% in mIoU.
Figure 6 shows that although the occlusion problem (low-rise buildings are occluded by high-rise buildings and buildings are occluded by trees) leads to missing building contours, the CA mechanism recovers the building contours in the regions occluded by high-rise buildings and trees by enhancing the attention to the building’s own features (e.g., edges, textures, which are related to the building contours). However, in the process of enhancing attention to the building features, the CA mechanism pays too much attention to some local features and thus misextracts non-building features, which indicates that the mechanism is more efficient in forming building contours but slightly less effective in overcoming mis-extractions of non-building features. As shown in
Figure 6 and
Table 1, the introduction of the SP module in ASPP enhances the model’s ability to extract building edges by capturing features at building edges more efficiently, resulting in a more accurate building profile. Meanwhile, the SP module adopts an appropriate pooling scale to limit the feature extraction range of the improved model under a larger receptive field, reducing the focus on non-buildings, which in turn significantly reduces the mis-extraction of non-building feature elements, resulting in an improvement of 0.52% in mIoU. Compared with the previous two improvement strategies, the SP-ASPP module shows the greatest increase in accuracy, indicating that the SP-ASPP structure achieves a more complete building outline extraction. The above improvement strategies indicate that MobileNetV2 significantly enhances model computational efficiency and prediction speed, while the other two modules effectively segment urban buildings. According to
Table 1 and
Figure 6, the improved method achieves the highest extraction accuracy, and there are no obvious false positives or problems such as discontinuous building contours, which proves the effectiveness and accuracy of the improved strategy.
In comparative experiments, PSPNet, U-Net, Deeplabv3plus, and HRNetv2 were employed as mainstream algorithms for evaluation. According to
Table 3 and
Table 4, on the Mapchallenge dataset, the improved model achieves the highest precision in building extraction and the fastest prediction speed among the compared algorithms. As depicted in
Figure 7, contours of buildings extracted by the comparative algorithms appear discontinuous, whereas the improved algorithm produces smoother building contours without obvious jagged edges, demonstrating its capability to effectively address the issue of discontinuity in building contour edges. Additionally, the PSPNet algorithm exhibits false positives of non-building entities, indicating that the improved algorithm can achieve precise building extraction without introducing issues such as false positives. Furthermore, based on
Figure 7, HRNetv2_w32 and the Mobilenet-PSPNet algorithm exhibit irregularities in the contours of small-sized buildings, incompleteness in the contours of medium-sized buildings, and unclear contours of large-sized buildings when extracting multiscale buildings. In contrast, the improved algorithm can clearly extract the contours of buildings at different scales, which affirms the effectiveness of the proposed method. According to
Table 5 and
Table 6, on a typical dataset of urban buildings in China, the improved algorithm achieves an mIoU of 85.11%, and in terms of model prediction speed, it achieves 110.67 FPS. This confirms the effectiveness and efficiency of the improved model, making it suitable for practical deployment and meeting real-world demands.
Figure 8 illustrates that the comparative algorithms encounter issues such as missing extractions for small-sized buildings and contour intersections for densely arranged buildings. In contrast, the improved algorithm can clearly extract the edge contours of various types of buildings without causing edge overlaps, showcasing its ability to achieve fine-grained extraction of building contours. Simultaneously, within a more straightforward urban context, the improved algorithm’s building extraction results tend to be more realistic compared to those of PSPNet and HRNet_w32. It minimizes interference from non-building entities, achieves more accurate contour extraction for irregular buildings, and demonstrates the stability of the improved algorithm in distinguishing between buildings and non-buildings. In summary, the improved algorithm effectively accomplishes the building extraction task and, compared to other mainstream semantic segmentation algorithms, exhibits higher speed and accuracy.
Furthermore, although the improved algorithm demonstrates a significant increase in prediction speed, the improvement in extraction accuracy is less than 1%. This is attributed to the excessive simplification of the model structure after lightweighting, reducing the model’s complexity and consequently impacting its precision. Additionally, it is noteworthy that the improved algorithm solely addresses the task of urban building extraction from remote sensing images, lacking experimental validation on other computer vision tasks. Therefore, in future research endeavors, from a model optimization perspective, we can contemplate algorithm refinement by considering the complexity of the model structure. From an application standpoint, exploring the adaptability of this algorithm to diverse computer vision tasks could be pursued through techniques such as transfer learning.
5. Conclusions
Aiming to confront the challenge that it is difficult to completely extract the outlines of urban buildings from high-resolution remote sensing images due to occlusion, mixing and other problems, a more efficient and accurate deep learning network is proposed to achieve building extraction. The coordinate attention module is used to enhance the attention to buildings’ contours in different positions to extract more complete building contours. The SP-ASPP module is used to further obtain the building edge feature information and improve the building extraction accuracy. Experimental results on datasets comprising typical architectural developments in Chinese cities and on the Mapchallenge building dataset demonstrate that the improved model achieves higher extraction accuracy. Ablation experiments reveal that the three improvement strategies of the enhanced model effectively enhance capacity for urban building extraction. Additionally, the introduction of the SP-ASPP module increases the model’s mIoU from 84.47% to 84.99%, effectively enhancing urban building extraction accuracy. Furthermore, with the integration of the lightweight Mobilenetv2 backbone, the model’s prediction speed escalates from 40.64 FPS to 137.53 FPS, enabling real-time response and practical deployment in engineering projects. Notably, combining Mobilenetv2 with the SP-ASPP and CA modules further improves the urban building extraction accuracy to 85.11%, while achieving a speed of 110.67 FPS. In the future, the combination of spectral features in remote sensing images (e.g., vegetation index, water index, building index, etc.) with deep learning models can also be explored to further improve the extraction efficiency of urban buildings.