SuperpixelGraph: Semi-automatic generation of building footprint through semantic-sensitive superpixel and neural graph networks

Most urban applications necessitate building footprints in the form of concise vector graphics with sharp boundaries rather than pixel-wise raster images. This need contrasts with the majority of existing methods, which typically generate over-smoothed footprint polygons. Editing these automatically produced polygons can be inefficient, if not more time-consuming than manual digitization. This paper introduces a semi-automatic approach for building footprint extraction through semantically-sensitive superpixels and neural graph networks. Drawing inspiration from object-based classification techniques, we first learn to generate superpixels that are not only boundary-preserving but also semantically-sensitive. The superpixels respond exclusively to building boundaries rather than other natural objects, while simultaneously producing semantic segmentation of the buildings. These intermediate superpixel representations can be naturally considered as nodes within a graph. Consequently, graph neural networks are employed to model the global interactions among all superpixels and enhance the representativeness of node features for building segmentation. Classical approaches are utilized to extract and regularize boundaries for the vectorized building footprints. Utilizing minimal clicks and straightforward strokes, we efficiently accomplish accurate segmentation outcomes, eliminating the necessity for editing polygon vertices. Our proposed approach demonstrates superior precision and efficacy, as validated by experimental assessments on various public benchmark datasets. A significant improvement of 8% in AP50 was observed in vector graphics evaluation, surpassing established techniques. Additionally, we have devised an optimized and sophisticated pipeline for interactive editing, poised to further augment the overall quality of the results.


Introduction
Buildings are essential geographical features in urban environments. Accurate and up-to-date building footprints in the form of concise vector graphics are increasingly in demand for various applications, such as map services, urban planning, and reconstruction (Zhu et al., 2020b;Haala and Kada, 2010). With the advancement of data acquisition techniques, high-resolution images with improved temporal and spatial resolution are now readily available. However, these images pose challenges for current building detection approaches, particularly in dealing with intricate roof designs and complex surrounding environments Xiong et al., 2014).
Two prevalent paradigms for generating building footprints currently exist: (1) pixelwise segmentation followed by regularization (Microsoft, 2018), and (2) end-to-end learnable approaches. The former utilizes semantic segmentation approaches (Long et al., 2015;Ronneberger et al., 2015)to create a binary probability map of buildings, followed by classical polygonal simplification and regularization methods for generating vector graphics. The latter directly predicts corner locations and connects them into closed polygons. Despite numerous impressive works on these paradigms, two issues remain unresolved: (1) Building boundaries often do not align well with human expectations. Segmentationbased approaches generally produce oversmoothed building probability maps due to the inherent receptive field of convolutional neural networks (CNN) (He et al., 2016) and the networks' multiscale nature. The direct end-to-end approach (Zorzi et al., 2022;Wei et al., 2023) frequently creates irregular shapes that deviate from human preferences, which can be attributed to the learning procedure's generalization ability in predicting corner linkages.
(2) Automatically extracted building footprints can be inefficient for subsequent interactive quality control. Building footprints that have been over-smoothed or have irregular shapes can be challenging to edit efficiently, limiting their widespread use in real-world applications despite their commonly achieved 90% precision metrics in segmentation. Currently, editing operations typically involve direct manipulation of polygon vertices, such as moving, adding, and deleting, which can be extremely time-consuming.
To address these issues, this paper proposes SuperpixelGraph for the semi-automatic generation of building polygons. We learn semantically-sensitive superpixels that react only to building boundaries, thus achieving improved boundary metrics (Fig. 1a). Superpixels naturally partition images into globally connected graphs (denoted as SuperpixelGraph, Fig. 1b), where node features and edge relationships are learned through graph neural networks. Owing to boundary preservation and global edge knowledge, local editing of SuperpixelGraph with addition and deletion strokes has global effects (Fig. 1c). More specifically, we learn both superpixel clustering and semantic segmentation of buildings using the same encoder-decoder architecture, i.e., Fully Convolution Networks (Long et al., 2015). The pixelwise feature maps are adaptively pooled into each superpixel, generating the original node feature of the SuperpixelGraph. Graph attention networks (GAT) (Veličković et al., 2017) are employed to embed node features and model pairwise potentials of superpixels. Segmentation is performed for each embedded feature of superpixels to generate rasterized building masks. Classical approaches (Suzuki et al., 1985;Dyken et al., 2009) are used to extract vector graphics of building footprints.  In summary, the contributions of the proposed method are twofold: (1) a multi-task network is designed to preserve building boundaries directly, effectively segmenting superpixels and building semantics simultaneously, and (2) the construction of a superpixel graph enhances the ability to distinguish between buildings and backgrounds while reducing the difficulty of human interaction. The remainder of the paper is organized as follows. Section 2 discusses related work, while Section 3 presents the proposed methods in detail. Section 4 comprises assessment and discussion. Finally, Section 5 concludes the paper.

Relate works
In the following sections, we focus on the most relevant topics related to this paper: (1) building footprint generation, (2) superpixels, and (3) graph neural networks.

Building Footprint Generation
Raster building segmentation. The advent of deep learning, particularly the proliferation of encoder-decoder architectures, has led to the adoption of semantic segmentation approaches for generating raster building masks. Multi-scale structures, such as UNet (Ronneberger et al., 2015), FCN (Long et al., 2015), and FPN (Lin et al., 2017), have enabled end-to-end training of binary building masks, significantly improving the practicality of such approaches. Buildings, as typical man-made structures, exhibit sharp edges. However, the hierarchical enlarged receptive field of CNNs, despite their impressive power in semantic modeling, inevitably smooths features over boundaries Liu et al., 2022). Many dedicated building segmentation studies focus on enhancing representativeness with respect to boundaries. Popular approaches include emphasizing features in shallow layers, auxiliary feature maps, and augmenting with low-level features (Zhu et al., 2020a;Liao et al., 2021;Zhou et al., 2022;Jung et al., 2021). Nonetheless, the contradiction between an effective receptive field and precise boundary information remains unresolved.
Footprint simplification and regularization. Due to the concise data structure and support for spatial analyses, many applications require building representations in vector graphic formats. The conversion between raster and vector graphics has been extensively studied within the GIS and computer graphics communities, with a focus on polyline simplification and regularization. Polylines can be directly traced from binary masks using marching squares, forming the vectorized contour of objects (Suzuki et al., 1985). By applying certain geometrical criteria such as length, distance, and angle, the number of vertices in the contour polylines can be reduced (Dyken et al., 2009;Gribov, 2019). To enforce shape regularities, strategies such as line fitting, discrete model selection, and least-squares optimization are commonly adopted (dos Santos et al., 2020;Xie et al., 2018). However, the quality of simplified and regularized results is inevitably affected by the initial raster mask. Nuisances in the raster mask, such as oversmoothing and zigzag issues, are generally difficult to remedy at this stage.
Learned vectorized footprint. Recently, a new paradigm has emerged for generating building footprints through directly learning vector generation. Various strategies have been attempted, including recurrent prediction of polygon vertices (RNN-based approaches) (Huang et al., 2021), CurveGCN-like graph convolution networks along 1D curves (Ling et al., 2019;Peng et al., 2020), and learning offsets with initial object detection (Zorzi et al., 2022;Wei et al., 2023). Moreover, extracting polygons with the assistance of auxiliary layers has demonstrated promising results, such as attractive field maps  and polyvector flow fields (Girard et al., 2021). However, creating such layers still requires features in high-level semantic contexts, leading to a gap between pixel-wise features and high-level semantic features. Additionally, the generalization ability is also a concern when learning to connect the corners of buildings.

Superpixels
Classical Clustering-based Superpixels. Superpixels are a partition of images into oversegments that should not cross the boundaries of objects of interest (Stutz et al., 2018;Achanta et al., 2012). By confining superpixels within the same objects, we can classify the region belonging to a single superpixel as a whole, thus improving the robustness and sharpness of semantic classification. This technique is also known as Object-based Image Analysis (Blaschke, 2010). Most classical methods for generating superpixels rely on clustering low-level features, such as intensity or gradient values of images. Clustering approaches include watershed segmentation (Benesova and Kottman, 2014), mean shift (Comaniciu and Meer, 2002;Vedaldi and Soatto, 2008), geometric flows (Levinshtein et al., 2009), K-means (Achanta et al., 2012;Li and Chen, 2015), graph cuts (Veksler et al., 2010), and others. However, it is important to note that superpixels should only be confined to the boundaries of interested objects, rather than any object. For example, if we are only interested in buildings, allowing superpixels to cross roads may also be beneficial to the final outputs (Fig. 1a).
Learned approach for superpixels. The hard assignment of each pixel to the corresponding superpixel is apparently not differentiable, making it nearly impossible to learn to generate superpixels. This challenge was addressed by the seminal work of the Superpixel Sampling Network (SSN) (Jampani et al., 2018), which proposed an elegant design for the soft connection matrix. SSN introduced a differentiable SLIC-like K-means iteration step as a workaround. Following a similar pipeline, another study proposed an FCN architecture to avoid the iteration of the differentiable SLIC by directly learning the soft connection matrix (Yang et al., 2020;Zhu et al., 2021). However, these approaches consider all natural boundaries, posing difficulties in improving the recall rate of the interested objects (Ng et al., 2023). In this paper, we adopt an additional semantic branch to make the superpixel generation semantically sensitive.

Graph Neaural Networks
Graph Neural Networks (GNNs) have emerged as a powerful framework for learning and representation of structured data in the form of graphs. Over the past decade, GNNs have experienced significant growth and have been applied to a wide range of problems. The foundational work of Scarselli et al. (2008) introduced the concept of Graph Neural Networks, which paved the way for various GNN architectures, such as Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016), Graph Attention Networks (GATs) (Veličković et al., 2017), and Graph Isomorphism Networks (GINs) (Xu et al., 2018). These architectures have been designed to leverage both local and global information in graphs by iteratively aggregating and transforming neighbor node features. Despite the embedding of features attached to the nodes of the graphs, features can also be learned for the edge connections of the graph, representing the relationships between nodes.
Upon segmenting images into superpixels, it becomes intuitive to model the image as a graph. The features corresponding to each superpixel can be derived through global pooling operations within the superpixel. The primary distinction between the proposed methodology and prior works in image or point cloud classification lies in our approach to generate semantically-relevant superpixels. Additionally, our method places emphasis on both edges within the graph and the relationships they represent, as opposed to merely concentrating on node feature embeddings.

Overview and problem setup
In this manuscript, we present a sophisticated and pragmatic pipeline tailored for the generation of vectorized building polygons. While the pursuit of a fully automated pipeline remains the paramount goal, the significance of developing efficacious interactions should not be understated. Our findings reveal that manual digitization or modification of vertices, particularly those necessitating strict conformity to underlying orthophotos, is considerably laborious. Human operators exhibit exceptional skill in executing perceptually accurate yet geometrically imprecise tasks. Consequently, to achieve streamlined interactions, it is imperative to incorporate automatically generated auxiliary information characterized by the following attributes: (1) geometric segments exhibiting a high degree of alignment with the intended boundaries, and (2) information encapsulating global context. Nonetheless, the aforementioned criteria pose an intrinsic contradiction. The former necessitates localized information, whereas the latter demands global context. This insight led us to partition the pipeline responsible for generating auxiliary information into two discrete phases. In the initial phase, the pipeline is designed to produce superpixels that adhere to building boundaries. The learning process harnesses SSN-like methodologies (Jampani et al., 2018), fundamentally capturing pixel pair-wise similarities and inherently encoding local information. Subsequently, in the second phase, superpixel segmentation is treated as a graph, with global context being modeled through the utilization of graph neural networks. Node features are systematically and hierarchically amalgamated to encapsulate global context, while the interrelationships between these features are appraised correspondingly. In a more formal context, the inputs comprise the color image x ∈ R S×3 (with image size S = W × H) and the corresponding binary building mask b ∈ Z S . Each pixel b p ∈ {0, 1} signifies either the building or background. To render the superpixel learnable, the building mask b is additionally transformed into the one-hot feature h ∈ R S×2 . Our proposed approach, denoted as SuperpixelGraph, incorporates the following key steps to semi-automatically generate vectorized buildings (refer to Fig. 2): The initial stage yields auxiliary over-segmentation M ∈ Z S of the image x into N superpixels 1 , implying that each pixel of the segmentation M p ∈ {0, 1, 2, · · · , N − 1}. A conventional encoder-decoder network is employed to facilitate learning the superpixel in an end-to-end fashion, wherein a differentiable pixel-superpixel association matrix Q ∈ R S×9 (Jampani et al., 2018) and a feature map f ∈ R S×C are concurrently generated. Each pixel in Q represents the probability that pixel p belongs to one of the adjacent 9 superpixels. The N superpixels, as opposed to the individual S pixels, are utilized for classification. Consequently, the superpixels ought to be sensitive to building semantics. Further details are elaborated upon in Subsection 3.2.
The construction of SuperpixelGraph G(V, E), derived from the superpixels M and feature maps f , is rather straightforward. The vertices V are endowed with the corresponding node feature V ∈ R N ×C . The node feature is produced by implementing a weighted average utilizing Q and f . The edge indices E ∈ Z |E|×2 are acquired by tracing adjacent pixels in segmentation M and gathering boundary indices.
SuperpixelGraph Embedding and Classification Θ G (G) → (V L , α). A graph neural network Θ G incrementally aggregates and projects the node feature into the final layer V L ∈ R N ×C (L represents the number of layers). Subsequently, the features V L are classified into superpixel-wise probabilities for the buildingB ∈ R N . Moreover, Θ G concurrently generates similarities α ∈ R |E| for adjacent nodes, as described in Subsection 3.3.
Globally Optimized Interactive Editing. The graph G can also be construed as a Markov Random Field (MRF), wherein probabilityB exemplifies the data term, and node similarities α for the smooth term. Editing operations, encompassing addition and deletion, can be represented by strokes on the segmentation M . ProbabilitiesB with indices in M that overlap with strokes are enforced to 1 or 0 for addition and deletion, respectively. A global label optimum is attained through the graph cut.
Building Footprint Vectorization. After editing, the probabilities of the superpixel nodes are rendered to a binary mask. Classical raster-to-vector conversion methods (Suzuki et al., 1985) with polygon simplification and regularization (Dyken et al., 2009;Xie et al., 2018) are adopted to generate the vectorized building footprint.

Semantically-Sensitive Superpixel Network
This paper adopts the SSN strategy (Jampani et al., 2018) to learn features for generating superpixels. The core of SSN lies in establishing the pixel-superpixel association matrix Q, where each pixel Q p ∈ R 9 denotes the probability that pixel p belongs to one of the adjacent 9 potential superpixels. Utilizing Q, we can directly aggregate the superpixel feature from the pixel-wise feature map and also disperse the superpixel feature back to the pixel-wise feature through weighted interpolation. To make this process learnable, appearance reconstruction and size regularities of the superpixel are used as loss functions. This strategy enforces intra-consistency of the image's appearance information, which is sensitive to various appearance changes along boundaries. To further enhance boundary preservation efficacy for buildings, this paper introduces an additional semantic branch. The goal is to make the superpixel solely responsive to building boundaries, thereby mitigating potential influences from other objects.
The architecture of the network (Fig. 3) for the generation of semantically-sensitive superpixels is quite straightforward, which consists of a feature extractor, two heads for superpixel and semantic segmentation and the loss functions. The core strategy for the network is inspired by SSN. For completeness, we also briefly introduce the necessary components of SSN for better understanding of this paper.  Feature Extractor. We adopt a standard encoder-decoder architecture to generate a pixelwise feature map f from the input image x, e.g., Θ F (x) = f . In the encoder part, we use a structure similar to nested U 2 -Net (Qin et al., 2020), with each encoder block having a Residual U-block. U 2 -Net is experimentally found to better capture subtle information along object boundaries for salient detection, particularly for thin objects, making it suitable for our purpose. In this way, each layer can combine information extracted from multiple scales, allowing the encoder to capture more context.
Superpixel and Segmentation Heads. The feature map f branches into two distinct paths: the pixel-superpixel association matrix Q in the superpixel heads, and the esimated building segmentation mapb ∈ R S×2 in the semantic segmentation heads. The weights are essentially convolutions with a 1 × 1 kernel.
As illustrated in Fig. 4, the image is initially partitioned into N square lattice grids, with √ N cells allocated to each dimension. Each pixel in the association matrix Q p aims to estimate the probability of belonging to the adjacent grids N p corresponding to pixel, which is determined as follows: where (r, c) represents the row and column of the grid that pixel p is located in. For cells situated along the image border, the cells are padded with reflected values. Superpixel Aggregation and Dispersion. Given the pixel-superpixel association matrix Q, we can softly aggregate pixel-wise features into superpixel-wise features. In this paper, we use the pixel-wise one-hot feature h for the binary building mask and convert it to the superpixelwise representation H ∈ R N ×2 . Since only the adjacent 9 superpixels are relevant, we sparsely estimate aggregation using adjacent grids. The row and column of the n-th superpixel are determined by n = r × √ N + c. In addition to the aggregation operator, we can also reconstruct the pixel-wise featureĥ from the aggregated superpixel-wise feature H using weighted interpolation. For the n-th superpixel, the aggregated feature H n and the dispersed featureĥ p are computed as follows: where Q p (n) indicates the probability of the n-th superpixel in Q p , and h p ∈ R 2 is the one-hot feature for pixel p in the building map. The aggregation only considers pixels inside the adjacent lattice grids Np. Z n is a normalizer for the corresponding superpixel: In addition to the aggregation and dispersion for the building masks h, the same strategy is also applicable for the pixel location p and the center of the superpixel P , wherep is the reconstructed location from superpixels. The location is crucial for maintaining the regular shape of the superpixel. Loss Functions. The loss function comprises two components: the superpixel loss L sp , which enforces intra-consistency and shape regularity within each superpixel, and the semantic loss L se , which facilitates learning features that distinguish buildings from the background. For the superpixel loss, supervision is provided by the original one-hot vector for buildings and the reconstructed one derived from the pixel-superpixel association matrix Q, taking into account shape regularities: where CE(·, ·) denotes the cross-entropy and λ serves as a weight to balance the regularization term. For the semantic loss, the conventional approach employed in semantic segmentation is utilized: The final loss is a direct combination of these two components. Superpixel Clustering and Feature Pooling. To generate the superpixel clustering map M , we compute the hard pixel-superpixel association directly by taking the maximum in the soft association matrix Q.
The corresponding feature for the n-th superpixel, V n ∈ R C , is aggregated using the feature map f , in a manner similar to Equations (2) and (4).

SuperpixelGraph Embedding with Graph Attention Networks
The aggregated superpixel feature V ∈ R N ×C is intrinsically local. L sp takes into account only the intra-consistency within each superpixel. Although the encoder-decoder architecture progressively expands the receptive field of the CNN, and L se leverages semantic information, the features must capture a broader context to achieve robustness in semantic segmentation. Moreover, as only intra-consistency is considered, the relationships between adjacent superpixels remain ambiguous. Consequently, the embedding step for SuperpixelGraph serves two primary objectives: (1) propagating and aggregating contextual information throughout the graph, and (2) explicitly estimating the relationships between superpixels, thereby enriching the information on the edges of the graph.
Graph Embedding Network. Extending the convolution operator to irregular domains typically involves a neighborhood aggregation or message passing scheme (Fey and Lenssen, 2019). In this paper, a vertex represents a superpixel with its node feature denoted by V n ∈ R C . The message passing formulation is employed to propagate information progressively to the target node through multiple layers l along the neighborhood. Specifically, a single layer of the message passing graph neural networks can be described as follows: In this context, the superscript l signifies the number of layers, while the subscripts (i, j) denote the indices of nodes. The symbol ⊕ represents an order-invariant function, such as sum or max. Θ refers to learnable functions, typically Multi-Layer Perceptrons (MLPs) with activation and normalization functions. However, the concept of message passing is general, and other modular schemes are also possible. In this paper, we leverage Graph Attention Networks (GAT) (Veličković et al., 2017) for the message passing layer.
As depicted in Fig. 5, this paper employs four message passing layers. As detailed in Section 3.1, the construction of the SuperpixelGraph considers only direct neighbors for edge links. Each layer propagates and aggregates information in a ring structure to the target node. Consequently, information along the edges with the 4-th degree becomes visible to the target node. Incorporating more context information can be achieved by stacking additional layers; however, this may also introduce irrelevant information, thereby compromising performance. We find that four layers strike an optimal balance between contextual information and precision.

Message Passing
Message Passing Message Passing Message Passing Graph Attention Networks. GAT represents an instantiation of the message passing layer and is formulated as follows: where Θ is a shared MLP for each layer. α ij is the attentional score between nodes (i, j). In contrast to classical attention maps that model dense connections between all nodes, α ij only considers the directly linked nodes (i, j) ∈ E in the graph, making it sparse. The attention coefficients α ij are computed using a modified dot attention mechanism: where a ∈ R 2C is the learned vector; the operator [·||·] concatenates two node features; and σ is the activation function, such as LeakyReLU (Xu et al., 2015), which is used in this paper. As shown in Equation (11), α ij essentially represents a learnable correlation between adjacent node features (V i , V j ), which is normalized using the sof tmax function among all nodes connected to V i . As depicted in Fig. 6, superpixels belonging to the same objects exhibit a markedly higher correlation compared to those of inter-object connections. This observation inspired us to utilize the attention scores for subsequent interactive editing. 0 1 Figure 6: Visualization of the attention score for a specific node. Attentional aggregation constructs a dynamic graph between adjacent superpixels. Weights α ij are shown as segments, with normalized values represented by colors.
Loss functions. The final layer of the above embedding yields the logits for the classification of the superpixel, denoted as V 4 . The corresponding label mask b should also be aggregated to the dynamically created graph. The superpixel-level label B ∈ Z N is generated using scattered reduction from the dynamically created superpixel segmentation M and the mask b. Cross entropy is employed as the loss function L G to learn the weights for the graph embedding and attentional aggregation.

Stroke editing with global optimization
In the inference step, the outputs of the SuperpixelGraph consist of the superpixel segmentation map M , the estimated superpixel-level probability of building segmentationB, and the correlation scores α. In practical applications, incorrect classification of buildings is inevitable. Therefore, we design an efficient global optimization step that takes into account user interaction.
As depicted in Fig. 7, an intermediate building mask is created using hard thresholding from the segmentation scoreB. The mask is overlaid on the image, allowing the operator to add or delete building regions using stroke editing. The probabilities inB for the corresponding superpixels that are affected by the editing strokes are reset to 0 and 1 for deletion and addition, respectively. Then, a Markov Random Field using the weighted Potts model is constructed, and the final label L ∈ Z N for each superpixel is solved using graph cuts: where I(·, ·) is an indicator function that evaluates to 0 if the two labels are the same and 1 if not. ϕ = 10 is the weight that balances the data term and smooth term.  (Boykov and Kolmogorov, 2004). The edited nodes are highlighted by yellow outlines, with red and blue representing modification of the building class probability of the node to 0 and 1, respectively. (d) The optimized building segmentation mask obtained after the optimization process.
After the graph cut-based optimization concludes, a precise building mask can be generated. The next task is to track the building outline and vectorize it based on parallel and vertical constraints. In this work, we adopt existing solutions (dos Santos et al., 2020;Xie et al., 2018) to trace the boundary and regularize the vector graphics.

Experiments and evaluation
In this section, we conduct a comprehensive experimental evaluation of the proposed techniques. A brief overview of the data sets is initially presented in Section 4.1, followed by an assessment of the superpixel segmentation in Section 4.2. Subsequently, an in-depth analysis of the precision and efficiency of the building extraction, focusing on local details, is provided in Section 4.3.

Datasets
Three datasets are employed in our experiments. The first dataset, the WHU aerial dataset (Ji et al., 2018), is sourced from New Zealand, encompassing 187,000 buildings within a 450 km 2 area. The dataset comprises 8,188 RGB image tiles in total, each containing 512×512 pixels with a 0.3m spatial resolution. It is divided into three parts: training set, validation set, and testing set, which consist of 4,736, 1,036, and 2,416 tiles, respectively. The second dataset, the INRIA aerial dataset (Maggiori et al., 2017), is gathered from 10 cities worldwide, covering an area of 810 km 2 . Each city features 36 images with dimensions of 5000×5000 and a spatial resolution of 0.3m. We employed only the training set of the dataset for quantitative experiments, cropping the images into 512×512 with 12 pixels of overlap. The tiles for training were randomly split by an 8:2 ratio for the training set and validation set. The third dataset, the Vegas satellite dataset, is a subset of the Deep Globe Building Extraction Challenge dataset (Demir et al., 2018), captured by the WorldView-3 satellites. This dataset consists of 3,851 images and 110,000 buildings. We selected Pan-sharpened RGB images with dimensions of 650×650 and a resolution of 0.3m spatial. The ground truth binary labels were generated using the summary file containing the spatial coordinates of all annotated building footprint polygons. All images were randomly divided by a 6:1.5:2.5 ratio for the training, validation, and testing sets. A comparison of the datasets is provided in Table 1.

Evaluation of superpixel segmentation
The performance of the proposed superpixel segmentation algorithm is compared to nine state-of-the-art approaches. All evaluations are conducted using the protocols and codes provided by the superpixel benchmark library (Stutz et al., 2018). The algorithms for SLIC (Achanta et al., 2012), SEEDS (Van den Bergh et al., 2012), LSC  are provided by the OpenCV library, the methods of ETPS (Yao et al., 2015), ERS (Liu et al., 2011) are provided in superpixel benchmark (Stutz et al., 2018), the methods of SSN (Jampani et al., 2018), SPFCN (Yang et al., 2020), LNSNET  and SICLE (Belém et al., 2022) are implemented from the corresponding authors. The evaluation metrics for superpixel include the achievement segmentation accuracy (ASA) and boundary recall and precision (BR-BP). ASA quantifies the achievable accuracy for segmentation using the superpixels as preprocessing step, BR and BP measure the boundary adherence of superpixels given the ground truth.   For qualitative results, Fig. 8, 9 and 10 present the details of the superpixel segmentation outcomes and compares them to two typical approaches, such as SLIC with classical clustering methods and SSN for learned methods. It is evident that the proposed method generates superpixels that form an approximately uniform grid in the background region, while the edges of superpixels in the building region exhibit strong consistency with the building edge orientation. This demonstrates that the proposed method is exclusively sensitive to building edges, resulting in superior boundary segmentation precision. In contrast, other methods cause superpixels to generally respond to various differences in images, rendering them insensitive to buildings and leading to inaccurate and incomplete building boundaries.  For the quantitative assessments, we classify the techniques into two primary categories: nonlearned methods (top rows of Table 2) and learned methods (bottom rows of Table 2). These evaluations focus on the delineation of ground truth building polygons. Notably, our proposed methodology exhibits a remarkable improvement in performance, surpassing existing approaches by a substantial margin when analyzing building boundaries (Fig. 11). We additionally present the standard performance metrics while maintaining the number of superpixels at approximately 1000, demonstrating consistent results. Furthermore, it is imperative to emphasize that the learned methods (bottom rows) undergo fine-tuning utilizing the same segmentation map containing buildings as the proposed methods. This process accentuates the significance of semantic sensitivity in achieving exceptional performance. The enhanced preservation of building contours will substantially augment subsequent building extraction endeavors.

Evaluation of Building Extraction
In this section, we meticulously evaluate the performance of the proposed methodology for building extraction by examining both the pixel-level accuracy and the vector-level correspondence. Despite the primary focus of this paper being the generation of vector graphics, the pixel-level assessment remains crucial, as we employ conventional simplification and regularization techniques.
Pixel-wise Segmentation. To assess the influence of superpixels within the network on the precision of building detection, we employ the U 2 -Net for semantic segmentation (utilizing the same encoder as our superpixel segmentation network) as the baseline, i.e., classification heads following the feature map f . The segmentation results are depicted in Fig. 12. It is apparent that the SuperpixelGraph is capable of generating more distinct and sharp boundaries compared to the baseline approach. This can be attributed to the fact that traditional networks like U 2 -Net frequently employ convolution and pooling processes, which cannot preserve boundaries in the encoder. In contrast, our method's superpixel generation network can provide boundarypreserving superpixels.
Furthermore, to compare the efficacy of the learned pixel-superpixel association matrix Q (Equation 8), we also perform ablation analyses, as displayed in Table 3. The method U 2 -Net+SLIC refers to averaging the segmentation results using the SLIC superpixel after the baseline. U 2 -Net+SLIC+GAT represents the results obtained with average superpixel pooling using the SLIC results, followed by appending graph optimization. The quantitative outcomes are provided in Table 3, where it can be observed that the disparities between SuperpixelGraph and the baseline in terms of pixel-wise metrics are relatively minor. However, as we will demonstrate subsequently, the vector-level performance differences are substantially more pronounced. Additionally, naively substituting the graph construction with SLIC does not enhance performance or, in some cases, even converge (as seen in the Vegas dataset). This underscores the significance of the soft superpixel aggregation facilitated by the association matrix Q.  Vector-Level Segmentation Analysis. In this section, we assess and contrast the perfor-mance of the resulting building vector polygons with existing methodologies. For U 2 -Net (Qin et al., 2020), which solely provides building labels, we employ the ASIP  and ArcGIS (Gribov, 2019) techniques for the vectorization process. Furthermore, we also compare our approach to state-of-the-art end-to-end networks such as PolyWorld (Zorzi et al., 2022) and Frame Field Learning (FrameField) (Girard et al., 2021). The evaluation metrics employed are consistent with those used by Huang et al. (2021) and include weighted coverage (WC), boundary F-score (BF), Hausdorff distance (HD), vertex number error (VNE), and average precision (AP). The evaluation results, based on vectors that possess a minimum of 50% intersection of union with the ground truth instances, are presented in Table 4. It is important to note that the results for PolyWorld and FrameField are derived using their respective pre-trained weights, as the training codes have been retained as proprietary by the authors.  Table 4 presents a comparison of the vector-level metrics. The top rows display results obtained using end-to-end approaches, while the bottom rows require raster-to-vector conversion. From the results, it is evident that U 2 -Net+ASIP yields high WC and BF values but a low AP50 value. This can be attributed to the fact that WC and BF metrics only compute the precision of correct instances, meaning that only a significantly lower number of high-quality buildings are produced by these methods. Given that the U 2 -Net baseline achieves almost identical pixel-wise metrics as SuperpixelGraph, it is intriguing to observe the disparity between pixel-level and vector-level scores. It is worth noting that FrameField demonstrates a substantially better generalization capability compared to PolyWorld. This is likely due to the learning of low-level field representation, while the vector polygons are still generated using traditional tracing techniques. Regarding U 2 -Net+ArcGIS, the segmentation accuracy surpasses that of both U 2 -Net+ASIP, underscoring the significance of building tracing methods for raster-to-vector conversion. Nonetheless, the proposed methods, despite employing a conventional polygon simplification and regularization approach, still outperform the aforementioned techniques.

Interactive editing with global optimization
Owing to the intricacy of roof types, environmental disruptions, occlusions, and noise, it is inevitable that segmentation outputs will be imperfect, which significantly hinders the adoption of current methods in real-world applications. In our work, we enable operators to edit the segmentation results using the auxiliary information generated through our approach. We showcase several typical scenarios before and after interactive optimization in Fig. 13, and the qualitative results are also provided in Table 5. In practical implementations, we overlay the detected results on images with transparent shading. Operators draw strokes across the image, and when the mouse hovers over a pixel, the corresponding superpixel is highlighted. It is evident that substantial improvements can be achieved through minimal manual interaction, and various types of errors are rectified, including missed small roofs (Fig. 13a), false-positive detections (Fig. 13b), imperfect boundaries (Fig. 13c), and mixed errors in complex scenes with small buildings (Fig.  13d).  Figure 13: Examples of interactive optimization. From left to right: input images with ground truth footprints filled in cyan, initial results accompanied by interactive strokes, and the optimized results.

Conclusion
In this paper, we have presented a learning-based, semi-automatic building footprint detection algorithm that utilizes superpixels as the fundamental segmentation units, thereby ensuring improved preservation of building boundaries. By integrating both the superpixel segmentation and building semantic generation tasks within a single, multi-task network, our approach streamlines the process and enhances its efficiency. Furthermore, the superpixel graph construction within the network facilitates subsequent manual refinement. Our experimental findings reveal the effectiveness of the proposed method, showcasing its ability to produce more precise building outlines while also demonstrating significant potential for integration into practical applications. Despite its current reliance on a post-processing step for generating vectorized polygons, an intriguing avenue for future research involves the combination of the superpixel representation with an end-to-end vectorization module, which could potentially elevate the algorithm's performance and applicability.