A METHOD FOR ROOF WIREFRAME RECONSTRUCTION BASED ON SELF-SUPERVISED PRETRAINING

: In this paper, we present a two-stage method for roof wireframe reconstruction employing a self-supervised pretraining technique. The initial stage utilizes a multi-scale mask autoencoder to generate point-wise features. The subsequent stage involves three steps for edge parameter regression. Firstly, the initial edge directions are generated under the guidance of edge point identification. The next step employs edge parameter regression and matching modules to extract the parameters (namely, direction vector and length) of edge representation from the obtained edge features. Finally, a specifically designed edge non-maximum suppression and an edge similarity loss function are employed to optimize the representation of the final wireframe models and eliminate redundant edges. Experimental results indicate that the pre-trained self-supervised model, enriched by the roof wireframe reconstruction task, demonstrates superior performance on both the publicly available Building3D dataset and its post-processed iterations, specifically the Dense dataset, outperforming even traditional methods.


INTRODUCTION
Accurate 3D roof models play a crucial role in architectural applications and urban planning.Existing 3D roof reconstruction methods are mainly divided into two categories: image-based and point cloud-based (Wang and Zakhor, 2022).In this paper, we focus on reviewing research related to point cloud-based reconstruction.In comparison to the 3D roof mesh representation (constructed by vertices, lines, and faces) in surface reconstruction, a wireframe model (constructed by vertices and lines) is a simplified representation of a complex 3D shape.It can be generated from a set of point cloud data and can also serve as a foundation for constructing mesh models.Due to the characteristics of wireframe model representation, constructed by vertices and lines, many researchers have transformed the problem of wireframe reconstruction into vertices identification and edge linking problems.
With the widespread adoption of deep learning techniques and considering that traditional methods may introduce accumulated errors, some 3D wireframe reconstruction methods based on end-to-end approaches have been proposed.A pioneering work is Point2Roof (Li et al., 2022), which represents the first end-to-end 3D roof modeling from airborne LiDAR point clouds.It employs PointNet++ (Qi et al., 2017b) as a backbone to extract point-wise features and identifies a series of candidate corner points.These candidate corner points are clustered into a set of initial vertices, and a paired point attention module is proposed to predict the final accurate vertices.However, this method is tested on artificial roof datasets that encompass a limited number of 16 simple roof types.Following this trajectory, Wang et al. (Wang et al., 2023) proposed a supervised-based baseline to predict the final wireframe model on an urban-scale dataset consisting of more than 160 thousand real-world buildings.WireframeNet (Cao et al., 2023) utilizes the medial axis transform technique to filter the original input point cloud.It integ-rates the corner and edge points' information and analyzes the connectivity between the edge points to construct the complete wireframe structure.However, the aforementioned methods are supervised-based and require a large amount of labeled data.Given the expensive cost of manually labeling data, SSL is proposed to address this problem, wherein the model is trained with supervisory signals generated from the data itself.In previous work (Wang et al., 2023), a self-supervised baseline is proposed to identify corner points and use a graph neural network to generate the wireframe model.In essence, this method also follows the common route of corner identification and edge prediction, which may fail in the presence of missing corner points and sparsity in real-world datasets.
In this paper, we propose a pretrain-finetune pattern for 3D roof wireframe reconstruction, aiming to address the limitations encountered in real-world datasets.Specifically, we adopt a mask autoencoder as our self-supervised feature extractor.The input point clouds are partitioned into several patches using farthest point sampling and Ball Query techniques.These patches are categorized into visible and invisible patches, and the visible patches are fed into a mask autoencoder with a multi-scale feature mechanism to extract point-wise features.Subsequently, these point-wise features are input into a specifically designed edge point regression module to generate the initial direction of the edge.The generation of this initial edge direction is determined through an efficient 3D line fitting method.The output, including edge point identification results and edge initial direction, is then fed into the edge regression module to obtain the edge's equation, which comprises the direction vector and length of the edge.More precisely, the edge parameter regression module is designed to parse the obtained edge features (including edge confidence scores, edge length, direction vector, and direction offset) by constructing multiple Multilayer Perceptron (MLPs) to regress distinct parameterized edges.Additionally, bipartite edge matching and edge non-maximum sup-pression strategies are employed to generate accurate wireframe models.
Our main contributions can be summarized as follows: (1) To the best of our knowledge, our proposed method is the first to utilize a pretrain-finetune strategy for edge regression under the guidance of edge point identification.The pre-trained self-supervised model, when combined with the downstream task of roof wireframe reconstruction, demonstrates superior performance.
(2) We introduce a novel method for parameterized edge extraction that leverages edge point information to establish the initial edge's direction vector.This initial direction vector, combined with edge direction, direction offset, and edge length, is used iteratively to generate wireframe drafts.Subsequently, edge-based non-maximum suppression is applied to eliminate redundant edges.

SSL for point clouds based on mask recovery
The fundamental concept underlying mask recovery in SSL methods involves acquiring point cloud representations through the reconstruction of corrupted point clouds, with the aim of recovering the original structure to the greatest extent possible.For example, Point-BERT (Yu et al., 2022) randomly masks a portion of the input point cloud and employs a BERT-style transformer to reconstruct the invisible tokens.This reconstruction is performed under the supervision of visible tokens obtained from a pre-trained discrete Variational Autoencoder (dVAE).However, this approach tends to overly depend on data augmentation.To address this issue, Point-MAE (Pang et al., 2022) introduces a unified framework of mask autoencoder.It utilizes the standard transformer as the backbone, featuring an asymmetric encoder-decoder architecture to effectively recover the masked data with a high ratio.In contrast, Occupancy-MAE (Min et al., 2023) is proposed to reconstruct masked voxels by relying only on a small number of visible voxels.This is achieved by combining a range-aware random masking strategy and incorporating a pretext task involving occupancy prediction techniques, particularly when the input point clouds are divided into multiple voxels.Continuing along this path, we employ the mask recovery-based autoencoder as our self-supervised feature extractor.

Wireframe reconstruction
The detection of corner points, akin to 3D key points, plays a crucial role in capturing and representing structural information.In previous work, USIP (Li and Lee, 2019) utilized a feature proposal network to obtain features for key points along with their corresponding transformation pairs.The generated key points underwent optimization by minimizing the distances between detected key points in pairs of point clouds.Li et al. (Li et al., 2022) employed point-wise binary classification and point clustering techniques to identify initial vertices.The PointNet backbone was then utilized to extract vertex features and perform offset regression.A subsequent refinement process was applied to precisely determine the locations of the initial vertices, resulting in the generation of accurate vertex representations.Jiang et al. (Jiang et al., 2023) introduced a double-flow structure for extracting both semantic and offset information.These informative features were then leveraged to identify high-quality interest regions.Deep estimators were subsequently employed to predict corner point proposals within each interest region, facilitating the accurate detection of corner points.It is noteworthy, however, that these methods belong to the category of supervised learning approaches, necessitating a substantial number of annotated point clouds.Furthermore, these methods have primarily been applied to synthetic datasets and regular geometric, close-range objects.
Additionally, alternative methods based on different patterns have been proposed.LC2WF (Yicheng Luo and Bao, 2022) introduced a Line-Patch transformer-based network to construct building wireframes by extracting junctions and connectivities from line cloud datasets.Ma et al. (Ma et al., 2022) presented a deep spatial gestalt model designed to infer the relationship between visible and invisible cues of 3D structure.Furthermore, some research has focused on achieving the goal of generating wireframes by extracting line segments.In particular, there are works dedicated to segmenting the input point cloud into a collection of facets to extract line segments from largescale point clouds (Lin et al., 2017) (Lin et al., 2015).In contrast to these approaches, our proposed roof wireframe reconstruction method omits the corner detection steps and instead employs a regression approach for the edges.This is achieved under the guidance of the initial direction generation of the edge points.

Overview of experimental datasets
We present a comprehensive description of the publicly available Building3D dataset and its post-processed Dense dataset.We utilize this description to elucidate the motivation behind constructing a Dense dataset and introducing our proposed approach.As depicted in Fig. 1(a), we randomly selected four roof datasets from the public Building3D dataset (Wang et al., 2023).However, this selection revealed certain issues, including noise and missing corner points, denoted by red-dotted circular shapes.The conventional wireframe reconstruction pattern of "corner detection + edge prediction" encountered challenges under these conditions.In response to this, we curated a Dense dataset, as illustrated in Fig. 1(b), with the aim of observing the performance disparity of our proposed method on both dataset versions.Edge regression and matching strategies: The purpose of the edge regression module is to extract the parameters defining edge representation from the acquired edge features.This module comprises four dedicated MLPs, each operating on specific tensors (Edge confidence scores, Direction offset, Edge direction, and Edge length) to regress their corresponding edge parameters.Initially, an Edge Confidence Scores MLP is employed to derive confidence scores for the edges.Subsequently, the orientation of the edge is decomposed into three components along the XYZ axis.This is achieved using a Direction Offset MLP to predict three component values on the XYZ axis.Finally, an Edge Length MLP is utilized to predict the residual offsets from candidate points to midpoints of the edges.Moreover, the conventional algorithm employed for solving Bipartite Edge Matching is the Hungarian algorithm (Zhang et al., 2016).Nev-ertheless, these methods prove unsuitable for our case, given the absence of a corner detection process in the 3D wireframe reconstruction.A novel edge similarity is formulated based on three distinct factors: the distance between the two lines, edge length, and edge direction.
(1) The distance between the two lines.The Hausdorff Edit Distance (HED) (Fischer et al., 2015) is employed for pairwise node matching.Each node in one graph is compared with every node in the other graph, akin to comparing subsets of a metric space using the Hausdorff distance.For example, we have defined two sets of edges, denoted as A and B, where each edge in these sets is represented by Ai and Bj.The points belonging to these edges are denoted as Ai and Bj.The Hausdorff distance is defined with respect to the metric dis(Ai, Bj) in the following manner: (2) Edge length.The proximity between two edges is quantified by the ratio of the minimum distance (dmin) between the end points of the two edges to the length of the shorter segment.This is defined as follows: (3) Edge direction.Cosine similarity (Hoe et al., 2021) is employed to assess the directional similarity between two edges, and it is defined as follows: The edge similarity is expressed by combining Eqs. ( 1), (2), and (3), and it is defined as follows: ) Here, α, β, and γ represent the balancing coefficients.In our experiment, these coefficients are set to α = 2, β = 0.5, and γ = 0.5, respectively.The smaller the value in Eq. ( 4), the greater the similarity between the two edges.
Edge Non-Maximum Suppression: The obtained edge similarity representation and predicted edge confidence scores are amalgamated to function as inputs for eliminating redundant edges.Firstly, all edges are sorted in descending order based on confidence scores.Next, the edge similarity function is constructed, and a threshold is set to select similar edges.Thirdly, an empty list is initialized to store the selected candidate edges.Edges with confidence scores, starting from high to low, are iteratively added to the list if they meet the threshold criteria.

Loss function
The overall loss of the entire framework comprises three components: the self-supervised feature extractor module, edge point regression module, and edge regression and matching modules.Self-supervised feature extractor loss: It is computed using the Chamfer distance (Pang et al., 2022).Edge point regression and edge length losses: We present a illustrative example, as depicted in Fig. 3, to elucidate our design of the edge point loss.Let any point in a point cloud be denoted as p; the distances between this point p and its neighboring edges (Line1, Line2, Line3, and Line4) are marked as d1, d2, d3, and d4, respectively.Assuming the distance between any point p and the projection point in Line1 is the shortest.In three-dimensional space, the points within the shortest distance are distributed on the surface of a sphere, such as a marked point termed as a Candidate point.Additionally, besides ensuring that the distance between any point p and the projection point is the shortest, there is a need to minimize the distance between a candidate point and the projection point.The L1-norm distance loss (He et al., 2017) is employed to enforce these constraints and optimize edge point regression and edge length.Edge con- fidence scores and edge direction losses: The optimization of predicted edge confidence scores and edge direction is achieved through the utilization of cross-entropy and cosine similarity loss functions, respectively.
Overall edge similarity loss: The edge similarity loss is computed by combining the aforementioned self-supervised loss, edge point loss, edge length, edge confidence scores, and edge direction loss with different multiplication factors, respectively.

Details of the experimental datasets
In this experimental section, the roof dataset utilized is primarily categorized into two versions: the Building3D dataset (Wang et al., 2023) and the Dense dataset.The generation of the Dense dataset was executed using CloudCompare software, facilitating the production of a dense point representation derived from the Building3D dataset.This was done to assess the performance difference of the proposed method on both Dense datasets without missing corner points and Building3D.The primary distinction between the Dense and Building3D datasets lies in the point density and the presence of missing corner points.The Dense dataset maintains uniform point density throughout, whereas the Building3D dataset may exhibit variations and lack some points.Specifically, the Dense and the Building3D datasets consist of 5,698 training point clouds and 583 testing point clouds.

Implementation details
During the SSL pipeline, we employ the AdamW optimizer with a learning rate set to 0.001 and train for 1000 epochs.The pre-training phase of our model consists of 300 epochs with a batch size of 128.In the fine-tuning process, the batch size is set to 2048 points.The point-wise features obtained from the SSL pipeline are then fed into the edge point identification module.These features, along with 128 query points and 64 query embeddings, are input into the decoder to obtain edge features with 256 channels.All code and models are trained using an RTX A6000 GPU with 48GB of memory.

Evaluation metric
The precision and recall of corners and edges are evaluated on the aforementioned three roof datasets.The metric is the same as (Wang et al., 2023), (Li et al., 2022).The average precision (AP) and recall (AR) for the predicted corners are calculated by considering the sets of predicted labels G and ground truth labels Q.The AP and AR are formulated as follows: given set and the complementary set of a given set.In addition, we also calculate the ACO which is the average offset between predicted corners and ground-truth corners.
The experimental results demonstrate our proposed method on the Dense dataset exhibits a significant performance margin compared to the baseline provided by Building3D (Wang et al., 2023).The metrics CR and ER show increases of 19% and 54% (Building3D: 22% and 50%), respectively, with corresponding F1 scores showing improvements of 16% and 39% (Building3D: 13% and 29%) on the Building3D dataset.Furthermore, the ACO distance has significantly decreased.As illustrated in Fig. 6, the red and blue dotted boxes indicate that higher values correspond to better performance in metrics such as CP, CR, F1 scores of corners, EP, ER, and F1 scores of edges, while lower values of ACO signify better results.These remarkable improvements can be attributed to the guidance provided by the edge point regression and edge initial directions in our proposed method.
Furthermore, we conduct visual comparisons with several experimental methods, including three traditional approaches: 2.5D Dual Contouring (Zhou and Neumann, 2010), City3D (Huang et al., 2022), and KSR (Bauchet and Lafarge, 2020).Additionally, we compare against two recent deep learning methods, PC2WF (Liu et al., 2021) and NerVE (Zhu et al., 2023), as illustrated in Fig. 7. Compared with two deep learning-based methods, our reconstruction results achieve the most accurate representation of the roofs, as evidenced by the visualization results.Our reconstruction outputs are in the form of wireframes, which can be further processed by filling each face of the wireframe and assigning specific height values to generate the faces.This process enables us to convert the wireframes into mesh format.Such conversion provides a clear advantage when compared with the reconstruction results obtained from traditional methods.It is evident upon observation that the reconstruction results achieved by our proposed method surpass those of other compared methods, encompassing both traditional and related deep learning approaches.

CONCLUSION
This paper introduces a two-stage SSL method for the reconstruction of 3D roof wireframes.The proposed approach consists of four key components: a self-supervised feature extractor, an edge point regression module, edge regression and matching modules guided by edge initial direction, and an edge Non-Maximum Suppression module, aimed at eliminating redundant edges to achieve an accurate wireframe model.Notably, we incorporate an efficient edge point regression loss to identify the distribution of edge points and ensure the accuracy of the initial edge direction.Subsequently, parameterized edges resulting from this process undergo bipartite edge matching using the designed edge similarity algorithm.Experimental results demonstrate the superiority of our approach over state-ofthe-art methods in terms of reconstruction performance on both Dense and Building3D datasets.Furthermore, the proposed method can be extended to other domains within architecture and urban planning where labeled data is limited, enabling the generation of more precise and diverse models.

Figure 1 .
Figure 1.Characterization of experimental datasets: (a) illustration of the public Building3D dataset and (b) its post-processed Dense dataset.

Figure 2 .
Figure 2. Workflow of our proposed methodology.

3. 2
Workflow of our proposed methodology Self-supervised feature extractor and edge point regression module: The Point-MAE (Pang et al., 2022) serves as a selfsupervised feature extractor backbone, facilitating the extraction of multiple-scale features.For instance, as depicted in the upper-left corner of Fig.2, the grey points and circles represent the output derived from farthest point sampling (FPS) and Ball Query operations on the original input point cloud.The patches-Patch 1, Patch 2, and Patch 3-illustrate point sets at varying scales generated by applying the Ball Query operation to the original input.Subsequently, a random masking strategy is employed on the acquired point sets at different scales, segregating them into visible and invisible segments.The unmasked portions are then utilized as input to an encoderdecoder structure, facilitating the extraction of point-wise features.The point-wise features are fed into an edge point regression module to derive the identification results for the edge points.Subsequently, an efficient 3D line fitting strategy is applied to extract the edge features.

Figure 3 .
Figure 3. Illustration of the edge point regression loss.
∈ |G ∩ Q|, F P ∈ |G ∩ Q| and F N ∈ |G ∩ Q|.The symbols CP and EP, as well as CR and ER, have the same formulation as the previously mentioned AP and AR, respectively, and represent the precision and recall of corner points and edges.The symbols | * | and | * | denote the cardinality of a

Figure 4 .
Figure 4. Results from the iterative step in our proposed method on the Dense dataset.

Figure 5 .
Figure 5. Results from the iterative step in our proposed method on the Building3D dataset.

Figure 6 .
Figure 6.Comparative analysis of methodological performance on the Building3D Dataset.

Figure 7 .
Figure 7. Qualitative comparisons among three conventional approaches, two affiliated deep learning methods, and our proposed method.

Table 1 .
Comparative analysis of methodological performance on the Building3D Dataset.