Iterative multilinear optimization for planar model fitting under geometric constraints

Planes are the core geometric models present everywhere in the three-dimensional real world. There are many examples of manual constructions based on planar patches: facades, corridors, packages, boxes, etc. In these constructions, planar patches must satisfy orthogonal constraints by design (e.g. walls with a ceiling and floor). The hypothesis is that by exploiting orthogonality constraints when possible in the scene, we can perform a reconstruction from a set of points captured by 3D cameras with high accuracy and a low response time. We introduce a method that can iteratively fit a planar model in the presence of noise according to three main steps: a clustering-based unsupervised step that builds pre-clusters from the set of (noisy) points; a linear regression-based supervised step that optimizes a set of planes from the clusters; a reassignment step that challenges the members of the current clusters in a way that minimizes the residuals of the linear predictors. The main contribution is that the method can simultaneously fit different planes in a point cloud providing a good accuracy/speed trade-oﬀ even in the presence of noise and outliers, with a smaller processing time compared with previous methods. An extensive experimental study on synthetic data is conducted to compare our method with the most current and representative methods. The quantitative results provide indisputable evidence that our method can generate very accurate models faster than baseline methods. Moreover, two case studies for reconstructing planar-based objects using a Kinect sensor are presented to provide qualitative evidence of the efficiency of our method in real applications.


INTRODUCTION
Fitting multiple plane-based models under geometric constraints to point clouds obtained with RGBD noisy sensors in time-constrained applications remains a challenging problem. More broadly, geometric primitive detection (planes, cuboids, spheres, cylinders, etc.) has been extensively studied for multiple applications (robotics, modeling, shape processing, rendering, interaction, animation, architecture, etc.) (Kaiser, Ybanez Zepeda & Boubekeur, 2019). Multiple plane-based primitives are particularly interesting because they are very common in several engineering and architectonic projects in industry and construction, configuring objects and scenarios designed with geometric characteristics between planes (angles, position, etc.). Building corridors, facades or rooms, manufactured components, packages, boxes, etc., are commonly formed by planar patches. These engineering and architectonic elements are present in different application domains, such as robot navigation, object reconstruction and reverse engineering (Werghi et al., 1999;Benko et al., 2002;Anwer & Mathieu, 2016).
Computer vision plays an important role in providing methods for modeling objects or scenes by means of processing 3D point clouds. Plane detection and model fitting are frequently used as the first stage in object and scene modeling pipelines. Robot navigation systems use plane detection and fitting to perform Simultaneous Localization and Mapping (SLAM) tasks (Lu & Song, 2015;Xiao et al., 2011). Indoor and outdoor scene reconstruction may be performed with plane detection and fitting phases followed by setting up the piecewise planar model (Cohen et al., 2016;Zhang et al., 2015;Engelmann & Leibe, 2016). Object detection and reconstruction from 3D point clouds is a complex task in the computer vision field, frequently involving a prior plane detection and fitting step (Hu & Bai, 2017, Saval-Calvo et al., 2015b, Saval-Calvo et al., 2015a. Various devices (Sansoni, Trebeschi & Docchio, 2009) can be used to obtain threedimensional (3D) point clouds depending on the application such as medical imaging, industry and robotics. Recently, 3D consumer RGBD cameras have been used extensively in video games and other areas. Point clouds from RGBD cameras are commonly noisy owing to the presence of outliers and missing data (Villena-Martínez et al., 2017). The challenge of model fitting this type of point cloud forces the method to be robust against noise. However, different application areas, such as SLAM or Human-Computer Interatcion (HCI), introduce real-time constraints that must be satisfied by the system.
In this context, we propose a novel iterative method to accurately reconstruct scenes composed by planes from noisy 3D point clouds, using the geometric constraints of the scene. The main contribution of this method is that it can simultaneously fit different planes in a point cloud using linear regression estimators and simple constraints (normal vectors to the planes), providing a robust solution. It exhibits high accuracy in the presence of noise and outliers and reduces the processing time.
The reminder of this paper is organized as follows. First, we review studies related to multi-plane model reconstruction focusing on methods that exploit orthogonality constraints to optimize the relationship between accuracy and temporal cost. In "MC-LSE: An Iterative Multi-Constraint Least Squares Estimation Algorithm", we present the method, MC-LSE, for the multi-constraint least squares estimation (MC-LSE). In "Experiments", we describe an extensive experimental study of the method compared with other related methods. Finally, the conclusions of the experimental analysis are presented in "Discussion and Conclusions".

Related work
The multi-class multi-model fitting is a classic problem dealing with the interpretation of a set of input points originating from noisy observations as a multiple model with different geometric primitives (Barath & Matas, 2019). Common cases of this problem are the estimation of multiple line segments and circles in 2D edge maps, homographies from point correspondences, and multiple motions in videos, planes and spheres in 3D point clouds.
A particular case of model fitting is planar model fitting, which estimates a model composed of a set of planes that are represented by a noisy 3D point cloud. Problems such as point cloud segmentation or clustering are aimed at grouping points in subsets with similar characteristics (geometric, radiometric, etc.), often with unsupervised processes, providing plane detection (Grilli, Menna & Remondino, 2017), without addressing model fitting. Several authors categorize model fitting methods into three groups: Hough transform-based methods, iterative methods (regression and RANSAC), and regiongrowing-based methods (Kaiser, Ybanez Zepeda & Boubekeur, 2019;Xie, Tian & Zhu, 2020;Jin et al., 2017;Deschaud & Goulette, 2010). This common classification includes methods not only related to model fitting but also for segmentation and clustering in plane detection.
Region-growing methods build regions by expanding the area from seeds as certain conditions are met. A plane detection algorithm was proposed in (Poppinga et al., 2008) using a two-point-seed-growing approach. Different clustering methods have been proposed, such as k-means (Cohen-Steiner, Alliez & Desbrun, 2004), which is one of the most commonly used methods. In some cases, these methods are focused on clustering or region segmentation (Nurunnabi, Belton & West, 2012). The success of these methods usually depends on the selection of the initial seed. Furthermore, they are oriented toward plane detection rather than a complete model fitting process.
Hough transform (Hough, 1962) is a common method for parameterized object detection, such as lines or circles, formerly defined for 2D images (Duda & Hart, 1972). Extensions to deal with 3D images have been proposed in several studies (Hulik et al., 2014). The voting process of Hough transform-based methods results in a high computational cost, especially in the presence of large input data. To reduce the computational cost, the randomized Hough transform (RHT) satisfies the voting process in a probabilistic manner (Borrmann et al., 2011). The RHT is adequate for detecting planes in large structures. However, these methods have not yet demonstrated their efficiency in complex model-fitting problems of scenes with geometric constraints between planes.
Random sample consensus (RANSAC) is a robust approach for fitting single models in an iterative manner (Fischler & Bolles, 1981). The method randomly selects an initial subset of the data to calculate a tentative model. It is then validated by counting the remaining data whose distance is under a threshold (i.e. inliers). The process iterates, and the best model is finally selected. In computer vision, RANSAC and its variants are widely used owing to their robustness against noise in the input data. Some methods aimed at fitting planes in images or 3D data are based on RANSAC. In some cases, the least squares estimation (LSE) and RANSAC are used together as the LSE method is used to calculate the model from the initial subset. However, RANSAC tends to simplify complex planar structures (Jin et al., 2017). To overcome this problem, several variants have been proposed (Saval-Calvo et al., 2015a). The CC-RANSAC (Gallo, Manduchi & Rafii, 2011) and its variants (Zhou et al., 2011;Zhou et al., 2013) can obtain multiple surfaces by employing a modified RANSAC loss function to obtain better results. These approaches select multiple subsets (one per expected plane) and consider the relationships among them, instead of selecting one random data and trying to find the best planar model by counting the inliers. CC-RANSAC includes the largest connected component of inliers with eight-neighbor topology. The CC-RANSAC variants improve the basic method by adding vector normal information to allow the estimation of each cluster in the clustering and patch-joining steps. However, these methods do not consider the information related to the planes themselves. MC-RANSAC (Saval-Calvo et al., 2015a) uses a pre-clustering process by using a k-means algorithm to estimate the clusters and a search tree technique to improve the solutions while considering the prior constraints (angles between planes). Moreover, it extends traditional RANSAC by introducing a novel step for evaluating whether the inliers comply with the prior constraints among pre-clusters. Hence, this method outperforms the previous methods, achieving high accuracy in the final model estimation. However, the introduction of the search tree to calculate the pre-clusters and the step in RANSAC to check the constraints among plane models make the method too slow for some applications. New methods have been developed based on RANSAC. Prior-MLESAC (Zhao et al., 2020) is based on the previous maximum likelihood estimation sampling consensus (MLESAC), which improves the extraction of vertical and non-vertical planar and cylindrical structures by exploiting prior knowledge of physical characteristics. Progressive-X (Prog-X) is an any-time algorithm for geometric multi-model fitting using the termination criterion adopted from RANSAC, improving similar methods in terms of accuracy (Barath & Matas, 2019;Barath et al., 2020).
Regression, which is one of the most referred approaches for plane fitting, is a statistical strategy to solve the problem by finding a model that minimizes the overall error in the data. Ordinary LSE and its variants such as least median of squares (LMS) regression are widely used methods. The least squares method is a standard approach used to estimate model parameters by minimizing the squared distances between the observed data and their expected values. In computer vision, it has been widely used as the most popular form of regression analysis in various tasks. However, LSE results are highly influenced by outliers, leading to inconsistent results (Mitra & Nguyen, 2003). LMS (Massart et al., 1986) minimizes the median of the squares and has proved to be more robust than the original LSE. However, it still fails when more than 50% of the data are outliers. Although other robust approaches have been proposed to overcome this problem, such as the least K-th order of square (LKS) or adaptive LKS (Lee, Meer & Park, 1998), the estimation of the optimal parameters requires high computational effort. Consequently, they are not viable for many applications, as the original LSE is one of the most used methods for this purpose, at least as a part of more sophisticated systems. LSE has been widely used to estimate planes or planar patches in computer vision or as a part of robust methods that can calculate them. Araújo & Oliveira (2020) provided a new robust statistical approach, robust statistic plane detection (RSPD), for detecting planes in unorganized point clouds to achieve better accuracy and response times than previous approaches.
Regression, which is one of the most referred approaches for plane fitting, is a statistical strategy to solve the problem by finding a model that minimizes the overall error in the data. Ordinary LSE and its variants such as least median of squares (LMS) regression are widely used methods. The least squares method is a standard approach used to estimate model parameters by minimizing the squared distances between the observed data and their expected values. In computer vision, it has been widely used as the most popular form of regression analysis in various tasks. However, LSE results are highly influenced by outliers, leading to inconsistent results (Mitra & Nguyen, 2003). LMS (Massart et al., 1986) minimizes the median of the squares and has proved to be more robust than the original LSE. However, it still fails when more than 50% of the data are outliers. Although other robust approaches have been proposed to overcome this problem, such as the least K-th order of square (LKS) or adaptive LKS (Lee, Meer & Park, 1998), the estimation of the optimal parameters requires high computational effort. Consequently, they are not viable for many applications, as the original LSE is one of the most used methods for this purpose, at least as a part of more sophisticated systems. LSE has been widely used to estimate planes or planar patches in computer vision or as a part of robust methods that can calculate them. Araújo & Oliveira (2020) provided a new robust statistical approach, robust statistic plane detection (RSPD), for detecting planes in unorganized point clouds to achieve better accuracy and response times than previous approaches.
In recent studies, the problem of multi-model fitting have been addressed using energy functions to balance geometric errors (Isack & Boykov, 2012) and solve multiple geometric models, where greedy approaches such as RANSAC do not perform properly. These global energy-based approaches search for an optimal solution to fit all models present in the multi-structured data set, usually at high computational costs with respect to the number of models fitted (Amayo et al., 2018). Moreover, these methods (PEARL, T-linkage and CORAL (Isack & Boykov, 2012;Magri & Fusiello, 2014;Amayo et al., 2018)) have shown their efficiency in solving 2D multi-model fitting problems and homographies, but do not show extensive experimentation in the reconstruction of 3D multiple plane-based models with geometric constraints. Lin et al. (2020) formulated the problem as a global gradient minimization, proposing an updated method (Global-L0) based on a constraint model that outperforms traditional plane fitting methods.
A review of related works shows that methods for model fitting considering both accuracy and computational cost are needed. We propose a method for plane-based models reconstructed from a set of 3D points, taking advantage of the geometric constraints that are present in the original scene, exhibiting high accuracy in the presence of noise and outliers, and reducing the processing time. We selected the most representative methods for comparison: LSE (Mitra & Nguyen, 2003) and RANSAC (Fischler & Bolles, 1981) as classic baseline methods and MC-RANSAC (Saval-Calvo et al., 2015a), RSPD (Araújo & Oliveira, 2020), Prior-MLESAC (Zhao et al., 2020), Prog-X (Barath & Matas, 2019;Barath et al., 2020) and Global-L0 (Lin et al., 2020) are the most recent, providing a wide range of methods.

MC-LSE: AN ITERATIVE MULTI-CONSTRAINT LEAST SQUARES ESTIMATION ALGORITHM
In this section, we present our iterative method named MC-LSE for multi-constraint leastsquares estimation. After the introduction of our notations in "Notations and Setting", we present the workflow of MC-LSE in "MC-LSE", and it involves three main steps: clustering, linear regression, and reassignment.

Notations and setting
Let us assume that we have access to a set of m points, P ¼ fp , captured by using a 3D camera and assumed to be (likely noisy) representatives of some target planar model (e.g. cube, facade, corridor and office). The task we are dealing with in this study consists of reconstructing this target model by learning n planes under orthogonality constraints. These constraints take the form of an n × n-matrix, A, where A[j, k] = 1 if planes j and k must be orthogonal, and 0 otherwise. In this supervised machine learning setting, p x i ; p y i À Á plays the role of the input feature vector, and p z i is the dependent variable (corresponding to the depth). We approach this task by solving a joint constrained regression problem (which is described in the next section), where the minimizer takes the form of a set of n models, h h j p x i ; p y i À Á (with j = 1…n), that maps linearly from input is supposed to provide a good estimation ofp z i of p z i as follows: Therefore, we deduce that h h j p x i ; p y i À Á is defined as follows: where h θj is the j th plane, and is the corresponding set of parameters learned from a certain subset of points, P j & P.
Let N j ¼ ðh 1 j ; h 2 j ; Àh 3 j Þ be the normal vector of the j th plane andÑ l be the normal vector of any set of points P l & P.Ñ l can be easily computed by selecting the eigenvector corresponding to the smallest eigenvalue of the scatter matrix of P l . Finally, let kNNðp i Þ & P be the set of k-nearest neighbors of p i , given a metric distance.

MC-LSE
MC-LSE is an iterative algorithm that aims at reconstructing the target model from a set of points captured by using a 3D camera. The workflow of our method is presented in Fig. 1, where the target model is a cube. It involves three main steps.
1. A clustering-based unsupervised step that consists of initializing n clusters from set P of points that are supposed to represent at most the n planes viewed by the 3D camera. Note that this step is performed only once.

2.
A linear regression-based supervised step that learns, under orthogonal constraints (using matrix A), n planes from the clusters.
3. A reassignment step that challenges the membership of the points to the current clusters in a manner that minimizes the residuals of the regression tasks.
Steps 2 and 3 are repeated until no (or only a few) reassignments are performed.

Clustering step
As illustrated in Fig. 2, performing clustering from set P might be a tricky task. The points captured by the 3D camera (i) are often noisy representatives of the actual faces of the cube and (ii) may represent overlapping areas, especially at the corners of the cube (Fig. 2 (left)). Therefore, using a standard clustering algorithm with the Euclidean distance (such as K-Means (Forgy, 1965) as used in this study) would lead to irrelevant clusters that would tend to bring together points (see p 1 and p 2 in the figure) that likely do not belong to the same plane. To overcome this drawback, an efficient solution consists of projecting a 3D-point, p i , onto a 6-dimensional space by considering not only the original features, p x i ; p y i ; p z i À Á , but also the 3D-normal vector,Ñ kNNðp i Þ , of set kNN(p i ) of the nearest neighbors of p i (see Fig. 2 (right)). Letp i 2 R 6 be the corresponding point. Set is then used as the input for the clustering algorithm that outputs n clusters, P 1 ,…,P n . Note that if the number of points assigned to a given cluster is not large enough (according to a threshold tuned by the user), the corresponding plane is (at least partially) hidden and cannot be properly captured by the camera. In such a case, the cluster is deleted, and only n − 1 planes are learned. Furthermore, note that the size of the neighborhood, k (which is tuned by cross-validation), affects the homogeneity of the orientation of the normal vectors. The normal is smoothed for a large neighborhood, but the edges of the objects are also smoothed; thus, they are less descriptive. In contrast, if k is small, the normal vectors are significantly affected by noise and less uniform for a single plane surface. To consider the specificity of the application at hand (quality of the 3D camera, level of noise in the data, and camera point of view), we suggest assigning a weight to the normal vector, which is also tuned by cross-validation.
The next step of the process is optimization under the constraints of parameters θ 1 ,…,θ n corresponding to the n planes of the target model. Note that even if the clustering step considers the normal vectors to prevent the algorithm from building irrelevant clusters, outliers may still have a considerable impact on the slope of the learned planes at the first iteration of MC-LSE (see Fig. 3A in the case of a target cube). To address this limitation, we suggest selecting landmarks as a certain percentage of points from each cluster, P i , that are the closest (according to the Manhattan distance) to the centroid of P i . Thus, the initialization of our planes should be improved (see Fig. 3B).

Regression step
The regression step aims to use the points of the current clusters to fit parameters θ 1 ,…,θ n of the 3D planes such that the orthogonality constraints of matrix A are satisfied. To achieve this task, only the original coordinates of the points (p x , p y , p z ) are used. Planes h θj (j = 1…n) are of the following form: The corresponding normal vector of each plane is given by To correctly reconstruct the target planar model, h θ1 ,…,h θn must be learned under constraints such that the normal vectors, N i , N j , are orthogonal; that is, Note that these constraints can be rewritten in terms of θ i and θ j . Let L be a 4 × 4-matrix defined as follows: We can deduce that N t i N j can be rewritten as follows: Planes h θj (p) (j = 1…n) is learned to find the 4 × n matrix θ = (θ 1 ,…,θ n ) as the minimizer of the following constrained optimization least-squares problem. Here,

Reassignment step
The objective of the reassignment is to challenge the membership of the points to the clusters, allowing us to better fit, using a step-by-step method, the parameters of the planes. Similar to the clustering step, the reassignment of p j , assumed to be currently part of cluster P l , accounts for both the original features, ðp x j ; p y j ; p z j Þ, and the 3D-normal vector,Ñ kNNðp j Þ , to prevent two close examples belonging to two different faces from being reassigned to the same cluster. The only difference is that the normal vector,Ñ kNNðp j Þ , is calculated from the nearest neighbors that belong to P l instead of from the entire dataset, P. Letp j be the projection of p j in R 6 .
The reassignment of every point,p j , consists of looking for the closest plane, h θi , in terms of the Euclidean distance and assigning it to the corresponding cluster. In other words, this procedure selects a plane that minimizes the residuals. As illustrated in Fig. 4, this procedure allows us to significantly improve the quality of the clusters, having a positive impact on the planes optimized at the next iteration.

EXPERIMENTS
In this section, we present an extensive experimental study of the proposed algorithm, MC-LSE. After the presentation of the experimental setup, we report the results of comparisons with previous methods in terms of several evaluation criteria. Then, we present an analysis of MC-LSE according to different levels of noise added to the data. The results were used for the cases presented in the next section.  Figure 2 Construction of the pre-clusters using the normal vectors of the points. For simplicity, only two faces are considered here. On the left: a set of noisy points captured by the 3D camera. Points p 1 and p 2 are very close to each other according to the Euclidean distance in R 3 . On the right: a 6-dimensional feature vector is used to represent each point. The three additional features correspond to the normal 3D-vector,Ñ kNNðpiÞ , calculated from the neighborhood, kNN(p i ), of each point, p i . In this way, p 1 and p 2 are no longer close and will probably belong to two different clusters, C 1 and C 2 , shown in blue and green, respectively. Full-size  DOI: 10.7717/peerj-cs.691/fig-2

Experimental setup
Synthetic data from a target cube were used for the experiments. Without loss of generality, this setup allows us to quantitatively compare the methods on a simple target model by having access to the ground truth (analytical expression of the true planes of the cube). In this cube-based scenario, the number of learned planar models was set to n = 3, and matrix A was based on 2-by-2 orthogonality constraints, as presented in Table 1. Synthetic data were generated by simulating a Microsoft Kinect sensor with Blensor (Gschwandtner et al., 2011). This tool allows us to create a cube and obtain images from different points of view by moving a virtual camera. The experiments included eight points of view of the cube (see Fig. 5), simulating a counter-clockwise self-rotation of the object to validate the algorithm in terms of different geometrical and clusters characteristics. The method was implemented with MATLAB 2019 and YALMIP, a toolbox for modeling and optimization (Löfberg, 2004). Note that different levels of Gaussian noise (mean μ = 0 and standard deviation σ from 1.00E−05 to 1.00E−04) were added to the synthetic data to evaluate the robustness of the methods (see Fig. 6).
Four performance criteria (see Fig. 7) were used to evaluate the methods. The first two aim at evaluating the global accuracy of the learned models with respect to the ground truth. The last two are used to assess the robustness of the methods in the presence of noisy data: Angle error: the root mean square (RMS) of the angles (in degrees) composed by the calculated fitted plane and the ground truth for each plane of the model. Model error: the mean of the differences between the angles (in degrees) among the learned planes that compose the model and the corresponding from the ground truth.  Distance error: the RMS of the Euclidean distance in R 6 between the points and their closest plane. This quantity provides an insight into the residuals of the linear regressions.
Cluster error: the percentage of points incorrectly clustered compared to the ground truth.

Experimental comparison
MC-LSE is compared to seven other methods as mentioned previously. To assess the impact of the orthogonal constraints, we employed an ordinary LSE and RANSAC Synthetic data for different levels of Gaussian noise: Performance criteria used to evaluate methods. The angle error (A) is calculated using the angles between the ground truth (solid line) and the estimated plane (dotted line). In this example, it is The model error (B) considers the angles among the planes that compose the object compared with those of the ground truth (solid line). In this example, for an object composed of two planes, it is ka À bk=1. Finally, the distance error (C) can be calculated as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Full-size  DOI: 10.7717/peerj-cs.691/fig-7 (Fischler & Bolles, 1981) as baseline methods. We also compared MC-LSE with state-ofthe-art methods such as multi-constraint RANSAC (MC-RANSAC) (Saval-Calvo et al., 2015a), RSPD (Araújo & Oliveira, 2020), Prior-MLESAC (Zhao et al., 2020), Prog-X (Barath et al., 2020), and Global-L0 (Lin et al., 2020), providing an extensive comparison of our method. For the sake of comparison, the methods that allow a previous clustering (LSE, MC-RANSAC and RANSAC) use the same pre-clusters as proposed in this paper. Hence, the RANSAC method used in the comparison could be considered as a variant of the related methods based on CC-RANSAC (Gallo, Manduchi & Rafii, 2011;Zhou et al., 2011;Zhou et al., 2013). Because these methods consider the consensus set, the largest connected component of inliers within the spatial neighborhood and within normal vector information coherence, the use of these clusters allows them to be pre-calculated. In other words, the RANSAC checking step to conform to the consensus set considers only the patches previously calculated by k-means that contain coherence in the spatial neighborhood as well as in the normal vector. The RSPD, Prior-MLESAC, Prog-X and Global-L0 methods do not allow the use of the initial pre-clustering because they perform a different pre-treatment that is part of the proposed method such as Prior-MLESAC (calculation of curvature characteristics, normals, etc.), a progressive sampling schema (Prog-X), global optimization approach considering the whole data (Global-L0), or a growth scheme (RSPD).
The results are reported in Table 2, based on which, we can make the following remarks. MC-LSE outperforms all the other methods in terms of the angle error and model error (being the most reliable performance parameters in terms of model fitting). It is worth noting that this is true regardless of the level of noise added to the data, and the advantage of MC-LSE is even higher with an increasing level of noise. Moreover, our constrainedoptimization problem allows us to automatically satisfy the orthogonality requirements (i.e. the model error is always equal to 0), whereas the other methods suffer from an increasing inability to fulfill the 90-degrees constraint. Specifically, the RSPD method suffers more due to its use of the bottom-up patch growth scheme that detects several planes in the same cluster with a large angle deviation. The two other criteria (distance error and cluster error) provide some insight into the capacity of the methods to fit models from relevant clusters, to reduce the residuals and to segment the input data. For the former, the two best performing methods are MC-RANSAC and MC-LSE, with very similar results for all noise levels, being the best on average. Except for LSE and RANSAC, all obtained results are very similar because the methods make use, among others, of the point distances to the estimated plane in their optimization processes to perform the fitting. In terms of the cluster error, the worst results are obtained by RSPD because it detects several planar patches per plane. LSE and RANSAC obtain the second-worst results because they only make use of partial data (pre-clusters) to obtain the planes and, consequently, have local minima. The methods that consider global constraints (MC-RANSAC, Global-LO, Prog-X and ours) obtain very similar results, with a variation of approximately 1% of errors in the obtained clusters. Finally, it is worth noting that the Prior-MLESAC (also considering the whole data) does not perform a complete assignment  of the points of the scene to a cluster, and the results are better than the others. Thus, although MC-LSE does not obtain the best results in terms of clusters, it is capable of achieving the best fit.
Our objective is to compare MC-LSE with other methods, in terms of not only its error, but also its computational cost. It is worth remembering that MC-LSE is based on an iterative process that is repeated until convergence. Formally, convergence is reached if no reassignment is performed between the two iterations. However, note that if only a few points are assigned to the wrong cluster, MC-LSE is not prevented from learning a good model. To evaluate this behavior, we performed three additional experiments consisting of stopping MC-LSE when no more than 3%, 5% and 10% of the points change between two iterations. Figure 8 shows a plot of the joint behavior of MC-LSE in terms of both the running time (in seconds) and angle error in the four settings: convergence until no (red ball), no more than 3% (cyan ball), no more than 5% (brown ball) and no more than 10% (blue point) of reassignment. We also report the results of MC-RANSAC (blue diamond), RANSAC variant (purple square), Global-L0 (orange right arrow), RSPD (purple left arrow), Prior-MLESAC (green triangle) and Prog-X (blue star). It is important to note that LSE, RANSAC, Prior-MLESAC, MC-RANSC and MC-LSE were implemented in MATLAB 2019, whereas Global-L0, RSPD and Prog-X were implemented in C++, as described by the authors of the papers. Therefore, the latter obtains results in a faster time than the MATLAB version. We can see that by relaxing the convergence constraint, MC-LSE is still very accurate with an execution time very close to the minimum provided by Global-L0, Prog-X and RANSAC (not counting LSE and RSPD as the fastest but worst results). The RANSAC variant is faster but much less accurate, whereas MC-RANSAC yields a small error, but is much more time-consuming. The implementations of Prog-X and Global-L0 (in C++) provide similar processing times with better performance for Prog-X. Although the median of Prog-X is similar to that of MC-LSE, the results are very scattered, ranging from less than 1 degree to more than 10 for certain acquisitions.

Specific analysis of MC-LSE
Once the comparative analysis is performed, in this section, we focus on the analysis of MC-LSE. First, the pre-clustering results obtained by the clustering-based unsupervised

step (see "Clustering
Step") are analyzed. Subsequently, the behavior of MC-LSE according to the number of iterations and processing time is analyzed. Next, view V5 is analyzed in detail because the results are the most different compared to the rest of the views. The main difference from the other views is that the data are highly imbalanced.

Pre-clustering results
The first step of MC-LSE is to obtain a set of pre-clusters in the scene. The calculation of this first step is critical in the final result because the entire process starts from it. Therefore, in this section, we analyze the results obtained for the four most representative clustering methods: Gaussian mixture model-based expectation and maximization (GMM-EM) (Ari & Aksoy, 2010), hierarchical clustering (HC) (Cai et al., 2014), selforganizing maps (SOM) (Mingoti & Lima, 2006), and k-means (Forgy, 1965). It should be noted that the clustering performed with SOM applies k-means to the map generated from the setP of normal vectors and 3D points (see "Clustering Step"). Table 3 lists the results of the entire MC-LSE for the evaluation criteria (see "Experimental Setup") with respect to various levels of Gaussian noise (from 1.E−05 to 1.E −04) for the eight tested different views, except for the model error. It is not calculated as the proposed method meets the geometrical constraints (see results in Table 2). Additionally, the cluster error is calculated for the pre-clusters resulted of the unsupervised methods (pre-cluster error in the Table 3).  First, the error of the clusters generated in the first step (pre-cluster error) and the result of the whole process (cluster error) are analyzed. It can be seen that the best method is GMM-EM, which optimizes the GMM distributions. This is the best result in all cases except for the lowest noise level (1.E−05). Because the noise used for testing is Gaussian noise, it can explain the very good results. However, in comparison to the final result of the process, it can be seen that all methods, on average, generate similar results, with k-means being the best. In addition, the GMM-EM algorithm provides the best results for noise levels of 2.E−05, 4.E−05, 8.E−05 and 1.E−04. As can be seen, the GMM-EM and HC methods show an excessively high deviation, resulting in very good results in some views and very bad results in others. Therefore, good pre-cluster results do not guarantee a good final result because it depends on the set of points in the pre-cluster that may be in different planes of the real model. This case will be analyzed later with View 5 in "Analysis of View V5: A Difficult Case".
Finally, according to the distance error variable for all noise levels, similar results were obtained (again, except for GMM-EM with high standard deviations). In any case, the angle error and model error could be the most reliable variables for defining the accuracy of the method. However, as previously mentioned, because the method meets the geometrical constraints, the model error is 0. However, angle error shows that both SOM and k-means provide the best results, with k-means being the one with the best average error (0.66 degrees). In all cases for k-means, the RMS error is less than 1 degree except for the maximum level of noise (1.E−04) and having small deviations, indicating its capability to obtain very accurate results regardless of the viewpoint.
To conclude the study of the pre-clustering results, the influence of the input data on the pre-clustering calculation step was also analyzed. As discussed in "Clustering Step", the clustering algorithm considers not only the original features, p i ¼ p x i ; p y i ; p z i À Á , but also the 3D-normal vector,Ñ kNNðp i Þ , of set kNN(p i ) of the nearest neighbors of p i to conform to the corresponding points,p i 2 R 6 , used as inputs of the clustering algorithm. In the Table 4, the results of using onlyÑ kNNðp i Þ (Normals), only original features p i or both,p i (Normals and points), according to the variables studied previously are shown. As can be seen, the use of normal vectors improves the performance of MC-LSE. In addition, the most reliable parameter angle error is less than 1 degree using normal vectors, and more than 14 degrees if only the original features p i are used. In this case, MC-LSE is not capable of extracting accurate plane models. From these data, it can be concluded that the use Table 4 Results (mean ± standard deviation) of the pre-clustering method considering all experimental data (for different various levels of Gaussian noise and viewpoints). The results in bold font indicate the best method.

Normals Normals & points Points
Pre-cluster error ( of normal vectors is very important in the clustering step and, consequently, in the results obtained by the method. Furthermore, in this case, although practically the same results are obtained, it is better to usep i as inputs to separate planes that have the same orientation but are in different positions of the scene. We explain this case in the room reconstruction of the case study described in "Room Reconstruction".

Convergence and processing time results
The method converges very quickly, as shown in Table 5 (mean and standard deviation) and Fig. 9. The behavior in terms of convergence of the method is similar for the different  levels of noise and views, except for view V5. It can converge in six or fewer iterations, even for high levels of noise (up to 7E−5). For the highest levels of noise (from 8E−5 to 1E−4) slightly increased, with a mean of approximately 6.5 iterations for this range. The worst case is view V8 for the 1E−4 level of noise, reaching up to nine iterations. Regarding view V5, it requires the largest number of iterations to converge for all levels of noise (except for level 2E−5), reaching the maximum for noises 6.0E−05 and 10.0E−05, requiring a total of 12 iterations. Hence, this view is analyzed in depth in the following subsection. Given only the number of iterations to converge, the variation in the quality of the solution in each iteration is not considered. Hence, it is interesting to analyze the number of iterations according to the angle error as the most precise accuracy parameter studied in the previous section. Figure 10 shows the behavior of the variable with respect to the number of iterations. It represents the mean of the quality variable for different views at a specific level of noise. The error drops significantly in the first iteration, with a performance similar to an exponential decrease. Moreover, the error does not decrease significantly after approximately 4-5 iterations for any level of noise.
It is important to remember that the method converges when the number of points that are reassigned for a cluster with respect to the previous iteration is lower than a certain threshold. In the experiments, the threshold is 0, which means that the method does not stop as long as changes in the assignment continue to occur. However, as shown in Fig. 11 for a noise level of 10.0E−05 (the level with most iterations), in the first four iterations, the percentage of points that change is less than 1% except for views V5 and V6. In iteration 5, view V6 decreases from 2.62% to 0.75%; and view V5 decreases from 3.58% to 2.08%. From iteration 6 onwards, in all cases, it is less than 1%. After iteration 8, the change in view V5 was 0.3. The fitting time and reassignment time are completely dependent on the iterations made by the method, and they are not affected by noise. Specifically, the former in the experiments is also not dependent on the number of points in the input data. The method can provide a model fitted into the input data within 0.2463 ± 0.0151 s per iteration. The latter is dependent on the number of planes in which the model to be fitted is composed of the number of input data.

Analysis of view V5: a difficult case
The behavior of the results for view V5 is different from that of the rest. The face P 2 in view of V5 (Fig. 5E) is highly imbalanced with respect to the other faces P 1 and P 3 . Moreover, the normal vector of P 2 is approximately orthogonal to the point of view of the camera. In this section, we focus on the results for the highest level of noise (Fig. 12).
First, the pre-clusters are calculated by the clustering-based unsupervised step (see "Clustering Step") could be analyzed in Fig. 12A. Although pre-clusters of planes P 1 (green) and P 3 (blue) are mainly distributed around the corresponding planes, the points preclustered as P 2 (red) are distributed on planes P 2 and P 3 . These results can be analyzed quantitatively in Table 6 (left). The points predicted as P 2 are 33.7% for the actual face, whereas 55% of the predictions are for points belonging to P 3 . In other words, the precluster is distributed in two faces P 2 and P 3 , with most of the points incorrectly classified.
For the improved initialization of the planes, landmarks are selected using 50% of the points for each cluster P i closest to their centroid. The results are shown in Fig. 12B and quantitatively analyzed in Table 6 (middle). The landmarks for planes P 1 and P 3 are well calculated, but the results for P 2 are even worse. The points predicted as P 2 are 33.3% for the actual face, whereas 63.9% of the predictions are for points belonging to P 3 . Finally, Fig. 12C shows the fitted model for the method after convergence and Table 6 (right) the confusion matrix. The model was fitted perfectly to the data. The errors for the clusters were mainly distributed at the intersection of the planes and at the edges of the faces. For example, 21% of the points predicted as P 2 for actual P 3 points are those at the edges. Because the number of points of the cluster P 2 is less than the others, for approximately 10% of the captured points, the relative error is higher. Cases of study: reconstructing scenes This section presents two case studies to qualitatively evaluate the performance of MC-LSE in realistic situations. Specifically, this experiment consists of registering a set of 3D point clouds using a state-of-the-art method, μ-MAR (Saval-Calvo et al., 2015b), which uses multiple 3D planar surfaces to find the transformations between them. To obtain an accurate model, the estimation of transformations to align views is critical. μ-MAR uses the model of the planes instead of the actual 3D data to reduce noise effects and, hence, improve registration. By using multiple non-coplanar planes, the method estimates the correspondences between views and calculates the transformation using the normals and centroids of the planes. The method uses the normals of both the fixed and moving sets of plane models to determine the rotation. Next, the translation is iteratively calculated by projecting the centroids of the moving set to the fixed set and minimizing the distances. For more details, refer to the original paper (Saval-Calvo et al., 2015b). Because the final registration result highly depends on the accuracy of the planar models, the better those models are, the more accurate the result will be. "Object Reconstruction" shows the reconstruction of two objects that are part of the original dataset of the μ-MAR. As in the original study, the evaluation was performed by visual inspection because there is no ground truth to compare with. Next, "Room Reconstruction" presents a reconstruction of a scene composed of multiple orthogonal planes (walls and ceiling), the models of which have been estimated using MC-LSE, and the μ-MAR has been used to align the views.

Object reconstruction
As previously mentioned, here, we show the reconstruction of two objects using a 3D planar marker registration method, where both MC-RANSAC and MC-LSE methods are used to estimate the planes of the markers (cubes) under orthogonal constraints. The planes extracted for the cubes were registered using the μ-MAR method. Finally, the transformations calculated for the cube are applied to the point clouds of the objects to reconstruct them.
The objects were a bomb and a taz toy. Figure 13 shows the two objects and the configuration with the 3D markers around them. The set-up includes a turntable that rotates using a stepper motor and an RGB-D camera that takes color and depth images in every step of the motor. The table was covered with a blue fabric to ease the segmentation of the objects and markers. After applying the μ-MAR registration method, we obtained an aligned point cloud. The result is shown in Fig. 14, where the markers were removed from the scene for a clearer interpretation. Although both methods provide good results, small improvements can be observed using MC-LSE. In the taz toy, the right leg showed better alignment. The bomb toy in Fig. 14B shows a more rounded shape. For better visualization, Fig. 15 shows the details of the reconstruction. The first row presents a zoomed view of the leg in the Taz toy, where the MC-LSE planes achieve better reconstruction than the planes of MC-RANSAC. The second row shows the bomb toy, where the object reconstructed by MC-LSE is more round, and the eyes are more defined (i.e. the views are better aligned). Figure 16 shows a bottom view where the markers (cubes) have not been removed to provide another reference to evaluate the accuracy of the alignment. In general, the markers are better aligned in MC-LSE, with the right cube a clear example with a more compact shape.

Room reconstruction
In the second part of the case study, we present an indoor scene reconstruction composed of multiple orthogonal planes (walls and ceiling) registered using the μ-MAR algorithm, to estimate planes extracted by the proposed method. This application is related to the SLAM problem in robot localization and indoor building reconstruction.
The point cloud was captured using factory calibration for the Kinect camera. The reconstruction is more challenging than in the previous case of study because it has to deal with the same problems as before, but with larger errors; the optical aberration is worse than the fact that most points are closer to the border of the image (Fig. 17B) and the planes are further (about 4 meters in some views) increasing the noise in the point cloud (Fig. 17C). Figure 18 shows the five acquired data viewpoints of the room. μ-MAR uses pairwise alignment; hence, every two consecutive frames have some planes in common. Overlapped on the images are the clusters corresponding to the planes. For this case study, the preclusters corresponding to planes have been manually segmented because it is not a core contribution of the study. Moreover, the constraints have been relaxed, not being strictly orthogonal due to the curves present in the planes as shown in Fig. 17B.
The final reconstruction is shown in Fig. 19. As can be seen, the final alignment is correct, and the geometrical constraints are preserved using the plane models estimated by MC-LSE. The capabilities and robustness of the proposed method for estimating planes even in the presence of high noise and optical aberration can be seen in the planes estimated in Fig. 20 from the frame in Fig. 18, which are far from the camera and close to the borders in the image; hence, the proposed method is very noisy and curved. In this figure, the three planes of the frame are shown along with the fitted model planes. The ceiling is the top part of the figure with the red model; on the right with turquoise is the frontal wall with shelves; and in green is the side wall beside the window. Note the high amount of noise and curved data in which accurate plane models are estimated by MC-LSE.

DISCUSSION AND CONCLUSIONS
In this paper, we propose MC-LSE, an iterative method to accurately reconstruct objects composed of planes from 3D point clouds captured by cameras. The fundamental aspect of this method is the use of geometric constraints given in the object by design. These constraints provide a robust solution for estimating the planes that are able to deal with noisy data and in the presence of a high number of outliers. The optimization process with the aim of obtaining the model is performed for all planes of the object at the same time. This allows a minimization of the error as a whole while fulfilling the geometric constraints, improving partial solutions for individually fitting planes that do not guarantee a minimum error for each plane or, of course, a fitting of the model under geometric constraints. To fit the planes to the noisy 3D data, first, MC-LSE provides an initial estimation of the regions of points captured by the 3D camera representing the planes of the model by means of an unsupervised clustering process. It is based on the k-means algorithm and aims to cluster at most the number of faces expected for the object. The input of the method is a point in R 6 as a result of projecting the input point cloud onto a 6-dimensional space formed by each point and its normal vector calculated from the neighborhood. Working on this space allows the method to discriminate better points close to the edges that belong to different faces. However, for very noisy data that consider the normal vector, the clustering process can lead to errors assigned to different faces, points that are close in location. Consequently, the method considers the specificity of the application, weighting the coordinates of location and the normal vector by coefficients.
Once the initial clusters are estimated, they are the input for the core linear regressionbased supervised step that learns the planes simultaneously, minimizing the residuals for each cluster under the geometric constraints. Although accurate for the clusters, the output of the regression is highly dependent on the clusters provided by the unsupervised step. Hence, the final proposed step consists of a re-assignment of all input points to the plane models calculated by regression. The re-assignment is performed by the closest plane using (again) the six-dimensional space composed of the point coordinates and the normal vector. This provides a minimization of the residuals of the previous step. In this way, the method iterates refining the solution until a threshold of points is not re-assigned. Experimental results show that the supervised regression and the reassignment step produce model results converging to a minimum in a few iterations. It allows the selection of the threshold again by considering the specificity of the application to refine the solution depending on the remaining time (i.e. real-time purposes).
The experiments in this study allow us to validate the method in the presence of different noise levels and points of view from an RGBD camera. Moreover, the comparison with LSE as baseline (and considered as part of many methods) and the other state-of-the art methods (RANSAC variant and MC-RANSAC) provides the big picture of the performance of the method. LSE and RSPD provide the best results according to processing time, which is too far from the obtained results for MC-LSE, but the error is extremely high for considerable noise levels. The use of this method as part of a more robust method should be considered. On the other hand, the other methods (Global-L0, CC-RANSAC, MC-RANSAC, Prog-X and Prior-MLSAC), and ours provide robust estimations of planar models based on inliers, but the computational cost is higher (according to the number of outliers, model parameters, etc.). The MC-RANSAC adds steps to consider the constraints among faces, allowing it to provide highly accurate results at the expense of dramatically increasing the computational cost. The proposed MC-LSE can provide the best results at a balanced time (comparable to the C++ implementation of Prog-X). Moreover, to validate in a real case study, a reconstruction based on a planarbased marker (cube) registration method using a Kinect sensor was presented.
Consequently, the importance and relevance of the proposal could be analyzed from the point of view of the computer vision problems that have to deal with planes as the core geometric model present everywhere in the hand-made 3D real world. As mentioned previously, they were designed with geometric constraints that could be exploited. In this experimental setup, the validation was performed for cubes whose visible faces were at most three orthogonal planes. This is a subset of problems in which the solution can be considered, but it is widely present in many applications. For example, the case study for 3D markers, but also for mapping in robotic purposes such as SLAM, in which different views of a corridor could be seen at most as three orthogonal planes (floor, wall and ceil).
The generalization of the proposal for many orthogonal planes is direct, and strict constraints could be incorporated without changing the proposal, as we analyzed in the scene reconstruction case study. This case is frequent for non-calibrated cameras, which are not able to provide orthogonal planes even though they exist because of the geometric distortions of the camera. Hence, it is recommended that intrinsic calibration be used to minimize the effects of optical distortions on the image.
In future research, we plan to model other geometrical constraints for planes that can be useful for other applications (for example, we plan to use other geometrical objects for calibration purposes of a multi-camera system). At the same time, we would like to introduce a random selection of the landmarks of the first iteration to not depend on the first selection of points. This could be approached from a RANSAC point of view but speeding up the solution based on the characteristics of the method. This is because it could provide the hypothesis of the planar model considering all planes while reducing the number of iterations that the method should perform. Finally, the clustering-based unsupervised step could be improved by incorporating other characteristics of the scene as the Prior-MLESAC does to reduce the iterations needed to obtain the model fitting and, consequently, the processing time.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the French ANR project LIVES (ANR-15-CE23-0026-03) and the Spanish State Research Agency (AEI) and the European Regional Development Fund (FEDER) under projects TIN2017-89069-R and PID2020-119144RB-I00. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.