Single-Shot Cuboids: Geodesics-based End-to-end Manhattan Aligned Layout Estimation from Spherical Panoramas

It has been shown that global scene understanding tasks like layout estimation can benefit from wider field of views, and specifically spherical panoramas. While much progress has been made recently, all previous approaches rely on intermediate representations and postprocessing to produce Manhattan-aligned estimates. In this work we show how to estimate full room layouts in a single-shot, eliminating the need for postprocessing. Our work is the first to directly infer Manhattan-aligned outputs. To achieve this, our data-driven model exploits direct coordinate regression and is supervised end-to-end. As a result, we can explicitly add quasi-Manhattan constraints, which set the necessary conditions for a homography-based Manhattan alignment module. Finally, we introduce the geodesic heatmaps and loss and a boundary-aware center of mass calculation that facilitate higher quality keypoint estimation in the spherical domain. Our models and code are publicly available at https://vcl3d.github.io/SingleShotCuboids/.


Introduction
Modern hardware advances have commoditized spherical cameras 1 which have evolved beyond elaborate optics and camera clusters. Affordable handheld 360 o cameras are finding widespread use in various applications, with the more prominent ones being real-estate, interior design and virtual tours, with recently introduced datasets following the same trends. Realtor360 [60] contains panoramas acquired by a real-estate company, while Kujiale [28] and Structured3D [65] were rendered using a large corpus of 1 We will be using the adjective terms spherical, omnidirectional and 360 o for cameras and images interchangeably. Figure 1: From a single indoor scene panorama input, we estimate a Manhattan aligned cuboid of the room's layout, in a single-shot. To achieve this, we rely on spherical coordinate localization using geodesic heatmaps. This explicit reasoning about the corner positions in the image, allows for the integration of vertical alignment constraints that drive a differentiable homography-based cuboid fitting module.
computer-generated data from an interior design company. Further, datasets containing spherical panoramas like Mat-terport3D [3] and Stanford2D3D [1], were created using the Matterport camera, originally developed for virtual tours. This signifies the importance of spherical panoramas for indoor 3D capturing, as they are (re-)used in multiple 3D vision tasks [55,50,67].
Spherical panoramas capture the entire scene context within their field-of-view (FoV), an important trait for scene understanding. While humans can infer out of FoV information, the same cannot be said for machines, with view extrapolation methods [44] using spherical data to address this. Certain tasks like illumination or layout estimation implicitly extrapolate outside narrow FoVs. Neural Illumination [43] estimates a scene's lighting from a single perspective image employing a perspective-to-spherical completion intermediate task within their end-to-end model. Estimating 1 arXiv:2102.03939v2 [cs.CV] 9 Feb 2021 a scene's layout involves extrapolating structural information, and, thus, many works now resort to spherical panoramas to exploit their holistic structural and contextual information.
The seminal work of PanoContext [64], reconstructs an entire room into a 3D cuboid, fully exploiting the large FoV of omnidirectional panoramas . Its complex formulation and weak priors resulted in high computational complexity, requiring several minutes for each panorama. While modern deep priors produce higher quality results [68,60], increasing the accuracy of their predictions and ensuring Manhattan-aligned layouts, requires postprocessing and hurts runtime efficiency.
Spherical panoramas necessitate higher resolution processing, and therefore, increased computational complexity, as evidenced by recent data-driven layout estimation models [68,60,47]. More efficient alternatives [15] produce irregular (i.e. non-Manhattan) outputs, require parameter sensitive postprocessing, and increase efficiency by lowering spatial resolution, which comes at the cost of accuracy. Moreover, data-driven spherical vision needs to address the distortion of the projective omnidirectional data formats. But distortion mitigating convolutions add a significant computational overhead as reported in [15] and [9].
In this work, we present a single-shot spherical layout estimation model. As presented in Figure 1, we employ spherical-aware corner coordinate estimation and thus, add explicit constraints that facilitate vertically aligned corners. Capitalizing on this, we further integrate full Manhattan alignment directly into the model, allowing for end-to-end training, lifting the postprocessing requirement.

Layout Estimation
While an excellent review regarding the 3D reconstruction of structured indoor environments exists [40], our discussion will provide the necessary details for positioning our work. We focus on monocular layout estimation and thus, refrain from discussing works using multiple panoramas [39,41,37,38], interaction [30], other types of cameras [27,29].
PanoContext [64] showcased the expressiveness of 360 o panoramas in terms of structural and contextual information. Prior to the maturation of deep data-driven methods, PanoContext relied on edge and line detection, Hough transform, and deformable part models to generate different room layout hypotheses. Similarly, low-level line segments were used in an energy minimization formulation to estimate a scene's structural planes [17]. In Panoramix [59], the line features were supplemented by superpixel facets, and embedded as vertices in a graph for a constrained least squares problem.
Hybrid data-driven methods [16] used structural edge detection to improve the performance and runtime of [64] when using fewer hypotheses. Pano2CAD [58] used a probabilistic formulation that relied on CNN object recognition and detection. It generated a synthetic scene reconstruction but required several minutes of processing. Its computational overhead largely comes from the fusion of narrow FoV predictions from perspective 360 o crops. This is common to all aforementioned methods relying on line segments and to [61], which runs various CNNs on all narrow FoV sub-views before merging them in 360 o .
PanoRoom [14] and LayoutNet [68] were the first models to be trained on spherical panoramas. They both modelled layout corner and structural edge estimation as a spatial probabilistic inference task. While it is possible to extract the layout's corners by relying on heuristically or empirically parameterized peak detection, these estimations will most likely not deliver Manhattan-aligned outputs. Consequently, joint optimization is performed using both sources of information to recover the final layout corner estimates. LayoutNet requires several seconds to infer and optimize the layout on a CPU, but PanoRoom is much faster as it uses a greedy RANSAC approach.
DuLa-Net [60] employs a novel approach for 360 o layout estimation. The main insight is that spherical images can be projected in multiple ways, and different projections highlight different cues. Specifically, DuLa-Net uses a 'ceiling-view' that offers a more informative viewpoint with respect to the floor-plan, which is a projection of a Manhattan 3D layout. It performs feature fusion across both the equirectangular and ceiling-view branches, using a height prediction to estimate the final 3D layout. HorizonNet [47] is yet another novel take at omnidirectional layout estimation. Instead of image localised predictions, it encodes the boundaries and intersections in one-dimensional vectors, which are then used to reconstruct the scene's corners. This allows HorizonNet to exploit the expressiveness of recurrent models (LSTM [22]) to offer globally coherent predictions. After a postprocessing step involving peak detection and height optimization, the final Manhattan-aligned layout is computed. A recent thorough comparison between Lay-outNet, DuLa-Net and HorizonNet was presented in [69]. Unified encoding models and training scripts were used to fairly evaluate these approaches. Their findings indicate that the PanoStretch data augmentation proposed in [47], as well as its heavier encoder backbone lead to improved performance for the other models as well. The Cornersfor-Layout (CFL) [15] model is currently the most efficient approach for 360 o layout estimation in terms of runtime, but at the expense of accuracy and Manhattan alignment. While an end-to-end model is discussed, an empirically or heuristically parameterized postprocessing image peak detection step is still required.
Compared to these approaches, our model is end-to-end trainable, producing Manhattan aligned corners in a singleshot. We approach the layout estimation task as a keypoint localization one and use an efficiently designed spherical model.

Learning on the Sphere
There are multiple representations for spherical images with the more straightforward being the cube-map. Traditional CNN models can be applied to the cube faces [33], and then warped back to the sphere. This was used in [64] and [59] to detect lines on each cube's faces [53], while [58] and [61] used CNN inference on each face. Still, cubemaps suffer from distortion as well, and additionally require face-specific padding [4] to deal with the faces' discontinuities. Yet, to capture the global context these approaches need to expand their receptive field to connect all faces continuously, which leads to inefficient models.
A novel line of research pursues model adaptation from the perspective domain to the equirectangular one [45]. The follow-up work, Kernel Transformer Networks [46], adapt traditional kernels to the spherical domain in a learned manner, also discussing two important aspects. First, the accuracy-resolution trade-off for spherical images, which necessitates the user of higher resolutions. Indeed, most aforementioned data-driven layout estimation methods from 360 o images operate on 1024×512 images, which are unusually large for CNNs. Only [15] is the exception to this rule, which further supports this point, taking into account its reduced performance. The second point of discussion is related to the effect that non-linearities have, when combined with kernel projection methods like [6] and [51]. It is shown that the assumption that needs to hold for no error to accumulate when using kernel projection, only holds for the first layers of the network, and as it deepens, the accumulated error becomes even larger. Still, [15] shows that their EquiConv offer more robust predictions. A generalization of this concept, Mapped Convolutions [9], decouple the sampling operation from the filtering one, and demonstrate increased performance in dense estimation tasks. Still, runtime performance is greatly reduced as reported in both [15] and [9]. This is also the main drawback of frequency-based spherical convolutions as presented in the concurrent works of [5] and [11]. They are also highly inefficient in terms of memory, allowing for training and inference in very low resolution images only. DeepSphere [8] and [25] present another approach to handle distortion and discontinuity by leveraging graph convolutions and lifting the sphere representation to a graph. Nonetheless, this requires a graph generation step and loses efficacy compared to traditional convolutions, whose implementations are highly optimized to exploit the memory regularity of image representations.
The most efficient way to handle the discontinuity is circular padding [54,47,7], which is partly our approach as well, taking into account the inefficiency of distorted kernels. It should also be noted that model adaptation methods would not transfer well for the layout estimation task. While an object detection task parses a scene in a local manner, layout estimation requires to reason about the global context, with perspective methods typically needing to extrapolate the scene's structure. However, as first proven by PanoContext [64], the availability of the entire scene is much more informative, and this would hinder the applicability of transferring models like RoomNet [27] to the 360 o domain using such techniques [45,46].

Coordinate Regression
Regressing coordinates in an image has been shown to be an intriguingly challenging problem [31]. The proposed solution was to offer the coordinate information explicitly. Yet, most keypoint estimation works in the literature initially used fully connected layers to regress coordinates. The counter-intuition is that convolutions are inherently spatial, and should be more well-behaved in spatial prediction tasks. This is how data-driven layout estimation models have addressed this problem up to now ( [68], [15]), transforming coordinates into spatial configurations, using smoothing kernels to approximate coordinates, and leverage dense supervision. Keypoint localisation tasks with semantic inter-correlated structures, typically use one heatmap per keypoint. However, an issue that has recently received attention [62], is the way the final coordinate is estimated from each dense prediction. Indeed the spatial maxima might not always best approximate the coordinate, and thus, heuristic approaches have persisted. Specifically for layout estimation, where the corners are predicted on the same map, manually-set peak detection thresholds are used.
The overlapping works of [32], [48] and [35] derive smooth operations to reduce a heatmap to single a coordinate. Using the coordinate grid and a spatial softmax function, they smoothly, and differentiably, transform a spatial probabilistic representation into a single location. As shown in [52], all the above operations are treating pixels as particles with masses, and estimate their center of mass.

Single-Shot Cuboids
Unlike previous works, we approach layout estimation as a keypoint localisation task, alleviating the need for postprocessing and simultaneously ensure Manhattan aligned outputs. Section 3.1 formulates our coordinate regression objective and its adaption to the spherical domain, Section 3.2 introduces the geodesic heatmaps and loss function and then, Section 3.3 provide insights into our model's design, and the techniques to achieve end-to-end Manhattan alignment. 3

Spherical Center of Mass
The center of mass (CoM) c P for a collection of particles P : {p 0 , . . . , p N } ∈ R 3 is defined as: with m i being the mass of particle p i and M the system's total mass. The CoM c P represents a concentration of the particle system's mass and does not necessarily lie on an existing particle. This way, when considering a sparse keypoint estimation task in a structured grid, we can reformulate it as a dense prediction task by instead inferring the mass of each grid point. Using Eq. (1) we can directly supervise it with the keypoint coordinates, instead of relying on a surrogate objective as commonly done in pose estimation [62] or facial landmark detection [13]. For spherical layout estimation, the set of particles P for which we seek to individually estimate their per particle mass, lies on a sphere. Each layout corner is considered as the CoM of a distinct particle system defined on the sphere. Each particle p = (φ, θ) on the sphere is represented by its longitude φ and latitude θ. While there are ways for learning directly on the 2-sphere S 2 manifold, as explained in Section 2.2, they are very inefficient. Consequently, we consider the equirectangular projection of the sphere which preserves the angular parameterization of each particle. The equirectangular projection is an equidistant planar projection of the sphere, where the pixels in the image domain Nevertheless, this format necessitates a different approach to overcome its weaknesses, namely, image boundary discontinuity, and planar projection distortion.
The discontinuity arises at the horizontal panorama boundary, where the particles, even though at the opposite sides of the image, are actually neighboring on the sphere. For traditional images, the (normalized) grid coordinates are typically defined in [0, 1] or [−1, 1], and thus, the particles at the boundary would be maximally distant. However, for spherical panoramas, the longitudinal coordinate φ is periodic and wraps around, with the particles at the boundaries being proximal (i.e. minimally distant). To address this, we split the CoM calculation for the longitude and latitude coordinates, and adapt the former to consider each point as lying on a circle. Therefore, for each panorama row, which represents a circle of (equal) latitude, we define new particles r ∈ R with while lie on a unit circle. We can then calculate the CoM 2 We transition between these terms flexibly given their linear mapping. c R : This estimates, exactly and continuously, the CoM of the circle. To map this back to the original domain, we extract the angleφ:φ which represents the longitudinal CoM across the discontinuity. Figure 2 shows a toy example of CoM calculations along two circles of latitude on the sphere, with the erroneous estimates acquired on the equirectangular projection and the correct ones when considering the boundary. Although the equirectangular projection maps circles of latitude (longitude) to horizontal (vertical) lines of constant spacing, the same does not apply for its sampling density. Indeed, while it samples the sphere with a constant density vertically, it stretches each circle of latitude to fit the same constant horizontal line. Thus, its sphere sampling density is not uniform in all planar pixel locations. The sampling density is 1/ sin θ [49] and it approaches infinity near the pole singularities. When calculating the CoM in the equirectangular domain, we need to compensate for it by reweighting the contribution of each pixel p by σ(p) = sin θ [66].
Essentially, given a dense mass prediction M(p), p ∈ A, we calculate the spherical CoM by first estimating a three-dimensional coordinate c a : with a(p) = (r(φ), θ) = (cos φ, sin φ, θ), and then drop it to the two-dimensions again to calculate the final CoM In addition, the geodesic distance between the red square and the colorized diamond coordinates are also presented on the same image. The geodesic distance similarly respects the boundary and distortion of the equirectangular projection as seen by the great circles drawn on the image that correspond to each pair's angular distance.

Geodesic Heatmaps
Accordingly, predicting the sparse coordinates of a corner comes down to predicting the dense mass map M, or otherwise heatmap, which is the terminology we will be using hereafter. Previous approaches complemented the sparse objective with a dense regularisation term [35]. The reason was that CoM regression is not constrained in any way as to the shape of its dense prediction. This was addressed by adding a distribution loss over the predicted heatmap and a Gaussian centered at the groundtruth coordinate.
Yet while extracting the CoM, as presented in Section 3.1, takes the spherical domain into account, traditional (flat) Gaussian heatmaps do not. A spatial normal distribution N (c, s) centered around a coordinate c = (u, v), using a standard deviation s = (s x , s y ) would consider the equirectangular image as a flat one, with a discontinuous boundary and no distortion.
To overcome this, we construct geodesic heatmaps, which are reconstructed directly on the equirectangular do-main using a shifted angular coordinate grid A s 3 defined on the panorama: where α is the angular standard deviation around the distribution's center c m , and g(·) is the geodesic distance: where ∆φ = φ 1 − φ 2 and ∆θ = θ 1 − θ 2 . As illustrated in Figure 3, using the geodesic distance between two angular coordinates on the equirectangular panorama, we reconstruct geodesic heatmaps that simultaneously take into account both the continuous boundary, as well as the projection's distortion.

End-to-end Manhattan Model
Our model infers a set of heatmaps M j , one for each layout corner j ∈ [1, J] (or junction, given that 3 planes intersect), with J = 8 for cuboid layouts. It operates in a single-shot manner, as these predictions are directly mapped into layout corners c j m . Apart from removing the post-processing step, another advantage of our single-shot approach is the sub-pixel level accuracy that it allows for, as the CoM of the particles is not necessarily one of the particles themselves. This translates to a reduction of the input and working resolution of the model.
We choose a light-weight stacked hourglass (SH) architecture [34]. It is designed for multi-scale feature extraction and merging, that enables the effective capturing of spatial context. It suits spherical layout estimation very well as it is a global scene understanding task that benefits from spatial context aggregation, which is achieved by lowering the spatial dimension of the features. Still, it also requires precise localisation of specific keypoints, which needs higher spatial fidelity, (i.e. resolution) predictions.

Stacked Hourglass Model Adaptation
We made several modifications to the original SH model stemming mainly from recent advances made in the field. While we preserve the original residual block [20] in the feature preprocessing block, we replace the hourglass residual blocks with preactivated ones [21]. Essentially, this adds direct identity mappings between the stack of hourglasses, allowing for immediate information propagation from the output to the earlier hourglass modules. We also use antialiased max-pooling [63], which preserves shift equivariance and leads to smoother activations across downsampled layers. Finally, unlike some state-of-the-art spherical layout estimation methods [68,60,69], we address feature map The predicted geodesic heatmaps get transformed directly to panoramic layout coordinates through a spherical CoM module. Since we regress coordinates, we explicitly enforce quasi-Manhattan alignment. This sets the ground for a homography-based cuboid alignment head that ensures the Manhattan alignment of our estimates. The symbol denotes a global multiply-accumulate operation, reducing the predicted dense representation to a set of sparse coordinates. Color-graded spheres indicate coordinate-based distance from the origin. discontinuity by using spherical padding. For the horizontal image direction, we apply circular padding, as also done in [54] and [47], and for the vertical one at the pole singularities, we resort to replication padding.

Quasi-Manhattan Alignment
Since we are directly regressing coordinates, we can explicitly ensure quasi-Manhattan alignment during training and inference alike. Previous approaches either use postprocessing to ensure the Manhattan alignment of their predictions [68,60,47], or simply forego it and produce non-Manhattan outputs [15]. While this relaxation is sometimes presented as an advantage, most man-made environments are Manhattan-aligned, with walls being orthogonal to ceiling and floors, and therefore, same edge wall corners are vertically aligned. For each wall-to-ceiling junction, there exists a wall-to-floor junction, effectively splitting our heatmaps in two groups, the top M j t and bottom M j b heatmaps (i.e. ceiling and floor junctions respectively). We enforce quasi-Manhattan alignment by averaging the longitudinal coordinates of each wall's vertical edge, guaranteeing a consistent longitudinal coordinate for both the top and bottom junction.

Homography-based Full Manhattan Alignment
This quasi-Manhattan alignment ensures that wall edges are vertical to the floor, but does not enforce their orthogonality. To achieve this, we introduce a differentiable operation that transforms the predicted corners so as to ensure the orthogonality between adjacent walls. While the estimated corners are up-to-scale, with a single center-to-floor/ceiling measurement/assumption we can extract metric 3D coordinates for each corner as in [64] 4 , by fixing the ceiling/floor vertical distance to the corresponding average height.
We extract the f = (x, y) horizontal coordinates coordinates, corresponding to an orthographic floor view projection, which comprise a general trapezoid. This is transformed to a unit square by estimating the projective transformation H (planar homography) mapping the former to the latter [18]. Using the trapezoid's edge norms v 2 , with v = f j+1 − f j , we calculate the average opposite edge distances and use them to scale the unit square to a rectangle, after translating it for their centroids to align. Then, we rotate and translate the rectangle to align with the original trapezoid using orthogonal Procrustes analysis [42]. Finally, the rectangle gets lifted to a cuboid using the vertical (z) ceiling and floor coordinates. The resulting cuboid vertices can be transformed back to angular coordinates for loss computation, with the overall process presented in Figure 5. We use this cuboid alignment transform C as the final block of our model to ensure full Manhattan alignment in an end-to-end manner.
We supervise the junction angular coordinates using the geodesic distance of Eq. (7): with c j m andĉ j m being the groundtruth and predicted coordinates. The geodesic distance smoothly handles the continuous boundary and provides a more appropriate distance metric on the sphere, instead of the equirectangular projection. We additionally supervise the spatially normalized heatmaps H j = spatial sof tmax(M j ) predicted by our model with Kullback Leibler divergence: whereG(·) is the spatially normalized geodesic heatmap G(·). Apart from regularizing the predicted heatmaps, this loss allows for stable end-to-end training with the cuboid alignment transform, as pure coordinate supervision destabilized the model during early training, which prevented convergence as a consequence of the double solve required in the homography and Procrustes analysis. Our final loss is defined as: with λ G and λ D being weighting factors between the geodesic distance and KL loss, applied on each of the N hourglass predictions.
The higher level SH architecture allows for global processing without relying on heavy bottlenecks [68], computational expensive feature fusion [60] or recurrent models [47]. It also requires no post-processing as it can produce a Manhattan aligned layout in a single-shot with high accuracy albeit operating at lower than typical resolutions.

Implementation Details
The input to our model is a single upright 5 , i.e. horizontal floor, 512 × 256 spherical panorama. We use 128 features for each hourglass's residual block, with a 128×64 heatmap resolution, and initialize our SH model using [19]. We use the Adam [26] optimizer with a learning rate of 0.002 and Figure 5: Starting from quasi-Manhattan corner estimates, these get first deprojected (K −1 ) to 3D coordinates. Then, keeping only the horizontal coordinates (F), we get a floor view trapezoid, which depending on the measurement and coordinates (floor/ceiling) our projection operated on, is slightly different (cyan for the ceiling, and blue for the floor). Using these floor view horizontal coordinates, we estimate a homography H to transform them to an axis aligned, unit square. This gets translated and scaled (S) using the average opposite edge lengths and centroid of the original untransformed floor view coordinates. An orthogonal Procrustes analysis (O) is used to align the rectangle to the trapezoid, which then gets lifted to a cuboid (Q) using the original heights, taking into account the quasi-Manhattan alignment of our estimates. The cuboid's 3D coordinates then get projected (K) back to equirectangular domain corners. Apart from the ceiling and floor starting corners, we also consider a joint approach where the horizontal floor view coordinates get averaged from both 3D estimates, before proceeding to estimate the homography. For this approach to work, we rescale the ceiling coordinates so that their camera to floor distances align, therefore removing any scale difference from the camera's position deviation from the true center.
default values for the other parameters, no weight decay, and a batch size of 8. Further, after an empirical greedy search, we use a fixed α = 2 o and s = (3.5, 3.5) for our Geodesic and Isotropic Gaussian distribution reconstructions respectively, which are created using the encoding of [62], and set the loss weights to λ G = 1.0 and λ D = 0.15. For cuboid alignment we use the joint approach and use a floor distance of −1.6m. We implement our models using PyTorch [36,12], setting the same seed for all random number generators. Further, each parameter update uses the gradients of 16 samples.
We apply heavy data augmentation during training, as established in prior work [69,47,15]. Apart from photometric augmentations (random brightness, contrast, and gamma [2]), following [15], we further apply random erasing, with a uniform random selection between 1 and 3 blocks erased per sample. We also probabilistically apply a set of 360 o panorama specific augmentations in a cascaded manner: i) uniformly random horizontal rotations spanning the full an- gle range, ii) left-right flipping, and iii) PanoStretch augmentations [47] using the default stretching ratio ranges. All augmentation probabilities are set to 50%.

Datasets
Prior work up to now has experimented with small scale datasets. PanoContext [64] manually annotated a total of 547 panoramas from the Sun360 dataset [57] as cuboids. Additionally, LayoutNet manually annotated 552 panoramas from the Stanford2D3D dataset [1], which are not complete spherical images as their vertical FoV is narrower. Similar to previous works, we use the common train, test and validation splits as used in [15] and [68] for the PanoContext and Stanford2D3D datasets respectively. Taking into account their small scale, we jointly consider them as a single real dataset and train all our models for 150 epochs.
More recently, layout annotations have been provided in newer computer-generated datasets, the Kujiale dataset used in [28] and the Structured3D dataset [65], totaling 3550 and 21835 annotated images respectively. Albeit synthetic, they offer a much more expanded data corpus than what is currently available for real datasets. Given their synthetic nature, these datasets offer different room styles for the same scene. In particular, they provide empty rooms as well as rooms filled with furniture by interior designers. For the Kujiale dataset we use both types of scenes, while for Structured3D we only use full scenes and follow their respective official dataset splits. Our models are trained for 30 and 125 epochs respectively on Structured3D and Kujiale.

Metrics
For the quantitative assessment of our approach against prior works we use a set of standard metrics found in the literature [69], complemented by another set of accuracy metrics. The standard metrics include 2D and 3D intersection over union (IoU2D and IoU3D), normalized corner error (CE), pixel error (PE), and the depth-based RMSE and δ 1 accuracy [10]. For all 3D calculations a fixed floor distance at −1.6m is used. We also use junction (J d ) and wireframe (W d ) accuracy metrics, defined as correct when the closest groundtruth junction or line segment respectively is within a pixel threshold d. More specifically, we use the thresholds d = [5,10,15]. Finally, since we regress sub-pixel coordinates, all metric calculations are evaluated on a 1024 × 512 panorama resolution, and the arrows next to each metric denote the direction of better performance.

Performance Analysis
First, we focus on the latest results reported in [69], where three data-driven cuboid panoramic layout estimation methods ( [68,60,47]) were adapted for fairer compar-ison. Similar to [69], we train a 3 stack (HG-3) single-shot cuboid (SSC) model using the real dataset. We present results tested on real (combined and single) datasets in Table 1 where our model compares favorably with the state-of-theart 6 , offering robust performance and end-to-end Manhattan aligned estimates, a trait no other state-of-the-art method offers currently. For these results, we report the same metrics as those reported in [69]. Furthermore, Figure 6 presents a set of qualitative results for our HG-3 model on these two datasets.
With the recent availability of large scale synthetic datasets, we additionally train a model using Structured3D [65]. Since only HorizonNet offers a pretrained model using the same data, we present results on the Structured3D test dataset for two HorizonNet variants and our model in Table 2. Apart from the standard model that includes postprocessing, we also assess a single-shot variant of Hori-zonNet. For this, we only perform peak detection on the predicted wall-to-wall boundary vector and directly sample the heights at the detected peaks to reconstruct the layout. While this saves an amount of processing, the postprocessing scheme used by HorizonNet improves the results when applied to Structured3D's test set. On the other hand, our model produces accurate layout corner estimates without any postprocessing. While SSC outperforms HorizonNet in the established metrics, HorizonNet offers higher accuracy in the junction and wireframe metrics. This is also the case for the cross-validation experiment that we present in Table 3. We test the models trained using Structured3D on the test set of Kujiale, using only the full rooms. The difference is this setting is that the single-shot variant of Horizon-Net provides more accurate layout estimates than the postprocessed one. This exposes the weakness of postprocessing approaches, which require empiric or heuristic tuning. Nonetheless, this HorizonNet model is trained for general layout estimation, and the performance deviation might be related to this extra trait. Qualitative results for our endto-end model for both synthetic datasets are presented in Figure 7.

Ablation Study
We perform an ablation study across all datasets. Tables 4, 2 and 5 present the results on the real and synthetic datasets 7 . Our baseline is the model as presented in Section 3.3 without the end-to-end Manhattan alignment homography module (Section 3.3.3), but with the quasi-Manhattan alignment (Section 3.3.2) offered by aligning the longitude of top and bottom corners. Apart from adding the end-to-end Manhattan alignment module, we also ablate the effect of the geodesic heatmap and loss (Section 3.2), the SH model adaptation (spherical padding, pre-9 Figure 7: Qualitative results on the Structured3D (top) and Kujiale (bottom) datasets. Same scheme as Figure 6 applies.  Table 2: Quantitative results and ablation on the synthetic Structured3D synthetic dataset.  Table 3: Cross-validation results on the Kujiale dataset using the Structured3D trained model. activated residual blocks and anti-aliased maxpooling -Section 3.3.1), and the quasi-Manhattan alignment itself by training a model with unrestricted, traditional (i.e.not spherical as presented in Section 3.1) CoM calculation for each corner.
These offer a number of insights. While the end-toend model provides the more robust performance across all datasets, its performance is uncontested in the IoU and depth related metrics. However, on the remaining projective metrics, the unrestricted coordinate regression approaches usually perform better. This is reasonable as the homography fits a cuboid on the predictions, while the un-/semi-constrained approaches can freely localise the corners, even though at the expense of unnatural/Manhattan outputs, which manifests at an IoU3D drop. Overall, we observe that the additional of explicit Manhattan constraints (quasi and homography-based) offer increased performance compared to directly regressing the corners. The same applies to spherical (periodic CoM and geodesics) and model adaptation that consistently increase performance.
We also ablate the three approaches (floor/ceiling/joint) that use different starting coordinates for the homography estimation in Tables 4 and 5. We find that the joint approach produces higher quality results, as it enforces both the top and bottom predictions to be consistent between them. This way, the cuboid misalignment errors are backpropagated to all corner estimates through the homography.

Conclusion
Our work has focused on keypoint estimation on the sphere and in particular on layout corner estimation. Through coordinate regression we integrate explicit constraints in our model. Moreover, while we have also shown that end-to-end single-shot layout estimation is possible, our approach is rigid as it is based on a frequent and logical assumption, that the underlying room is, or can be approximated by, a cuboid. Nonetheless, this rigidity comes from the structured predictions that CNN enforce, with the number of heatmaps that will be predicted being strictly defined at the design phase. Future work should try to address this limitation to fully exploit the potential that single-shot approaches offer, mainly stemming from end-to-end supervision. Finally, as with all prior layout estimation works, predictions are up to a scale, which hinders applicability. Even so, structured scene layout estimation is an important task that can even be used as an intermediate task to improve  Table 5: Ablation study on the synthetic Kujiale dataset. other tasks, as shown in [28]. With metric scale inference, it has the potential for significant interplay with other 3D vision tasks like depth or surface estimation.

Supplement
Supplementary material including additional ablation experiments and qualitative results are appended after the references.

A. Supplementary Material
In this supplementary material we present additional information regarding runtime and floating point operations, with the data offered in Table 6, and illustrated in Figure 8. Apart from the models presented in the main document, we also add efficient CFL models for completeness. In addition, we provide evaluation results for the Stan-ford2D3D and PanoContext datasets separately, in Tables 7  and 8 respectively. Further, in Tables 9, 10, and 11, we offer a decomposed model ablation for the Stanford2D3D, the PanoContext, and both datasets (averaged) respectively, where each individual component is ablated (namely, preactivated bottlenecks, spherical padding, and anti-aliased max pooling). The pre-activated residual blocks offer the larger gains, followed by the padding and finally, the antialiased max pooling. Nonetheless, each different component is contributing to increased performance, with their combined effect being the most significant as observed by the model without all of these components together. Figures 9, 10, 11 and 12 present additional qualitative results of our single-shot, end-to-end Manhattan aligned layout estimation model using the joint homography head module in Stanford2D3D, PanoContext, Structured3D and Kujiale datasets respectively. Finally, Figures 13 and 14 present the qualitative samples from the real and synthetic datasets respectively, which are included in the main manuscript in animated 3D views (can only be viewed in recent Adobe Acrobat Reader versions). Visual comparison of spherical layout estimation models in terms of parameters (denoted by each bullet's size), computational complexity (x axis, in log scale, billions of multiply-accumulate operations) and accuracy (y axis, average IoU3D accuracy). Our model (SSC) is the most lightweight and offers a good comprise between complexity and accuracy, surpassing most other approaches. It also provides an end-to-end layout prediction in a single-shot, compared to all other approaches that require postprocessing. Different variants of each model are depicted. The exact data of this plot can be found in Table 6.