Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Graphical abstract Our pipeline for dense 3D scene reconstruction is composed of a robust monocular depth estimation module, a metric depth recovery module

How to ensure the depth consistency over consecutive frames is receiving increasing attention. Bian et al. [9,20] employs an unsupervised paradigm and models the predicted depth prediction as scale-invariant. They propose a geometric consistency loss to implicitly learn the scale consistency over consecutive frames. Similarly, CVD [23] employs an unsuper-vised method but does the inference training to ensure consistency. By contrast, RCVD [24] takes the MiDaS model as the depth prior and estimates consistent dense depth maps and camera poses from a monocular video. They have achieved promising visual consistency of depth maps. However, we obverse that their reconstructed point clouds are still not so satisfactory. Please see the supplemental material for more detailed analyses. In this work, instead of the visual consistency, we focus on geometric consistency, i.e., achieving the 3D scene reconstruction from consecutive frames.
Following previous methods, we enforce the model to predict an affine-invariant depth, thus recovering scale and shift for the prediction is the main barrier for 3D scene reconstruction over a monocular video. Existing methods [1][2][3] propose to directly compute a scale and shift value from least-squares fitting with the ground truth (GT), i.e. global recovery strategy. However, we observe that the optimal scale and shift are always heteroscedastic. During practical application, the global fitting does not consider the distribution difference between affine-invariant depth and ground-truth depth, and fails to effectively align the local regions. In Figure 2(a), we show an example error map visualization between the ground truth and such global scale-shift recovered depth, and there exists a lowfrequency error clearly. Such a coarse alignment method cannot recover a high-quality metric depth for reconstruction.
Motivated by this observation, we propose a local recov-ery strategy to recover the locally aligned metric depth. Concretely, we employ the locally weighted linear regression as a novel metric depth alignment method, and demonstrate that heteroscedastic scale and shift maps can be recovered with as few as 25 points by enforcing spatial smoothness. Rather than fitting a unified scale map and shift map, i.e., sharing the same scale and shift values over all coordinates of the image, our local recovery strategy can retrieve a location-related scale map and shift map to adjust the distribution of prediction. Experiments show that our method can significantly improve metric accuracy. Although some sparse anchor points are required, compared with existing state-of-the-art depth completion methods [25][26][27][28] , which generally take more than hundreds to thousands of sparse points, our strategy requires much fewer points than them. Also experimentally we show that although completion methods have taken sparse depth measurements as inputs, our method can further boost their performance with our local recovery strategy. Besides boosting performance, the second benefit of our local recovery strategy is to better analyze the weakness of all existing depth estimation methods, and guide the design and choice of loss functions. The depth error can be decoupled into two parts: the coarse misalignment error and the detailmissing error. Compared with the error of global fitting, the alleviated error achieved by the local recovery is the coarse misalignment error, while the remaining one is the detailmissing error. Through our re-local-alignment analytical experiments in Table 3, we observe that current state-of-the-art depth estimation methods, including supervised learning metric depth, unsupervised learning scale-invariant depth, supervised learning affine-invariant depth, and depth completion, all suffer from noticeable misalignment issue w.r.t. the ground truth.

6.3
To achieve 3D scene reconstruction from a monocular video, retrieving a strong and robust monocular depth prediction model is also essential. We collect over million images from the existing RGB-D datasets for training models using backbones such as ResNet50 [29] and Swin-L [30] , and investigate how much accuracy can we benefit from a largescale dataset. With the robust monocular depth and local recovery strategy, the metric depth can be recovered by locally aligning with some sparse points.
The last challenge is how to obtain accurate sparse anchor points as the metric guidance. For analytical evaluation, we leverage the ground-truth depth to decouple and analyze the composition of errors. For practical application, aligning with ground-truth depth can be the upper bound of our local recovery strategy. We perturb the ground truth manually and analyze the error to simulate the inaccuracy of sparse anchor points. Also, The SfM [31] can be employed to recover sparse depth points on distinguishable feature points in practice. Through per-frame alignment with such accurate guidance, we can achieve geometrically consistent depth and perform robust 3D scene reconstruction. Although existing geometrybased methods, e.g., multi-view stereo reconstruction [32][33][34] , also takes the similar paradigms of structure from motion (SfM), their performance may suffer from inaccurate corres-pondences in low-texture regions. By contrast, our per-frame prediction comes from our well-trained strong monocular depth estimation model, which is much more robust in the low-texture regions.
Finally, through applying the local recovery strategy, our ResNet50 model drops the depth absolute relative error up to 50% on current affine-invariant depth evaluation benchmarks. For 3D scene reconstruction, our pipeline significantly improves both the accuracy and consistency, and achieves better performance than related works on five NYU [15] videos. To summarize, our main contributions are as follows: • We propose a novel and effective metric depth recovery strategy, i.e., locally weighted linear regression, which significantly improves the accuracy of the recovered metric depth with a very sparse set of anchor points. Extensive experiments show that the depth absolute relative error of state-ofthe-art methods can drop up to 50% with our proposed method.
• Our local recovery strategy can be an analytical tool for subsequent depth prediction works, enabling decoupling the prediction errors and analyzing the weakness of their models.
• We train a robust monocular depth estimation model on large diverse data that contains 6.3 million images in total. We provide detailed analyses of its performance w.r.t. the training dataset size using our analytical tool.
• Aiming at the video-based scenarios, by combining our strong monocular depth estimation model with a geometrybased method for retrieving high-confidence anchor points, we design a new pipeline for robust and dense 3D scene reconstruction.

Materials and methods
The pipeline for our dense 3D scene reconstruction method is shown in Figure 1. Overall, our pipeline contains robust datadriven monocular depth estimation, a novel metric depth recovery, and RGB-D fusion [35] .

Our Pipeline for Dense 3D Scene Reconstruction
Robust Monocular Depth Estimation Module. Retrieving robust and accurate depth maps from 2D images is significant for 3D scene reconstruction. In the supplementary material, we analyzed that the unsupervised depth estimation methods suffer from the weak supervision of photometric loss, while inaccurate correspondences may degrade the accuracy and robustness of MVS-based methods. Thus, the supervised monocular depth estimation method is employed in our pipeline, whose promising robustness and accuracy have been demonstrated in recent works [1,3] .
To retrieve strong monocular depth estimation models, we collect over 6.3 million data from 14 diverse datasets, which cover a wide range of scenes, camera poses, and camera intrinsic parameters. Following previous works, we enforce the network to learn the affine-invariant depth. Several scale-shift invariant losses are employed during training for better learning the inherent geometric information of depth maps, including the pair-wise normal (PWN) loss [1] , image-level normalized regression (ILNR) loss [1] , and multi-scale gradient (MSG) loss [5] as follows.
Here, and represent the surface normal of sampled point pair ( , ). and are the corresponding ground truth.
is the Z-score normalized groundtruth depth, and and represent the mean and standard deviation of the trimmed ground-truth depth map, which removes the nearest and farthest 10% values in advance. The and stand for the gradient at -th scale along and axis separately. The PWN loss samples paired points on edge, planar, and random regions to supervise the surface normal information. The ILNR loss is introduced to reduce the average element-wise difference of N pixels between the predicted depth and the normalized ground truth . The MSG loss ensures the accurate depth gradients at scales. The loss function is balanced by hyperparameters and , which are set to 1 and 0.2 individually in our experiments. Metric Depth Recovery Module. In the supplemental material, we analyzed the shape distortion and duplication caused by inaccurate shifts and inconsistent scales, which ensures the importance of recovering accurate and consistent scale-shift values along consecutive frames. Existing approaches recover a scale and a shift value for the depth map using least-squares fitting with some anchor points, which neglects the information of the distribution difference between affine-invariant prediction and anchor points. In contrast, we propose to perform locally weighted linear regression to recover the metric depth (Refer to Section 2.2 in detail). Compared to the global least-squares fitting, our local recovery strategy generates a scale and a shift map for each depth map, which can not only recover metric depth, but also correct the overall depth maps and ensure the accuracy and consistency of 3D reconstruction.
Our proposed local recovery strategy leverages the sparse anchor points obtained from SLAM system [36] , SfM algorithms [31,37] , or some low-quality sensors, and can not only boost the performance of depth estimation, but also be an analytical tool to decouple the prediction depth errors into the coarse misalignment error and the detail missing error. Please see Section 3.2 in detail.

RGB-D Fusion Module.
Through per-frame depth estimation and local recovery, locally scale-aligned monocular video depths are obtained. But the subtle details may still remain partially inconsistent between frames, which can cause outliers and duplication if we simply unproject it to 3D space without post-processing. Therefore, we propose to fuse multiframe information with an RGB-D fusion module, which takes the RGB frames, depth maps, camera poses, and intrinsic parameters as inputs. It balances the difference between frames, filters out outliers and inconsistent regions between frames, and outputs the fused dense 3D mesh or point cloud.
In this work, we employ the TSDF fusion [35] to fuse multiple depth maps into a projective truncated signed distance function (TSDF) voxel volume during reconstruction. The sparse guided points used for local alignment can be obtained from various SLAM [36] systems, SfM [31,37] algorithms, and some low-quality sensors such as ToF sensors of mobile phones. Two strong and robust monocular depth estimation models are trained based on ResNet50 [29] and Swin-L [30] backbones, respectively.

Metric Depth RecoveryD
= sD + θJD D s θ J Monocular depth estimation methods [1][2][3]7] have achieved promising results on diverse scenes. The problem is that their predicted depth/inverse depth is scale-shift-invariant, namely, affine-invariant depth/inverse depth [7] . Here we take the affineinvariant depth as an example. To recover the metric depth, it should be scaled and shifted, i.e., , where , , , and are the recovered metric depth, predicted affine-invariant depth, scale, shift and all-ones matrix respectively. The pipeline for dense 3D scene reconstruction. The robust monocular depth estimation model trained on 6.3 million images, locally weighted linear regression strategy, and TSDF fusion [35] are the main components of our method.
Xu et al.

J u s t A c c e p t e d
Some methods propose to obtain them through a global leastsquares fitting method with ground-truth depth: where is the flattened ground-truth metric depth, is the homogeneous representation of the flattened predicted depth , represents the flattened length of depth map with a shape of . is composed of scale value and shift value , and is the optimized value of . Note that the scale value and shift value can be regarded as a scale map and a shift map shared on the whole map. However, such a globally scaling and shifting method often fails to reduce the spatially heteroscedastic errors, which follow a rather simple pattern. For example, we visualize the pixel-wise absolute relative error map between the ground truth and the globally recovered predicted depth in Figure 2(a), and observe the existence of the low-frequency spatial error. We see that the left part has a higher error than that of the right part. Motivated by this observation, we propose to leverage a locally recovering method, i.e., locally weighted linear regression (LWLR), to recover a scale and a shift map. Guided by very sparse ground-truth points, we can fix and quantify these low-rank spatial errors which are common in depth estimation tasks.
Locally Weighted Linear Regression. We thus employ a locally weighted regression method, which is: where is the sampled sparse ground-truth metric depth (we use around 25 100 points in practice), is the homogeneous representation of the sampled sparse predicted depth , stands for the number of sampled points.
Different from recovering a scale-shift value or a globally shared scale-shift map by the global least-squares fitting method, we recover a location-aware scale-shift map. For each 2D coordinate , the predicted depth can be fitted to the ground-truth depth by minimizing the squared locally weighted distance, which is re-weighted by a diagonal weight matrix . It pays more attention to sparse points closer to the estimated location, based on the idea that points near each other in the explanatory variable space are more likely to be related in a simple way. By iterating over the whole image, the scale map and shift map can be generated composed of the scale values and shift values of each location . Finally, the locally recovered metric depth equals to the shift map plus the Hadamard product ( , known as element-wise product) of the affine-invariant depth and the scale map . In our implementation, we employ a Gaussian kernel function to compute the weight matrix:

J u s t A c c e p t e d
where is the bandwidth of Gaussian kernel, and is the Euclidean distance between the guided point and target point .
The scale map and shift map obtained this way can yield much more accurate metric depth than the global methods. However, some scale maps can be fitted to negative due to the shift-invariant characteristic of monocular depth and flexibility of weighted linear regression, which inverses the distribution of depth prediction and lacks reasonableness. Since the bias is not centered and the solution space is not bounded, results are widely distributed with no physical meanings and far from the real scale and shift. Therefore, we conduct the global recovery strategy to monocular depth first, and restrict the solution to be simple by adding an regularization on the shift: where is the homogeneous representation of the globally recovered depth . With the regularization of shift, the location-related scale map is encouraged to be positive. Please see supplemental material for the visualization of the scale value distribution.
With our proposed local recovery strategy, we only need very sparse ground-truth depth (around 25-100 points) to recover the metric depth map by fitting a location-related scale map and a shift map. Figure 2 compares the global leastsquares fitting and our proposed weighted linear regression results. Thanks to the optimized pixel-wise scale map (Figure 2(c)) and shift map (Figure 2(d)), the overall loss is reduced considerably (Figure 2(b)). Note that the predicted affine-invariant depths are the same for the two methods. Importantly, the recovered metric depth with our method is more linearly correlated to the ground truth (see Figure 2(e) and Figure 2(f)). Please see supplemental material for more ex-amples.

Results and Discussion
The components of training datasets, the implementation details, and the evaluation details can be found in the supplemental material.

Dense 3D Scene Reconstruction
With our model trained on 6.3 million data and the local scale and shift recovery method, we can achieve high-quality 3D scene reconstruction through per-frame prediction and TSDF fusion [35] . To evaluate the consistency and accuracy, we collect 5 NYU videos and compare with the single image 3D reconstruction method (LeReS [1] ), the state-of-the-art depth completion method (NLSPN [25] ), the robust consistent video depth estimation method (RCVD [24] ), the unsupervised video depth estimation method SC-DepthV2 [20] , and the learningbased MVS method DPSNet [38] . Note that NLSPN and SC-DepthV2 are trained on NYU, and only NLSPN can predict metric depth. Our method uses the same sparse ground truth (100 points) as NLSPN. For LeReS and RCVD, we align their predictions with metric depth globally. For SC-DepthV2 and DPSNet, only global scale values are recovered by ensuring the same medians as the ground truth. Besides leveraging sparse ground truth, we also sample points from SfM methods, e.g., COLMAP [37] , to reconstruct a 3D scene with an RGB video, ground-truth intrinsic and poses. The Rel, , the Chamfer distance and the F-score with the threshold of 5 cm are employed for evaluation.
Quantitative comparisons are shown in Table 1. First, we compare with the depth completion method NLSPN [25] , which also uses the sparse guided points to obtain metric information. The main difference is that their model should be trained on the testing set and lacks generalization in the wild. By contrast, we can achieve better performance and generalize well to zero-shot datasets due to the robust depth prior. RCVD [24] and SC-DepthV2 [20] aim to solve the visual consistency problem for video depth prediction. LeReS [1] reconstructs 3D scene shape from a single image and performs well in the wild. DPSNet [38] leverages CNNs to extract features and match between frames automatically. Before evaluating, the NLSPN and SC-depthV2 have been trained on the NYU data- Table 1. Quantitative comparison of monocular depth estimation and 3D scene reconstruction with diverse related methods [1,20,24,25,38] on five NYU scenarios.

Method
Sparse Points  061 0.645 3.1 98.80.025 0.886 3.7 98.1 0.073 0.587 1.6 99.80.018 0.958 4.3 97.50.049 0.674 Xu et al. set. The "global" and "local" represent global and our proposed local metric depth recovery strategies separately. Ours-SfM (local) means performing the local recovery strategy with points sampled from SfM [31,37] depth. As a result, our pipeline of 3D scene reconstruction from video achieves stateof-the-art performance on all five scenes. Experiments of qualitative comparison are shown in the supplemental material. Depth completion method NLSPN performs well but misses some high-quality details due to the lack of geometry supervision during training. The RCVD focuses on visual depth consistency but fails to recover the shift of depth maps, leading to the distortion of the 3D structure. The SC-depthV2 achieves visually consistent video depth through an unsupervised paradigm, but the weak supervision brings some distortion during reconstruction. The LeReS achieves excellent detail prediction but lacks consistency between frames for misalignment caused by the global recovery strategy. The DPSNet improves the quality of extracted features with the help of CNNs, but without training on the NYU dataset, it lacks robustness to generalize to some unseen scenarios. With our local recovery strategy, our method can reconstruct better 3D point clouds than others. For Ours-SfM (local), we obtain SfM depth first, then fit the monocular depth with SfM depth and filter out the errors bigger than the 99th-percentile iteratively, before performing the local recovery strategy. Note that even with slightly inaccurate sparse SfM points, Ours-SfM (local) can still achieve comparable results with Ours(global), which requires ground-truth depth acquired from sensors.

Monocular Depth Estimation
Comparison with State-of-the-Art Methods. In this experiment, we compare with state-of-the-art robust monocular depth estimation methods [1-7, 39, 40] on five zero-shot datasets, whose scale and shift are recovered with a globally leastsquares fitting method. During evaluation, the latest released model weights are adopted uniformly. As shown in Table 2, our ResNet50 [29] model outperforms other ResNet50 and Res-NeXt101 [41] models on four testing datasets, and our Swin-L [30] model achieves comparable results with the ViT-large [42] model of DPT [2] . Through recovering scale and shift with the proposed locally weighted linear regression method, our method with ResNet50 and Swin-L backbones (i.e., "Ours-R50 (local)", "Ours-swin (local)") can outperform all previous methods and our global predictions by a large margin over all zero-shot testing datasets. The qualitative comparison can be found in the supplemental material.
Effectiveness of Locally Weighted Linear Regression. To demonstrate our proposed locally weighted linear regression can boost various monocular depth estimation methods, we enforce it on several different methods: 1) learning affineinvariant depth methods, e.g., LeReS [1] , MiDaS [3] , and DPT [2] ; 2) learning metric depth on a specific dataset (VNL [7] ); 3) learning scale-invariant depth with unsupervised methods (MonoDepth2 [21] ); 4) depth completion method (NLSPN [25] ). Results are shown in Table 3. We uniformly sample 100 guided points to perform the local recovery, and all their performances can be boosted significantly (see the "w" columns). Critically, even though the NLSPN method has input such 100 sampled points for completion, our method can still further boost its performance. Note that the latest released weights and code are used for this experiment, NLSPN (KITTI) and Monodepth2 are trained on the KITTI dataset, and NLSPN (NYU) and VNL are trained on the NYU dataset.
Decoupling of Monocular Depth Error. Besides improving performance, the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error. Compared with the error of global recovery, the alleviated error brought by local recovery represents the coarse misalignment error, and the remaining one stands for detail-missing error. As shown in Table 3, the percentage of coarse misalignment error ("%" columns) Table 2. Quantitative comparison of monocular depth estimation with state-of-the-art methods on five unseen datasets. The numbers in brackets represent the reduced absolute relative error brought by our local recovery method.

Ablation Study
Ablation Study for Training Data. In this experiment, we aim to study the relations between data volume and performance improvement. We gradually aggregate more data for training, and evaluate the performance on 5 zero-shot datasets. Note that 3 different quality data sources are increased in balance, and the results are illustrated in Table 4. We can observe that when the data size increases from 42K to 900K (around by 20 times), the performance is boosted significantly. However, when further increased by 7 times, the accuracy can only be improved slightly. We conjecture that such large-scale data has fully exploited the capacity of the model (ResNet50 backbone). Furthermore, we also conduct local recovery here to de-couple the error into coarse misalignment error and detailmissing error. As shown in Table 4, the percentage of coarse misalignment error remains nearly constant, which shows the model can study the detail information and the global structure simultaneously.
b Ablation Study for Locally Weighted Linear Regression Method. The performance of our proposed local recovery strategy may be affected by the amount of sparse points, the sparsity distribution, random noises from sparse points, and the bandwidth . Their effects on the depth accuracy are explored and shown in Table 5. Here, "Amount", "Distribution", and "Noise" correspond to the number, distribution, and maximum perturbation percentage of the sampled ground truth. Parameter b represents the bandwidth of the Gaussian kernel function. "Grid" means sampling points from the vertexes of the evenly divided image plane, and "Uniform" means sampling randomly. The whole image and half image stand for only sampling ground truth from the original whole or half image. All experiments are conducted on the Table 3. Boosting various monocular depth estimations with our local recovery method. We compare the accuracy without ("w/o") and with ("w") our recovery methods, and show the reduced errors and percentages ("w" and "%").    J u s t A c c e p t e d  Table 6. Analysis for the amount of ground-truth points during recovering monocular metric depth. The Rel decreases faster with our proposed local recovery strategy with the increase of ground-truth points.