Unsupervised Monocular Depth and Camera Pose Estimation with Multiple Masks and Geometric Consistency Constraints

This paper presents a novel unsupervised learning framework for estimating scene depth and camera pose from video sequences, fundamental to many high-level tasks such as 3D reconstruction, visual navigation, and augmented reality. Although existing unsupervised methods have achieved promising results, their performance suffers in challenging scenes such as those with dynamic objects and occluded regions. As a result, multiple mask technologies and geometric consistency constraints are adopted in this research to mitigate their negative impacts. Firstly, multiple mask technologies are used to identify numerous outliers in the scene, which are excluded from the loss computation. In addition, the identified outliers are employed as a supervised signal to train a mask estimation network. The estimated mask is then utilized to preprocess the input to the pose estimation network, mitigating the potential adverse effects of challenging scenes on pose estimation. Furthermore, we propose geometric consistency constraints to reduce the sensitivity of illumination changes, which act as additional supervised signals to train the network. Experimental results on the KITTI dataset demonstrate that our proposed strategies can effectively enhance the model’s performance, outperforming other unsupervised methods.


Introduction
Estimating scene depth and camera pose from video sequences is a critical topic in visual perception and forms the foundation of many advanced tasks. Such estimations can be used to build 3D scene structures, which can be implemented in various industrial environments, including autonomous driving, visual navigation, and augmented reality [1][2][3]. Traditional methods rely on geometric cues in the image for inference, making them sensitive to challenging environments with low texture or strong lighting changes [4][5][6][7][8]. Conversely, learning-based depth and pose estimation methods exhibit better adaptability to challenging environments [9][10][11][12][13]. These methods take the image sequence as input and output the depth map and camera pose through a nonlinear mapping formed by the collaboration of neurons. This is similar to the human brain processing high-dimensional information after observing with human eyes. Supervised learning-based methods primarily use exact sensors like differential GPS, LiDAR, and IMU to obtain labeled data and then train the estimation network on these labeled data to learn the mapping function by minimizing the difference between the model's predicted values and the label values.
While supervised methods have shown excellent performance, their reliance on massive quantities of labeled data during model training poses a significant limitation. Acquiring labeled data in real-world scenarios often requires expensive equipment or large amounts of manpower, thus preventing them from further improving the performance of their models [9]. In contrast, unsupervised methods do not require expensive labels during training and can leverage larger datasets and more complex models to achieve better performance [10]. This is exemplified by recent models such as ChatGPT, which uses large amounts of data and parameters to achieve state-of-the-art performance [14]. Researchers have also proposed numerous unsupervised methods for tasks related to depth recovery and pose estimation [10][11][12][13]. A common principle is using the view synthesis process to generate supervised signals to train the model. This technique employs two sub-networks based on convolutional neural networks (CNNs), which respectively estimate the depth map of a view (target view) and the camera pose of the view and the adjacent view (source view). Once the depth and camera pose is estimated, the source view can be projected onto the target view to synthesize a new view, and the entire framework is then trained by minimizing the photometric error between the synthesized new view and the target view.
View synthesis necessitates a static scene without occluded areas; however, real-world scenes are rife with dynamic objects and occlusions, inevitably resulting in unstable training [15]. Consequently, numerous studies have proposed various masks to mitigate outliers in the scene during the view synthesis process [15][16][17][18][19][20][21]. However, these methods often overlook the impact of dynamic objects and occlusion areas on pose estimation, jointly trained with depth estimation during the view synthesis process. As a result, any degradation in pose estimation accuracy will decrease depth estimation accuracy. Furthermore, training the entire unsupervised framework primarily relies on the photometric differences between the synthesized view and the target view, implying that when the illumination changes drastically or when the training sequence is lengthy, inconsistent illumination intensities may interfere with the model's learning process.
To overcome these challenges, we propose a new unsupervised learning framework for estimating scene depth and camera pose. Due to the fact that both dynamic objects and occluded regions in the scene can simultaneously affect the computation of the loss function and the estimation of the camera pose, we have designed multiple masks to identify different types of outliers in the scene during forward propagation. These computed masks are combined and applied in two ways: firstly, to prevent outliers from participating in calculating the loss function during view synthesis, and secondly, as a supervised signal for MaskNet, a neural network trained to estimate outliers such as dynamic objects and occluded regions. The mask obtained by MaskNet is then used to preprocess the input of the pose network. This involves multiplying the regions that may be outliers by a small weight coefficient, thereby avoiding the influence of outliers in the scene on pose estimation. Moreover, since training the network solely with view synthesis is prone to be affected by changes in lighting, we propose several geometrically consistent loss functions, such as flow consistency, depth consistency, and pose consistency loss, to exploit the geometric properties of the 3D scene and further strengthen our model's functionality as a whole.
The main contributions of the work are twofold: (1) We introduce multiple mask techniques to mitigate the adverse impact of outliers in the scene during the view synthesis process. Additionally, we employ a MaskNet network to address the detrimental effects of outliers on pose estimation; (2) we propose several geometric consistency constraints to alleviate the limitations of sole training with photometric consistency. Finally, we evaluate our model on the widely used KITTI dataset and demonstrate its superiority over other unsupervised methods.
The remainder of the paper is structured as follows: Section 2 provides an overview of the existing literature concerning depth and pose estimations. Section 3 elaborates on the method and the enhanced strategies proposed in this paper. The effectiveness of our approach is validated through experimental results and ablation studies, which are presented in Section 4. Finally, Section 5 offers a summary of our work.

Related Works
Visual Simultaneous Localization and Mapping (VSLAM) [4][5][6] and Structure from Motion (SFM) [7,8] are two examples of traditional geometry-based techniques used to estimate scene depth and camera pose. Both of these methods require extracting handdesigned key points from the image and matching them to estimate the camera pose, Sensors 2023, 23, 5329 3 of 18 followed by the triangulation technique to estimate the scene's structure. However, the traditional methods may fail in low-texture regions where key points cannot be extracted, presenting a challenge. Learning-based methods have emerged as a solution to overcome this challenge. These methods are classified as supervised or unsupervised based on whether or not they use labels for training.

Supervised Learning of Scene Depth and Camera Pose
Supervised depth recovery and camera motion estimation approaches typically consider these separate tasks. They learn them individually by minimizing the discrepancies between the estimated values and the related ground truth. The pioneering work by Eigen et al. [22] demonstrated the use of deep CNNs to predict depth from a single image using two network stacks: one for global prediction and the other for local refinement. In contrast, Liu et al. [23] employed hierarchical conditional random fields and a superpixel pooling method to improve the quality of the depth map. Regarding the problem of camera pose estimation, also called visual odometry (VO), optical flow is a widely used technique in learning-based VO methods [24][25][26], where the optical flow (OF) field contains geometric motion. Costante et al. [25] used a self-encoder to learn the optical flow field's low-dimensional latent feature space. Then they used this feature space to regress the camera's 6-dimensional pose, thus improving the robustness of the estimated model. Zhao et al. [26] also used optical flow as the input to the pose estimation network and continued the work of [25] by adding recurrent neural networks for sequence learning to improve the estimation accuracy further. Despite the promising performance of supervised methods, their utility is limited by the requirement for labeled datasets, which can be arduous, costly, and prone to a lack of generalizability.

Unsupervised Learning of Scene Depth and Camera Pose
Garg et al. [11] pioneered a novel method to reduce the dependence on labeled data by training a network with stereo image pairs as input. The objective function of the training is to minimize the photometric discrepancies between the left image and its corresponding right synthetic image using epipolar geometric inference for view synthesis. Godard et al. [12] extended this methodology by incorporating the left-right consistency constraint for depth estimation. Additionally, Zhan et al. [13] continued to follow this approach of using stereo images as input, solving the absolute scale problem of VO estimation [6]. Meanwhile, they also proposed a feature reconstruction loss to strengthen the training of the framework.
Nevertheless, camera calibration in stereo systems is complex, leading Zhou et al. [15] to propose an approach relying on monocular sequences. Their core idea is to train both the depth estimation network and the pose estimation network by using the photometric consistency loss generated by the view synthesis process as the objective function. Building upon this pioneering work, subsequent studies have made significant strides. For example, in [27], Mahjourian et al. further considered the 3D geometry of the whole scene and required that the estimated 3D point cloud be consistent across the continuous images. Similarly, [28] utilized the 3D-2D correspondence constraint and deployed it on an autonomous driving platform via 5G telephony and wireless communication.
Despite its potential, the monocular unsupervised method has two significant limitations. Firstly, it cannot provide global scale consistent pose estimation, and secondly, the photometric consistency loss assumes a static scene without occlusion regions and dynamic objects. For scale ambiguity, Bian et al. [18] introduced geometric consistency loss to cope with scale inconsistencies on different samples, resulting in VO results comparable to the stereo image training model. Sun et al. [29] proposed two constraints that operate on predicted depth and relative poses, enforcing consistency across different training samples and jointly promoting pose and depth estimation. For dynamic objects and occlusion areas in the scene, an intuitive method is to design a mask to remove these outliers to avoid their adverse effects on the calculation of reconstruction loss. Therefore, researchers have proposed a variety of mask-generation methods. Some of them generate masks by adding networks, such as the explainability mask [15] proposed by Zhou et al., the uncertain map [16] proposed by Klodt et al., and the confidence mask [17] proposed by Chen et al. Alternatively, some methods generate masks through computation, such as those in [18][19][20][21]. In [18], Bian et al. calculated depth inconsistency to generate a self-discovered mask. Zhao et al. [19] generated an outlier elimination mask by analyzing the consistency of forward and backward optical flows. Wang et al. [20] derive overlap and blank masks during forward calculation, while Jiang et al. [21] generate a mask based on the assumption that the outlier reconstruction error is significantly greater than the average photometric error. These methods have a positive impact on reconstruction loss computation.
Additionally, several studies have explored the potential benefits of jointly training the subtasks of scene depth, camera pose, and optical flow by exploiting their inherent geometric correlation, thus allowing for better nonlinear solutions when constraints are added to the loss function [30]. For instance, Yin et al. [31] proposed a collaborative learning framework to estimate all three subtasks and reconstruct a view containing both static and dynamic scenes using the geometric relationships between them. Zhang et al. [32] introduced an optical flow estimation network and added additional supervised signals to the training of the framework through multi-view synthesis to improve the overall estimation accuracy. Based on this joint estimation framework, Zou et al. [33] also proposed optical flow consistency loss for rigid regions in the scene to improve their results. Finally, Ranjan et al. [34] added a motion segmentation task to the above three tasks, and the individual performance of each task was enhanced by joint training.
Although these methods have shown considerable advancements in the basic models, their treatment of changing scenes is limited to calculating the loss function in view synthesis, ignoring the impact on VO estimation. Moreover, the detrimental effects on training when the photometry is inconsistent are not considered.

Methods
This paper aims to learn the depth and camera pose from unlabeled monocular video sequences while exhibiting good robustness to outliers in the scene, such as dynamic objects and occlusion areas. Our framework utilizes training samples composed of three consecutive frames, with the middle frame as the target view and the other two as the source view, to obtain two target-source image pairs. Since the operations performed on the two image pairs are the same, we only show the process for one image pair, as shown in Figure 1. Our method comprises four sub-networks: (1) DepthNet, which estimates the depth map of a single image; (2) FlowNet, which estimates the optical flow between adjacent frames, is an off-the-shelf model that does not require training; (3) MaskNet, which estimates a mask containing outliers based on two adjacent frames, and its supervised signal mainly comes from the calculation of multiple masks in the view synthesis process; and (4) PoseNet, which uses the mask-preprocessed optical flow as input and outputs the camera motion.
We train the DepthNet, MaskNet and PoseNet sub-networks jointly with a final loss function comprising four parts: a photometric consistency loss, which is the primary supervised signal used during network training; a depth smoothness loss, which ensures the smoothness of the estimated depth values; a mask loss, which enhances the model's performance by training a mask to preprocess PoseNet's input; and a geometric consistency loss, which provides an additional weak supervised signal by adding several constraints to the model. Therefore, the final loss function can be expressed as with L l ph , L l sm , L l m and L l g denoting the photometric consistency loss, smoothness loss, mask loss, and geometric consistency loss, respectively, λ s and λ m representing the corresponding weight values. The parameter l is the scale factor of different image sizes, and similar to previous work [15], the DepthNet outputs four depth estimation maps at different scales, with the loss function being calculated for each scale separately. We train the DepthNet, MaskNet and PoseNet sub-networks jointly with a final loss function comprising four parts: a photometric consistency loss, which is the primary supervised signal used during network training; a depth smoothness loss, which ensures the smoothness of the estimated depth values; a mask loss, which enhances the model's performance by training a mask to preprocess PoseNet's input; and a geometric consistency loss, which provides an additional weak supervised signal by adding several constraints to the model. Therefore, the final loss function can be expressed as with , , and denoting the photometric consistency loss, smoothness loss, mask loss, and geometric consistency loss, respectively, and representing the corresponding weight values. The parameter l is the scale factor of different image sizes, and similar to previous work [15], the DepthNet outputs four depth estimation maps at different scales, with the loss function being calculated for each scale separately.

Photometric Consistency Loss and Smoothness Loss
The primary supervised signal for training the entire framework is derived from the photometric consistency loss generated during view synthesis. The process is illustrated in Figure 2. Given two consecutive input frames (It, It+1), represented as the target view and the source view, The DepthNet estimates the depth map of the target view. Using this depth information (Dt), a pixel in the image can be projected onto a 3D point cloud as follows: where K is the intrinsic camera matrix and , is the 3D point. The camera motion Tt estimated by the PoseNet is used to transform , to the coordinates at frame t+1 via , = , . This 3D point can then be transformed to the camera coordinates at frame t+1 using the intrinsic camera matrix K. Therefore, by using these projection transformation relations, we can get the projection relationship between the coordinates of the target view and the source view, which is expressed as:

Photometric Consistency Loss and Smoothness Loss
The primary supervised signal for training the entire framework is derived from the photometric consistency loss generated during view synthesis. The process is illustrated in Figure 2. Given two consecutive input frames (I t , I t+1 ), represented as the target view and the source view, The DepthNet estimates the depth map of the target view. Using this depth information (D t ), a pixel in the image can be projected onto a 3D point cloud as follows: Q To reconstruct the target view I t , we need to obtain the corresponding pixel values of the reconstructed frameÎ t based on the projected position on frame I t+1 . However, the pixel values after projection are not integers. Hence we use bilinear interpolation to obtain the pixel value at p t+1 . Specifically, we linearly interpolate the 4-pixel values (top-left, top-right, bottom-left, and bottom-right) around p t+1 using the formula: where w ij is the proportional term for bilinear interpolation, measuring the spatial proximity of p t+1 and p ij t+1 with ∑ i,j w ij = 1. Then, the synthesized viewÎ t is obtained. Assuming that the photometric values of 3D spatial points projected onto the target view and the source view are equal, it can be theoretically deduced that the photometric values of the target view and the synthesized target view obtained through interpolation using the source view should be consistent. This property is used to construct the loss function for training the entire framework. Similar to other works [20,21], we use the combination of the L1 norm and the structural similarity index measure (SSIM) [35] to construct the photometric consistency loss, which is expressed as: where α is set to 0.85 empirically.
During the training of the depth estimation model, the edge smoothness loss is often used to filter out incorrect predictions and preserve clear details. We use the same loss function as [31], expressed as where |·| denotes the elementwise absolute value, ∇ represents the vector differential operator, and T denotes the transpose of image gradient weighting. To reconstruct the target view It, we need to obtain the corresponding pixel values of the reconstructed frame based on the projected position on frame It+1. However, the pixel values after projection are not integers. Hence we use bilinear interpolation to obtain the pixel value at pt+1. Specifically, we linearly interpolate the 4-pixel values (top-left, top-right, bottom-left, and bottom-right) around pt+1 using the formula: where is the proportional term for bilinear interpolation, measuring the spatial proximity of pt+1 and with ∑ , = 1. Then, the synthesized view is obtained.
Assuming that the photometric values of 3D spatial points projected onto the target view and the source view are equal, it can be theoretically deduced that the photometric values of the target view and the synthesized target view obtained through interpolation using the source view should be consistent. This property is used to construct the loss function for training the entire framework. Similar to other works [20,21], we use the combination of the L1 norm and the structural similarity index measure (SSIM) [35] to construct the photometric consistency loss, which is expressed as: where is set to 0.85 empirically. During the training of the depth estimation model, the edge smoothness loss is often used to filter out incorrect predictions and preserve clear details. We use the same loss function as [31], expressed as

Calculated Mask and Mask Loss
Due to the many assumptions involved in view synthesis, such as the scene is static and devoid of dynamic objects and occlusions, training images that violate these assumptions will inevitably hinder model training. Therefore, it is necessary to consider the influence of these factors when using the view synthesis process to calculate the loss function. A common principle is to design masks to shield these outlier regions and prevent them from participating in the calculation of the loss function, thereby improving the performance of the model.
During the view synthesis process, when the pixels in the target view are projected onto the source view, some pixels will be projected beyond the imaging plane of the source view so that the pixel value of the point cannot be reconstructed. As shown in Figure 3, suppose there are two points (p 1 of S 1 and S 2 , but in frame I t+1 , the area of S 1 will be occluded by the area of S 2 , and only the area of S 2 can be seen. Thus, the problem arises that the areas of S 1 and S 2 captured by frame I t is projected onto frame I t+1 and overlap in coordinates, making it impossible to use the image captured at frame I t+1 to restore the area of S 1 captured at frame I t . suppose there are two points ( , ) in the target view It. When the camera moves to the frame It+1, the projection of onto source view It+1 is , and the projection of is , which is located outside the boundary of It+1, making it impossible to determine the pixel value of that point accurately. Therefore, we mark all these points projected outside the boundary as 0, thus generating a boundary mask denoted as Me. After being multiplied by the target view, the mask can avoid calculating the loss function for some boundary points. In addition to the boundary points that will affect the training of the model, the occluded areas in the scene will also affect the training of the model. For example, as shown in Figure 4a, when a car equipped with a camera travels from frame It to frame It+1, there is an obstacle on the left side of the yellow car. In frame It, the camera can capture the areas of S1 and S2, but in frame It+1, the area of S1 will be occluded by the area of S2, and only the area of S2 can be seen. Thus, the problem arises that the areas of S1 and S2 captured by frame It is projected onto frame It+1 and overlap in coordinates, making it impossible to use the image captured at frame It+1 to restore the area of S1 captured at frame It. When occluded areas appear in the image, they must be marked to avoid their participation in calculating the loss function. As shown in Figure 4b, when two pixels ( , ) in frame It are projected onto frame It+1, if they fall in the same grid, that is, they have four identical interpolation points, we consider occlusion to have occurred. According to the distance between these two pixels and the camera, we mark the farther point as 0 and the closer point as 1. In Figure 4b, we assume that is closer, so it is marked as 1, while is marked as 0. This generates a mask denoted as Mo, which is used to identify the occlusion area.
Due to the pixel points that meet the hypothesis, the photometric errors will gradually converge to a lower value during training. In contrast, for some pixel points caused by various adverse factors, the photometric error value will always remain at a higher value [21]. Using this characteristic, those points with photometric errors much higher than the average are identified as outliers. Precisely, for each pixel on the target view, we can determine whether it is an outlier point according to the following expression: value of that point accurately. Therefore, we mark all these points projected outside the boundary as 0, thus generating a boundary mask denoted as Me. After being multiplied by the target view, the mask can avoid calculating the loss function for some boundary points. In addition to the boundary points that will affect the training of the model, the occluded areas in the scene will also affect the training of the model. For example, as shown in Figure 4a, when a car equipped with a camera travels from frame It to frame It+1, there is an obstacle on the left side of the yellow car. In frame It, the camera can capture the areas of S1 and S2, but in frame It+1, the area of S1 will be occluded by the area of S2, and only the area of S2 can be seen. Thus, the problem arises that the areas of S1 and S2 captured by frame It is projected onto frame It+1 and overlap in coordinates, making it impossible to use the image captured at frame It+1 to restore the area of S1 captured at frame It.  When occluded areas appear in the image, they must be marked to avoid their participation in calculating the loss function. As shown in Figure 4b, when two pixels ( , ) in frame It are projected onto frame It+1, if they fall in the same grid, that is, they have four identical interpolation points, we consider occlusion to have occurred. According to the distance between these two pixels and the camera, we mark the farther point as 0 and the closer point as 1. In Figure 4b, we assume that is closer, so it is marked as 1, while is marked as 0. This generates a mask denoted as Mo, which is used to identify the occlusion area.
Due to the pixel points that meet the hypothesis, the photometric errors will gradually converge to a lower value during training. In contrast, for some pixel points caused by various adverse factors, the photometric error value will always remain at a higher value [21]. Using this characteristic, those points with photometric errors much higher than the average are identified as outliers. Precisely, for each pixel on the target view, we can determine whether it is an outlier point according to the following expression: When occluded areas appear in the image, they must be marked to avoid their participation in calculating the loss function. As shown in Figure 4b, when two pixels (p 1 t , p 2 t ) in frame I t are projected onto frame I t+1 , if they fall in the same grid, that is, they have four identical interpolation points, we consider occlusion to have occurred. According to the distance between these two pixels and the camera, we mark the farther point as 0 and the closer point as 1. In Figure 4b, we assume that p 1 t is closer, so it is marked as 1, while p 2 t is marked as 0. This generates a mask denoted as M o , which is used to identify the occlusion area.
Due to the pixel points that meet the hypothesis, the photometric errors will gradually converge to a lower value during training. In contrast, for some pixel points caused by various adverse factors, the photometric error value will always remain at a higher value [21]. Using this characteristic, those points with photometric errors much higher than the average are identified as outliers. Precisely, for each pixel on the target view, we can determine whether it is an outlier point according to the following expression: where L ph is the average value of all photometric errors, M a indicates mask, β is the corresponding weight that reflects the tolerance for outliers. The larger the value, the more outliers will be retained. 1.5 is used here empirically. The three masks mentioned above can effectively remove the majority of outliers. However, there are still some special cases that need to be considered, such as objects moving at speed similar to the camera or scenes where the camera is stationary, both of which violate the assumptions required for the view synthesis process, i.e., a moving camera and a stationary scene. To address this issue, we adopt the auto-making technique used in [12], i.e., the photometric error of the synthesized target view should be less than the photometric error calculated directly using the source view. The mask denoted as M s is expressed as follows: The minimum projection technique proposed in [12] addresses the problem of occluded areas in the scene, which in essence, is also a mask, and we denote this mask by M m and calculate it by the following expression: Then, all the masks are combined using element-wise logical conjunction, as shown below, to generate the final mask, marked M f .
Finally, the photometric consistency loss is updated as follows: The final mask obtained through forward propagation also serves as a supervised signal for the MaskNet. The MaskNet takes the target-source image pairs as input and generates a mask of estimated outliers. Unlike the final mask, the values in the generated mask are not binary but continuous, ranging from 0 to 1, making it more convenient for model training. The MaskNet is trained by minimizing the difference between the estimated mask (M e ) and the calculated mask (M f ). The loss function is as follows: where n represents the number of elements in M f and M e , and log is the logarithm function.

Geometric Consistency Loss
To address the issue that the photometric consistency loss function fails to hold in scenes with significant illumination variations, we propose the geometric consistency loss with its generation mechanism illustrated in Figure 6. This loss leverages the geometric

Geometric Consistency Loss
To address the issue that the photometric consistency loss function fails to hold in scenes with significant illumination variations, we propose the geometric consistency loss with its generation mechanism illustrated in Figure 6. This loss leverages the geometric constraints inherent to the 3D scene, rendering it impervious to illumination variations and serving as a complementary measure to the photometric consistency loss. The geometric consistency loss comprises optical flow consistency loss, depth consistency loss, and pose consistency loss, which is expressed as follows: with L f lo , L dep and L pos denoting the optical flow consistency loss, the depth consistency loss, and the pose consistency loss, respectively, and λ f , λ d and λ p are the corresponding weighting coefficients. Figure 5. Two examples of our proposed mask visualization. The goal of the calculated and estimated masks is to mitigate the negative impact of challenging scenes on the view synthesis process and VO estimation, respectively.

Geometric Consistency Loss
To address the issue that the photometric consistency loss function fails to hold in scenes with significant illumination variations, we propose the geometric consistency loss with its generation mechanism illustrated in Figure 6. This loss leverages the geometric constraints inherent to the 3D scene, rendering it impervious to illumination variations and serving as a complementary measure to the photometric consistency loss. The geometric consistency loss comprises optical flow consistency loss, depth consistency loss, and pose consistency loss, which is expressed as follows: with , and denoting the optical flow consistency loss, the depth consistency loss, and the pose consistency loss, respectively, and , and are the corresponding weighting coefficients. Figure 6. The generation mechanism of the geometric consistency loss function. The geometric consistency loss consists of three components: optical flow consistency loss, depth consistency loss, and pose consistency loss. The optical flow consistency loss is calculated from the difference between the estimated optical flow and the calculated projected optical flow; the depth consistency Figure 6. The generation mechanism of the geometric consistency loss function. The geometric consistency loss consists of three components: optical flow consistency loss, depth consistency loss, and pose consistency loss. The optical flow consistency loss is calculated from the difference between the estimated optical flow and the calculated projected optical flow; the depth consistency loss is calculated from the difference between the estimated depth and the inverse warping depth [18]; and the pose consistency loss is obtained in three frames of snippets by ensuring a tight coupling of the transformation matrices with each other.
To better extract the geometric features in the image when estimating the camera pose, we use the optical flow estimated by FlowNet as an input. In addition, the estimated optical flow (F f ) can provide an additional supervised signal to the framework. Specifically, using the estimated depth and the estimated camera pose, we can calculate the projected optical flow (F cal ) using the following expression: After removing outliers, the calculated optical flow (F cal ) should be theoretically consistent with the estimated optical flow (F f ). Then the optical flow consistency loss can be calculated using the following formula: where V denotes the valid region after excluding the outliers.
Since the correspondence between the target view and source view coordinates can be determined computationally during the projection process, there is also a correspondence between the target view depth map and the source view depth map estimated by DepthNet. That is, the source view depth map and the projected optical flow can inverse warp the target view depth map, which is consistent with the depth map estimated by DepthNet [18]. Therefore, we also use the L1 norm to calculate their differences, and the depth consistency loss is represented as follows: The pose consistency loss is obtained in the three-frame snippet by ensuring that the transformation matrices are closely coupled. Specifically, with the PoseNet network, the pose information between each pair of the three image frames can be estimated, denoted as T t−1→t , T t→t+1 and T t−1→t+1 . Using the transformation relationship between them, i.e., T t−1→t ·T t→t+1 = T t−1→t+1 , the pose consistency loss is proposed to constrain the entire model further, expressed as follows:

Experiments
In this section, we conduct several experiments to evaluate the estimation results of our pose and depth models and visualize the estimated results. We also conduct ablation experiments to validate the effectiveness of our used strategies as well as tests on unfamiliar datasets to verify the generalization ability of the model.

Implementation Details
Our framework comprises four sub-networks. For the depth estimation network (DepthNet), we employ a U-shaped architecture [36] with skip connections, which takes a single image as input and outputs the corresponding depth map. The encoder mainly consists of cascaded residual convolutional neural networks, with ResNet18 [37] as the underlying network that contains 11 million trainable parameters. The decoder primarily consists of cascaded deconvolutional layers. For the pose estimation network (PoseNet), we also use ResNet18 as the encoder, followed by a global average layer before the final prediction to obtain the 6 DOF camera pose (3 for translation and 3 for rotation in Euler angles), with the pre-processed optical flow as input. The mask estimation network (MaskNet) has the same network structure as DepthNet but with different inputs, taking a stack of two frames in RGB channels and outputting the probability of each pixel being an outlier in the scene. Finally, for optical flow estimation, we adopt a pre-trained network, MaskFlownet [38], to accelerate the training process by computing the optical flow information for all adjacent and intermediate frames in advance, which can be directly read during training.
The framework was implemented on the PyTorch platform [39], and all experiments were performed using a single NVIDIA graphics card (RTX 1080 Ti) and an Intel Core i7 3.6 GHz CPU. We use Adam [40] for optimization. The hyperparameters of the loss function, including λ s = 0.001, λ m = 0.2, λ f = 0.2, λ d = 0.2 and λ p = 0.5. To ensure fair comparisons with other works, we cropped the image resolution to 640 × 192, accelerating the network training. Similar to other works [20,21], we applied data augmentation techniques such as cropping, random scaling, and horizontal flips. For all ResNet18-based models, including DepthNet, PoseNet, and MaskNet, we initialized their encoder parts using the weights pre-trained in ImageNet [41], following the practice of MonoDepth2 [12]. The batch size was set to 8, the initial learning rate was set to 0.0001, and it was multiplied by 0.6 every 5 epochs. The total training time was approximately 30 h, and the network was trained for 20 epochs.

Datasets and Metrics
We conducted our training and testing on the widely used KITTI benchmark [42], the most popular dataset in the field of autonomous driving. It provides 56 driving scenes at a rate of 10 frames per second with an image resolution of approximately 1226 × 370, covering various urban, residential, and highway driving scenarios. In addition, the dataset includes ground truth for camera poses and depth maps, which are derived from multiple modalities such as high-precision LiDAR, GPS, and IMU sensors. Since our method is unsupervised and only requires a sequence of consecutive frames as input, we also pre-trained our model on the Cityscapes dataset [43], which consists of video sequences of cars driving in over 50 cities and stereo data without annotations. Additionally, we validated the generalization performance of our trained model on the Make3D dataset [44], which includes single-view images and corresponding low-resolution depth maps but no monocular sequences or stereo images.
As in previous work [15], we adopted the absolute trajectory error (ATE) to evaluate pose estimation and the synthetic policy [12] to evaluate depth estimation. The following is the definition of ATE: P i and Q i represent the estimated pose and its corresponding ground truth, S denotes the similarity transformation matrix, and trans represents the translation component [45]. Depth estimation was evaluated using various metrics, including absolute relative error (Abs. Rel), which measures the relative error between predicted and ground truth values; square relative error (Sq. Rel), which squares Abs. Rel and accentuates the differences; root mean squared error (RMSE), which reflects the absolute error between predicted and ground truth values; log root mean squared error (RMSE log), which uses logarithmic operations to reduce the impact of outliers on RMSE; and prediction accuracy (δ), which intuitively reflects the accuracy of the predictions. The following are the definitions of these evaluation standards: where D i and D * i represent the estimated depth and its related ground truth. We used T values of 1.25, 1.25 2 , and 1.25 3 . Lower values for error metrics (Abs. Rel, Sq.Rel, RMSE, RMSE log) and higher values for the accuracy metric (δ) indicate better performance.

Evaluation of Depth Estimation
In the monocular depth estimation experiments, we evaluated the performance of our depth estimation network using the widely used Eigen split of the KITTI dataset [42]. Specifically, our training set consisted of 39,810 monocular triplets, while the validation set consisted of 4424 triplets, and the testing set included 697 representative frames. As with other unsupervised methods, the median ratio [15] aligns each predicted depth map with the corresponding ground truth depth map. Again, conventional metrics and the cropping region in [12] were used, and the upper limit of the standard depth was set to 80 m. Finally, we compared our method with other classic approaches using learning-based methods, and the quantitative and qualitative comparisons are shown in Table 1 and Figure 7. In the supervised signal column of Table 1, "Depth" indicates that the method is supervised, "Stereo" suggests that the method is trained on stereo images with baseline information, and "Mono" suggests that the method is trained solely on monocular image sequences. The third column indicates the dataset used for training, where "K" denotes training only on the KITTI dataset and "CS + K" denotes fine-tuning on the KITTI dataset following pre-training on the Cityscapes dataset. Among all the unsupervised methods, our method outperforms all the others, especially Godard's Monodepth2 [12], a classical depth estimation network. Our model has the same number of parameters compared to Monodepth2's, but the model's performance is significantly improved by using a variety of optimization strategies we have proposed. We also pre-trained our model on the Cityscapes dataset [43] and fine-tuned it on the KITTI dataset. The results (in the bottom part of Table 1) show that increasing the training data can improve the model's performance. This demonstrates the advantage of unsupervised methods over supervised methods, where the estimation accuracy increases as the training data grows and the model size increases. However, for supervised methods, increasing the training data requires more work to generate labels, and increasing the model capacity blindly without increasing the data can lead to overfitting. This is why unsupervised methods are increasingly preferred by researchers.
A qualitative comparison of our method with some classical methods is shown in Figure 7. Compared with other methods, our method estimates the boundaries of various objects more clearly, including dynamic objects (moving cars and pedestrians) and distant cars. In the supervised signal column of Table 1, "Depth" indicates that the method is supervised, "Stereo" suggests that the method is trained on stereo images with baseline information, and "Mono" suggests that the method is trained solely on monocular image sequences. The third column indicates the dataset used for training, where "K" denotes training only on the KITTI dataset and "CS + K" denotes fine-tuning on the KITTI dataset following pre-training on the Cityscapes dataset. Among all the unsupervised methods, our method outperforms all the others, especially Godard's Monodepth2 [12], a classical depth estimation network. Our model has the same number of parameters compared to Monodepth2's, but the modelʹs performance is significantly improved by using a variety of optimization strategies we have proposed. We also pre-trained our model on the Cityscapes dataset [43] and fine-tuned it on the KITTI dataset. The results (in the bottom part of Table 1) show that increasing the training data can improve the modelʹs performance. This demonstrates the advantage of unsupervised methods over supervised methods, where the estimation accuracy increases as the training data grows and the model size increases. However, for supervised methods, increasing the training data requires more work to generate labels, and increasing the model capacity blindly without increasing the data can lead to overfitting. This is why unsupervised methods are increasingly preferred by researchers.
A qualitative comparison of our method with some classical methods is shown in Figure 7. Compared with other methods, our method estimates the boundaries of various objects more clearly, including dynamic objects (moving cars and pedestrians) and distant cars.

Evaluation of VO Estimation
For pose estimation, we conducted experiments on the KITTI odometry dataset, using sequences 00-08 for training and sequences 09 and 10 for testing, following the method of previous work [15]. To ensure a fair comparison, the input sequences were modified from 3 to 5 frames to be consistent with other methods [11][12][13]. Our method's qualitative and quantitative results compared to similar methods are shown in Figure 8 and Table 2, respectively. ORB-SLAM [6] 0.014 ± 0.008 0.012 ± 0.011 Zhou et al. [15] 0.016 ± 0.009 0.013 ± 0.009 Bian et al. [18] 0.016 ± 0.007 0.015 ± 0.015 Mahjourian et al. [27] 0.013 ± 0.010 0.012 ± 0.011 Yin et al. [31] 0.012 ± 0.007 0.012 ± 0.009 Ranjan et al. [34] 0.012 ± 0.007 0.012 ± 0.008 Ours 0.008 ± 0.005 0.007 ± 0.005 Our method achieves the best results, especially over Zhou et al. [15]. On the one hand, they use raw RGB images in estimating the pose, which contains redundant information that does not help the network learn the motion information of the camera. On the other hand, many outliers in the scene, such as occlusions and dynamic objects, impact their pose estimation model. In contrast, our method uses optical flow as input, which contains camera motion information that can be more easily learned by the network. In addition, the interference of outliers, such as dynamic objects, is effectively avoided by the multiple masking techniques, which leads to a significant improvement in the accuracy of pose estimation.

Evaluation of VO Estimation
For pose estimation, we conducted experiments on the KITTI odometry dataset using sequences 00-08 for training and sequences 09 and 10 for testing, following the method of previous work [15]. To ensure a fair comparison, the input sequences were modified from 3 to 5 frames to be consistent with other methods [11][12][13]. Our methodʹs qualitative and quantitative results compared to similar methods are shown in Figure 8 and Table 2, respectively.

Ablation Study
To verify the effectiveness of our proposed strategies, we conducted an ablation study on the entire framework. Since the DepthNet and the PoseNet adopt a joint training method and their accuracy also affects each other, it is sufficient to test only one sub-network. Here we test the PoseNet, using the sequences 00-08 on the KITTI dataset as the training set and the sequences 09 and 10 as the test set. The results of the ablation experiments are shown in Table 3.
The baseline model represents that no mask is used, and the loss function used for training only includes the photometric consistency loss and the smoothness loss. M f denotes that the model calculated the mask used to eliminate outliers during training and used the mask when calculating the photometric consistency loss. The model is represented by M e not only calculates the mask M f during training but also adds L m to the final loss function to train the MaskNet and preprocesses the input of PoseNet with the estimated mask M e . L f lo represents that only the optical flow consistency constraint is used, L dep represents that only the depth consistency constraint is used, L pos represents that only the pose consistency constraint is used, and L f ull represents a complete model. That is, two masks and three consistency constraints are used. We evaluate each improved component in the proposed monocular system and remove them from the whole system to prove their effectiveness indirectly. It can be seen from Table 3 that the accuracy of the baseline is the worst of all models, and all the strategies we propose help to improve the accuracy of the framework. In addition, we found that the increase of L dep and L pos was relatively small, while the increase of L f lo was relatively large. We believe this is because the optical flow consistency constraint, the supervised signal used, comes from the trained network. This additional supervised signal has further improved the accuracy of the model. Depth consistency and pose consistency are implicit constraints within the model. Their design is mainly used to ensure global consistency in pose estimation. Since the input is a continuous multi-frame picture, the scale consistency constraint of every two frames will also ensure that the constant multi-frame pictures maintain the same scale, thus ensuring the global consistency of pose estimation.

Generalization on Make3D Dataset
To verify the generalization of our trained model, we conducted tests on the Make3D [44] dataset using the model trained on the KITTI and Cityscapes datasets. This means that our model has never encountered any images from the Make3D dataset before. The qualitative and quantitative results of the depth estimation network are shown in Figure 9 and Table 4. It can be observed that our method has certain advantages over other methods of the same type. However, due to different domain biases in different datasets, there is still room for improvement in the performance of our model compared to the results obtained on the KITTI dataset. To verify the generalization of our trained model, we conducted tests on the Make3D [44] dataset using the model trained on the KITTI and Cityscapes datasets. This means that our model has never encountered any images from the Make3D dataset before. The qualitative and quantitative results of the depth estimation network are shown in Figure  9 and Table 4. It can be observed that our method has certain advantages over other methods of the same type. However, due to different domain biases in different datasets, there is still room for improvement in the performance of our model compared to the results obtained on the KITTI dataset. Figure 9. A qualitative comparison of our method with that of Zhou et al. [15] and Godard et al. [12] on the Make3D dataset.

Conclusions
This paper introduces a novel unsupervised learning framework for estimating scene depth and camera pose from video sequences, focusing on challenging scenes. Our proposed method incorporates multiple mask techniques to identify and eliminate the influence of outliers in challenging scenes during the view synthesis process and pose estimation. Furthermore, we have proposed several geometrically consistent loss Figure 9. A qualitative comparison of our method with that of Zhou et al. [15] and Godard et al. [12] on the Make3D dataset.

Conclusions
This paper introduces a novel unsupervised learning framework for estimating scene depth and camera pose from video sequences, focusing on challenging scenes. Our proposed method incorporates multiple mask techniques to identify and eliminate the influence of outliers in challenging scenes during the view synthesis process and pose estimation. Furthermore, we have proposed several geometrically consistent loss functions as additional supervised signals to enhance the performance of our model. We conducted evaluation and ablation experiments on the KITTI dataset, and the results validate the effectiveness of our contributions. Our framework has promising potential for addressing the challenges of estimating scene depth and camera pose in real-world scenarios. Future work can extend our method to handle more complex scenes and design the framework using high-capacity models such as Transformer and training with more data.