Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation

: Recent studies on self-supervised monocular depth estimation have achieved promising results, which are mainly based on the joint optimization of depth and pose estimation via high-level photometric loss. However, how to learn the latent and beneficial task-specific geometry representation from videos is still far from being explored. To tackle this issue, we propose two novel schemes to learn more effective representation from monocular videos: (i) an Inter-task Attention Model (IAM) to learn the geometric correlation representation between the depth and pose learning networks to make structure and motion information mutually beneficial; (ii) a Spatial-Temporal Memory Module (STMM) to exploit long-range geometric context representation among consecutive frames both spatially and temporally. Systematic ablation studies are conducted to demonstrate the effectiveness of each component. Evaluations on KITTI show that our method outperforms current state-of-the-art techniques.


Introduction
Understanding the 3D structure of scenes is an essential topic in machine perception, which plays a crucial part in applications such as autonomous driving, robot vision, visual reality and so on [1][2][3][4].For most scenarios, there is vast latent geometric information existing in the input videos.One of the key challenges in this domain is how to acquire effective task-specific geometry representation from videos to help obtain more accurate and reliable depth information.
Recently, there have been some successful attempts [1,2,5] to execute monocular depth estimation and visual odometry prediction together in a self-supervised manner by giving full consideration of the transformation between consecutive frames.In this pipeline, two networks are generally used to predict the depth and camera pose separately, which are then jointly exploited to warp source frames to the target ones, converting the depth estimation problem to a reprojection error minimization process, as shown in Figure 1a.
Despite various extensions of the self-supervised pipeline by adding more penalty items [5][6][7] or joining with other tasks (optical flow or segmentation) [8,9], these methods only design various high-level loss functions to combine and regularize the network learning, neglecting to leverage valuable geometry representation from videos, e.g., intertask geometric correlation learning, inter-frame long-range dependency learning, and 3D geometry consistency representation from continuous frames.
Intuitively, modeling the process of perceiving 3D structure from videos can be informed by our human experience.According to the research in biology and neuroscience [10], human brains process motion information during the inference of depth, and conversely, the perceived depth information can bring significant benefits to motion estimation [11].Inspired by this biological mechanism, we present an Inter-task Attention Module (IAM) to guide the feature-level inter-task geometric correlation learning.It can enhance the interaction between the depth and pose estimation networks and is effective in making structure and motion information mutually beneficial for improving estimation accuracy.Comparison of the learning process of the general pipeline (a) and our method (b) for self-supervised monocular depth estimation.Different from the general pipeline that learns the depth feature F D and the pose feature F P separately using a 2D photometric loss L, we propose a new scheme for learning better representation from videos.A memory mechanism M is devised to exploit the long-range context from videos for depth feature learning.An inter-task attention mechanism A is devised to leverage depth information for helping pose feature learning, which inversely benefits depth feature learning as well via gradient back-propagation.
Furthermore, many psychologists believe that humans rely on not only immediate sensory feedback but also perception memories from the past for understanding an environment [12,13].Similarly, it is significant to help networks learn a representation leveraging long-range context and memorizing historical information to disambiguate and realize more precise perception.Therefore, we introduce a Spatial-Temporal Memory Module (STMM) to learn spatial and temporal dependency from video clips and mimic the above perception mechanism of human beings.We embody an STMM based on the Non-local network [14], which is demonstrated to be effective in modeling long-range information, after exploring various attention structures.
In summary, the learning process of our method is shown in Figure 1b, and our main contributions are as follows:

•
We devise an Inter-task Attention Module to exploit the inter-task geometric correlation between depth and pose estimation networks.It learns attention maps from depth information as guidance to help the pose network identify key regions to be targeted.
To the best of our knowledge, this is the first attempt to propose this idea for exploiting the inter-task geometric correlation in self-supervised monocular depth estimation.

•
We introduce a Spatial-Temporal Memory Module in a depth estimation network to leverage the spatial and temporal geometric context among consecutive frames, which is effective for utilizing historical information and improving estimation results.

•
We conduct comprehensive empirical studies on the KITTI dataset, and the singleframe inference result of our method outperforms state-of-the-art methods by a relative gain of 6.6% based on the major evaluation metric.

Inter-Task Monocular Video Learning
Ref. [1] proposed a fully unsupervised end-to-end network for training with monocular videos that can jointly predict the depth and pose transformation between consecutive frames.The core technique is a spatial transformer network [15] to synthesize target frames from source frames, which converts the depth estimation problem to a reprojection error minimization process.This pipeline was then extended by plenty of researchers.Ref. [6] added a feature-based warping loss upon the original photometric loss and trained the networks with stereo image pairs to resolve the scale ambiguity.Ref. [2] further proposed an auto-masking strategy to handle situations where the camera is static or objects move at the same speed as the camera, yielding more accurate results.Moreover, some works combined depth and pose estimation with other tasks, e.g., normal, segmentation, and optical flow estimation.Ref. [7] implemented the estimation of normal in scenes and incorporated an edge-aware depth-normal consistency constraint.Ref. [16] used the Mask R-CNN [17] model to extract semantic information and obtain pre-computed object masks to filter out moving objects.Ref. [8] defined a cascaded network to jointly learn depth, pose, and optical flow for handling rigid motions and moving objects separately, using a forward-backward coherence loss.Ref. [18] also presented an architecture to simultaneously learn depth, ego-motion, and optical flow and focused on enforcing cross-task consensus between depth and optical flow.JPerceiver [19] jointly learn depth estimation, visual odometry, and Bird's-Eye-View segmentation.Despite the progress made by these methods, almost all of them follow the pipeline in [1] that uses separate networks to learn depth, pose, and other tasks without any interactions before being combined into the final loss.By contrast, we propose an IAM to learn geometry information from the depth network as guidance to help the pose network learn more valuable representation for pose estimation.Notably, the two tasks are joint-optimized via high-level photometric error, which enables an interaction between two networks via gradient back-propagation, meaning that depth tasks can also benefit from the IAM and learn more useful representation to improve the estimation results.

Long-Range Representation Learning
Taking videos instead of single images as input is extremely important for many applications, such as autonomous driving, robotic vision, and drones.However, the rich long-range dependency, including spatial and temporal correlations, is still far from being fully utilized to eliminate ambiguity and obtain more consistent estimation.UnDeepVO [20] was proposed as the first end-to-end visual odometry by combining CNNs and two stacking LSTMs to achieve simultaneous representation learning and sequential modelling of the monocular VO.Kumar et al. [21] proposed a convolutional LSTM-based network architecture for depth to capture inter-frame dependencies and variations.Wang et al. [22] also adopted ConvLSTM architecture, multi-view reprojection, and forward-backward consistency constraints to utilize the temporal information effectively.These research efforts demonstrated that utilizing longrange information from videos is helpful in learning more effective representation for both depth and pose networks and improving the estimation accuracy.However, convolutional and recurrent operations both process a local neighborhood, either in space or time [14].And, all these RNN methods focused only on temporal dependency learning without long-range spatial context, which is not that useful for single image inference.Recently, Transformer-based methods [23][24][25][26][27] are attracting researcher's attention to use the stronger backbone to extract better visual representation.MonoViT [26] introduces a Vision Transformer-based encoder for self-supervised monocular depth estimation, leveraging both local and global reasoning capabilities to achieve state-of-the-art performance on the KITTI dataset.PixelFormer [25] was proposed as a novel pixel query refinement approach for monocular depth prediction, using a Skip Attention Module to effectively fuse global and local features.MonoFormer [27] was introduced as a deep analysis of self-supervised monocular depth estimation models, and the authors proposed methodological enhancements to improve their generalization across various environments.These Transformer-based methods achieve impressive performance but require larger network structures and consume more computational resources.In this paper, aligning with our lightweight network design, we introduce an STMM module to learn long-range geometric relationships both spatially and temporally among pixels in consecutive frames.We embody STMM based on the Non-local network [14] after exploring various attention structures.STMM is demonstrated to be beneficial for both multi-frame training and single-frame inference.

Problem Definition
A typical self-supervised monocular depth estimation pipeline is mainly built upon the perspective projection among consecutive frames.Taking <I t , I t+1 , . . ., I t+n > as a training video within time window N, once the depth D t+n and camera transformation T t+n→t are obtained, we can warp the source frame I t+n to reconstruct the target frame Ît+n→t using the differentiable bilinear sampling approach [15], which can be formulated as where I ij t+n is the homogeneous coordinate given image I t+n and D ij t+n denotes the depth value of the view I t+n .Given a rotation matrix R, translation vector t, and camera intrinsics K, the transformed homogeneous coordinate I ij t and depth D ij t can be obtained.Thus, the reprojected image coordinate Ît+n→t can be acquired by dehomogenization of Then, the self-supervised learning is conducted based on the difference between the synthetic view Ît+n→t and the original view I t : Here, L r denotes a consistency measurement loss.

Network Architecture
As shown in Figure 2a, our network is composed of two main networks for depth estimation and pose estimation, respectively.Meanwhile, the pose network is split into two branches for the estimation of rotation and translation.The proposed IAM is used to address the importance of geometric correlation representation between the depth and pose tasks, while the proposed STMM is used to exploit the long-range geometric relevance among continuous frames.The details of IAM and STMM are presented later.The depth network adopts an encoder-decoder architecture in a U-shape with skip connections similar to DispNet [28].The encoder is a Resnet18 [29] network pre-trained on ImageNet [30].Our depth decoder is similar to that of [2], using sigmoid activation functions in multi-scale side outputs and ELU nonlinear functions [31] otherwise.Most importantly, we take a three-frame snippet as the sequential input and stack the encoded features as the input of the STMM to learn the temporal and spatial geometric correlations among video sequences during training.The outputs are then decoded into a three-frame depth sequence.
The pose network takes two consecutive frames as input at each time and outputs the corresponding pose transformation based on an encoder-decoder structure as well.To generate more accurate estimations, the network is divided into two branches to calculate R and t, respectively.In the feature encoding phase, the IAM is employed to produce the attention from depth features as guidance for the R and t branches.

Inter-Task Attention Module
The IAM aims to leverage the latent geometric correlation between depth and pose estimation tasks during learning.To exploit geometry information, features from the penultimate layer of the depth decoder are first stacked in the same order as the input sequence of the IAM.In the IAM, the features are first processed by an average pooling layer and a max pooling layer along the channel axis and then concatenated together as a compact representation, as previous studies [32,33] show that pooling layers can help highlight features.Furthermore, a subsequent convolution layer is used to obtain the attention maps.
The varying weights regarding different pixels in the learned attention maps guide the R and t branches in deciding what feature should be the focus and prioritized.Therefore, we use the attention maps to obtain scaled pose features by element-wise multiplication, which are then added to the original pose features as a residual item.A schematic diagram of the IAM is provided in Figure 2b, which can be formulated as Here, F mn represents the stacked depth features encoded from continuous frames I m and I n by the depth network, while F p and F ′ p denote the original pose feature and the attended one, respectively.The average and max pooling layers are represented by AVP and MAP, respectively, while W c denotes the learnable weight of the convolution layer.
Intuitively, the geometric patterns of learned attention maps for the two branches should be different or even opposite, as nearby regions tend to matter more to translation, while distant pixels may play a more important role in deciding the rotation.The ablation study and visualization results demonstrate that the IAM does learn different attention patterns for the R and t branches from geometry information.They are utilized to guide the pose network to learn more valuable and effective representation, improving estimation accuracy for both tasks via joint optimization.

Spatial-Temporal Memory Module
Instead of learning representation from each frame individually [1,2,5] in the depth network, we introduce an STMM to leverage and aggregate long-range geometric correlation from both a spatial context and a temporal context to obtain a more representative feature embedding for depth estimation.To this end, various attention structures can be leveraged, including SE [34], CBAM [32], and Non-local attention [14].After a comparison study, we found that Non-local attention is more effective at capturing long-range context.Thereby, we chose it to embody STMM in this work.First, the encoded depth features from several consecutive frames are concatenated together and used as the input of the Non-local block.Then, the attention map for each frame is obtained, which is multiplied and added to the original depth features after a group convolution layer.The group number is the same as the number of input frames.As shown in Figure 2c, in STMM, the aggregated depth feature F ′ D is calculated as follows: where F D is the input depth feature and NL means the operation of Non-local block.W δ is the learnable weight of the group convolution layer.

Experiments 4.1. Depth Estimation Result
Although our method is trained with three-frame snippet input, it can infer single image depth during inference by stacking the same encoded features three times before being fed into the STMM.Following the common evaluation protocol [1], we report the single-frame inference results in the following experiments, although better results can be achieved by leveraging multiple frames.The extensive experimental results on KITTI are presented in Table 1, from which it is clear that our method outperforms all prior works trained with monocular and even stereo videos in a self-supervised manner.The visual results shown in Figure 3 demonstrate our method can generate more accurate and sharper depth maps, especially for challenging situations, such as moving objects, distant objects, and fine structures.More experiment results for both depth and pose estimation can be found in Supplementary Materials.Table 1.Quantitative performance of single depth estimation over KITTI test set [35].For a fair comparison, all the results are evaluated, taking 80 m as the maximum depth threshold.The "S" and "M" in the train column mean stereo and monocular inputs for training, while "R18" and "R50" denote the used Resnet [29] version." †" means updated result after publication.We train our models using only KITTI without any post-processing.The best results are illustrated with bold text.

Evaluation of Generalization Ability
Though our models were only trained on KITTI [44], competitive results can be achieved on unseen datasets without any fine-tuning.We evaluated our method on two outdoor datasets: Make3D [45] and Cityscapes [46].In Table 2, our model outperforms other self-supervised methods on the Make3D test protocol, showing good domain adaptation ability.The qualitative comparison in Figure 4 on Cityscapes provides additional intuitive evidence on the generalization ability.More test results can be found in Supplementary Materials.

Ablation Study
The ablation study was conducted on KITTI to highlight the effects of individual components of our model.Table 3 shows the detailed results by removing specific component(s) from our model.The in-depth analysis of each part is given in corresponding sections.Moreover, we conducted systematic experiments to test the performance under various training conditions, listed in Table 3, including input resolution (1024 × 320 vs. 640 × 192), backbone networks (Resnet18 vs. Resnet50), and with/without pretraining.

Effect of Inter-Task Attention Module
As mentioned before, when predicting the pose from two consecutive frames, we believe that different geometry information has a different impact on the estimation of rotation and translation.Thus, we introduce valuable attention guidance learned by the IAM into the prediction of R and t.The attention maps learned for the two branches during training are visualized in Figure 5, with color variation denoting different weight values.The attention maps indicate that the two branches did learn different geometric priorities from the depth information to help with their own estimation and conversely improve the depth estimation result, as shown in Table 3.The learned geometric patterns demonstrate that the estimation of R attaches more importance to farther regions and corner places, while the t branch values closer areas more.Our IAM adopts an attention mechanism and works in a generalized representation learning manner to utilize the geometric correlation between depth and pose, which can also be useful in other similar tasks to improve estimation quality.Effects of STMM on distant objects.The motivation of the STMM is to leverage the rich temporal and spatial geometric dependency among continuous frames.By exploiting the temporal information of depth features from three consecutive frames as input, The STMM is helpful for utilizing historical knowledge within the time window and enhancing the estimation of distant objects.During inference, the input of networks is a single image, and our STMM can exploit only spatial correlations from the pixels within the single image.The ablation study results shown in Table 3 demonstrate the benefit of STMM compared with the models replacing STMM with other attention structures (CBAM and SE).To better evaluate our model's performance on estimating distant objects, we segmented each scene into two groups of pixels according to a distance of 20 m, following [41] to ensure fairness.We conducted an ablation test for the estimation of distant objects, and the results are listed in Table 4.The results show that removing the STMM severely decreases the performance of our model for distant objects, which demonstrates the effectiveness of STMM in distant scenes.

Evaluation with Improved Ground Truth
The main evaluation method proposed by Eigen [35] uses the reprojected raw LI-DAR points as ground truth, which brings severe effects on the estimation of tricky cases, such as occlusion, object motion, and so on.To conduct a fair comparison with [2], we also adopted the annotated depth map from the official KITTI website as ground truth to evaluate methods.These annotated depth maps introduced by [48] tackle the abovementioned tough cases to improve ground truth using stereo pair.We compared our models with other self-supervised methods, as shown in Table 5.The results demonstrate that our method outperforms all previous methods, including both monocular and stereo training approaches.
Table 5. Quantitative performance of a single depth estimation using an annotated depth map [48] as ground truth.For a fair comparison, all the results are evaluated, taking 80 m as the maximum depth threshold.The resolution column indicates the size of input images during training.We trained our network using only KITTI without any post-processing.The best results are illustrated with bold text." †" means updated result after publication.

Single-Scale Evaluation
Monocular training methods usually need a scaling step during evaluation because monocular solutions do not have a certain metric scale during training.For evaluation, ref. [1] calculated the median of each predicted depth map and the ground truth as the scaling factor.However, using a distinct scaling factor for every frame may cause an unfair advantage in contrast to stereo methods, which use a certain scale for all images, according to [2].
In [2], the authors changed this evaluation protocol by taking the median of all the scaling ratios of the depth maps on the test set as a constant scale for all test images.To conduct a fair comparison, we adopted this modified protocol to validate our methods.The quantitative comparison can be found in Table 6, in which our method still outperforms all previous approaches.The standard deviation σ scale of our method is also lower than other methods, which indicates our approach can generate more consistent depth map scales.

Results with Post-Processing
To finish the comprehensive comparison with the previous state-of-the-art work [2], we also evaluated our method with post-processing.This technique was proposed by [37] to improve stereo-based methods, but it has proved effective for monocular training methods as well.As shown in Table 7, this post-processing step did improve the result of our methods.In addition, the performance of our models exceeds the post-processed results of [2] even without post-processing.

Inference Speed
The depth inference task usually plays an important role in autonomous driving and robotic vision.In these fields, there generally are strict requirements for calculation speed.To test the practicability of models, we calculated the inference speed of our models under the condition with a GPU or CPU device.In Table 8, we list the average time cost for testing 697 frames of Eigen's test set [35].The inference speeds of our model on the GPU and CPU devices are significantly different.Frames of the KITTI [44] dataset were collected at 10 Hz, and our inference speed on the GPU device is over 10 fps, which indicates the practicability of our method on the GPU device.However, the speed on the CPU device is much lower, which will be improved in our future work.

Conclusions
Our work is dedicated to the self-supervised monocular depth estimation problem with a focus on learning more effective task-specific representation during learning.In our method, the IAM can actively explore the geometric correlation between depth-and pose-estimation tasks by learning attentive representation from depth to guide the pose network to highlight and leverage more valuable geometry information, which improves the estimation quality of depth and pose.We also introduce an STMM to learn the spatial and temporal geometric dependencies among sequential frames, which are helpful for utilizing long-range historical knowledge within the time window to perceive distant objects.Experimental results demonstrated that our method is superior to existing stateof-the-art approaches and can generate higher-quality depth maps.In our future work, we will explore more powerful network architectures, such as Transformers and their corresponding attention mechanisms.

Figure 1 .
Figure 1.Comparison of the learning process of the general pipeline (a) and our method (b) for self-supervised monocular depth estimation.Different from the general pipeline that learns the depth feature F D and the pose feature F P separately using a 2D photometric loss L, we propose a new scheme for learning better representation from videos.A memory mechanism M is devised to exploit the long-range context from videos for depth feature learning.An inter-task attention mechanism A is devised to leverage depth information for helping pose feature learning, which inversely benefits depth feature learning as well via gradient back-propagation.

Figure 2 .
Figure 2. Illustration of our network framework (a) and the architecture of the IAM (b) and the STMM (c).The network takes three consecutive frames as input to learn the long-range geometric correlation representation by introducing STMM after the encoder.The pose network is split into two branches to predict rotation R and translation t separately.The IAM is applied after the second convolution layer of both R and t branches, learning valuable geometry information to assist R and t branches in leveraging inter-task correlation representation.

Figure 3 .
Figure 3. Qualitative results on KITTI test set.Our method produces more accurate depth maps with low-texture regions, moving vehicles, delicate structures, and object boundaries.

Figure 4 .
Figure 4. Visual results evaluated on the Cityscapes dataset.The evaluation uses models trained on KITTI without any refinement.Compared with the methods in[2], our method generates higherquality depth maps and captures moving and slim objects better.The difference is highlighted with the dashed circles.

Figure 5 .
Figure 5.The visualization of learned attention maps in the IAM.It indicates the IAM places distinct emphasis on different regions for two branches to improve their estimation.

Figure 6 .
Figure 6.Visual comparison of the visual odometry trajectories.Full trajectories are plotted using the Evo visualization tool [51].

Table 2 .
Quantitative results on the Make3D dataset.The best results are illustrated with bold text.

Table 3 .
Ablation results on KITTI with each individual component removed and using backbone networks (Resnet18 or Resnet50) and different resolutions of input videos during training.The term "plain" means removing all components, while "pre" means pretraining on ImageNet.The best results are illustrated with bold text.

Table 4 .
Ablation study and comparison for distant objects.

Table 8 .
[35]rence speed on Eigen's test set[35]."Time" means the total time required for the inference for 697 frames.

Table 9 .
Results of the visual odometry on the KITTI Odometry dataset."Frame" means the number of frames used when calculating absolute trajectory error." †" means updated result after publication.The best results are illustrated with bold text.