Skip to main content
Log in

Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

The 3D geometry understanding of dynamic scenes captured by moving cameras is one of the cornerstones of 3D scene understanding. Optical flow estimation, visual odometry, and depth estimation are the three most basic tasks in 3D geometry understanding. In this work, we present a unified framework for joint self-supervised learning of optical flow estimation, visual odometry, and depth estimation with two- and three-view geometric constraints. As we all know, visual odometry and depth estimation are more sensitive to dynamic objects, while optical flow estimation is more difficult to estimate the boundary area moved out of the image. To this end, we use estimated optical flow to help visual odometry and depth estimation process dynamic objects and use a rigid flow synthesized by the estimated pose and depth to help learn the optical flow of the area that moves out of the boundary due to camera motion. In order to further improve the consistency of cross-tasks, we introduce three-view geometric constraints and propose a three-view consistency loss. Finally, experiments on the KITTI data set show that our method can effectively improve the performance of the occluded boundary area and the dynamic object area. Moreover, our method achieves comparable or better performance than other monocular self-supervised state-of-the-art methods in these three subtasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets generated and analyzed during this study are available in the http://www.cvlibs.net/datasets/kitti/index.php.

References

  1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (IJCV) 60(2), 91–110 (2004)

    Article  Google Scholar 

  2. Bian, J., Lin, W.-Y., Matsushita, Y., Yeung, S.-K., Nguyen, T.-D., Cheng, M.-M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4181–4190 (2017)

  3. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  4. Horn, B.K., Schunck, B.G.: Determining optical flow. In: Techniques and Applications of Image Understanding, vol. 281, pp. 319–331 (1981). International Society for Optics and Photonics, USA

  5. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439 (2010). IEEE

  6. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Neural Information Processing Systems (NeurIPS) (2014)

  7. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Recognit. Mach. Intell. (PAMI) (2016). https://doi.org/10.1109/TPAMI.2015.2505283

    Article  Google Scholar 

  8. Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  9. Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050 (2017). IEEE

  10. Wang, S., Clark, R., Wen, H., Trigoni, N.: End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. Int. J. Robot. Res. (IJRR) 37(4–5), 513–542 (2018)

    Article  Google Scholar 

  11. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)

  12. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  13. Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: Pwc-net: CNNS for optical flow using pyramid, warping, and cost volume. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  14. Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4161–4170 (2017)

  15. Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV) (2016). Springer

  16. Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  17. Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: Association for the Advancement of Artificial Intelligence (AAAI), New Orleans, Louisiana (2018)

  18. Zhong, Y., Ji, P., Wang, J., Dai, Y., Li, H.: Unsupervised deep epipolar flow for stationary or dynamic scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12095–12104 (2019)

  19. Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: European Conference on Computer Vision (ECCV), pp. 690–706 (2018)

  20. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  21. Zou, Y., Luo, Z., Huang, J.-B.: DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. In: European Conference on Computer Vision (ECCV) (2018)

  22. Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive Collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  23. Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: IEEE International Conference on Computer Vision (ICCV), pp. 7063–7072 (2019)

  24. Wang, Y., Yang, Z., Wang, P., Yang, Y., Luo, C., Xu, W.: Joint unsupervised learning of optical flow and depth by watching stereo videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  25. Liu, L., Zhai, G., Ye, W., Liu, Y.: Unsupervised learning of scene flow estimation fusing with local rigidity. In: Association for the Advancement of Artificial Intelligence (AAAI), pp. 876–882 (2019)

  26. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  27. Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: Dense 3D reconstruction in real-time. In: IEEE Intelligent Vehicles Symposium (IV), pp. 963–968 (2011). IEEE

  28. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. (TRO) 33(5), 1255–1262 (2017)

    Article  Google Scholar 

  29. Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular slam. In: European Conference on Computer Vision (ECCV), pp. 834–849 (2014). Springer

  30. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Recognit. Mach. Intell. (PAMI) 40(3), 611–625 (2017)

    Article  Google Scholar 

  31. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), pp. 239–248 (2016). IEEE

  32. Kendall, A., Grimes, M., Cipolla, R.: Posenet: a convolutional network for real-time 6-dof camera relocalization. In: IEEE International Conference on Computer Vision (ICCV), pp. 2938–2946 (2015)

  33. Li, R., Liu, Q., Gui, J., Gu, D., Hu, H.: Indoor relocalization in challenging environments with dual-stream convolutional neural networks. IEEE Trans. Autom. Sci. Eng. 15(2), 651–662 (2017)

    Article  Google Scholar 

  34. Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representation learning with CNNS for frame-to-frame ego-motion estimation. IEEE Robot. Autom. Lett. 1(1), 18–25 (2015)

    Article  Google Scholar 

  35. Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: Vinet: visual-inertial odometry as a sequence-to-sequence learning problem. In: Association for the Advancement of Artificial Intelligence (AAAI) (2017)

  36. Li, Z., Chen, Z., Liu, X., Jiang, J.: Depthformer: exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211 (2022)

  37. Li, Z., Wang, X., Liu, X., Jiang, J.: Binsformer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)

  38. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  39. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012). IEEE

  40. Butler, D., Wulff, J., Stanley, G., Black, M.: MPI-sintel optical flow benchmark: Supplemental material. In: MPI-IS-TR-006, MPI for Intelligent Systems (2012). Citeseer

  41. Pilzer, A., Lathuiliere, S., Sebe, N., Ricci, E.: Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9768–9777 (2019)

  42. Ahmadi, A., Patras, I.: Unsupervised convolutional neural networks for motion estimation. In: IEEE International Conference on Image Processing (ICIP), pp. 1629–1633 (2016). IEEE

  43. Jason, J.Y., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: European Conference on Computer Vision (ECCV), pp. 3–10 (2016). Springer

  44. Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., Xu, W.: Occlusion aware unsupervised learning of optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4884–4893 (2018)

  45. Godard, C., Mac Aodha, O., Brostow, G.: Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260 (2018)

  46. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)

  47. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  48. Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: IEEE International Conference on Computer Vision (ICCV) (2019)

  49. Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: IEEE International Conference on Computer Vision (ICCV), pp. 8977–8986 (2019)

  50. Bian, J.-W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 129(9), 2548–2564 (2021)

    Article  Google Scholar 

  51. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048 (2016)

  52. Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Neural Information Processing Systems (NeurIPS) (2019)

  53. Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV), pp. 402–419 (2020). Springer

  54. Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., Yuille, A.: Every pixel counts ++: joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI) PP(99), 1–1

  55. Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7396–7405 (2020)

  56. Wang, G., Zhang, C., Wang, H., Wang, J., Wang, Y., Wang, X.: Unsupervised learning of depth, optical flow and pose with occlusion from 3D geometry. In: IEEE Transactions on Intelligent Transportation Systems (TITS) (2020)

  57. Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 8001–8008 (2019)

  58. Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  59. Liang, Y., He, F., Zeng, X., Luo, J.: An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering (Preprint), pp. 1–19 (2022)

  60. Wu, Y., He, F., Zhang, D., Li, X.: Service-oriented feature-based data exchange for cloud-based design and manufacturing. IEEE Trans. Serv. Comput. 11(2), 341–353 (2015)

    Article  Google Scholar 

  61. Song, Y., He, F., Duan, Y., Liang, Y., Yan, X.: A kernel correlation-based approach to adaptively acquire local features for learning 3D point clouds. Comput.-Aided Des. 146, 103196 (2022)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the STI 2030-Major Projects of China under Grant 2021ZD0201300 and by the National Science Foundation of China under Grant 62276127.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Furao Shen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Shen, F., Zhao, J. et al. Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints. Vis Comput 40, 1193–1204 (2024). https://doi.org/10.1007/s00371-023-02840-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02840-y

Keywords

Navigation