Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints

Liu, Xiaoliang; Shen, Furao; Zhao, Jian; Nie, Changhai

doi:10.1007/s00371-023-02840-y

Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints

Original article
Published: 15 April 2023

Volume 40, pages 1193–1204, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Xiaoliang Liu^1,2,
Furao Shen ORCID: orcid.org/0000-0002-7285-326X^1,3,
Jian Zhao^1,4 &
…
Changhai Nie^1,2

295 Accesses
1 Altmetric
Explore all metrics

Abstract

The 3D geometry understanding of dynamic scenes captured by moving cameras is one of the cornerstones of 3D scene understanding. Optical flow estimation, visual odometry, and depth estimation are the three most basic tasks in 3D geometry understanding. In this work, we present a unified framework for joint self-supervised learning of optical flow estimation, visual odometry, and depth estimation with two- and three-view geometric constraints. As we all know, visual odometry and depth estimation are more sensitive to dynamic objects, while optical flow estimation is more difficult to estimate the boundary area moved out of the image. To this end, we use estimated optical flow to help visual odometry and depth estimation process dynamic objects and use a rigid flow synthesized by the estimated pose and depth to help learn the optical flow of the area that moves out of the boundary due to camera motion. In order to further improve the consistency of cross-tasks, we introduce three-view geometric constraints and propose a three-view consistency loss. Finally, experiments on the KITTI data set show that our method can effectively improve the performance of the occluded boundary area and the dynamic object area. Moreover, our method achieves comparable or better performance than other monocular self-supervised state-of-the-art methods in these three subtasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MuDeepNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Using Multi-view Consistency Loss

Article 06 July 2019

Transferring knowledge from monocular completion for self-supervised monocular depth estimation

Article 24 July 2021

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation

Data availability

The datasets generated and analyzed during this study are available in the http://www.cvlibs.net/datasets/kitti/index.php.

References

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (IJCV) 60(2), 91–110 (2004)
Article Google Scholar
Bian, J., Lin, W.-Y., Matsushita, Y., Yeung, S.-K., Nguyen, T.-D., Cheng, M.-M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4181–4190 (2017)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Google Scholar
Horn, B.K., Schunck, B.G.: Determining optical flow. In: Techniques and Applications of Image Understanding, vol. 281, pp. 319–331 (1981). International Society for Optics and Photonics, USA
Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439 (2010). IEEE
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Neural Information Processing Systems (NeurIPS) (2014)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Recognit. Mach. Intell. (PAMI) (2016). https://doi.org/10.1109/TPAMI.2015.2505283
Article Google Scholar
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050 (2017). IEEE
Wang, S., Clark, R., Wen, H., Trigoni, N.: End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. Int. J. Robot. Res. (IJRR) 37(4–5), 513–542 (2018)
Article Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: Pwc-net: CNNS for optical flow using pyramid, warping, and cost volume. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4161–4170 (2017)
Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV) (2016). Springer
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: Association for the Advancement of Artificial Intelligence (AAAI), New Orleans, Louisiana (2018)
Zhong, Y., Ji, P., Wang, J., Dai, Y., Li, H.: Unsupervised deep epipolar flow for stationary or dynamic scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12095–12104 (2019)
Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: European Conference on Computer Vision (ECCV), pp. 690–706 (2018)
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Zou, Y., Luo, Z., Huang, J.-B.: DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. In: European Conference on Computer Vision (ECCV) (2018)
Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive Collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: IEEE International Conference on Computer Vision (ICCV), pp. 7063–7072 (2019)
Wang, Y., Yang, Z., Wang, P., Yang, Y., Luo, C., Xu, W.: Joint unsupervised learning of optical flow and depth by watching stereo videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Liu, L., Zhai, G., Ye, W., Liu, Y.: Unsupervised learning of scene flow estimation fusing with local rigidity. In: Association for the Advancement of Artificial Intelligence (AAAI), pp. 876–882 (2019)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)
Article Google Scholar
Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: Dense 3D reconstruction in real-time. In: IEEE Intelligent Vehicles Symposium (IV), pp. 963–968 (2011). IEEE
Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. (TRO) 33(5), 1255–1262 (2017)
Article Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular slam. In: European Conference on Computer Vision (ECCV), pp. 834–849 (2014). Springer
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Recognit. Mach. Intell. (PAMI) 40(3), 611–625 (2017)
Article Google Scholar
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), pp. 239–248 (2016). IEEE
Kendall, A., Grimes, M., Cipolla, R.: Posenet: a convolutional network for real-time 6-dof camera relocalization. In: IEEE International Conference on Computer Vision (ICCV), pp. 2938–2946 (2015)
Li, R., Liu, Q., Gui, J., Gu, D., Hu, H.: Indoor relocalization in challenging environments with dual-stream convolutional neural networks. IEEE Trans. Autom. Sci. Eng. 15(2), 651–662 (2017)
Article Google Scholar
Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representation learning with CNNS for frame-to-frame ego-motion estimation. IEEE Robot. Autom. Lett. 1(1), 18–25 (2015)
Article Google Scholar
Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: Vinet: visual-inertial odometry as a sequence-to-sequence learning problem. In: Association for the Advancement of Artificial Intelligence (AAAI) (2017)
Li, Z., Chen, Z., Liu, X., Jiang, J.: Depthformer: exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211 (2022)
Li, Z., Wang, X., Liu, X., Jiang, J.: Binsformer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012). IEEE
Butler, D., Wulff, J., Stanley, G., Black, M.: MPI-sintel optical flow benchmark: Supplemental material. In: MPI-IS-TR-006, MPI for Intelligent Systems (2012). Citeseer
Pilzer, A., Lathuiliere, S., Sebe, N., Ricci, E.: Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9768–9777 (2019)
Ahmadi, A., Patras, I.: Unsupervised convolutional neural networks for motion estimation. In: IEEE International Conference on Image Processing (ICIP), pp. 1629–1633 (2016). IEEE
Jason, J.Y., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: European Conference on Computer Vision (ECCV), pp. 3–10 (2016). Springer
Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., Xu, W.: Occlusion aware unsupervised learning of optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4884–4893 (2018)
Godard, C., Mac Aodha, O., Brostow, G.: Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260 (2018)
Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: IEEE International Conference on Computer Vision (ICCV), pp. 8977–8986 (2019)
Bian, J.-W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 129(9), 2548–2564 (2021)
Article Google Scholar
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048 (2016)
Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Neural Information Processing Systems (NeurIPS) (2019)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV), pp. 402–419 (2020). Springer
Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., Yuille, A.: Every pixel counts ++: joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI) PP(99), 1–1
Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7396–7405 (2020)
Wang, G., Zhang, C., Wang, H., Wang, J., Wang, Y., Wang, X.: Unsupervised learning of depth, optical flow and pose with occlusion from 3D geometry. In: IEEE Transactions on Intelligent Transportation Systems (TITS) (2020)
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 8001–8008 (2019)
Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Liang, Y., He, F., Zeng, X., Luo, J.: An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering (Preprint), pp. 1–19 (2022)
Wu, Y., He, F., Zhang, D., Li, X.: Service-oriented feature-based data exchange for cloud-based design and manufacturing. IEEE Trans. Serv. Comput. 11(2), 341–353 (2015)
Article Google Scholar
Song, Y., He, F., Duan, Y., Liang, Y., Yan, X.: A kernel correlation-based approach to adaptively acquire local features for learning 3D point clouds. Comput.-Aided Des. 146, 103196 (2022)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the STI 2030-Major Projects of China under Grant 2021ZD0201300 and by the National Science Foundation of China under Grant 62276127.

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Xiaoliang Liu, Furao Shen, Jian Zhao & Changhai Nie
Department of Computer Science and Technology, Nanjing University, Nanjing, China
Xiaoliang Liu & Changhai Nie
School of Artificial Intelligence, Nanjing University, Nanjing, China
Furao Shen
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
Jian Zhao

Authors

Xiaoliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Furao Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Changhai Nie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Furao Shen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, X., Shen, F., Zhao, J. et al. Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints. Vis Comput 40, 1193–1204 (2024). https://doi.org/10.1007/s00371-023-02840-y

Download citation

Accepted: 01 March 2023
Published: 15 April 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00371-023-02840-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints

Abstract

Access this article

Similar content being viewed by others

MuDeepNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Using Multi-view Consistency Loss

Transferring knowledge from monocular completion for self-supervised monocular depth estimation

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-supervised learning of monocular 3D geometry understanding with two- and three-view geometric constraints

Abstract

Access this article

Similar content being viewed by others

MuDeepNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Using Multi-view Consistency Loss

Transferring knowledge from monocular completion for self-supervised monocular depth estimation

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation