Triaxial Squeeze Attention Module and Mutual-Exclusion Loss Based Unsupervised Monocular Depth Estimation

Wei, Jiansheng; Pan, Shuguo; Gao, Wang; Zhao, Tao

doi:10.1007/s11063-022-10812-x

Triaxial Squeeze Attention Module and Mutual-Exclusion Loss Based Unsupervised Monocular Depth Estimation

Published: 31 May 2022

Volume 54, pages 4375–4390, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Jiansheng Wei¹,
Shuguo Pan¹,
Wang Gao¹ &
…
Tao Zhao¹

339 Accesses
5 Citations
Explore all metrics

Abstract

Monocular depth estimation plays a crucial role in scene perception and 3D reconstruction. Supervised learning based depth estimation needs vast amounts of ground-truth depth data for training, which seriously restricts its generalization. In recent years, the unsupervised learning methods without LiDAR points cloud have attracted more and more attention. In this paper, an unsupervised monocular depth estimation method using stereo pairs for training is designed. We present a triaxial squeeze attention module and introduce it into our unsupervised framework to augment the representations of the depth map in detail. We also propose a novel training loss that enforces mutual-exclusion in image reconstruction to improve the performance and robustness in unsupervised learning. Experimental results on KITTI show that our method not only outperforms existing unsupervised methods but also achieves better results comparable with several supervised approaches trained with ground-truth data. The improvements in our method can better preserve the details of the depth map and allow the shape of objects to be maintained more smoothly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Deep Learning vs. Traditional Computer Vision

References

Geng M, Shang S, Ding B, Wang H, Zhang P (2020) Unsupervised learning-based depth estimation-aided visual slam approach. Circ Syst Signal Process 39(2):543–570. https://doi.org/10.1007/s00034-019-01173-3
Article Google Scholar
Lee SJ, Choi H, Hwang SS (2020) Real-time depth estimation using recurrent CNN with sparse depth cues for SLAM system. Int J Control Autom 18(1):206–216. https://doi.org/10.1007/s12555-019-0350-8
Article Google Scholar
Wang Y, Chao W, Garg D, Hariharan B, Campbell M, Weinberger KQ (2019) Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR 2019). pp 8437–8445. https://doi.org/10.1109/CVPR.2019.00864
Chen C, Seff A, Kornhauser A, Xiao J (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In: 2015 IEEE international conference on computer vision (ICCV). pp 2722–2730. https://doi.org/10.1109/ICCV.2015.312
Jin Y, Lee M (2019) Enhancing binocular depth estimation based on proactive perception and action cyclic learning for an autonomous developmental robot. IEEE Trans Syst Man Cybern Ssyat 49(1):169–180. https://doi.org/10.1109/TSMC.2017.2779474
Article Google Scholar
Ding Y, Lin L, Wang L, Zhang M, Li D (2020) Digging into the multi-scale structure for a more refined depth map and 3D reconstruction. Neural Comput Appl 32(15):11217–11228. https://doi.org/10.1007/s00521-020-04702-3
Article Google Scholar
Wang B, Feng Y, Liu H (2018) Multi-scale features fusion from sparse LiDAR data and single image for depth completion. Electron Lett 54(24):1375–1376. https://doi.org/10.1049/el.2018.6149
Article Google Scholar
Willis AR, Papadakis J, Brink KM (2017) Linear depth reconstruction for RGBD sensors, Southeastcon 2017
Guo Y, Chen T (2018) Semantic segmentation of RGBD images based on deep depth regression. Pattern Recogn Lett 109:55–64. https://doi.org/10.1016/j.patrec.2017.08.026
Article Google Scholar
Wang Y, Gao Y, Achim A, Dahnoun N (2014) Robust obstacle detection based on a novel disparity calculation method and G-disparity. Comput Vis Image Underst 123:23–40. https://doi.org/10.1016/j.cviu.2014.02.014
Article Google Scholar
Zhou C, Liu Y, Sun Q, Lasang P (2021) Vehicle detection and disparity estimation using blended stereo images. IEEE Trans Intell Veh 6(4):690–698. https://doi.org/10.1109/TIV.2020.3049008
Article Google Scholar
Wu G, Li Y, Huang Y, Liu Y (2019) Joint view synthesis and disparity refinement for stereo matching. Front Comput Sci 13(6):1337–1352. https://doi.org/10.1007/s11704-018-8099-4
Article Google Scholar
Hu P, Yang S, Zhang G, Deng H (2021) High-speed and accurate 3D shape measurement using DIC-assisted phase matching and triple-scanning. Opt Lasers Eng. https://doi.org/10.1016/j.optlaseng.2021.106725
Article Google Scholar
Bao Z, Li B, Zhang W (2019) Robustness of ToF and stereo fusion for high-accuracy depth map. IET Comput Vis 13(7):676–681. https://doi.org/10.1049/iet-cvi.2018.5476
Article Google Scholar
Wang C et al (2021) Self-supervised multiscale adversarial regression network for stereo disparity estimation. IEEE T Cybern 51(10):4770–4783. https://doi.org/10.1109/TCYB.2020.2999492
Article Google Scholar
Dong Q, Feng J (2018) Adaptive disparity computation using local and non-local cost aggregations. Multimed Tools Appl 77(24):31647–31663. https://doi.org/10.1007/s11042-018-6236-6
Article Google Scholar
Wang C, Miguel Buenaposada J, Zhu R, Lucey S (2018) Learning depth from monocular videos using direct methods. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 2022–2030. https://doi.org/10.1109/CVPR.2018.00216
Mayer N et al (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). pp 4040–4048. https://doi.org/10.1109/CVPR.2016.438
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: Proceedings of 2016 fourth international conference on 3D vision (3DV). pp 239–248. https://doi.org/10.1109/3DV.2016.32
Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal 31(5):824–840. https://doi.org/10.1109/TPAMI.2008.132
Article Google Scholar
Wu D, Luo X, Shang M, He Y, Wang G, Zhou M (2021) A deep latent factor model for high-dimensional and sparse matrices in recommender systems. IEEE Trans Syst Man Cybern Syst 51(7):4285–4296. https://doi.org/10.1109/TSMC.2019.2931393
Article Google Scholar
Luo X, Zhou M, Li S, Wu D, Liu Z, Shang M (2021) Algorithms of unconstrained non-negative latent factor analysis for recommender systems. IEEE Trans Big Data 7(1):227–240. https://doi.org/10.1109/TBDATA.2019.2916868
Article Google Scholar
Tan N, Zhong Z, Yu P, Li Z, Ni F (2022) A discrete model-free scheme for fault tolerant tracking control of redundant manipulators. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3149919
Article Google Scholar
Ye X, Ji X, Sun B, Chen S, Wang Z, Li H (2020) DRM-SLAM: towards dense reconstruction of monocular SLAM with scene depth fusion. Neurocomputing 396:76–91. https://doi.org/10.1016/j.neucom.2020.02.044
Article Google Scholar
Luo H, Gao Y, Wu Y, Liao C, Yang X, Cheng K (2019) Real-time dense monocular SLAM with online adapted depth prediction network. IEEE Trans Multimed 21(2):470–483. https://doi.org/10.1109/TMM.2018.2859034
Article Google Scholar
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Adv Neural Inf Process Syst 27
Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). pp 5162–5170
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 2002–2011. https://doi.org/10.1109/CVPR.2018.00214
Ma F, Karaman S (2018) Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 4796–4803
Uhrig J, Schneider N, Schneider L, Franke U, Brox T, Geiger A (2017) Sparsity invariant CNNs. In: Proceedings 2017 international conference on 3D vision (3DV), pp 11–20, https://doi.org/10.1109/3DV.2017.00012
Garg R, VijayKumar BG, Carneiro G, Reid I (2016) Unsupervised CNN for single view depth estimation: geometry to the rescue. Comput Vis - ECCV 2016 PT VIII 9912:740–756. https://doi.org/10.1007/978-3-319-46484-8_45
Article Google Scholar
Yu JJ, Harley AW, Derpanis KG (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. Comput Vis- ECCV 2016 Workshops Pt III 9915:3–10. https://doi.org/10.1007/978-3-319-49409-8_1
Article MathSciNet Google Scholar
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: 30th IEEE conference on computer vision and pattern recognition (CVPR 2017), pp 6602–6611. https://doi.org/10.1109/CVPR.2017.699
Feng X, Fang B (2021) Algorithm for epipolar geometry and correcting monocular stereo vision based on a plane mirror. Optik. https://doi.org/10.1016/j.ijleo.2020.165890
Article Google Scholar
Chen J, Yang X, Jia Q, Liao C (2021) DENAO: monocular depth estimation network with auxiliary optical flow. IEEE Trans Pattern Anal 43(8):2598–2610. https://doi.org/10.1109/TPAMI.2020.2977021
Article Google Scholar
Gomaa MAK, de Silva O, Mann GKI, Gosine RG (2021) Observability-constrained VINS for MAVs using interacting multiple model algorithm. IEEE Trans Aerosp Electron Syst 57(3):1423–1442. https://doi.org/10.1109/TAES.2020.3043534
Article Google Scholar
Dai R et al (2019) Unsupervised learning of depth estimation based on attention model and global pose optimization. Signal Process-Image Commnun 78:284–292. https://doi.org/10.1016/j.image.2019.07.007
Article Google Scholar
Song X et al (2021) MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans Image Process 30:4691–4705. https://doi.org/10.1109/TIP.2021.3074306
Article Google Scholar
Xu X, Chen Z, Yin F (2021) Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement. IEEE Trans Image Process 30:8811–8822. https://doi.org/10.1109/TIP.2021.3120670
Article Google Scholar
Yang D, Zhong X, Gu D, Peng X, Hu H (2020) Unsupervised framework for depth estimation and camera motion prediction from video. Neurocomputing 385:169–185. https://doi.org/10.1016/j.neucom.2019.12.049
Article Google Scholar
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: 30th IEEE conference on computer vision and pattern recognition (CVPR 2017), pp 6612+. https://doi.org/10.1109/CVPR.2017.700
Pilzer A, Xu D, Puscas MM, Ricci E, Sebe N (2018) Unsupervised adversarial depth estimation using cycled generative networks. In: 2018 international conference on 3D vision (3DV), pp 587–595. https://doi.org/10.1109/3DV.2018.00073
Ji Z, Song X, Song H, Yang H, Guo X (2021) RDRF-Net: a pyramid architecture network with residual-based dynamic receptive fields for unsupervised depth estimation. Neurocomputing 457:1–12. https://doi.org/10.1016/j.neucom.2021.05.089
Article Google Scholar
Zhao S, Fu H, Gong M, Tao D (2019) Geometry-aware symmetric domain adaptation for monocular depth estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR 2019), pp 9780–9790. https://doi.org/10.1109/CVPR.2019.01002
Chen Y, Zhao H, Hu Z, Peng J (2021) Attention-based context aggregation network for monocular depth estimation. Int J Mach Learn Cybern 12(6):1583–1596. https://doi.org/10.1007/s13042-020-01251-y
Article Google Scholar
Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Trans Circ Syst Video 31(11):4381–4393. https://doi.org/10.1109/TCSVT.2021.3049869
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 41774027, 41904022) and the Fundamental Research Funds for the Central Universities (2242020R40135).

Author information

Authors and Affiliations

School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, People’s Republic of China
Jiansheng Wei, Shuguo Pan, Wang Gao & Tao Zhao

Authors

Jiansheng Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shuguo Pan
View author publications
You can also search for this author in PubMed Google Scholar
Wang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuguo Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, J., Pan, S., Gao, W. et al. Triaxial Squeeze Attention Module and Mutual-Exclusion Loss Based Unsupervised Monocular Depth Estimation. Neural Process Lett 54, 4375–4390 (2022). https://doi.org/10.1007/s11063-022-10812-x

Download citation

Accepted: 02 April 2022
Published: 31 May 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11063-022-10812-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Triaxial Squeeze Attention Module and Mutual-Exclusion Loss Based Unsupervised Monocular Depth Estimation

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Deep Learning vs. Traditional Computer Vision

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Triaxial Squeeze Attention Module and Mutual-Exclusion Loss Based Unsupervised Monocular Depth Estimation

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Deep Learning vs. Traditional Computer Vision

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation