Skip to main content
Log in

Joint self-supervised learning and adversarial adaptation for monocular depth estimation from thermal image

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Depth estimation from thermal images is one potential solution to achieve reliability and robustness against diverse weather, lighting, and environmental conditions. Also, a self-supervised training method further boosts its scalability to various scenarios, which are usually impossible to collect ground-truth labels, such as GPS-denied and LiDAR-denied conditions. However, self-supervision from thermal images is usually insufficient to train networks due to the thermal image properties, such as low-contrast and textureless properties. Introducing additional self-supervision sources (e.g., RGB images) also introduces further hardware and software constraints, such as complicated multi-sensor calibration and synchronized data acquisition. Therefore, this manuscript proposes a novel training framework combining self-supervised learning and adversarial feature adaptation to leverage additional modality information without such constraints. The framework aims to train a network that estimates a monocular depth map from a thermal image in a self-supervised manner. In the training stage, the framework utilizes two self-supervisions; image reconstruction of unpaired RGB-thermal images and adversarial feature adaptation between unpaired RGB-thermal features. Based on the proposed method, the trained network achieves state-of-the-art quantitative results and edge-preserved depth estimation results compared to previous methods. Our source code is available at www.github.com/ukcheolshin/SelfDepth4Thermal

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)

  2. Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249 (2019)

  3. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838 (2019)

  4. Bian, J.-W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 129(9), 2548–2564 (2021)

    Article  Google Scholar 

  5. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR (2020)

  6. Watson, J., Aodha, O.M., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: self-supervised multi-frame monocular depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1164–1174 (2021)

  7. Gonzalez, J.L., Kim, M.: Plade-net: towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6851–6860 (2021)

  8. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018)

  9. Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018 (2021)

  10. Kim, N., Choi, Y., Hwang, S., Kweon, I.S.: Multispectral transfer network: unsupervised depth estimation for all-day vision. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  11. Lu, Y., Lu, G.: An alternative of lidar in nighttime: unsupervised depth estimation based on single thermal image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3833–3843 (2021)

  12. Shin, U., Lee, K., Lee, S., Kweon, I.S.: Self-supervised depth and ego-motion estimation for monocular thermal video using multi-spectral consistency loss. IEEE Robot. Autom. Lett. 7, 1103–1110 (2021)

    Article  Google Scholar 

  13. Beier, K., Gemperlein, H.: Simulation of infrared detection range at fog conditions for enhanced vision systems in civil aviation. Aerosp. Sci. Technol. 8(1), 63–71 (2004)

    Article  Google Scholar 

  14. Sun, Y., Zuo, W., Liu, M.: Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 4(3), 2576–2583 (2019)

    Article  Google Scholar 

  15. Zhang, Q., Zhao, S., Luo, Y., Zhang, D., Huang, N., Han, J.: Abmdrnet: adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2633–2642 (2021)

  16. Kim, Y.-H., Shin, U., Park, J., Kweon, I.S.: Ms-uda: multi-spectral unsupervised domain adaptation for thermal image semantic segmentation. IEEE Robot. Autom. Lett. 6(4), 6497–6504 (2021)

    Article  Google Scholar 

  17. Liu, H., Zhang, J., Yang, K., Hu, X., Stiefelhagen, R.: Cmx: cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:2203.04838 (2022)

  18. Shin, U., Lee, K., Lee, B.-U., Kweon, I.S.: Maximizing self-supervision from thermal image for effective self-supervised learning of depth and ego-motion. IEEE Robot. Autom. Lett. 7(3), 7771–7778 (2022)

    Article  Google Scholar 

  19. Choi, Y., Kim, N., Hwang, S., Park, K., Yoon, J.S., An, K., Kweon, I.S.: Kaist multi-spectral day/night data set for autonomous and assisted driving. IEEE Trans. Intell. Transp. Syst. 19(3), 934–948 (2018)

    Article  Google Scholar 

  20. Nagase, Y., Kushida, T., Tanaka, K., Funatomi, T., Mukaigawa, Y.: Shape from thermal radiation: passive ranging using multi-spectral lwir measurements. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12661–12671 (2022)

  21. Yawen, L., Guoyu, L.: Superthermal: matching thermal as visible through thermal feature exploration. IEEE Robot. Autom. Lett. 6(2), 2690–2697 (2021)

    Article  Google Scholar 

  22. Shin, U., Park, K., Lee, B.-U., Lee, K., Kweon, I.S.: Self-supervised monocular depth estimation from thermal images via adversarial multi-spectral adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5798–5807 (2023)

  23. Lee, A.J., Cho, Y., Yoon, S., Shin, Y., Kim, A.: ViViD: vision for visibility dataset. In: ICRA Workshop on Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR, Montreal, May. 2019. Best paper award

  24. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)

  25. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

  26. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)

  27. Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster r-cnn for object detection in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348 (2018)

  28. Tsai, Y.-H., Hung, W.-C., Schulter, S., Sohn, K., Yang, M.-H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7472–7481 (2018)

  29. Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2020–2030 (2017)

  30. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 2223–2232 (2017)

  31. Chen, Y.-C., Lin, Y.-Y., Yang, M.-H., Huang, J.-B.: Crdoco: pixel-level domain transfer with cross-domain consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1791–1800 (2019)

  32. Park, K., Woo, S., Kim, D., Cho, D., Kweon, I.S.: Preserving semantic and temporal consistency for unpaired video-to-video translation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1248–1257 (2019)

  33. Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., Darrell, T.: Cycada: cycle-consistent adversarial domain adaptation. In: Proceedings of International Conference on Machine Learning (ICML), pp. 1989–1998 (2018)

  34. Gong, R., Li, W., Chen, Y., Van Gool, L.: Dlow: domain flow for adaptation and generalization. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2477–2486 (2019)

  35. Park, K., Woo, S., Shin, I., Kweon, I.S.: Discover, hallucinate, and adapt: open compound domain adaptation for semantic segmentation. Adv. Neural Inf. Process. Syst. 33, 10869–10880 (2020)

    Google Scholar 

  36. Li, Y., Yuan, L., Vasconcelos, N.: Bidirectional learning for domain adaptation of semantic segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 6936–6945 (2019)

  37. Pan, F., Shin, I., Rameau, F., Lee, S., Kweon, I.S.: Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3764–3773 (2020)

  38. Lee, T., Lee, B.-U., Shin, I., Choe, J., Shin, U., Kweon, I.S., Yoon, K.-J.: Uda-cope: unsupervised domain adaptation for category-level object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14891–14900 (2022)

  39. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  40. Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Advances in Neural Information Processing Systems, pp. 35–45 (2019)

  41. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  42. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

  43. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  44. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)

  45. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  46. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European Conference on Computer Vision, pp. 746–760. Springer (2012)

  47. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)

  48. Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4796–4803. IEEE (2018)

  49. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

  50. Lee, A.J., Cho, Y., Shin, Y., Kim, A., Myung, H.: Vivid++: vision for visibility dataset. IEEE Robot. Autom. Lett. 7(3), 6282–6289 (2022)

    Article  Google Scholar 

  51. Shivakumar, S.S., Rodrigues, N., Zhou, A., Miller, I.D., Kumar, V., Taylor, C.J.: Pst900: Rgb-thermal calibration, dataset and segmentation network. arXiv preprint arXiv:1909.10980 (2019)

Download references

Acknowledgements

This work was supported by Police-Lab 2.0 Program funded by the Ministry of Science and ICT(MSIT, Korea) and Korean National Police Agency(KNPA, Korea) [Project Name: AI System Development for an Image processing Based on Multi-Band(visible, NIR, LWIR) Fusion Sensing/Project Number: 220122 M0500]

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ukcheol Shin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shin, U., Park, K., Lee, K. et al. Joint self-supervised learning and adversarial adaptation for monocular depth estimation from thermal image. Machine Vision and Applications 34, 55 (2023). https://doi.org/10.1007/s00138-023-01404-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01404-3

Keywords

Navigation