Skip to main content

Towards Explainable Navigation and Recounting

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Explainability and interpretability of deep neural networks have become of crucial importance over the years in Computer Vision, concurrently with the need to understand increasingly complex models. This necessity has fostered research on approaches that facilitate human comprehension of neural methods. In this work, we propose an explainable setting for visual navigation, in which an autonomous agent needs to explore an unseen indoor environment while portraying and explaining interesting scenes with natural language descriptions. We combine recent advances in ongoing research fields, employing an explainability method on images generated through agent-environment interaction. Our approach uses explainable maps to visualize model predictions and highlight the correlation between the observed entities and the generated words, to focus on prominent objects encountered during the environment exploration. The experimental section demonstrates that our approach can identify the regions of the images that the agent concentrates on to describe its point of view, improving explainability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spacy.io/.

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  2. Anjomshoae, S., Najjar, A., Calvaresi, D., Främling, K.: Explainable agents and robots: results from a systematic literature review. In: AAMAS (2019)

    Google Scholar 

  3. Bigazzi, R., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Embodied agents for efficient exploration and smart scene description. In: ICRA (2023)

    Google Scholar 

  4. Bigazzi, R., Landi, F., Cascianelli, S., Baraldi, L., Cornia, M., Cucchiara, R.: Focus on impact: indoor exploration with intrinsic motivation. RA-L 7(2), 2985–2992 (2022)

    Google Scholar 

  5. Bigazzi, R., Landi, F., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Explore and explain: self-supervised navigation and recounting. In: ICPR (2020)

    Google Scholar 

  6. Bigazzi, R., Landi, F., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Out of the box: embodied navigation in the real world. In: CAIP (2021)

    Google Scholar 

  7. Bolelli, F., Baraldi, L., Pollastri, F., Grana, C.: A hierarchical quasi-recurrent approach to video captioning. In: IPAS (2018)

    Google Scholar 

  8. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355 (2018)

  9. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)

    Google Scholar 

  10. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2019)

    Google Scholar 

  11. Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. In: NeurIPS (2020)

    Google Scholar 

  12. Cornia, M., Baraldi, L., Cucchiara, R.: SMArT: training shallow memory-aware transformers for robotic explainability. In: ICRA (2020)

    Google Scholar 

  13. Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis. AI Commun. 35(2), 111–129 (2022)

    Article  MathSciNet  Google Scholar 

  14. Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal captioner: inducing content-style separation in vision-and-language model training. arXiv preprint arXiv:2111.12727 (2022)

  15. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR (2020)

    Google Scholar 

  16. Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_1

    Chapter  Google Scholar 

  17. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)

    Google Scholar 

  18. Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)

    Google Scholar 

  19. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  20. Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: HIVE: evaluating the human interpretability of visual explanations. In: Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2016. LNCS, vol. 13672, pp. 280–298. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19775-8_17

    Chapter  Google Scholar 

  21. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Lovino, M., Bontempo, G., Cirrincione, G., Ficarra, E.: Multi-omics classification on kidney samples exploiting uncertainty-aware models. In: ICIC (2020)

    Google Scholar 

  24. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: ICML (2017)

    Google Scholar 

  25. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  26. Poppi, S., Cornia, M., Baraldi, L., Cucchiara, R.: Revisiting the evaluation of class activation mapping for explainability: a novel metric and experimental analysis. In: CVPR Workshops (2021)

    Google Scholar 

  27. Poppi, S., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Multi-class explainable unlearning for image classification via weight filtering. arXiv preprint arXiv:2304.02049 (2023)

  28. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  29. Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Occupancy anticipation for efficient exploration and navigation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 400–418. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_24

    Chapter  Google Scholar 

  30. Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration. IJCV 129, 1616–1649 (2021)

    Article  Google Scholar 

  31. Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-augmented contrastive learning for image and video captioning evaluation. In: CVPR (2023)

    Google Scholar 

  32. Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Retrieval-augmented transformer for image captioning. In: CBMI (2022)

    Google Scholar 

  33. Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)

    Google Scholar 

  34. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  35. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  36. Sennrich, R., Haddow, B., Birch, A.: neural machine translation of rare words with subword units. In: ACL (2016)

    Google Scholar 

  37. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

  38. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)

  39. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: a survey on deep learning-based image captioning. IEEE Trans. PAMI 45(1), 539–559 (2022)

    Article  Google Scholar 

  40. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: ICML (2017)

    Google Scholar 

  41. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  42. Wang, H., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: CVPR Workshops (2020)

    Google Scholar 

  43. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  44. You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. In: ICLR (2019)

    Google Scholar 

  45. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  46. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the “Fit for Medical Robotics” (Fit4MedRob) project, funded by the Italian Ministry of University and Research, and by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 955778 for project “Personalized Robotics as Service Oriented Applications” (PERSEO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Poppi, S. et al. (2023). Towards Explainable Navigation and Recounting. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43148-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43147-0

  • Online ISBN: 978-3-031-43148-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics