Skip to main content
Log in

Vision-language navigation: a survey and taxonomy

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Vision-language navigation (VLN) tasks require an agent to follow language instructions from a human guide to navigate in previously unseen environments using visual observations. This challenging field, involving problems in natural language processing (NLP), computer vision (CV), robotics, etc., has spawned many excellent works focusing on various VLN tasks. This paper provides a comprehensive survey and an insightful taxonomy of these tasks based on the different characteristics of language instructions. Depending on whether navigation instructions are given once or multiple times, we divide the tasks into two categories, i.e., single-turn and multiturn tasks. We subdivide single-turn tasks into goal-oriented and route-oriented tasks based on whether the instructions designate a single goal location or specify a sequence of multiple locations. We subdivide multiturn tasks into interactive and passive tasks based on whether the agent is allowed to ask questions. These tasks require different agent capabilities and entail various model designs. We identify the progress made on these tasks and examine the limitations of the existing VLN models and task settings. Hopefully, a well-designed taxonomy of the task family enables comparisons among different approaches across papers concerning the same tasks and clarifies the advances made in these tasks. Furthermore, we discuss several open issues in this field and some promising directions for future research, including the incorporation of knowledge into VLN models and transferring them to the real physical world.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

No data, models, or code were generated or used in this review.

Notes

  1. https://arxiv.org/abs/2108.11544.

  2. http://ai2thor.allenai.org.

  3. https://vizdoom.cs.put.edu.pl/.

  4. http://gibsonenv.stanford.edu/.

  5. http://svl.stanford.edu/igibson/

  6. https://github.com/facebookresearch/House3D.

  7. https://bringmeaspoon.org/.

  8. https://aihabitat.org/.

References

  1. Shridhar M, Thomason J, Gordon D, Bisk Y, Han W, Mottaghi R, Zettlemoyer L, Fox D (2020) ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, CVPR, pp 10737–10746

  2. Ye X, Yang Y (2020) From seeing to moving: a survey on learning for visual indoor navigation (VIN). CoRR arXiv:2002.11310

  3. Anderson P, Wu Q, Teney D, Bruce J, Johnson M, Sünderhauf N, Reid ID, Gould S, van den Hengel A (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR, pp 3674–3683

  4. Qi Y, Wu Q, Anderson P, Wang X, Wang WY, Shen C, van den Hengel A (2020) REVERIE: remote embodied visual referring expression in real indoor environments. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 9979–9988

  5. Pfeifer R, Iida F (2004) Embodied artificial intelligence: trends and challenges. In: Embodied artificial intelligence, pp 1–26

  6. Pfeifer R, Bongard J (2006) How the body shapes the way we think: a new view of intelligence

  7. Duan J, Yu S, Tan HL, Zhu H, Tan C (2022) A survey of embodied AI: from simulators to research tasks. IEEE Trans Emerg Top Comput Intell

  8. Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  PubMed  Google Scholar 

  9. Uppal S, Bhagat S, Hazarika D, Majumder N, Poria S, Zimmermann R, Zadeh A (2022) Multimodal research in vision and language: a review of current and emerging trends. Inf Fusion 77:149–171

    Article  Google Scholar 

  10. Guilherme ND, Avinash CK (2002) Vision for mobile robot navigation: a survey. IEEE Trans Pattern Anal Mach Intell 24(2):237–267

    Article  Google Scholar 

  11. Kruse T, Pandey AK, Alami R, Kirsch A (2013) Human-aware robot navigation: a survey. Robot Auton Syst 61(12):1726–1743

    Article  Google Scholar 

  12. Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser TA (2017) Semantic scene completion from a single depth image. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR, pp 190–198

  13. Chang AX, Dai A, Funkhouser TA, Halber M, Nießner M, Savva M, Song S, Zeng A, Zhang Y (2017) Matterport3d: Learning from RGB-D data in indoor environments. In: 2017 International conference on 3D vision, 3DV, pp 667–676

  14. Armeni I, Sax S, Zamir AR, Savarese S (2017) Joint 2d-3d-semantic data for indoor scene understanding. CoRR arXiv:1702.01105

  15. Straub J, Whelan T, Ma L, Chen Y, Wijmans E, Green S, Engel JJ, Mur-Artal R, Ren C, Verma S, Clarkson A, Yan M, Budge B, Yan Y, Pan X, Yon J, Zou Y, Leon K, Carter N, Briales J, Gillingham T, Mueggler E, Pesqueira L, Savva M, Batra D, Strasdat HM, Nardi RD, Goesele M, Lovegrove S, Newcombe RA (2019) The replica dataset: a digital replica of indoor spaces. CoRR arXiv:1906.05797

  16. Kolve E, Mottaghi R, Gordon D, Zhu Y, Gupta A, Farhadi A (2017) AI2-THOR: an interactive 3d environment for visual AI. CoRR arXiv:1712.05474

  17. Kempka M, Wydmuch M, Runc G, Toczek J, Jaskowski W (2016) Vizdoom: A doom-based AI research platform for visual reinforcement learning. In: IEEE Conference on computational intelligence and games, CIG, pp 1–8

  18. Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson ENV: real-world perception for embodied agents. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR, pp 9068–9079

  19. Shen B, Xia F, Li C, Martín-Martín R, Fan L, Wang G, Pérez-D’Arpino C, Buch S, Srivastava S, Tchapmi L et al (2020) iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 7520–7527

  20. Li C, Xia F, Martín-Martín R, Lingelbach M, Srivastava S, Shen B, Vainio KE, Gokmen C, Dharan G, Jain T, Kurenkov A, Liu CK, Gweon H, Wu J, Fei-Fei L, Savarese S (2021) iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In: Proceedings of machine learning research. PMLR

  21. Wu Y, Wu Y, Gkioxari G, Tian Y (2018) Building generalizable agents with a realistic and rich 3d environment. In: 6th International conference on learning representations, ICLR

  22. Savva M, Malik J, Parikh D, Batra D, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V (2019) Habitat: a platform for embodied AI research. In: 2019 IEEE/CVF international conference on computer vision, ICCV, pp 9338–9346

  23. Misra DK, Bennett A, Blukis V, Niklasson E, Shatkhin M, Artzi Y (2018) Mapping instructions to actions in 3d environments with visual goal prediction. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018, pp 2667–2678

  24. Deruyttere T, Vandenhende S, Grujicic D, Gool LV, Moens M (2019) Talk2car: Taking control of your self-driving car. In: EMNLP-IJCNLP, pp 2088–2098

  25. Yu H, Lian X, Zhang H, Xu W (2018) Guided feature transformation (GFT): a neural language grounding module for embodied agents. In: 2nd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 87, pp 81–98

  26. Chaplot DS, Sathyendra KM, Pasumarthi RK, Rajagopal D, Salakhutdinov R (2018) Gated-attention architectures for task-oriented language grounding. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, pp 2819–2826

  27. Yang W, Wang X, Farhadi A, Gupta A, Mottaghi R (2019) Visual semantic navigation using scene priors. In: 7th International conference on learning representations, ICLR

  28. Zhu F, Liang X, Zhu Y, Yu Q, Chang X, Liang X (2021) Soon: scenario oriented object navigation with graph-based exploration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12689–12699

  29. Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D (2018) Embodied question answering. In: 2018 IEEE Conference on computer vision and pattern recognition, CVPR, pp 1–10

  30. Zang X, Pokle A, Vázquez M, Chen K, Niebles JC, Soto A, Savarese S (2018) Translating navigation instructions in natural language to a high-level plan for behavioral robot navigation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2657–2666

  31. Fu J, Korattikara A, Levine S, Guadarrama S (2019) From language to goals: inverse reinforcement learning for vision-based instruction following. In: 7th International conference on learning representations, ICLR

  32. Jain V, Magalhães G, Ku A, Vaswani A, Ie E, Baldridge J (2019) Stay on the path: instruction fidelity in vision-and-language navigation. In: ACL (1), pp 1862–1872

  33. Ku A, Anderson P, Patel R, Ie E, Baldridge J (2020) Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP (1), pp 4392–4412

  34. Zhu W, Hu H, Chen J, Deng Z, Jain V, Ie E, Sha F (2020) Babywalk: going farther in vision-and-language navigation by taking baby steps. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2539–2556

  35. Krantz J, Wijmans E, Majumdar A, Batra D, Lee S (2020) Beyond the nav-graph: vision-and-language navigation in continuous environments. In: ECCV (28). Lecture notes in computer science, vol 12373, pp 104–120

  36. Yan A, Wang X, Feng J, Li L, Wang WY (2019) Cross-lingual vision-language navigation. CoRR arXiv:1910.11301

  37. Chen H, Suhr A, Misra D, Snavely N, Artzi Y (2019) TOUCHDOWN: natural language navigation and spatial reasoning in visual street environments. In: CVPR, pp 12538–12547

  38. Paz-Argaman T, Tsarfaty R (2019) RUN through the streets: a new dataset and baseline models for realistic urban navigation. In: EMNLP/IJCNLP (1), pp 6448–6454

  39. Hermann KM, Malinowski M, Mirowski P, Banki-Horvath A, Anderson K, Hadsell R (2020) Learning to follow directions in street view. In: AAAI, pp 11773–11781

  40. Mirowski P, Banki-Horvath A, Anderson K, Teplyashin D, Hermann KM, Malinowski M, Grimes MK, Simonyan K, Kavukcuoglu K, Zisserman A, Hadsell R (2019) The streetlearn environment and dataset. CoRR arXiv:1903.01292

  41. Kim H, Zala A, Burri G, Tan H, Bansal M (2020) Arramon: A joint navigation-assembly instruction interpretation task in dynamic environments. In: EMNLP (Findings), pp 3910–3927

  42. Suhr A, Yan C, Schluger J, Yu S, Khader H, Mouallem M, Zhang I, Artzi Y (2019) Executing instructions in situated collaborative interactions. In: EMNLP-IJCNLP, pp 2119–2130

  43. Nguyen K, Dey D, Brockett C, Dolan B (2019) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: IEEE Conference on computer vision and pattern recognition, CVPR, pp 12527–12537

  44. Nguyen K, III HD (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In: EMNLP-IJCNLP, pp 684–695

  45. Thomason J, Murray M, Cakmak M, Zettlemoyer L (2019) Vision-and-dialog navigation. In: 3rd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 100, pp 394–406

  46. Chi T-C, Shen M, Eric M, Kim S, Hakkani-tur D (2020) Just ask: An interactive learning framework for vision and language navigation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 2459–2466

  47. de Vries H, Shuster K, Batra D, Parikh D, Weston J, Kiela D (2018) Talk the walk: Navigating New York city through grounded dialogue. CoRR arXiv:1807.03367

  48. Banerjee S, Thomason J, Corso JJ (2020) The robotslang benchmark: dialog-guided robot localization and navigation. In: CoRL. Proceedings of machine learning research, vol 155, pp 1384–1393

  49. Blukis V, Brukhim N, Bennett A, Knepper RA, Artzi Y (2018) Following high-level navigation instructions on a simulated quadcopter with imitation learning. In: Robotics: science and systems XIV

  50. Shah S, Dey D, Lovett C, Kapoor A (2017) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In: Field and service robotics, results of the 11th international conference, FSR. Springer proceedings in advanced robotics, vol 5, pp 621–635

  51. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  CAS  PubMed  Google Scholar 

  52. Blukis V, Misra DK, Knepper RA, Artzi Y (2018) Mapping navigation instructions to continuous control actions with position-visitation prediction. In: 2nd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 87, pp 505–518

  53. Blukis V, Terme Y, Niklasson E, Knepper RA, Artzi Y (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. In: 3rd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 100, pp 1415–1438

  54. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR, pp 770–778

  55. Storks S, Gao Q, Thattai G, Tür G (2021) Are we there yet? Learning to localize in embodied instruction following. CoRR arXiv:2101.03431

  56. Singh KP, Bhambri S, Kim B, Mottaghi R, Choi J (2020) MOCA: A modular object-centric approach for interactive instruction following. CoRR arXiv:2012.03208

  57. Yu H, Zhang H, Xu W (2017) A deep compositional framework for human-like language acquisition in virtual environment. arXiv preprint arXiv:1703.09831

  58. Anand A, Belilovsky E, Kastner K, Larochelle H, Courville AC (2018) Blindfold baselines for embodied QA. CoRR arXiv:1811.05013

  59. Das A, Gkioxari G, Lee S, Parikh D, Batra D (2018) Neural modular control for embodied question answering. In: 2nd Annual conference on robot learning, CoRL 2018. Proceedings of machine learning research, vol 87, pp 53–62

  60. Parvaneh A, Abbasnejad E, Teney D, Shi Q, van den Hengel A (2020) Counterfactual vision-and-language navigation: Unravelling the unseen. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS

  61. Wu Y, Jiang L, Yang Y (2020) Revisiting EmbodiedQA: a simple baseline and beyond. IEEE Trans Image Process 29:3984–3992

    Article  ADS  Google Scholar 

  62. Wijmans E, Datta S, Maksymets O, Das A, Gkioxari G, Lee S, Essa I, Parikh D, Batra D (2019) Embodied question answering in photorealistic environments with point cloud perception. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 6659–6668

  63. Yu L, Chen X, Gkioxari G, Bansal M, Berg TL, Batra D (2019) Multi-target embodied question answering. In: IEEE Conference on computer vision and pattern recognition, CVPR, pp 6309–6318

  64. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: 4th International conference on learning representations, ICLR

  65. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33nd international conference on machine learning, ICML. JMLR workshop and conference proceedings, vol 48, pp 1928–1937

  66. Wu Y, Wu Y, Tamar A, Russell SJ, Gkioxari G, Tian Y (2019) Bayesian relational memory for semantic visual navigation. In: 2019 IEEE/CVF International conference on computer vision, ICCV, pp 2769–2779

  67. Lin X, Li G, Yu Y (2021) Scene-intuitive agent for remote embodied visual grounding. In: CVPR, pp 7036–7045

  68. Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 11618–11628

  69. Yu H, Zhang H, Xu W (2018) Interactive grounded language acquisition and generalization in a 2d world. In: 6th International conference on learning representations, ICLR

  70. Anderson P, Chang AX, Chaplot DS, Dosovitskiy A, Gupta S, Koltun V, Kosecka J, Malik J, Mottaghi R, Savva M, Zamir AR (2018) On evaluation of embodied navigation agents. CoRR arXiv:1807.06757

  71. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: 5th International conference on learning representations, ICLR

  72. Sepulveda G, Niebles JC, Soto A (2018) A deep learning based behavioral approach to indoor autonomous navigation. In: 2018 IEEE International conference on robotics and automation, ICRA, pp 4646–4653

  73. Huang H, Jain V, Mehta H, Baldridge J, Ie E (2019) Multi-modal discriminative model for vision-and-language navigation. CoRR arXiv:1905.13358

  74. Fried D, Hu R, Cirik V, Rohrbach A, Andreas J, Morency L, Berg-Kirkpatrick T, Saenko K, Klein D, Darrell T (2018) Speaker-follower models for vision-and-language navigation, pp 3318–3329

  75. Ilharco G, Jain V, Ku A, Ie E, Baldridge J (2019) General evaluation for instruction conditioned navigation using dynamic time warping

  76. Zhao M, Anderson P, Jain V, Wang S, Ku A, Baldridge J, Ie E (2021) On the evaluation of vision-and-language navigation instructions, pp 1302–1316

  77. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, Berlin, pp 382–398

  78. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  79. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  80. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  81. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  82. Wang X, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6629–6638

  83. Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. CoRR arXiv:2007.05655

  84. Hong Y, Wu Q, Qi Y, Opazo CR, Gould S (2020) A recurrent vision-and-language BERT for navigation. CoRR arXiv:2011.13922

  85. Ma C, Lu J, Wu Z, AlRegib G, Kira Z, Socher R, Xiong C (2019) Self-monitoring navigation agent via auxiliary progress estimation

  86. Ma C-Y, Wu Z, AlRegib G, Xiong C, Kira Z (2019) The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6732–6740

  87. Ke L, Li X, Bisk Y, Holtzman A, Gan Z, Liu J, Gao J, Choi Y, Srinivasa S (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6741–6749

  88. Huang H, Jain V, Mehta H, Ku A, Magalhaes G, Baldridge J, Ie E (2019) Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7404–7413

  89. Zhu F, Zhu Y, Chang X, Liang X (2020) Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10012–10022

  90. Wang H, Wang W, Shu T, Liang W, Shen J (2020) Active visual information gathering for vision-language navigation. In: European conference on computer vision. Springer, Berlin, pp 307–322

  91. Wang H, Wang W, Liang W, Xiong C, Shen J (2021) Structured scene memory for vision-language navigation. In: CVPR, pp 8455–8464

  92. Zhang W, Ma C, Wu Q, Yang X (2020) Language-guided navigation via cross-modal grounding and alternate adversarial learning. IEEE Trans Circuits Syst Video Technol

  93. Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. In: NeurIPS

  94. Chen S, Guhur P-L, Schmid C, Laptev I (2021) History aware multimodal transformer for vision-and-language navigation. In: NeurIPS

  95. Chen S, Guhur P-L, Tapaswi M, Schmid C, Laptev I (2022) Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: CVPR

  96. Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2019) Perceive, transform, and act: multi-modal attention networks for vision-and-language navigation. CoRR arXiv:1911.12377

  97. Magassouba A, Sugiura K, Kawai H (2021) Crossmap transformer: a crossmodal masked path transformer using double back-translation for vision-and-language navigation. IEEE Robotics Autom Lett 6(4):6258–6265

    Article  Google Scholar 

  98. Wu Z, Liu Z, Wang T, Wang D (2021) Improved speaker and navigator for vision-and-language navigation. IEEE MultiMedia

  99. Mao S, Wu J, Hong S (2020) Vision and language navigation using multi-head attention mechanism. In: 2020 6th International conference on big data and information analytics (BigDIA). IEEE, pp 74–79

  100. Hong Y, Opazo CR, Qi Y, Wu Q, Gould S (2020) Language and visual entity relationship graph for agent navigation

  101. Xia Q, Li X, Li C, Bisk Y, Sui Z, Gao J, Choi Y, Smith NA (2020) Multi-view learning for vision-and-language navigation. CoRR arXiv:2003.00857

  102. Qi Y, Pan Z, Zhang S, van den Hengel A, Wu Q (2020) Object-and-action aware model for visual language navigation. In: Proceedings of the European conference on computer vision (ECCV). Springer, Berlin, pp 23–28

  103. Tan H, Yu L, Bansal M (2019) Learning to navigate unseen environments: back translation with environmental dropout, pp 2610–2621

  104. Parvaneh A, Abbasnejad E, Teney D, Shi Q, van den Hengel A (2020) Counterfactual vision-and-language navigation: unravelling the unseen. Adv Neural Inf Process Syst 33

  105. Wang X, Xiong W, Wang H, Wang WY (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings of the European conference on computer vision (ECCV), pp 37–53

  106. Lansing L, Jain V, Mehta H, Huang H, Ie E (2019) VALAN: vision and language agent navigation. CoRR arXiv:1912.03241

  107. Wang H, Wu Q, Shen C (2020) Soft expert reward learning for vision-and-language navigation 12354:126–141

  108. Zhou L, Small K (2021) Inverse reinforcement learning with natural language goals. In: AAAI, pp 11116–11124

  109. Hu R, Fried D, Rohrbach A, Klein D, Darrell T, Saenko K (2019) Are you looking? Grounding to multiple modalities in vision-and-language navigation, pp 6551–6557

  110. Kurita S, Cho K (2021) Generative language-grounded policy in vision-and-language navigation with Bayes’ rule. In: ICLR

  111. Hong Y, Opazo CR, Wu Q, Gould S (2020) Sub-instruction aware vision-and-language navigation, pp 3360–3376

  112. Agarwal S, Parikh D, Batra D, Anderson P, Lee S (2019) Visual landmark selection for generating grounded and interpretable navigation instructions. In: CVPR workshop on deep learning for semantic visual navigation

  113. Fu T-J, Wang XE, Peterson MF, Grafton ST, Eckstein MP, Wang WY (2020) Counterfactual vision-and-language navigation via adversarial path sampler. In: European conference on computer vision. Springer, Berlin, pp 71–86

  114. Yu F, Deng Z, Narasimhan K, Russakovsky O (2020) Take the scenic route: improving generalization in vision-and-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 920–921

  115. An D, Qi Y, Huang Y, Wu Q, Wang L, Tan T (2021) Neighbor-view enhanced model for vision and language navigation. In: ACM multimedia, pp 5101–5109

  116. Liu C, Zhu F, Chang X, Liang X, Ge Z, Shen Y (2021) Vision-language navigation with random environmental mixup. In: ICCV, pp 1624–1634

  117. Sun Q, Zhuang Y, Chen Z, Fu Y, Xue X (2021) Depth-guided AdaIN and shift attention network for vision-and-language navigation. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6

  118. Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith NA, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling. In: EMNLP-IJCNLP, pp 1494–1499

  119. Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 13134–13143

  120. Huang J, Huang B, Zhu L, Ma L, Liu J, Zeng G, Shi Z (2020) Real-time vision-language-navigation based on a lite pre-training model. In: iThings/GreenCom/CPSCom/SmartData/Cybermatics, pp 399–404

  121. Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, Batra D (2020) Improving vision-and-language navigation with image-text pairs from the web. In: European conference on computer vision. Springer, Berlin, pp 259–274

  122. Hong Y, Wu Q, Qi Y, Opazo CR, Gould S (2021) VLN BERT: A recurrent vision-and-language BERT for navigation. In: CVPR, pp 1643–1653

  123. Qi Y, Pan Z, Hong Y, Yang M, van den Hengel A, Wu Q (2021) Know what and know where: An object-and-room informed sequential BERT for indoor vision-language navigation. CoRR arXiv:2104.04167

  124. Guhur P-L, Tapaswi M, Chen S, Laptev I, Schmid C (2021) Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1634–1643

  125. Anderson P, Shrivastava A, Truong J, Majumdar A, Parikh D, Batra D, Lee S (2020) Sim-to-real transfer for vision-and-language navigation. In: CoRL. Proceedings of Machine Learning Research, vol 155, pp 671–681

  126. Zhu W, Qi Y, Narayana P, Sone K, Basu S, Wang X, Wu Q, Eckstein MP, Wang WY (2022) Diagnosing vision-and-language navigation: What really matters. In: NAACL-HLT, pp 5981–5993

  127. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, pp 5998–6008

  128. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, vol 1, pp 4171–4186

  129. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International conference on learning representations, ICLR

  130. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 6–12, 2020, virtual

  131. Anderson P, Shrivastava A, Parikh D, Batra D, Lee S (2019) Chasing ghosts: instruction following as Bayesian state tracking. In: NeurIPS, pp 369–379

  132. Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith NA, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling, pp 1494–1499

  133. Chen K, Chen JK, Chuang J, Vázquez M, Savarese S (2021) Topological planning with transformers for vision-and-language navigation. In: CVPR, pp 11276–11286

  134. Wang T, Wu Z, Wang D (2020) Visual perception generalization for vision-and-language navigation via meta-learning. CoRR arXiv:2012.05446

  135. Xiang J, Wang X, Wang WY (2020) Learning to stop: a simple yet effective approach to urban vision-language navigation. In: EMNLP (Findings). Findings of ACL, vol EMNLP 2020, pp 699–707

  136. Zhu W, Wang X, Fu T, Yan A, Narayana P, Sone K, Basu S, Wang WY (2021) Multimodal text style transfer for outdoor vision-and-language navigation. In: EACL, pp 1207–1221

  137. Mehta H, Artzi Y, Baldridge J, Ie E, Mirowski P (2020) Retouchdown: releasing touchdown on streetlearn as a public resource for language grounding tasks in street view. In: Proceedings of the third international workshop on spatial language understanding

  138. Mirowski P, Grimes MK, Malinowski M, Hermann KM, Anderson K, Teplyashin D, Simonyan K, Kavukcuoglu K, Zisserman A, Hadsell R (2018) Learning to navigate in cities without a map. In: NeurIPS, pp 2424–2435

  139. Vasudevan AB, Dai D, Gool LV (2021) Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. Int J Comput Vis 129(1):246–266

    Article  Google Scholar 

  140. Cirik V, Zhang Y, Baldridge J (2018) Following formulaic map instructions in a street simulation environment. In: 2018 NeurIPS workshop on visually grounded interaction and language, vol 1

  141. Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020) Vision-dialog navigation by exploring cross-modal memory. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 10727–10736

  142. Roman Roman H, Bisk Y, Thomason J, Celikyilmaz A, Gao J (2020) RMM: A recursive mental model for dialogue navigation. In: Findings of the association for computational linguistics: EMNLP 2020

  143. Mikhail EM, Bethel JS, McGlone JC (2001) Introduction to modern photogrammetry. New York 19

  144. Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 6750–6759

  145. Liu B, Xiao X, Stone P (2021) A lifelong learning approach to mobile robot navigation. IEEE Robotics Autom Lett 6(2):1090–1096

    Article  Google Scholar 

  146. Nguyen T, Nguyen D, Le T (2019) Reinforcement learning based navigation with semantic knowledge of indoor environments. In: 11th International conference on knowledge and systems engineering, KSE, pp 1–7

  147. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision. Springer, Berlin, pp 121–137

  148. Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13137–13146

  149. Hu Y, Subagdja B, Tan A-H, Yin Q (2021) Vision-based topological mapping and navigation with self-organizing neural networks. IEEE Trans Neural Netw Learn Syst

  150. Tadokoro S (2009) Rescue robotics: DDT project on robots and systems for urban search and rescue

  151. Bhirangi RM, Hellebrekers TL, Majidi C, Gupta A (2021) Reskin: versatile, replaceable, lasting tactile skins. In: CoRL. Proceedings of machine learning research, vol 164, pp 587–597

  152. Smith B, Wu C, Wen H, Peluse P, Sheikh Y, Hodgins JK, Shiratori T (2020) Constraining dense hand surface tracking with elasticity. ACM Trans Graph (TOG) 39(6):1–14

    Article  Google Scholar 

  153. Chen C, Jain U, Schissler C, Gari SVA, Al-Halah Z, Ithapu VK, Robinson P, Grauman K (2019) Audio-visual embodied navigation. Environment 97:103

    Google Scholar 

Download references

Acknowledgements

The work described in this paper was sponsored in part by the National Natural Science Foundation of China under grant nos. 62103420, 62103428 and 62102432 and the Natural Science Fund of Hunan Province under Grant Nos. 2021JJ40702 and 2021JJ40697.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue Hu.

Ethics declarations

Conflict of interest

All authors declare that no conflicts of interest exist.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, W., Chang, T., Li, X. et al. Vision-language navigation: a survey and taxonomy. Neural Comput & Applic 36, 3291–3316 (2024). https://doi.org/10.1007/s00521-023-09217-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09217-1

Keywords

Navigation