Vision-language navigation: a survey and taxonomy

Wu, Wansen; Chang, Tao; Li, Xinmeng; Yin, Quanjun; Hu, Yue

doi:10.1007/s00521-023-09217-1

Vision-language navigation: a survey and taxonomy

Review
Published: 27 November 2023

Volume 36, pages 3291–3316, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Wansen Wu¹,
Tao Chang²,
Xinmeng Li¹,
Quanjun Yin¹ &
…
Yue Hu ORCID: orcid.org/0000-0002-8115-7020¹

537 Accesses
2 Citations
Explore all metrics

Abstract

Vision-language navigation (VLN) tasks require an agent to follow language instructions from a human guide to navigate in previously unseen environments using visual observations. This challenging field, involving problems in natural language processing (NLP), computer vision (CV), robotics, etc., has spawned many excellent works focusing on various VLN tasks. This paper provides a comprehensive survey and an insightful taxonomy of these tasks based on the different characteristics of language instructions. Depending on whether navigation instructions are given once or multiple times, we divide the tasks into two categories, i.e., single-turn and multiturn tasks. We subdivide single-turn tasks into goal-oriented and route-oriented tasks based on whether the instructions designate a single goal location or specify a sequence of multiple locations. We subdivide multiturn tasks into interactive and passive tasks based on whether the agent is allowed to ask questions. These tasks require different agent capabilities and entail various model designs. We identify the progress made on these tasks and examine the limitations of the existing VLN models and task settings. Hopefully, a well-designed taxonomy of the task family enables comparisons among different approaches across papers concerning the same tasks and clarifies the advances made in these tasks. Furthermore, we discuss several open issues in this field and some promising directions for future research, including the incorporation of knowledge into VLN models and transferring them to the real physical world.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

Active Visual Information Gathering for Vision-Language Navigation

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

Data availability

No data, models, or code were generated or used in this review.

Notes

References

Shridhar M, Thomason J, Gordon D, Bisk Y, Han W, Mottaghi R, Zettlemoyer L, Fox D (2020) ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, CVPR, pp 10737–10746
Ye X, Yang Y (2020) From seeing to moving: a survey on learning for visual indoor navigation (VIN). CoRR arXiv:2002.11310
Anderson P, Wu Q, Teney D, Bruce J, Johnson M, Sünderhauf N, Reid ID, Gould S, van den Hengel A (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR, pp 3674–3683
Qi Y, Wu Q, Anderson P, Wang X, Wang WY, Shen C, van den Hengel A (2020) REVERIE: remote embodied visual referring expression in real indoor environments. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 9979–9988
Pfeifer R, Iida F (2004) Embodied artificial intelligence: trends and challenges. In: Embodied artificial intelligence, pp 1–26
Pfeifer R, Bongard J (2006) How the body shapes the way we think: a new view of intelligence
Duan J, Yu S, Tan HL, Zhu H, Tan C (2022) A survey of embodied AI: from simulators to research tasks. IEEE Trans Emerg Top Comput Intell
Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Article PubMed Google Scholar
Uppal S, Bhagat S, Hazarika D, Majumder N, Poria S, Zimmermann R, Zadeh A (2022) Multimodal research in vision and language: a review of current and emerging trends. Inf Fusion 77:149–171
Article Google Scholar
Guilherme ND, Avinash CK (2002) Vision for mobile robot navigation: a survey. IEEE Trans Pattern Anal Mach Intell 24(2):237–267
Article Google Scholar
Kruse T, Pandey AK, Alami R, Kirsch A (2013) Human-aware robot navigation: a survey. Robot Auton Syst 61(12):1726–1743
Article Google Scholar
Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser TA (2017) Semantic scene completion from a single depth image. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR, pp 190–198
Chang AX, Dai A, Funkhouser TA, Halber M, Nießner M, Savva M, Song S, Zeng A, Zhang Y (2017) Matterport3d: Learning from RGB-D data in indoor environments. In: 2017 International conference on 3D vision, 3DV, pp 667–676
Armeni I, Sax S, Zamir AR, Savarese S (2017) Joint 2d-3d-semantic data for indoor scene understanding. CoRR arXiv:1702.01105
Straub J, Whelan T, Ma L, Chen Y, Wijmans E, Green S, Engel JJ, Mur-Artal R, Ren C, Verma S, Clarkson A, Yan M, Budge B, Yan Y, Pan X, Yon J, Zou Y, Leon K, Carter N, Briales J, Gillingham T, Mueggler E, Pesqueira L, Savva M, Batra D, Strasdat HM, Nardi RD, Goesele M, Lovegrove S, Newcombe RA (2019) The replica dataset: a digital replica of indoor spaces. CoRR arXiv:1906.05797
Kolve E, Mottaghi R, Gordon D, Zhu Y, Gupta A, Farhadi A (2017) AI2-THOR: an interactive 3d environment for visual AI. CoRR arXiv:1712.05474
Kempka M, Wydmuch M, Runc G, Toczek J, Jaskowski W (2016) Vizdoom: A doom-based AI research platform for visual reinforcement learning. In: IEEE Conference on computational intelligence and games, CIG, pp 1–8
Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson ENV: real-world perception for embodied agents. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR, pp 9068–9079
Shen B, Xia F, Li C, Martín-Martín R, Fan L, Wang G, Pérez-D’Arpino C, Buch S, Srivastava S, Tchapmi L et al (2020) iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 7520–7527
Li C, Xia F, Martín-Martín R, Lingelbach M, Srivastava S, Shen B, Vainio KE, Gokmen C, Dharan G, Jain T, Kurenkov A, Liu CK, Gweon H, Wu J, Fei-Fei L, Savarese S (2021) iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In: Proceedings of machine learning research. PMLR
Wu Y, Wu Y, Gkioxari G, Tian Y (2018) Building generalizable agents with a realistic and rich 3d environment. In: 6th International conference on learning representations, ICLR
Savva M, Malik J, Parikh D, Batra D, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V (2019) Habitat: a platform for embodied AI research. In: 2019 IEEE/CVF international conference on computer vision, ICCV, pp 9338–9346
Misra DK, Bennett A, Blukis V, Niklasson E, Shatkhin M, Artzi Y (2018) Mapping instructions to actions in 3d environments with visual goal prediction. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018, pp 2667–2678
Deruyttere T, Vandenhende S, Grujicic D, Gool LV, Moens M (2019) Talk2car: Taking control of your self-driving car. In: EMNLP-IJCNLP, pp 2088–2098
Yu H, Lian X, Zhang H, Xu W (2018) Guided feature transformation (GFT): a neural language grounding module for embodied agents. In: 2nd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 87, pp 81–98
Chaplot DS, Sathyendra KM, Pasumarthi RK, Rajagopal D, Salakhutdinov R (2018) Gated-attention architectures for task-oriented language grounding. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, pp 2819–2826
Yang W, Wang X, Farhadi A, Gupta A, Mottaghi R (2019) Visual semantic navigation using scene priors. In: 7th International conference on learning representations, ICLR
Zhu F, Liang X, Zhu Y, Yu Q, Chang X, Liang X (2021) Soon: scenario oriented object navigation with graph-based exploration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12689–12699
Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D (2018) Embodied question answering. In: 2018 IEEE Conference on computer vision and pattern recognition, CVPR, pp 1–10
Zang X, Pokle A, Vázquez M, Chen K, Niebles JC, Soto A, Savarese S (2018) Translating navigation instructions in natural language to a high-level plan for behavioral robot navigation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2657–2666
Fu J, Korattikara A, Levine S, Guadarrama S (2019) From language to goals: inverse reinforcement learning for vision-based instruction following. In: 7th International conference on learning representations, ICLR
Jain V, Magalhães G, Ku A, Vaswani A, Ie E, Baldridge J (2019) Stay on the path: instruction fidelity in vision-and-language navigation. In: ACL (1), pp 1862–1872
Ku A, Anderson P, Patel R, Ie E, Baldridge J (2020) Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP (1), pp 4392–4412
Zhu W, Hu H, Chen J, Deng Z, Jain V, Ie E, Sha F (2020) Babywalk: going farther in vision-and-language navigation by taking baby steps. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2539–2556
Krantz J, Wijmans E, Majumdar A, Batra D, Lee S (2020) Beyond the nav-graph: vision-and-language navigation in continuous environments. In: ECCV (28). Lecture notes in computer science, vol 12373, pp 104–120
Yan A, Wang X, Feng J, Li L, Wang WY (2019) Cross-lingual vision-language navigation. CoRR arXiv:1910.11301
Chen H, Suhr A, Misra D, Snavely N, Artzi Y (2019) TOUCHDOWN: natural language navigation and spatial reasoning in visual street environments. In: CVPR, pp 12538–12547
Paz-Argaman T, Tsarfaty R (2019) RUN through the streets: a new dataset and baseline models for realistic urban navigation. In: EMNLP/IJCNLP (1), pp 6448–6454
Hermann KM, Malinowski M, Mirowski P, Banki-Horvath A, Anderson K, Hadsell R (2020) Learning to follow directions in street view. In: AAAI, pp 11773–11781
Mirowski P, Banki-Horvath A, Anderson K, Teplyashin D, Hermann KM, Malinowski M, Grimes MK, Simonyan K, Kavukcuoglu K, Zisserman A, Hadsell R (2019) The streetlearn environment and dataset. CoRR arXiv:1903.01292
Kim H, Zala A, Burri G, Tan H, Bansal M (2020) Arramon: A joint navigation-assembly instruction interpretation task in dynamic environments. In: EMNLP (Findings), pp 3910–3927
Suhr A, Yan C, Schluger J, Yu S, Khader H, Mouallem M, Zhang I, Artzi Y (2019) Executing instructions in situated collaborative interactions. In: EMNLP-IJCNLP, pp 2119–2130
Nguyen K, Dey D, Brockett C, Dolan B (2019) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: IEEE Conference on computer vision and pattern recognition, CVPR, pp 12527–12537
Nguyen K, III HD (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In: EMNLP-IJCNLP, pp 684–695
Thomason J, Murray M, Cakmak M, Zettlemoyer L (2019) Vision-and-dialog navigation. In: 3rd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 100, pp 394–406
Chi T-C, Shen M, Eric M, Kim S, Hakkani-tur D (2020) Just ask: An interactive learning framework for vision and language navigation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 2459–2466
de Vries H, Shuster K, Batra D, Parikh D, Weston J, Kiela D (2018) Talk the walk: Navigating New York city through grounded dialogue. CoRR arXiv:1807.03367
Banerjee S, Thomason J, Corso JJ (2020) The robotslang benchmark: dialog-guided robot localization and navigation. In: CoRL. Proceedings of machine learning research, vol 155, pp 1384–1393
Blukis V, Brukhim N, Bennett A, Knepper RA, Artzi Y (2018) Following high-level navigation instructions on a simulated quadcopter with imitation learning. In: Robotics: science and systems XIV
Shah S, Dey D, Lovett C, Kapoor A (2017) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In: Field and service robotics, results of the 11th international conference, FSR. Springer proceedings in advanced robotics, vol 5, pp 621–635
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article CAS PubMed Google Scholar
Blukis V, Misra DK, Knepper RA, Artzi Y (2018) Mapping navigation instructions to continuous control actions with position-visitation prediction. In: 2nd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 87, pp 505–518
Blukis V, Terme Y, Niklasson E, Knepper RA, Artzi Y (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. In: 3rd Annual conference on robot learning, CoRL. Proceedings of machine learning research, vol 100, pp 1415–1438
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR, pp 770–778
Storks S, Gao Q, Thattai G, Tür G (2021) Are we there yet? Learning to localize in embodied instruction following. CoRR arXiv:2101.03431
Singh KP, Bhambri S, Kim B, Mottaghi R, Choi J (2020) MOCA: A modular object-centric approach for interactive instruction following. CoRR arXiv:2012.03208
Yu H, Zhang H, Xu W (2017) A deep compositional framework for human-like language acquisition in virtual environment. arXiv preprint arXiv:1703.09831
Anand A, Belilovsky E, Kastner K, Larochelle H, Courville AC (2018) Blindfold baselines for embodied QA. CoRR arXiv:1811.05013
Das A, Gkioxari G, Lee S, Parikh D, Batra D (2018) Neural modular control for embodied question answering. In: 2nd Annual conference on robot learning, CoRL 2018. Proceedings of machine learning research, vol 87, pp 53–62
Parvaneh A, Abbasnejad E, Teney D, Shi Q, van den Hengel A (2020) Counterfactual vision-and-language navigation: Unravelling the unseen. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS
Wu Y, Jiang L, Yang Y (2020) Revisiting EmbodiedQA: a simple baseline and beyond. IEEE Trans Image Process 29:3984–3992
Article ADS Google Scholar
Wijmans E, Datta S, Maksymets O, Das A, Gkioxari G, Lee S, Essa I, Parikh D, Batra D (2019) Embodied question answering in photorealistic environments with point cloud perception. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 6659–6668
Yu L, Chen X, Gkioxari G, Bansal M, Berg TL, Batra D (2019) Multi-target embodied question answering. In: IEEE Conference on computer vision and pattern recognition, CVPR, pp 6309–6318
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: 4th International conference on learning representations, ICLR
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33nd international conference on machine learning, ICML. JMLR workshop and conference proceedings, vol 48, pp 1928–1937
Wu Y, Wu Y, Tamar A, Russell SJ, Gkioxari G, Tian Y (2019) Bayesian relational memory for semantic visual navigation. In: 2019 IEEE/CVF International conference on computer vision, ICCV, pp 2769–2779
Lin X, Li G, Yu Y (2021) Scene-intuitive agent for remote embodied visual grounding. In: CVPR, pp 7036–7045
Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 11618–11628
Yu H, Zhang H, Xu W (2018) Interactive grounded language acquisition and generalization in a 2d world. In: 6th International conference on learning representations, ICLR
Anderson P, Chang AX, Chaplot DS, Dosovitskiy A, Gupta S, Koltun V, Kosecka J, Malik J, Mottaghi R, Savva M, Zamir AR (2018) On evaluation of embodied navigation agents. CoRR arXiv:1807.06757
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: 5th International conference on learning representations, ICLR
Sepulveda G, Niebles JC, Soto A (2018) A deep learning based behavioral approach to indoor autonomous navigation. In: 2018 IEEE International conference on robotics and automation, ICRA, pp 4646–4653
Huang H, Jain V, Mehta H, Baldridge J, Ie E (2019) Multi-modal discriminative model for vision-and-language navigation. CoRR arXiv:1905.13358
Fried D, Hu R, Cirik V, Rohrbach A, Andreas J, Morency L, Berg-Kirkpatrick T, Saenko K, Klein D, Darrell T (2018) Speaker-follower models for vision-and-language navigation, pp 3318–3329
Ilharco G, Jain V, Ku A, Ie E, Baldridge J (2019) General evaluation for instruction conditioned navigation using dynamic time warping
Zhao M, Anderson P, Jain V, Wang S, Ku A, Baldridge J, Ie E (2021) On the evaluation of vision-and-language navigation instructions, pp 1302–1316
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, Berlin, pp 382–398
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang X, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6629–6638
Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. CoRR arXiv:2007.05655
Hong Y, Wu Q, Qi Y, Opazo CR, Gould S (2020) A recurrent vision-and-language BERT for navigation. CoRR arXiv:2011.13922
Ma C, Lu J, Wu Z, AlRegib G, Kira Z, Socher R, Xiong C (2019) Self-monitoring navigation agent via auxiliary progress estimation
Ma C-Y, Wu Z, AlRegib G, Xiong C, Kira Z (2019) The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6732–6740
Ke L, Li X, Bisk Y, Holtzman A, Gan Z, Liu J, Gao J, Choi Y, Srinivasa S (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6741–6749
Huang H, Jain V, Mehta H, Ku A, Magalhaes G, Baldridge J, Ie E (2019) Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7404–7413
Zhu F, Zhu Y, Chang X, Liang X (2020) Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10012–10022
Wang H, Wang W, Shu T, Liang W, Shen J (2020) Active visual information gathering for vision-language navigation. In: European conference on computer vision. Springer, Berlin, pp 307–322
Wang H, Wang W, Liang W, Xiong C, Shen J (2021) Structured scene memory for vision-language navigation. In: CVPR, pp 8455–8464
Zhang W, Ma C, Wu Q, Yang X (2020) Language-guided navigation via cross-modal grounding and alternate adversarial learning. IEEE Trans Circuits Syst Video Technol
Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. In: NeurIPS
Chen S, Guhur P-L, Schmid C, Laptev I (2021) History aware multimodal transformer for vision-and-language navigation. In: NeurIPS
Chen S, Guhur P-L, Tapaswi M, Schmid C, Laptev I (2022) Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: CVPR
Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2019) Perceive, transform, and act: multi-modal attention networks for vision-and-language navigation. CoRR arXiv:1911.12377
Magassouba A, Sugiura K, Kawai H (2021) Crossmap transformer: a crossmodal masked path transformer using double back-translation for vision-and-language navigation. IEEE Robotics Autom Lett 6(4):6258–6265
Article Google Scholar
Wu Z, Liu Z, Wang T, Wang D (2021) Improved speaker and navigator for vision-and-language navigation. IEEE MultiMedia
Mao S, Wu J, Hong S (2020) Vision and language navigation using multi-head attention mechanism. In: 2020 6th International conference on big data and information analytics (BigDIA). IEEE, pp 74–79
Hong Y, Opazo CR, Qi Y, Wu Q, Gould S (2020) Language and visual entity relationship graph for agent navigation
Xia Q, Li X, Li C, Bisk Y, Sui Z, Gao J, Choi Y, Smith NA (2020) Multi-view learning for vision-and-language navigation. CoRR arXiv:2003.00857
Qi Y, Pan Z, Zhang S, van den Hengel A, Wu Q (2020) Object-and-action aware model for visual language navigation. In: Proceedings of the European conference on computer vision (ECCV). Springer, Berlin, pp 23–28
Tan H, Yu L, Bansal M (2019) Learning to navigate unseen environments: back translation with environmental dropout, pp 2610–2621
Parvaneh A, Abbasnejad E, Teney D, Shi Q, van den Hengel A (2020) Counterfactual vision-and-language navigation: unravelling the unseen. Adv Neural Inf Process Syst 33
Wang X, Xiong W, Wang H, Wang WY (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings of the European conference on computer vision (ECCV), pp 37–53
Lansing L, Jain V, Mehta H, Huang H, Ie E (2019) VALAN: vision and language agent navigation. CoRR arXiv:1912.03241
Wang H, Wu Q, Shen C (2020) Soft expert reward learning for vision-and-language navigation 12354:126–141
Zhou L, Small K (2021) Inverse reinforcement learning with natural language goals. In: AAAI, pp 11116–11124
Hu R, Fried D, Rohrbach A, Klein D, Darrell T, Saenko K (2019) Are you looking? Grounding to multiple modalities in vision-and-language navigation, pp 6551–6557
Kurita S, Cho K (2021) Generative language-grounded policy in vision-and-language navigation with Bayes’ rule. In: ICLR
Hong Y, Opazo CR, Wu Q, Gould S (2020) Sub-instruction aware vision-and-language navigation, pp 3360–3376
Agarwal S, Parikh D, Batra D, Anderson P, Lee S (2019) Visual landmark selection for generating grounded and interpretable navigation instructions. In: CVPR workshop on deep learning for semantic visual navigation
Fu T-J, Wang XE, Peterson MF, Grafton ST, Eckstein MP, Wang WY (2020) Counterfactual vision-and-language navigation via adversarial path sampler. In: European conference on computer vision. Springer, Berlin, pp 71–86
Yu F, Deng Z, Narasimhan K, Russakovsky O (2020) Take the scenic route: improving generalization in vision-and-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 920–921
An D, Qi Y, Huang Y, Wu Q, Wang L, Tan T (2021) Neighbor-view enhanced model for vision and language navigation. In: ACM multimedia, pp 5101–5109
Liu C, Zhu F, Chang X, Liang X, Ge Z, Shen Y (2021) Vision-language navigation with random environmental mixup. In: ICCV, pp 1624–1634
Sun Q, Zhuang Y, Chen Z, Fu Y, Xue X (2021) Depth-guided AdaIN and shift attention network for vision-and-language navigation. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6
Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith NA, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling. In: EMNLP-IJCNLP, pp 1494–1499
Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 13134–13143
Huang J, Huang B, Zhu L, Ma L, Liu J, Zeng G, Shi Z (2020) Real-time vision-language-navigation based on a lite pre-training model. In: iThings/GreenCom/CPSCom/SmartData/Cybermatics, pp 399–404
Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, Batra D (2020) Improving vision-and-language navigation with image-text pairs from the web. In: European conference on computer vision. Springer, Berlin, pp 259–274
Hong Y, Wu Q, Qi Y, Opazo CR, Gould S (2021) VLN BERT: A recurrent vision-and-language BERT for navigation. In: CVPR, pp 1643–1653
Qi Y, Pan Z, Hong Y, Yang M, van den Hengel A, Wu Q (2021) Know what and know where: An object-and-room informed sequential BERT for indoor vision-language navigation. CoRR arXiv:2104.04167
Guhur P-L, Tapaswi M, Chen S, Laptev I, Schmid C (2021) Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1634–1643
Anderson P, Shrivastava A, Truong J, Majumdar A, Parikh D, Batra D, Lee S (2020) Sim-to-real transfer for vision-and-language navigation. In: CoRL. Proceedings of Machine Learning Research, vol 155, pp 671–681
Zhu W, Qi Y, Narayana P, Sone K, Basu S, Wang X, Wu Q, Eckstein MP, Wang WY (2022) Diagnosing vision-and-language navigation: What really matters. In: NAACL-HLT, pp 5981–5993
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, pp 5998–6008
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, vol 1, pp 4171–4186
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International conference on learning representations, ICLR
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 6–12, 2020, virtual
Anderson P, Shrivastava A, Parikh D, Batra D, Lee S (2019) Chasing ghosts: instruction following as Bayesian state tracking. In: NeurIPS, pp 369–379
Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith NA, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling, pp 1494–1499
Chen K, Chen JK, Chuang J, Vázquez M, Savarese S (2021) Topological planning with transformers for vision-and-language navigation. In: CVPR, pp 11276–11286
Wang T, Wu Z, Wang D (2020) Visual perception generalization for vision-and-language navigation via meta-learning. CoRR arXiv:2012.05446
Xiang J, Wang X, Wang WY (2020) Learning to stop: a simple yet effective approach to urban vision-language navigation. In: EMNLP (Findings). Findings of ACL, vol EMNLP 2020, pp 699–707
Zhu W, Wang X, Fu T, Yan A, Narayana P, Sone K, Basu S, Wang WY (2021) Multimodal text style transfer for outdoor vision-and-language navigation. In: EACL, pp 1207–1221
Mehta H, Artzi Y, Baldridge J, Ie E, Mirowski P (2020) Retouchdown: releasing touchdown on streetlearn as a public resource for language grounding tasks in street view. In: Proceedings of the third international workshop on spatial language understanding
Mirowski P, Grimes MK, Malinowski M, Hermann KM, Anderson K, Teplyashin D, Simonyan K, Kavukcuoglu K, Zisserman A, Hadsell R (2018) Learning to navigate in cities without a map. In: NeurIPS, pp 2424–2435
Vasudevan AB, Dai D, Gool LV (2021) Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. Int J Comput Vis 129(1):246–266
Article Google Scholar
Cirik V, Zhang Y, Baldridge J (2018) Following formulaic map instructions in a street simulation environment. In: 2018 NeurIPS workshop on visually grounded interaction and language, vol 1
Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020) Vision-dialog navigation by exploring cross-modal memory. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR, pp 10727–10736
Roman Roman H, Bisk Y, Thomason J, Celikyilmaz A, Gao J (2020) RMM: A recursive mental model for dialogue navigation. In: Findings of the association for computational linguistics: EMNLP 2020
Mikhail EM, Bethel JS, McGlone JC (2001) Introduction to modern photogrammetry. New York 19
Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: IEEE conference on computer vision and pattern recognition, CVPR, pp 6750–6759
Liu B, Xiao X, Stone P (2021) A lifelong learning approach to mobile robot navigation. IEEE Robotics Autom Lett 6(2):1090–1096
Article Google Scholar
Nguyen T, Nguyen D, Le T (2019) Reinforcement learning based navigation with semantic knowledge of indoor environments. In: 11th International conference on knowledge and systems engineering, KSE, pp 1–7
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision. Springer, Berlin, pp 121–137
Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13137–13146
Hu Y, Subagdja B, Tan A-H, Yin Q (2021) Vision-based topological mapping and navigation with self-organizing neural networks. IEEE Trans Neural Netw Learn Syst
Tadokoro S (2009) Rescue robotics: DDT project on robots and systems for urban search and rescue
Bhirangi RM, Hellebrekers TL, Majidi C, Gupta A (2021) Reskin: versatile, replaceable, lasting tactile skins. In: CoRL. Proceedings of machine learning research, vol 164, pp 587–597
Smith B, Wu C, Wen H, Peluse P, Sheikh Y, Hodgins JK, Shiratori T (2020) Constraining dense hand surface tracking with elasticity. ACM Trans Graph (TOG) 39(6):1–14
Article Google Scholar
Chen C, Jain U, Schissler C, Gari SVA, Al-Halah Z, Ithapu VK, Robinson P, Grauman K (2019) Audio-visual embodied navigation. Environment 97:103
Google Scholar

Download references

Acknowledgements

The work described in this paper was sponsored in part by the National Natural Science Foundation of China under grant nos. 62103420, 62103428 and 62102432 and the Natural Science Fund of Hunan Province under Grant Nos. 2021JJ40702 and 2021JJ40697.

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410073, Hunan, China
Wansen Wu, Xinmeng Li, Quanjun Yin & Yue Hu
College of Computer Science, National University of Defense Technology, Changsha, 410073, Hunan, China
Tao Chang

Authors

Wansen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Chang
View author publications
You can also search for this author in PubMed Google Scholar
Xinmeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Quanjun Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yue Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Hu.

Ethics declarations

Conflict of interest

All authors declare that no conflicts of interest exist.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, W., Chang, T., Li, X. et al. Vision-language navigation: a survey and taxonomy. Neural Comput & Applic 36, 3291–3316 (2024). https://doi.org/10.1007/s00521-023-09217-1

Download citation

Received: 21 April 2022
Accepted: 21 October 2023
Published: 27 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00521-023-09217-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vision-language navigation: a survey and taxonomy

Abstract

Access this article

Similar content being viewed by others

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

Active Visual Information Gathering for Vision-Language Navigation

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation