RMM: A Recursive Mental Model for Dialog Navigation

Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models the guiding agent to simulate answers given candidate generated questions. The guiding agent in turn models the navigating agent to simulate navigation steps it would take to generate answers. We use the progress agents make towards the goal as a reinforcement learning reward signal to directly inform not only navigation actions, but also both question and answer generation. We demonstrate that RMM enables better generalization to novel environments. Interlocutor modelling may be a way forward for human-agent RMM where robots need to both ask and answer questions.


Introduction
A key challenge for embodied language is moving beyond instruction following to instruction generation.This dialog paradigm raises a myriad of new research questions, from grounded versions of traditional problems like co-reference resolution (Das et al., 2017a) to modeling theory of mind to consider the listener (Bisk et al., 2020).
In this work, we develop end-to-end dialog agents to navigate photorealistic, indoor scenes to reach goal rooms, trained on human-human dialog from the Collaborative Vision-and-Dialog Navigation (CVDN) (Thomason et al., 2019) dataset.Previous work considers only the vision-and-language navigation task, conditioned on single instructions (Wang et al., 2019;Ma et al., 2019;Fried et al., 2018;Anderson et al., 2018) or dialog histories (Hao et al., 2020;Zhu et al., 2020;Wang et al., 2020;Thomason et al., 2019).Towards dialog, recent work has modelled question answering in addition to navigation (Chi et al., 2020;Nguyen and Daumé III, 2019;Nguyen et al., 2019).Closing the loop, our work is the first to train agents to perform end-to-end, collaborative dialogs with question generation, question answering, and navigation conditioned on dialog history.
Theory of mind (Gopnik and Wellman, 1992) guides human communication.Efficient questions and answers build on a shared world of experiences and referents.We formalize this notion through a Recursive Mental Model (RMM) of a conversational partner.With this formalism, an agent spawns instances of itself to converse with to posit the effects of dialog acts before asking a question or generating an answer, thus enabling conversational planning to achieve the desired navigation result.
along with visual observations to reach a goal (Anderson et al., 2018;Chen and Mooney, 2011).These instructions describe step-by-step actions the agent needs to take.This paradigm has been extended to more longer trajectories and outdoor environments (Chen et al., 2019), as well as to agents in the real world (Chai et al., 2018;Tellex et al., 2014).In this work, we focus on the the simulated, photorealistic indoor environments of the MatterPort dataset (Chang et al., 2017), and go beyond instruction following to a cooperative, two-agent dialog setting.
Navigation Dialogs task a navigator and a guide to cooperate to find a destination.Agents can be trained on human-human dialogs, but previous work either includes substantial information asymmetry between the navigator and oracle (de Vries et al., 2018;Narayan-Chen et al., 2019) or only investigates the navigation portion of the dialog without considering question generation and answering (Thomason et al., 2019).The latter approach treats dialog histories as longer and more ambiguous forms of static instructions.No text is generated to approach such navigation-only tasks.Going beyond models that perform navigation from dialog history alone (Wang et al., 2020;Zhu et al., 2020;Hao et al., 2020), in this work we train two agents: a navigator agent that asks questions, and a guide agent that answers those questions.
Multimodal Dialog takes several forms.In Visual Dialog (Das et al., 2017a), an agent answers a series of questions about an image while accounting for dialog context in the process.Reinforcement learning (Das et al., 2017b) has proved essential to strong performance on this task, and such paradigms have been extended to producing multi-domain visual dialog agents (Ju et al., 2019).GuessWhat (de Vries et al., 2017) presents a similar paradigm, where agents use visual properties of objects to reason about which referent meets various constraints.Identifying visual attributes can also lead to emergent communication between pairs of learning agents (Cao et al., 2018).
Goal Oriented Dialog Goal-Oriented Dialog Systems, or chatbots, help a user achieve a predefined goal, like booking flights, within a closed domain (Gao et al., 2019;Vlad Serban et al., 2015;Bordes and Weston, 2017) while trying to limit the number of questions asked to the user.Modeling goal-oriented dialog require skills that go be-Algorithm 1: Dialog Navigation loc = p0; hist = t0; a ∼ N (hist); loc, hist = update( a, loc, hist); while a = STOP and len(hist) < 20 do q ∼ Q(hist, loc) ; // Question s = path(loc, goal, horizon = 5) ; o ∼ O(hist, loc, q, s) ; // Answer hist ← hist + (q, o); for a ∈ N (hist) do loc ← loc + a ; // Move hist ← hist + a; end end return (goal − t0) − (loc − t0) yond language modeling, such as asking questions to clearly define a user request, querying knowledge bases, and interpreting results from queries as options to complete a transaction.Most current task oriented systems are data-driven which are mostly trained in end-to-end fashion using semisupervised or transfer learning methods (Ham et al., 2020;Mrksic et al., 2017).However, these datadriven approaches may lack grounding between the text and the current state of the environment.Reinforcement learning-based dialog modeling (Su et al., 2016;Peng et al., 2017;Liu et al., 2017) can improve completion rate and user experience by helping ground conversational data to environments.

Task and Data
Our work creates a two-agent dialog task, building on the CVDN dataset (Thomason et al., 2019) of human-human dialogs.In that dataset, a human N avigator and Guide collaborate to find a goal room containing a target object, such as a plant.The N avigator moves through the environment, and the Guide views this navigation until the N avigator asks a question in natural language.Then, the Guide can see the next few steps a shortest path planner would take towards the goal, and writes a natural language response.This dialog continues until the N avigator arrives at the goal.
We model this dialog between two agents: 1. Navigator (N ) & Questioner (Q)

Guide (G)
We split the first agent into its two roles: navigation and question asking.The agents receive the same input as their human counterparts in CVDN.
In particular, both agents (and all three roles) have access to the entire dialog and visual navigation histories, in addition to a textual description of the target object (e.g., a plant).The N avigator, uses this information to decide on a series of actions: forward, left, right, look up, look down, and stop.The Questioner asks for specific guidance from the Guide.The Guide is presented not only with the navigation/dialog history but also the next five shortest path steps to the goal.Agents are trained on real human dialogs of natural language questions and answers from CVDN.Individual question-answer exchanges in that dataset are underspecified and rarely provide simple step-by-step instructions like "straight, straight, right, ...".Instead, exchanges rely on assumptions of world knowledge and shared context (Frank and Goodman, 2012;Grice et al., 1975), which manifest as instructions full of visuallinguistic co-references such as should I go back to the room I just passed or continue on?
The CVDN release does not provide any baselines or evaluations for this interactive dialog setting, focusing instead solely on the navigation component of the task.They evaluate navigation agents by "progress to goal" in meters, the distance reduction between the agent's starting position versus ending position with respect to the goal location.
Dialog navigation proceeds by iterating through the three roles until either the navigator chooses to stop or a maximum number of turns are played (Algorithm 1).Upon terminating, the "progress to goal" is returned for evaluation.We also report BLEU scores (Papineni et al., 2002) for evaluating the generation of questions and answers by comparing against human questions and answers.

Conditioning Context
In our experiments, we define three different notions of context dialog context (t O , QA i-1 , and QA 1:i-1 ), to evaluate how well agents utilize or are confused by the generated conversations.

t O
The agent must navigate to the goal while only knowing what type of object they are looking for (e.g., a plant).
QA i-1 The agent has access to their previous Question-Answer exchange.They can condition on this information to both generate the next exchange and then navigate towards the goal.
QA 1:i-1 This is the "full" evaluation paradigm in which an agent has access to the entire dialog when interacting.This context also affords the most potential distractor information.

Models
We introduce the Recursive Mental Model (RMM) as an initial approach to our full dialog task formulation of the CVDN dataset.Key to this approach is allowing component models (N avigator, Questioner, and Guide) to learn from each other and roll out possible dialogs and trajectories.We compare our model to a traditional sequenceto-sequence baseline, and we explore Speaker-Follower data augmentation (Fried et al., 2018).

Sequence-to-Sequence Architecture
The underlying architecture, shown in Figure 2, is shared across all approaches.The core dialog tasks are navigation action decoding and language generation for asking and answering questions.We present three sequence-to-sequence (Bahdanau et al., 2015) models to perform as N avigator, Questioner, and Guide.The models rely on an LSTM (Hochreiter and Schmidhuber, 1997) encoder for the dialog history and a ResNet backbone (He et al., 2015) for processing the visual surroundings; we take the penultimate ResNet layer as image observations.
Navigation Action Decoding Initially, the dialog context is a target object t O that can be found in the goal room, for example "plant" indicating that the goal room contains a plant.As questions are asked and answered, the dialog context grows.Following prior work (Anderson et al., 2018;Thomason et al., 2019), dialog history words w words are embedded as 256 dimensional vectors and passed through an LSTM to produce u context vectors and a final hidden state h N .The hidden state h N is used to initialize the LSTM decoder.At every timestep the decoder is updated with the previous action a t−1 and current image I t .The hidden state is used to attend over the language u and predict the next action a t (Figure 2a).We pretrain the decoder on the navigation task alone (Thomason et al., 2019) before fine-tuning in the full dialog setting we introduce in this paper.The next action is sampled from the model's predicted logits and the episode ends when either a stop action is sampled or 80 steps are taken.

Encoder
Decoder dt < l a t e x i t s h a 1 _ b a s e 6 4 = " p f 2 I q h h B V I 4 l W / U O 7 7 w v p j q 9 e R s = " > H e n Y 9 F 6 5 q T z 5 z A H z i f P y O S j b M = < / l a t e x i t > at 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 S 5 A J e A l b d v 4 K S t b x z M L y z u T e d I = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z b t q l m 0 3 Y n Q g l 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 3 8 9 h P X R s T q E S c J 9 y M 6 V C I U j K K V 2 r S f 4 Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 U 2 n a E P w F l 9 e J s 1 q x T u v V O 8 v y r X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A E q 8 I 2 4 < / l a t e x i t > a t 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 S 5 A J e A l b d v 4 K S t b x z M L y z u T e d I = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z b t q l m 0 3 Y n Q g l 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 3 8 9 h P X R s T q E S c J 9 y M 6 V C I U j K K V 2 r S f 4 Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 v 7 a r l 2 k 8 d R g G M 4 g T P w 4 B J q c A d 1 a A C D E T z D K 7 w 5 i f P i v D s f 8 9 Y V J 5 8 5 g j 9 w P n 8 A B C W P Y Q = = < / l a t e x i t > Attend w i < l a t e x i t s h a 1 _ b a s e 6 4 = " y u G y f G z G 4 L f d n 9 9 v X T g 9 T 6 I P f z 4 r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 7 8 9 j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U 3 9 1 q 5 e a h W q 7 d 5 n E U 4 B h O 4 A w 8 u I I a 3 E M d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x / 2 E Y 9 Y < / l a t e x i t >   a t 4   < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 8 M 7 2 Q 9 j 3 l 3 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U 3 9 1           Language Generation To generate questions and answers, we train sequence-to-sequence models (Figure 2b) where an encoder takes in a sequence of images and a decoder produces a sequence of word tokens.At each decoding timestep, the decoder attends over the input images to predict the next word of the question or answer.This model is also initialized via training on CVDN dialogs.In particular, question asking (Questioner) encodes the images of the current viewpoint where a question is asked, and then decodes the question asked by the human N avigator.Question answering (Guide) is encoded by viewing images of the next five steps the shortest path planner would take towards the goal, then decoding the language answer produced by the human Guide.
Pretraining initializes the lexical embeddings and attention alignments before fine-tuning in the collaborative, turn-taking setting we introduce in this paper.We experimented with several beam and temperature based sampling methods, but saw only minor effects; hence, we use direct sampling from the model's predicted logits for this paper.

Data Augmentation (DA)
Navigation agents can benefit from data augmentation produced by a learned agent that provides additional, generated language instructions (Fried et al., 2018).Data pairs of generated novel language with visual observations along random routes in the environment can help with navigation generalization.We assess the effectiveness of such data augmentation in our two-agent dialog task.
To augment navigation training data, we choose a CVDN conversation but initialize the navigation agent in a random location in the environment, then sample multiple action trajectories and evaluate their progress towards the conversation goal location.In practice, we sample two trajectories and additionally consider the trajectory obtained from picking the top predicted action each time without sampling.We give the visual observations of the best path to the pretrained Questioner model to produce a relevant instruction.This augmentation allows the agent to explore and collect alternative routes to the goal location.We downweight the contributions of these noisier trajectories to the overall loss, so loss = λ * generations + (1 − λ) * human.We explored different ratios before settling on λ = 0.1.The choice of λ affects the fluency of the language generated, because a navigator too tuned to generated language leads to deviation from grammatically valid English and lack of diversity (Section 6 and Appendix).

Recursive Mental Model
We introduce the Recursive Mental Model agent (RMM),1 which is trained with reinforcement learning to propagate feedback through all three component models: N avigator, Questioner, and Guide.In this way, the training signal for question generation includes the training signal for answer generation, which in turn has access to the training signal from navigation error.Over training, the agent's progress towards the goal in the environment informs the dialog itself; each model educates the others (Figure 3).This model does not use any data-augmentation but still explores the world and updates its representations and language.
Each model among the N avigator, Questioner, and Guide may sample N trajectories or genera-How this looks like for Vision-Dialog Navigation

Start
Recursive Mental Model Rollout Goal N < l a t e x i t s h a 1 _ b a s e 6 4 = " A i z T e X p l B v n 9 P W e W H A X H z W a q l V Y e 2 5 i g g w r w w i n s 0 o / 1 T T B Z I J H 1 L d U 4 J j q I J t H n q E z q w x R J J V 9 w q C 5 + n s j w 7 H W 0 z i 0 k 3 l E v e z l 4 n + e n 5 r o O s e 2 5 i g g w r w w i n s 0 o / 1 T T B Z I J H 1 L d U 4 J j q I J t H n q E z q w x R J J V 9 w q C 5 + n s j w 7 H W 0 z i 0 k 3 l E v e z l 4 n + e n 5 r o O s r 4 5 y n p 0 3 5 3 3 e W n D y m X 3 4 B e f j G 7 j S j 7 I = < / l a t e x i t > ⇠ q 00 < l a t e x i t s h a 1 _ b a s e 6 4 = " L X Q z 5 S L + B w p t o q d U / g D 5 / M H i e y R a w = = < / l a t e x i t > G < l a t e x i t s h a 1 _ b a s e 6 4 = " A w s m y h s 2 O h 7 2 g e O 5 w z l q r u T I J j g e 2 5 i g g w r w w i n s 0 o / 1 T T B Z I J H 1 L d U 4 J j q I J t H n q E z q w x R J J V 9 w q C 5 + n s j w 7 H W 0 z i 0 k 3 l E v e z l 4 n + e n 5 r o K s r w c l n j u E P n M 8 f S X u P f w = = < / l a t e x i t > ⇠ g 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 u 7 L 5 P 9 s z 1 I v e V P z P 6 6 Y m u g o y J p P U U E n m i 6 K U I x O j 6 f u o z x Q l h o 8 t w U Q x e y s i Q 6 w w M T a k o g 3 B W 3 x 5 m b Q u q l 6 t e n 1 f K 9 d v 8 j g K c A y n c A 4 e X E I d 7 q A B T S A g 4 R l e 4 c 3 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A E P I Y / h < / l a t e x i t > ⇠ g < l a t e x i t s h a 1 _ b a s e 6 4 = " e o e A x 2 m N w z t h 5 c d 6 d j 0 V r w c l n j u E P n M 8 f S X u P f w = = < / l a t e x i t > ⇠ g 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 u 7 L 5 P 9 s A j e 4 s v L p H V R 9 W r V 6 / t a p X 6 T x 1 G E I z i B c / D g E u p w B w 1 o A g E B z / A K b 4 5 y X p x 3 5 2 P e W n D y m U P 4 A + f z B 6 w y j 7 A = < / l a t e x i t > ⇠ g 00 < l a t e x i t s h a 1 _ b a s e 6 4 = " O U Y L w 2 Y W p l u p X d Y b X p s D p T w x i N A = " > A A A B 8 X i c b V B N S w M x E J 3 1 s 9 a v q k c v 0 S L 1 V H a l o N 6 K X j x W s B / Y X U o 2 z b a h S X Z J s k J Z + i + 8 e F D E q / / G m / / G t N 2 D t j 4 Y e L w 3 w 8 y 8 M O F M G 9 f 9 d l Z W 1 9 Y 3 N g t b x e 2 d 3 b 3 9 0 s F h S 8 e p I r R J Y h 6 r T o g 1 5 U z S p m G G 0 0 z t h 5 c d 6 d j 0 V r w c l n j u E P n M 8 f S X u P f w = = < / l a t e x i t > max q,g GP (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 x d x q r n r T q l 2 r n t / U y v W L P I 4 i 2 A c H o A J s c A r q 4 B o 0 Q B N g 8 A i e w S t 4 M 5 6 M F + P d + J i 2 F o x 8 Z h f 8 g f H 5 A 1 5 t l I o = < / l a t e x i t > Figure 3: The Recursive Mental Model allows for each sampled generation to spawn a new dialog and corresponding trajectory to the goal.The dialog that leads to the most goal progress is followed by the agent.tions of max length L. These samples in turn are considered recursively by the RMM agent, leading to N T possible dialog trajectories, where T is at most the maximum trajectory length.To prevent unbounded exponential growth during training, each model is limited to a maximum number of total recursive calls per run.Search techniques, such as frontiers (Ke et al., 2019), could be employed in future work to guide the agent.
Training In the dialog task we introduce, the agents begin only knowing the name of the target object.The N avigatoragent must move towards the goal room containing the target object, and can ask questions using the Questioner model.The Guide agent answers those questions given a privileged view of the next steps in the shortest path to the goal rendered as visual observations.
We train using a reinforcement learning objective to learn a policy π θ (τ |G) which maximizes the log-likelihood of the shortest path trajectory τ (Eq. 1) where a t = f θ D (z t , s t ) is the action decoder, z t = f θ E (w 1:t ) is the language encoder, and w 1:t is the dialog context at time t.
We can calculate the cross entropy loss between the generated action and the shortest path action at time t to do behavioral cloning before sampling the next action from the N avigator predictions.

Reward Shaping with Advantage Actor Critic
As part of the N avigator loss, the goal progress can be leveraged for reward shaping.We use the Advantage Actor Critic (Sutton and Barto, 1998) formulation with regularization (Eq.2) where The RL agent loss can then be expressed as the sum between the the A2C loss with regularization and the cross entropy loss between the ground truth and the generated trajectories CE(τ , τ ).This is then propagated through the generation models Questioner and Guide, as well by simply accumulating the RL navigator loss on top of the standard generation cross entropy CE( Ŵ , W ).
Inference During training, exact environmental feedback can be used to evaluate samples and trajectories.This information is not available at inference, so we instead rely on the navigator's confidence to determine which of several sampled paths should be explored.Specifically, for every question-answer pair sampled, the agent rolls forward five navigation actions, and the probability of all resulting navigation sequences are compared.The trajectory with the highest probability is used for the next timestep.Note, that this does not guarantee that the model is actually progressing towards the goal, but rather that the agent is confident that it is acting correctly given the dialog context and target object hint.

Gameplay
As is common in dialog settings, there are several moving pieces and a growing notion of state throughout training and evaluation.In addition to the N avigator, Questioner, and Guide ideally there should also be a model which generates the target object and one which determines when is best to ask a question.We leave these two components for future work and instead assume we have access to the human provided target (e.g. a plant) and set the number of steps before asking a question to four based on the human average of 4.5 in CVDN.
Setting a maximum trajectory is required due to computational constraints as the the language context w 1:j grows with every exchange.Following (Thomason et al., 2019), we use a maximum navigation length of 80 steps, leading to a maximum of (Guide answers to questions) tags (Figure 2a).During roll outs the model is reinitialized to prevent information sharing via the hidden units.

Results
In Table 1 we present gameplay results for our RMM model and competitive baselines.We report two main results and four ablations for seen and unseen house evaluations; the former are novel dialogs in houses seen at training time, while the latter are novel dialogs in novel houses.

Full Evaluation
The full evaluation paradigm corresponds to QA 1:i-1 for goal progress and BLEU.In this setup, the agent has access to and is attending over the entire dialog history up until the current timestep in addition to the original target object t O .We present three models and two conditions for RMM (N = 1 and N = 3).N refers to the number of samples explored in our recursive calls, so N = 1 corresponds to simply taking the single maximum prediction while N = 3 allows the agent to explore.In the second condition, the choice of path/dialog is determined by the probabilities assigned by the N avigator (Section 4.3).
An additional challenge for navigation agents is knowing when to stop.Following previous work (Anderson et al., 2018), we report Oracle Success Rates measuring the best goal progress the agents achieve along the trajectory, rather than the goal progress when the stop action is taken.
In unseen environments, the RMM based agent makes the most progress towards the goal and benefits from exploration at during inference.During inference the agent is not provided any additional supervision, but still makes noticeable gains by evaluating trajectories based on learned N avigator confidence.Additionally, we see that while low, the BLEU scores are better for RMM based agents across settings.
Ablations We also include two simpler results: t O , where the agent is only provided the target object and explores based on this simple goal, and QA i-1 where the agent is only provided the previous question-answer pair.Both of these settings simplify the learning and evaluation by focusing the agent on search and less ambiguous language, respectively.There are two results to note.First, even in the simple case of t O the RMM trained model generalizes best to unseen environments.In this setting, during inference all models have the same limited information, so the RL loss and exploration have better equipped RMM to generalize.
Second, several trends invert between the seen and unseen scenarios.Specifically, the simplest model with the least information performs best overall in seen houses.This high performance coupled with weak language appears to indicate the models are learning a different (perhaps search based) strategy rather than how to communicate via and effectively utilize dialog.In the QA i-1 and QA 1:i-1 settings, the agent generates a questionanswer pair before navigating, so the relative strength of the RMM model's communication becomes clear.We next analyze the language and behavior of our models to investigate these results.

Analysis
We analyze the lexical diversity and effectiveness of generated questions by the RMM, and present a qualitative inspection of generated dialogs.

Lexical Diversity
Both RMM and Data Augmentation introduce new language by exploring and the environment and generating dialogs.In the case of RMM an RL loss is used to update the models based on the most successful dialog.In the Data Augmentation strategy, the best generations are simply appended to the dataset for one epoch and weighted appropriately for standard, supervised training.The augmentation strategy leads to small boost in BLEU performance and goal progress in several settings (Table 1), but the language appears to collapse to repetitive and generic interactions.We see this manifest rather dramatically in Figure 4, where the DA is limited to only 22 lexical types.In contrast, Recursive Mental Model continues to produce over 500 unique lexical types, much closer to the nearly 900 of humans.

Effective Questions
A novel component of a dialog paradigm is assessing the efficacy of every speech act in accomplishing a goal.Specifically, the optimal question should elicit the optimal response, which in turn maximizes the progress towards the goal room.If agents were truly effective at modeling each other, we would expect the number of dialog acts to be kept to a minimum.We plot the percent of questions asked against the percent of goal progress in Figures 5a and 5b.Human conversations in CVDN always reach the goal location, and usually with only 3-4 questions (Figure 5a).We see that the relationship between questions and progress is roughly linear, excusing the occasional lost and confused human teams.The final human-human question is often simply confirmation that navigation has arrived successfully to the goal room.
In Figure 5b, we plot dialogs for the Baseline, Data Augmentation, and RMM agents against percent goal progress.The RMM consistently outperforms the other two agents in terms of goal progress for each dialog act.We see an increase in progress for the first 10 to 15 questions before the model levels off.In contrast the other agents exhibit shallower curves and fail to reach the same level of performance.

Qualitative Results
Figure 1 gives a cherry-picked example trajectory, and Figure 6 gives a lemon-picked example trajectory, from the unseen validation environments.
We discuss the successes and failures of the lemon-picked Figure 6.As with all CVDN instances, there are multiple target object candidates (here, "fire extinguisher") but only one valid goal yes, all the way down the small hall.4.0 should i turn left here?head into the house, then you will find a doorway at the goal staircase.go through the doors before those two small exit chairs, about half way down the hall.
5.7 lots of sink in this house, or wrong did.ok which way do i go go down the hallway, take a left and go down the next hallway and up the stairs on the right.

8.8
Table 2: Dialog samples for Figure 6 with corresponding Goal Progress -see appendix for complete outputs.
room.Goal progress is measured against the goal room.When the Guide is shown the next few shortest path steps to communicate, those steps are towards the goal room.As can be seen in Figure 6, the learned agents have difficulty in deciding when to stop and begin retracing their steps.This distinction is most obvious when comparing the language generated.Table 2 shows generated conversations along with the Goal Progress (GP) at each point when a question was asked.Note, that the generation procedure for all models is the same sampler, and they start training from the same checkpoint, so the relatively coherent nature of the RMM as compared to the simple repetitiveness of the Data Augmentation is entirely due to the recursive calls and RL loss.No model has access to length penalties or other generation tricks to avoid degenerating.

Figure 1 :
Figure 1: The RMM agent recursively models conversations with instances of itself to choose the right questions to ask (and answers to give) to reach the goal.
3 h z l P P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 8 0 2 o 9 d < / l a t e x i t > hN < l a t e x i t s h a 1 _ b a s e 6 4 = " Y b t m B R l b o Z 8 I E 3 o 5 a J i e 7 3 V b Y Q Y = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 k k q 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U z 9 1 h P X R s T q E c c J 9 y M 6 U C I U j K K V H o a 9 u 1 6 p 7 F b c G c g y 8 X J S h h z 1 X u m r 2 4 9 Z G n G F T F J j O p 6 b o J 9 R j Y J J P i l 2 U 8 M T y k Z 0 w D u W K h p x 4 2 e z U y f k 1 C p 9 E s b a l k I y U 3 9 P Z D Q y Z h w F t j O i O D S L 3 l T 8 z + u k G F 7 5 m V B J i l y x + a I w l Q R j M v 2 b 9 I X m D O X Y E s q 0 s L c S N q S a M r T p F G 0 I 3 u L L y 6 R Z r X j n l e r 9 R b l 2 n c d R g G M 4 g T P w 4 B J q c A t 1 a A C D A T z D K 7 w 5 0 n l x 3 p 2 P e e u K k 8 8 c w R 8 4 n z 8 g k o 2 x < / l a t e x i t > dT < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 p f 0 L G o k F U d C N n 2 e z 6 R B S q I R x F O I F T O A c P r q E O 9 9 C A J j A Y w z O 8 w p u T O C / O u / O x a C 0 4 + c w x / I H z + Q P w d Y 9 O < / l a t e x i t > Decoder Encoder w i 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " b l X 3 D 2 d a 4 r w r w T e L 2 n u q I 4 2 3 r n 4 = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 s S R S U G 9 F L x 4 r 2 A 9 o Q 9 l s J + 3 S z S b s b p Q S + i O 8 e F D E q 7 / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d l Z W 1 9 Y 3 N g t b x e 2 d 3 b 3 9 0 s F h U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A W j 2 6 n f e k S l e S w f z D h B P 6 I D y U P q 5 e a h W q 7 d 5 n E U 4 B h O 4 A w 8 u I I a 3 E M d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x / 5 G 4 9 a < / l a t e x i t > (a) Dialogue and action histories combined with the current observation are used to predict the next navigation action.<TAR> Plant <NAV> forward?<ORA> Yes Attend [! !… !" ] t e x i t s h a 1 _ b a s e 6 4 = " p f 2 3 h z l P P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 8 0 2 o 9 d < / l a t e x i t > hN < l a t e x i t s h a 1 _ b a s e 6 4 = " Y b t m B R l b o Z 8 I E 3 o 5 a J i e 7 3 V b Y Q Y = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 k k q 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U z 9 1 h P X R s T q E c c J 9 y M 6 U C I U j K K V H o a 9 u 1 6 p 7 F b c G c g y 8 X J S h h z 1 X u m r 2 4 9 Z G n G F T F J j O p 6 b o J 9 R j Y J J P i l 2 U 8 M T y k Z 0 w D u W K h p x 4 2 e z U y f k 1 C p 9 E s b a l k I y U 3 9 P Z D Q y Z h w F t j O i O D S L 3 l T 8 z + u k G F 7 5 m V B J i l y x + a I w l Q R j M v 2 b 9 I X m D O X Y E s q 0 s L c S N q S a M r T p F G 0 I 3 u L L y 6 R Z r X j n l e r 9 R b l 2 n c d R g G M 4 g T P w 4 B J q c A t 1 a A C D A T z D K 7 w 5 0 n l x 3 p 2 P e e u K k 8 8 c w R 8 4 n z 8 g k o 2 x < / l a t e x i t > dT < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 p f 0 H e n Y 9 F 6 5 q T z 5 z A H z i f P y O S j b M = < / l a t e x i t > at 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 S 5 A J e A l b d v 4 K S t b x z M L y z u T e d I = " > A A A B 7 n i c b y r X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A E q 8 I 2 4 < / l a t e x i t > a t 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 S 5 A J e A l b d v 4 K S t b x z M L y z u T e d I = " > A A A B 7 n i c bV B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z b t q l m 0 3 Y n Q g l 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d g pr 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 3 8 9h P X R s T q E S c J 9 y M 6 V C I U j K K V 2 r S f 4 Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 1 I v N T y h b E y H v G u p o h E 3 f j Y / d 0 r O r D I g Y a x t K S R z 9 f d E R i N j J l F g O y O K I 7 P s z c T / v G 6 K 4 Y 2 f C Z W k y B V b L A p T S T A m s 9 / J Q G j O U E4 s o U w L e y t h I 6 o p Q 5 t Q y Y b g L b + 8 S l q 1 q n d Z r T 1 c V e q 3 e R x F O I F T O A c P r q E O 9 9 C A J j A Y w z O 8 w p u T O C / O u / O x a C 0 4 + c w x / I H z + Q P w d Y 9 O < / l a t e x i t > Decoder Encoder w i 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " b l X 3 D 2 d a 4 r w r w T e L 2 n u q I 4 2 3 r n 4 = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 s S R S U G 9 F L x 4 r 2 A 9 o Q 9 l s J + 3 S z S b s b p Q S + i O 8 e F D E q 7 / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d l Z W 1 9 Y 3 N g t b x e 2 d 3 b 3 9 0 s F h U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A W j 2 6 n f e k S l e S w f z D h B P 6 I D y U P 7 a r l 2 k 8 d R g G M 4 g T P w 4 B J q c A d 1 a A C D E T z D K 7 w 5 i f P i v D s f 8 9 Y V J 5 8 5 g j 9 w P n 8 A B C W P Y Q = = < / l a t e x i t > Attend w i < l a t e x i t s h a 1 _ b a s e 6 4 = " y u G y f G z G 4 L f d n 9 9 v X T g 9 T 6 I P f z 4= " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o N 6 K X j x W M G 2 h D W W z n b Z L N 5 u w u 1 F K 6 G / w 4 k E Rr / 4 g b / 4 b t 2 0 O 2 v p g 4 P H e D D P z w k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j p o 5 T x d B n s 4 h l d 4 c 6 T z 4 r w 7 H 4 v W g p P P H M M f O J 8 / J 8 W O 7 w = = < / l a t e x i t > w i < l a t e x i t s h a 1 _ b a s e 6 4 = " y u G y f G z G 4 L f d n 9 9 v X T g 9 T 6 I P f z 4 = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o N 6 K X j x W M G 2 h D W W z n b Z L N 5 u w u 1 F K 6 G / w 4 k E R r / 4 g b / 4 b t 2 0 O 2 v p g 4 P H e D D P z w k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j p o 5 T x d B n s 4 h l d 4 c 6 T z 4 r w 7 H 4 v W g p P P H M M f O J 8 / J 8 W O 7 w = = < / l a t e x i t > a t < l a t e x i t s h a 1 _ b a s e 6 4 = " A b Z P 0 y 0 k F c j T 6 5 a / e I G Z p x B U y S Y 3 p e m 6 C f k Y 1 C i b 5 t N R L D U 8 o G 9 M h 7 1 q q a M S N n 8 1 P n Z I z q w x I G G t b C s l c / T 2 R 0 c i Y S R T Y z o j i y C x 7 M / E / r 5 t i e O V n Q i U p c s U W i 8 J U E o z J 7 G 8 yE J o z l B N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k t Z F 1 a t V r + 9 r l f p N H k c R T u A U z s G D S 6 j D H T S g C Q y G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S E o 3 Y < / l a t e x i t > a t 2< l a t e x i t s h a 1 _ b a s e 6 4 = " P j 7 + j l 4 a s d g R V m j T s R D U o 4 i T t y o = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 s S S l o N 6 K X j x W s B / Q h r L Z b t q l m 0 3 Y n Q g l 9 E d 4 8 a C I V 3+ P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d t b W N z a 3 t g s 7 x d 2 9 / Y P D 0 t F x y 8 S p Z r z J Y h n r T k A N l 0 L x J g q U v J N o T q N A 8 n Y w v p v 5 7 S e u j Y j V I 0 4 S 7 k d 0 q E Q o G E U r t W k / w 8 v q t F 8 q u x V 3 D r J K v J y U I U e j X / r q D W K W R l w h k 9 S Y r u c m 6 G d U o 2 C S T 4 u 9 1 P C E s j E d 8 q 6 l i k b c + N n 8 3 C k 5 t 8 q A h L G 2 p Z D M 1 d 8 T G Y 2 M m U S B 7 Y w o j s y y N x P / 8 7 o p h t d + J l S S I l d s s S h M J c G Y z H 4 n A 6 E 5 Q z m x h D I t 7 K 2 E j a i m D G 1 C R R u C t / z y K m l V K 1 6 t c v N Q K 9 d v 8 z g K c A p n c A E e X E Ed 7 q E B T W A w h m d 4 h T c n c V 6 c d + d j 0 b r m 5 D M n 8 A f O 5 w / 0 j I 9 X < / l a t e x i t > a t 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 F w Y k 9 F 8 t T 8 g C L q e b q 7 j A b y s c M M = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 s S R a U G 9 F L x 4 r 2 A 9 o Q 9 l s N + 3 S z S b s T o Q S + i O 8 e F D E q 7 / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U 3 9 1 h P X R s T q E c c J 9 y M 6 U C I U j K K V W r S X 4 f n l p F c q u x V 3 B r J M v J y U I U e 9 V / r q 9 m O W R l w h k 9 S Y j u c m 6 G d U o 2 C S T 4 r d 1 P C E s h E d 8 I 6 l i k b c + N n s 3 A k 5 t U q f h L G 2 p Z D M 1 N 8 T G Y 2 M G U e B 7 Y w o D s 2 i N x X / 8 z o p h t d + J l S S I l d s v i h M J c G Y T H 8 n f a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 1 C R R u C t / j y M m l e V L x q 5 e a h W q 7 d 5 n E U 4 B h O 4 A w 8 u I I a 3 E M d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x / 2 E Y 9 Y < / l a t e x i t > a t 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 8 M 7 2 Q 9 j 3 D d R P + x J V o b n 7 O D G a 2 o = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R g n o r e v F Y w X 5 A G 8 p m u 2 m X b j Z h d y K U 0 B / h x Y M i X v 0 9 3 v w 3 b t s c t P X B w O O 9 G W b m B Y k U B l 3 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j l o l T z X i T x T L W n Y A a L o X i T R Q o e S f R n E a B 5 O 1 g f D f z 2 0 9 c G x G r R 5 w k 3 I / o U I l Q M I p W a t N + h h e 1 a b 9 c c a v u H G S V e D m p Q e M o w g m c w j l 4 c A V 1 u I c G N I H B G J 7 h F d 6 c x H l x 3 p 2 P R W v B y W e O 4 Q + c z x / 3 l o 9 Z < / l a t e x i t > a t 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " x o P 2 Q Z j v P Y 3 3 6 0 M B 4 r b u i k A M d M Q = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 s S R S U W 9 F L x 4 r 2 A 9 o Q 9 l s N + 3 S z S b s T o Q S + i O 8 e F D E q 7 / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U 3 9 1 h P X R s T q E c c J 9 y M 6 U C I U j K K V W r S X 4 f n l p F c q u x V 3 B r J M v J y U I U e 9 V / r q 9 m O W R l w h k 9 S Y j u c m 6 G d U o 2 C S T 4 r d 1 P C E s h E d 8 I 6 l i k b c + N n s 3 A k 5 t U q f h L G 2 p Z D M 1 N 8 T G Y 2 M G U e B 7 Y w o D s 2 i N x X / 8 z o p h t d + J l S S I l d s v i h M J c G Y T H 8 n f a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 1 C R R u C t / j y M m l e V L x q 5 e a h W q 7 d 5 n E U 4 B h O 4 A w 8 u I I a 3 E M d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x / 5 G 4 9 a < / l a t e x i t > (b) A Bi-LSTM over the path is attended to during decoding for question and instruction generation.

Figure 2 :
Figure 2: Our backbone Seq2Seq architectures are provided visual observationsand have access to the dialogue history when taking actions or asking/answering questions (Thomason et al., 2019).
a l 2 n c W R h y M 4 g T P w 4 B J q c A t 1 a A A B C c / w C m + O d l 6 c d + d j 3 p p z s p l D + A P n 8 w c b y 4 / j < / l a t e x i t > ⇠ q < l a t e x i t s h a 1 _ b a s e 6 4 = " J 2 z T x m e v O y e S D 2 l c 2 e l L M W y d a H 1 W I S 2 U 2 A z 1 I v e V P z P 6 6 Y m u g o y J p P U U E n m i 6 K U I x O j 6 f u o z x Q l h o 8 t w U Q x e y s i Q 6 w w M T a k o g 3 B W 3 x 5 m b Q u q l 6 t e n 1 f K 9 d v 8 j g K c A y n c A 4 e X E I d 7 q A B T S A g 4 R l e 4 c 3 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A E P I Y / h < / l a t e x i t > ⇠ g < l a t e x i t s h a 1 _ b a s e 6 4 = " e o e A x 2 m N w

Figure 4 :
Figure 4: Log-frequency of words generated by human speakers as compared to the Data Augmentation (DA) and our Recursive Mental Model (RMM) models.
Normalized plot of goal progress and #Qs asked by humans.Note, even for long dialogs, most questions lead to substantial progress towards the goal.
DA and RMM generated dialogs make slower but consistent progress (ending below 25% of total goal progress).

Figure 5 :
Figure 5: Effectiveness of human dialogs (left) vs our models (right) at achieving the goal.The slopes indicate the effectiveness of each dialog exchange in reaching the target.

Figure 6 :
Figure 6: Generated trajectories in an unseen environment.The red stop-sign is the target, while the black stop-signs are distactors (other fire extinguishers) that may confuse the agents.The white dashed trajectory is the human path from CVDN, black is the baseline model, and green is our RMM with N = 3.
in between the ropes to my right or straight forward?straight forward through the next room 0 Should I proceed down the hall to the left of turn right?head down the hall to your right into the next room 13.3 Should I go through the open doors that are the closest to me?You are in the goal room 29.1 DA should i go into the room?you are in the goal room.5.7 should i go into the room?you are in the goal room.0.0 RMM should i head forward or bedroom the next hallway in front of me?

Table 1 :
Gameplay results on CVDN evaluated when agent voluntarily stops or at 80 steps.Full evaluations are highlighted in gray with the best results in blue, remaining white columns are ablation results.