Effective questions in referential visual dialogue

An interesting challenge for situated dialogue systems is referential visual dialogue: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.


Introduction
GuessWhat?! ) is a cooperative two-player referential visual dialogue game. One player (the Oracle) is assigned a referent object in an image, the other player (the Questioner) has to guess the referent by asking yes/no questions. The GuessWhat?! dataset contains games of different complexity, ranging from easy images with a referent and 1 distractor to images with 19 distractors.
Referential visual dialogue has a clear task success metric: whether the Questioner is able or not to correctly identify the referent at the end of the dialogue. The need of going beyond this metric to evaluate the quality of the dialogues has already been observed. So far attention has been put on the linguistic skills of the models (Shukla et al., 2019;Shekhar et al., 2019) and their dialogue strategies (Abbasnejad et al., 2018;Shekhar et al., 2018). Recently, Sankar et al. (2019) showed that current SOTA dialogue systems do not take dialogue history into account, and new models were proposed to make questions more informative and consistent with the dialogue history (Shukla et al., 2019;Ray et al., 2019;Abbasnejad et al., 2019;Pang and Wang, 2020). But still the models are mostly evaluated without considering how much each question contributes to the goal. We propose a new metric to evaluate dialogue effectiveness as the percentage of effective questions it contains. Intuitively, a question is effective if it eliminates at least one possible distractor from the set of objects (Krahmer and van Deemter, 2012). Figure 1 gives a game played by humans as an example. In the image there are 8 candidate objects: the referent object is the cow marked in green and the distractors are the other 6 cows and the wooden stick. The dialogue is highly effective: 80% of the questions eliminate at least one distractor.

Human question
Answer # D Effective 1. is it a cow? yes 6 True 2. is it the big cow in the middle? no 5 True 3. a cow on the left? no 3 True 4. on the right? yes 3 False 5. first cow near us? yes 0 True Figure 1: Human-human dialogue on the Guesswhat?! referential task extracted from . The target is highlighted in green. # D is the number of candidates remaining after the question is answered. Four out of five questions eliminate distractors and, hence, are effective.
Despite recent progress in the area of vision and language, recent work (Jain et al., 2019) in the navigation task (VLN) argues that current research leaves unclear how much of a role language plays in this task. They point out that dominant evaluation metrics have focused on goal completion rather than how each action contributes to the goal (Anderson et al., 2018). The nature of the path an agent takes, however, is of clear practical importance: it is undesirable for any robotic agent in the physical world to reach the destination by taking a lot of deviation or getting into dangerous zones. Jain et al. (2019) propose alternative metrics that evaluate the intermediate steps in the VLN task. As argued by Lowe et al. (2019), the vast majority of recent papers on emergent communication show that adding a communication channel leads to an increase in task success. This is a useful indicator, but provides only a coarse measure of the agent's learned communication abilities. As we move towards more complex environments, it becomes imperative to have a set of finer tools that allow qualitative and quantitative insights into the emergence of communication.
Following this idea of not only focusing on goal completion but on evaluating how much each step contributes to the goal, in this paper we propose a new metric for referential dialogue. We agree with Thomason et al. (2019) that incremental evaluation metrics such as ours should look further back into the dialogue history. We believe that language and vision systems should also be evaluated on aspects such as grammatically, truthfulness, diversity and other aspects as done in previous work (Lee et al., 2018;Ray et al., 2019;Xie et al., 2020;Murahari et al., 2019). In this paper we focus on whether a question is effective considering the dialogue history and the visual context.
One of the motivations for referential visual dialogue is to provide robots with the ability to identify objects through dialogue with a humans. The task we address in this paper is a simplification. In our setup, the view of the robot is static (i.e. a picture). For our work we use the GuessWhat?! dataset . We are particularly interested in models that generate questions explicitly modelling the dialogue history Shukla et al., 2019;Pang and Wang, 2020). 1

Effective questions
Our definition of effective question is based on the set of candidate objects: the reference set RS. We compute RS for each question q t . The reference set before the dialogue starts, RS(q 0 ), contains all the objects in the image. At each dialogue turn t, RS(q t ) is defined as the set of objects in RS(q t−1 ) such that the answer 2 to q t on those objects is the same than the answer to q t on the referent r. Formally: We say that a question q t is not effective iff RS(q t ) = RS(q t−1 ). That is, the question does exclude any distractor. The effectiveness of the dialogue is given by the percentage of effective questions it has. Table 1 reports the average effectiveness (Global column) for humans and SOTA models for which either the code or the dialogues with suitable annotations have been released. We also distinguish the effectiveness of dialogues finished in either Failure or Success. The baseline model  represents the Questioner as two independent models, the question generator and the guesser, and train them by supervised learning. RL  further trains this baseline with a reinforcement learning phase. GDSE-SL differs from the baseline by having a joint encoder for the Questioner components and GDSE-CL exploits this joint architecture by letting the two components cooperate with each other (Shekhar et al., 2019). Last, VDST (Pang and Wang, 2020) extends the questioner with a probability distribution of each object being the referent and trains with reinforcement learning.
The results suggest that models make more non-effective questions than one may expect. Surprisingly, successful dialogues generated by models do not have a higher percentage of effective questions. Even for humans, effectiveness is not considerably higher for successful dialogues. Human effectiveness is VDST GDSE-CL 1. is it food? yes 1. is it food? yes 2. is it in the left? yes 2. is it a cake? yes 3. is it in the front? yes 3. is it the dark brown? yes 4. is it in the top? no 4. is it the entire cake? yes 5. in the middle? no 5. so the most left of the brown ones? yes Figure 2: Dialogues generated by VDST and GDSE-CL in a successful game. Non effective in italics.
higher in almost every column of the table, the VDST model is close. Humans do not see the list of annotated objects as the Guesser models do. They rely on their sight on the image and they may ask questions that discard objects present in the image but not annotated in the dataset and hence not part of the reference set we calculate. All of these questions are marked as non-effective because they discard objects invisible to our metric and to the models. Hence, human effectiveness could be higher than we have calculated using the GuessWhat?! dataset object annotations. Our manual inspections of human dialogues has shown that humans ask non-effective questions mostly at the end of the dialogue to reinforce their belief before guessing. We only show results of the 5 questions setup for VDST as we only had access to those dialogues. Figure 2 shows an example of both metrics on a game on which VDST and GDSE-CL are successful. Effectiveness is 60 for VDST and 40 for GDSE-CL. Our definition of effectiveness not only accounts for question repetitions, but it also captures paraphrases and context-dependent redundancies. Examples of context dependent redundancy can be seen for both systems. In the VDST dialogue, 4 is redundant because, in this image, there is no cake that is both in the front and in the top. In GDSE-CL dialogue, question 2 is redundant because all cakes in the image are dark brown.

Conclusion and future work
We proposed a new metric for evaluating Guesswhat?! dialogues. Effectiveness, as we defined it, evaluates whether the question can rule out at least one possible distractor. We consider a question to be effective if it is able to make the reference set smaller. We observe that effectiveness decreases as dialogues advance and reaches its lowest level in the last turn. We also find that successful dialogues do not have a higher percentage of effective questions. This is surprising, and hints at the fact that there are other strategies to accomplish reference identification other than asking effective questions. We believe that our metric could be a heuristic that guides the training of end-to-end models.