Giving Commands to a Self-Driving Car: How to Deal with Uncertain Situations?

Current technology for autonomous cars primarily focuses on getting the passenger from point A to B. Nevertheless, it has been shown that passengers are afraid of taking a ride in self-driving cars. One way to alleviate this problem is by allowing the passenger to give natural language commands to the car. However, the car can misunderstand the issued command or the visual surroundings which could lead to uncertain situations. It is desirable that the self-driving car detects these situations and interacts with the passenger to solve them. This paper proposes a model that detects uncertain situations when a command is given and finds the visual objects causing it. Optionally, a question generated by the system describing the uncertain objects is included. We argue that if the car could explain the objects in a human-like way, passengers could gain more confidence in the car's abilities. Thus, we investigate how to (1) detect uncertain situations and their underlying causes, and (2) how to generate clarifying questions for the passenger. When evaluating on the Talk2Car dataset, we show that the proposed model, \acrfull{pipeline}, improves \gls{m:ambiguous-absolute-increase} in terms of $IoU_{.5}$ compared to not using \gls{pipeline}. Furthermore, we designed a referring expression generator (REG) \acrfull{reg_model} tailored to a self-driving car setting which yields a relative improvement of \gls{m:meteor-relative} METEOR and \gls{m:rouge-relative} ROUGE-l compared with state-of-the-art REG models, and is three times faster.


Introduction
Command: "There is mark on the bench! Pull over here and park so I can go grab lunch with him". The referred object is indicated with the yellow bounding box in the image and in bold font in the text. Best viewed in color.
In the last few years, we have witnessed a fast-growing interest in self-driving cars. Many researchers are involved in creating the first completely autonomous vehicle that drives in open realistic environments where a human driver becomes obsolete. However, surveys (Howard and Dai, 2014;Schoettle and Sivak, 2014;Richardson and Davies, 2018;Edmonds, 2019) show that * Equal contribution * * Equal contribution and corresponding author. people might feel uneasy or even scared to sit in a self-driving car where they no longer have control over the car's actions. For this study, we also held a survey on several social media platforms asking how much people trust self-driving cars. Out of the 254 responses, 36.8% indicated not to trust driving around in a self-driving car. As reported in previous surveys, the majority of these people explained to be afraid of not being in control anymore or to have safety concerns about autonomous cars. Furthermore, we noted that with the ability to give commands to self-driving cars, the number decreased to 30.4%. This is in line with recent studies, in which datasets are proposed to give such commands to self-driving cars. Examples of such datasets are Talk2Car (Deruyttere et al., 2019), Talk2Nav (Vasudevan et al., 2019), and Touchdown (Chen et al., 2018), to name a few. The former investigates commands that should be executed in the car's visual field. An example of such a command is visible in Figure 1). The latter two investigate natural language navigation commands that contain directional guidance (i.e., "cross the intersection and then turn left, when you see a red building turn right") and require the car to drive through the city to achieve its end goal.
To the best of our knowledge, all studies ignore that the issued command and the visual scene might result in an uncertain or ambiguous interpretation by the system. We believe that it is essential for a system to recognize and resolve these ambiguous situations. An example of this is given in the Talk2Car setting. This dataset includes commands such as "Stop next to this guy with his blue shirt. I need to pick him up. He is my friend". Nevertheless, suppose multiple persons are wearing a Command: Pick up that person Select Candidates 50% 50% Jointly detect uncertainty and uncertain objects Do you mean the person with black dress, or the person with blue shirt?

VG Model
Step 2 Step 1 (Optional) Textual output: generating question for uncertain objects Touch the object you are referring to Visual output: Touch screen + Figure  The Uncertainty Resolving System (URS) works by receiving the output of any visual grounding (VG) model that outputs a softmax distribution. The VG model is given an image I, a command c (for instance: "Pick up that person"), and a set of objects O I (in this case, they are limited to the green and yellow bounding box) as inputs. Based on these outputs, it makes a prediction to which object is being referred. In Step 1 of URS, we detect, based on the output of the VG model, if there is an ambiguous situation and also which objects are causing this uncertainty. For Step 2 of URS, we have a visual output and optionally also a textual output. In the visual output, we display the uncertain objects as an overlay on the image, for instance, on a touch screen. The user can then indicate which object they are referring to. We can also have an additional textual output that shows a question, explaining the uncertain objects to the passenger. Best viewed in color.
blue shirt. In that case, it might not be apparent to the car which person is being referred to. We hypothesize that resolving these ambiguous situations can be achieved in two ways. Either the car displays the source of ambiguity on a screen or, in addition to the former, the car can ask confirming questions to the passenger in either textual form or by speech. In the example, the car could generate a clarifying question such as "Do you mean the man on the left or the man on the right?" to lift the ambiguity.
In our aforementioned survey, we proposed both options and asked which of these would make them feel more confident in the car's abilities. A majority of 54.9% answered they would feel more confident with a visual + textual/speech output. At the same time, 38.7% opted for only the visual output. 6.3% indicated that they would not feel more confident with any of these two options. Hence, based on the results of the survey, in this paper we propose and evaluate a novel model that can output the source of ambiguity on the screen (visual output) or, in addition to the former, also generate a question geared towards the passenger (textual/speech output). However, for simplicity reasons, we create a model that outputs textual questions instead of speech. The extension from text to speech should be possible in the future with the large amount of research and improvements in this area (Taigman et al., 2017;Ren et al., 2019;Valle et al., 2020).
When passengers give a command to a self-driving car that refers to a specific object, a first step in the command understanding is to detect this object (Deruyttere et al., 2020b). This task is often called visual grounding (VG) in the literature. However, the issued command could introduce some uncertainty for the VG model. Hence, we propose a novel Uncertainty Resolving System (URS), to make the VG model ca-pable of coping with uncertainty or ambiguity caused by the command and by the visual scene perceived by the car. A visualisation of the complete URS can be found in Figure 2. To use URS, the softmax output of the VG model is first obtained by giving it an image, a free form natural language command, and a set of objects that have been extracted with a Region Proposal Network (RPN). Next, URS measures, based on the output of the VG model, if the command can be easily interpreted (step 1 in the Figure), or whether the system is uncertain about some objects as being the referred object in the command. To detect this uncertainty, we evaluate many different methods (e.g., ensembles, temperature scaling) and combinations of these methods. Furthermore, we design a novel set of restrictions to limit the number of predicted uncertain objects while still maintaining a high increase in performance. In the case of uncertainty, URS can return a visual output, and optionally, a textual output too. We design a robust two-layer LSTM referring expression generator that can efficiently use image features for the textual output. Furthermore, we propose a varying set of attributes that can uniquely describe the objects (e.g., color, distance, action) and experiment with using them in the referring expression generator.
The proposed URS is extensively evaluated on the Talk2Car dataset (Deruyttere et al., 2019). For the textual output of URS we have extended the dataset with expressions that discriminatively refer to specific objects, also called referring expressions. Next to these expressions, we also annotate the objects with attributes to see if this helps in generating better expressions, for which we propose the Attribute-Referring Expression Generator (A-REG) (Step 2 in Figure 2). The Talk2Car dataset was chosen over other datasets since it already contains many different modalities such as radar, LiDAR, instance segmenta-tion, video, map annotations, etc., some of which are used in this work. We separately evaluate each component of the system and assess the improvements in correctly understanding the commands of the Talk2Car dataset when based on the proposed human-machine interaction.
The contributions of this study are fourfold: 1. We evaluate different uncertainty detection methods and their combinations. We present a novel set of constraints valuable in a self-driving car situation that keep a balance between a low amount of uncertain objects presented to the passenger and high accuracy in solving the uncertain situation. 2. Based on the detected uncertainty, we propose a method that generates questions given the objects' visual characteristics that cause uncertainty. 3. We quantitatively evaluate the uncertainty detection model and the question generation approach. 4. Finally, human judges qualitatively evaluate the validity of the generated questions in solving the uncertainty. In Section 2, we discuss the related work. In Section 3 we describe the URS by explaining the VG model used in Subsection 3.2, how uncertainty is detected in Subsection 3.3, how objects are described in Section 4, and more specifically how a question is generated in Subsection 4.3. Section 5 discusses the experiments and the used dataset. Finally, Section 6 explains the conclusion and possible future work.

Related Work
The main topics we discuss in our related work are object detection in terms of the command (Subsection 2.1), uncertainty detection (Subsection 2.2), and referring expression generation (Subsection 2.4).

Detection of the Referred Object of the Command
Several approaches for the VG task (i.e., finding the object referred to in the command) in a self-driving car setting have been proposed. One approach implements a multi-step reasoning neural network where the network's reasoning process is guided by focusing on the natural language command sub-parts in different reasoning steps (Yu et al., 2018;Deng et al., 2018;Deruyttere et al., 2020a). Another paradigm is using a graphneural network Wang et al., 2019). Current most successful approaches use pre-trained language models to encode the language command (Lu et al., 2020;Chen et al., 2020). For the current study, we use the model from Rufus et al. (2020) which uses a pre-trained Sentence-BERT by Reimers and Gurevych (2019) to encode commands, and a pretrained EfficientNet-b2 by Tan and Le (2019) to encode objects detected in the image 1 . However, detecting the referred object in the command is not always correct, hence the importance of accurate uncertainty detection and quantification.

Uncertainty Detection
Currently, we rely on numerous artificial intelligence systems to help us with decision making. Detecting when the system's prediction is uncertain and accurately quantifying this uncertainty are research topics of increasing interest. Especially when decisions are taken in real-world situations, uncertainty detection is of primordial importance. The probabilistic output of a classification model, for instance, obtained with a softmax function, is not always a reliable reflection of the model's confidence. Even with high softmax values, a model can still be uncertain about its prediction (Gal, 2016). This is because the classifier, and specifically deep neural networks, tend to be poorly calibrated and overconfident in its predictions (Guo et al., 2017).
To overcome this issue, temperature scaling is applied to the softmax outputs a posteriori (Guo et al., 2017). Additionally, Bayesian Neural Networks could be used (MacKay, 1992), which detect uncertainty by specifying a prior distribution over the parameters of a neural network. After training the network on data, a posterior distribution is used to estimate the uncertainty (Lakshminarayanan et al., 2017). However, these networks are not well suited for many real problems as they are computationally complex and challenging to train (Gal and Ghahramani, 2016;Kendall and Gal, 2017;Lakshminarayanan et al., 2017;Ayhan et al., 2019). An alternative that approximates the Bayesian Neural Network's behavior is using an ensemble of networks to quantify uncertainty (Lakshminarayanan et al., 2017). An ensemble of networks can also be used to recalibrate an existing network. Instead of training and running multiple networks, Gal and Ghahramani (2016) propose to use Monte Carlo Dropout (MC Dropout) (Srivastava et al., 2014) which applies the dropout during both training as well as during inference. During inference, this method simulates having an ensemble of networks by sampling.
Quantifying uncertainty has made a successful contribution to several applications. For instance, Ayhan et al. (2019) have created a system to detect diabetic retinopathy that also reports its uncertainty by using data augmentation during inference. Detecting uncertainty with MC Dropout in 3D object detection with LiDAR also resulted in improved accuracy (Feng et al., 2018). To generate better depth maps when jointly using Li-DAR and images, Van Gansbeke et al. (2019) use a network with a branch for local information and another for global information. In addition, each branch learns a confidence map in an unsupervised manner. Based on these confidence maps, the network knows which branch is more certain and can give it more weight. Lee et al. (2018) use a Bayesian ensemble to cope with sensor failures in an autonomous vehicle. Xiao and Wang (2019) employ MC Dropout to measure uncertainty in natural language tasks and show that this improves accuracy.
In this paper, we focus on methods that detect and quantify uncertainty when the system understands a natural language command given in a certain visual context. We want the method to be agnostic of the underlying visual grounding (VG) system, hence our choice of temperature scaling and ensemble methods. An important novel goal of this paper is how to adapt uncertainty detection to the situation where a passenger interacts with his or her self-driving car and how to balance the accuracy of the uncertainty detection and the limitation of the cognitive overload for the passenger. Rupprecht et al. (2018) propose a method to improve image segmentation models by giving a user a first estimate of the segmentation mask. Then, the user can indicate the wrongly segmented parts of the image through speech/text. In a second pass, the segmentation model updates its prediction based on the received feedback. Instead of giving user feedback in the form of speech/text, Agustsson et al. (2019) propose to allow the user to draw a corrective scribble on top of the semantic segmentation to indicate the wrongly predicted parts. Branson et al. (2014) propose a hybrid human-machine vision system for fine-grained categorization. The goal is to predict which bird is shown in the image. The machine needs to ask questions to a human to reduce its uncertainty about the predicted bird species as quickly as possible. The human can then click on the image to indicate specific parts of the bird or tell the system which color certain parts are.

Resolving Uncertainty by a Natural Language Interface: Question Generation
Handling an uncertain or ambiguous situation by posing a natural language question to the user of an AI system is generally a good idea. For instance, Xu et al. (2019) generate template-based questions in a knowledge-base question answering setting, where a binary classifier first detects whether a question needs to be posed by the system. Specific parts of a question are generated relying on a Seq2Seq (Bahdanau et al., 2014) or a Transformer (Vaswani et al., 2017) architecture and are then joined based on a template. This paper will use a similar approach. A completely different approach is to extract text spans from a context instead of generating free-form questions or expressions .
Generating natural language questions about objects that cause uncertainty is related to referring expression generation (REG) where expressions uniquely describe objects compared to other objects in the same image. The task has similarities with the "dense image captioning" task (Johnson et al., 2016), which creates a non-unique caption for all perceived objects in the image. The REG task can be defined as follows: given an image and a set of objects, a model needs to generate a unique referring expression that describes each of the given objects. Mao et al. (2016) use a convolutional neural network (CNN) to encode the objects in the image and pass these features to a recurrent neural network (RNN) to generate a caption describing the objects. They use Maximum Mutual Information (MMI) training to penalize the model if it thinks that a certain expression could also be generated from another object in the same image. Liu et al. (2017) first train an attribute predictor to predict the attributes of objects. Then, they pass these attributes together with the extracted CNN object features to their generation module, which uses an LSTM to generate an expression. Yu et al. (2016b) embed object features and referring expressions in a joint semantic space used to generate expressions that discriminate between objects. The idea of a joint semantic space is followed by Tanaka et al. (2018), resulting in one of the top-performing methods in referring expression generation. The difference with the work from Yu et al. (2016b) is the use of a target-centered prior where they put a Gaussian distribution over the location of the object, the features of the target object, and the sentence context under generation. Tanaka et al. (2018) obtain similar or better results as Liu et al. (2020a) on RefCOCO(+,g). In this paper, we use the proposed models by Yu et al. (2016b) and Tanaka et al. (2018) as baselines in the task of question generation.
Graph-based solutions have recently been proposed (Yao et al., 2018;Gu et al., 2019) in the frame of natural language generation. These solutions stand or fall with the quality of the generated graphs (Zellers et al., 2018). This paper proposes to condition the generated question on the object's label and its attributes. These can be reliably extracted from the visual scene through the use of attribute predictors. Additionally, we propose a method that enforces the use of these labels in the generated question.
State-of-the-art language models make use of a Transformer architecture (Vaswani et al., 2017), like GPT-3 (Brown et al., 2020). However, this is not the case for any of the REG state-ofthe-art models. Furthermore, in our case, the generated questions that describe uncertain objects require to be short, and their generation should be fast in the self-driving car setting. While the REG state-of-the-art models use a more simplistic Long Short-Term Memory (LSTM) model design, we design a robust two-layer LSTM referring expression generator that can efficiently use all features for the textual output. Furthermore, we propose some varying sets of attributes that can uniquely describe the objects (e.g., color, distance, action) and integrate them in the referring expression generator.

Uncertainty Resolving System (URS)
In this section, we describe the proposed system ( Figure 2), which we name Uncertainty Resolving System (URS). We start this section by describing the notation used in this paper (Subsection 3.1). Then, we discuss the VG model (Subsection 3.2) that detects the object that is referred to in the command in the visual scene. Next, we move to uncertainty detection and quantification, and identify the objects proposed by the VG model that cause the uncertainty (Subsection 3.3). Then, we describe the generation process of the question, which is used to ask the passenger for clarification (Section 4).

Notation Introduction
Following the notation standards from Goodfellow et al. (2016), vectors will be represented as v and their size d will be represented as R d , and matrices will be represented as M and their dimensions will be represented as R a×b where a is the input dimension and b the output dimension.

Visual Grounding (VG) Model
Before using our URS, we require a VG model that finds the referred target object o * by the command c in a certain image I. We can create such a model by using the following statement: with O I the set of all objects in the image and θ the model parameters. For readability, we notate the probability distribution over the set of objects as p(O I |Φ, θ), with Φ the set of all inputs. Although our model is agnostic of the underlying VG model for computing this probability distribution, in this paper we make use of the CMSVG model (Rufus et al., 2020) as the VG model, since it is one of the top-performing models on the Talk2Car dataset at the time of writing. This model uses CenterNet  as a RPN to extract the set of objects O I objects from image I. For each object o ∈ O I , EfficientNet-b2 (Tan and Le, 2019) extracts its L2-normalized visual feature vector of R 1000 . The command c is encoded into a feature vector of R 1024 by a RoBERTa  based Sentence-BERT (Reimers and Gurevych, 2019). The resulting vector is then mapped to a vector of R 1000 using a fully-connected layer to obtain a sentence embedding with the same dimension as the visual feature vector. Next, the sentence embedding is combined with the visual feature vector through a dot product followed by the application of a softmax function to compute the probability distribution, as shown in Eq. 1. For a more detailed explanation, we refer the reader to the original paper by Rufus et al. (2020).

Jointly Detecting Uncertainty and Uncertain Objects
In the first part of URS (step 1 in Figure 2), the uncertainty when predicting the referred object o in Eq. 1 is quantified. Furthermore, this step jointly creates a subset O c ⊆ O I containing viable candidates for the target referred object which can be used to limit the amount of computations needed in the next step of URS. We examine several meta-classifiers that classify the VG model's output as correct or uncertain solely based on its created probability distribution and also return a subset O c of candidates for the referred object. The examined classifiers all use a variety of combinations from the four elements described below: model calibration (Subsection 3.3.1), used output function of the VG model (Subsection 3.3.2), uncertainty detection method (Subsection 3.3.3), and the number of objects in the visual scene (Subsection 3.3.4).

Calibrating the VG Model
Interpreting the pure softmax output of a classifier as its confidence is not always the best approach, as most deep networks tend to be overconfident (Guo et al., 2017). However, by recalibrating the model using an Ensemble or a posteriori Temperature Scaling, the softmax output can be correlated better to the confidence. We will experiment if re-calibrating with these methods aids in the uncertainty detection compared to the original softmax output.
Ensembles (Ens) have successfully been used to calibrate models, as well as for measuring the uncertainty of the prediction (Lakshminarayanan et al., 2017). Therefore, we create an ensemble of E randomly initialized VG models. To compute the calibrated probability distribution for the ensemble, we use the following equation: where p e is the probability output of the e-th VG model, and p E is the calibrated distribution of the full ensemble. In the experiments, we will use the notation Ens E where E indicates the number of models in the ensemble. In addition to calibrating a model, using an ensemble can also increase the accuracy. Hence, we also test in our experiments how much can be gained in terms of accuracy when using an ensemble.
Temperature Scaling (TS) Guo et al. (2017) update the regular softmax function to include a hyper-parameter τ: where z are the raw logits for some arbitrary classifier, z j is the j-th value in z, and N the number of classes. Guo et al. (2017) note that by examining Eq. 3, one can observe that with τ − → ∞, the probability distribution approaches a uniform distribution, and with τ − → 0, the probability distribution collapses to a point mass, representing maximum uncertainty and complete certainty respectively. When τ = 1, we have the normal softmax function. Note that this method does not influence the model's accuracy in any way, only the calibration.

Influence of other output functions
Here, we investigate whether we could replace the softmax function, which is used to transform the raw logits of the VG model into a probability distribution, with a sigmoid function. This operation does not affect the ordering of the classes or the output range (still between 0 and 1); only the final distribution is not a proper probability distribution anymore.
Additionally, we investigate if the VG model's raw logits could be used to detect uncertainty. This procedure needs to apply some additional constraints (i.e., limiting the raw logits' range to a pre-defined range) placed on the raw logits to allow efficient computation. To limit the raw logits' range to [-1,1], we use the cosine similarity of the object's feature vector and the sentence encoding.

Uncertainty Detection Methods
Uncertainty detection allows directly predicting whether the model was certain about its output or uncertain. In the latter case, the method selects the objects causing the uncertainty and collects them in the candidate set O c ⊆ O I . The ideal metaclassifier has a high accuracy for detecting when the object with the highest probability proposed by the VG model is also the correct referred object. This requirement is vital as every output classified as certain by our meta-classifier will be unrecoverable in our system.
Furthermore, the candidate set should remain small since we are in a time-critical environment. Confronting a car's passenger with many uncertain objects in the visual scene to choose from or generating a natural language question that entails the description of many objects in the visual scene would lead to a cognitive overload. Consequently, this situation would not allow a quick response by the passenger and the car's consequent timely action. Thus the method should balance between maximally exploiting uncertainty and restraining the size of O c on which the question generation will be based. The evaluated methods are described in the following paragraphs.
Softmax Addition (SA) relies on the softmax probability distribution adding up to 1. The top-k objects from O I are selected based on their probability from distribution p(O I |Φ, θ), such that the sum of these k probabilities is higher then the sum of remaining |O I | − k probabilities where |O I | is the number of objects in O I . With k = 1, the model is classified as certain and with k > 1 the model is uncertain with all top-k objects in the candidate set O c .
Centroid Agglomerative Hierarchical Clustering (CAHC) starts the clustering with every object forming its own cluster with the probability of the object p(O I |Φ, θ) representing it. Next, the method iteratively merges pairs of clusters if their centroids' absolute distance is smaller than a parameter δ. The centroid of the merged cluster is computed as the average of the probabilities of its objects. When no more clusters can be merged, i.e., there are no clusters within a distance δ of each other, the one with the largest probability is selected. If there is only one object in the cluster, the model is classified as certain. Otherwise, it is uncertain with all the objects in the highest probability cluster being members of the candidate set O c .
Thresholding (SoftTr, SigmTr, RLT) makes use of a threshold η for classifying the model as certain or not. In case of the softmax output of the model, the threshold (trained on the validation set) is applied over the probability distribution p(O I |Φ, θ) to create the candidate set O c as follows: In case that only a single object is part of the candidate set, the model is classified as certain. Otherwise, it is uncertain. When the softmax output function is changed with a function from Subsection 3.3.2, the probability distribution in Eq. 4 is replaced by the new output distribution. In the experiments we will refer to softmax thresholding as SoftTr, sigmoid thresholding as SigmTr, and raw logits thresholding as RLT.
Jenks Natural Breaks Optimization (Jenks) This algorithm, which can be seen as a 1-dimensional K-means clustering (Dent et al., 2008), tries to determine the best arrangement of values into different classes. It does this by minimizing each class's average deviation from the class mean, while maximizing each class's deviation from the other group's mean. In other words, the method seeks to reduce the variance within classes and maximize the variance between classes (Jenks, 1967). We say that the model is certain if the cluster with the highest probability only contains one object and uncertain otherwise. In the latter case, the objects in this cluster with the highest probability score are the uncertain objects. The amount of clusters k is found by finding a k-value that minimizes class means' squared deviations.
Ensemble Voting (EV) This method is only usable with an ensemble of VG models. We let every VG model in the ensemble vote for the command's referred object. Each model casts a vote for the referred object. If all the models vote for the same object, we say the ensemble is certain and uncertain otherwise. In the latter case, the different objects that have received votes represent the uncertain objects.

Influence of the number of objects in O I
The fourth and last category investigates the influence of the number of objects O I given to the VG model. Remember, we want to limit the number of objects detected as uncertain when interacting with the car's passenger. Hence, we will investigate the influence of selecting the top-k objects with the highest probability from the RPN.
Additionally, another way to reduce the number of objects in O I , is to predict the class (or its superclass, for instance, "car" has as superclass "vehicle") of the object referred to by the command. We will refer to this predicted class based on the command c as r c . All objects whose predicted class from the RPN is not equal to r c can then be ignored. To investigate this, we create the following referred object class predictors that take as input the command c and predicts the referred object's class r c . In the experiments we will refer to removing all the objects which do not belong to the same class as Class Filtering (CF) and removing all objects which do not belong to the same superclass as Superclass Filtering (SCF).
Bidirectional-LSTM (Bi-LSTM) Our first r c predictor is a Bi-LSTM that takes the full command as input and embeds each word as a vector of R 512 . Afterward, an embedding of R 512 is created to represent the command using the Bi-LSTM. Finally, this sentence embedding is then passed to a linear layer to predict the referred object class.
Bi-LSTM with attention (Bi-LSTM Att.) Some words are more important than others to know which object is being referred to by the command. To exploit this, we first encode the n word embeddings of size R 512 from the command into an embedding s ∈ R 512 with a Bi-LSTM. Then, we use the following equation to compute a softmax distribution over the word embeddings of the command: we define • as the operator that copies the first operand (s) to the dimensions of the second operand (W) and then applies the hadamard product, W ∈ R n×512 represents the command's word embeddings, and f is a linear layer of R 512×1 . The attention vector a ∈ R n is used in the following manner to produce a sum weighted sentence embedding of R 512 : where W i is the i-th word embedding. The resulting vector r is passed through a linear layer to predict r c . The idea behind the vector r is that the words that contain the most information for the classification, will have a higher softmax score than other words and thus have a higher weight in the sum.
Sentence-BERT (Reimers and Gurevych, 2019) is the final model that will be explored to predict r c . Sentence-BERT is a BERT model (Devlin et al., 2018) trained as a siamese neural network with tied weights. The goal of the model is to create semantically meaningful sentence representations. For more information about this model, we refer the reader to the original paper.

Describing the Objects that Cause the Uncertainty
The next step in URS (part 2 in Figure 2) is displaying the uncertain objects to the passenger. Yet, optionally, a textual question can also be provided. This question can be created by generating referring expressions for uncertain objects and chaining them together. As we are only interested in describing the uncertain objects, we can leave out the other objects and thus save computations.
In this section our proposed referring expression model Attribute-Referring Expression Generator (A-REG) is described. Since the model makes use of attributes, we first describe the attribute predictor in Subsection 4.1. Then we will define A-REG and possible variations in Subsection 4.2. We finish this section with the baselines with which our models are compared.

Attribute Predictor
Our goal in this section is to generate qualitative referring expressions. Therefore, we investigate whether better expressions can be generated when using attributes of the objects extracted from an attribute predictor. We identify three types of attributes that are interesting for the task at hand: the location of the object relative to the car, the color of the object, and the action performed by the object. We experiment with predictors that predict each of these types separately and a predictor that jointly predicts action and color.
For predicting the object's action, we first create an embedding of R 512 for the (predicted) class of the object. Then, we pass the image and its cropped object, after resizing both to 224 × 224, through ResNet-152 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009), with the last linear layer removed resulting in a vector of R 2048 for both the object and the image. We concatenate the object, image, and class embedding into a vector of R 4608 . This is then passed through a first linear layer of R 4608×1024 and a second linear layer of R 1024×11 , where 11 is the number of actions.
For predicting the object's color, a similar network is used as the one for predicting the action. Now, only the object is cropped and passed through a pre-trained ResNet-152, resulting in a vector of R 2048 . This is passed through two linear layers of R 2048×1024 and R 1024×12 , respectively, where 12 is the number of possible colors. For jointly predicting both action and color, we use the two models described above in one network by using the same ResNet-152 to extract region features and by having two output heads on top of the ResNet, one for the colors and one for the actions.
For predicting the location of the object, we experiment with six model options. The first is (1) a two-layer neural network with matrices R 3×100 and R 100×3 . This network takes the vector [x/imgW, (x + w/2)/imgW, (x + w)/imgW] as input where x and w are the bottom right coordinate and the width of the object respectively and imgW is the width of the image. The remaining options are (2) a Decision Tree, (3) a Random Forest with ten estimators, (4) a Support Vector Machine, (5) a Support Vector Machine with RBF kernel, (6) and Logistic Regression. For the experiments, see Subsection 5.8.

Generating Referring Expressions
The referring expression model's goal is to describe objects in such a way that they are uniquely distinguishable from other objects. Therefore, we hypothesize that knowledge regarding the attributes of an object can contribute to the generation of the expressions. Hence, we experiment with several ways of providing the attributes of an object, the distance count 2 , and the class label to our Attribute-Referring Expression Generator (A-REG). The different variations of A-REG are visualised in Figure 3.
We start by describing a basic Convolutional Neural Network and LSTM (CNN-LSTM) model in Subsection 4.2.1 that does not use attributes and will serve as a control model. Next, we introduce the A-REG that uses attributes and define its variations in Subsection 4.2.2.

Basic Control Model (CNN-LSTM)
This basic CNN-LSTM model is designed to validate the contributions of the object properties. Therefore, a minimal set of features is provided to the model.
Since we want this model to be powerful, we take inspiration from the well known Bottom-Up Top-Down model (Anderson et al., 2018). This model was originally designed for image captioning, where a set of image features is extracted for each object in the image, and a two-layer LSTM with attention is used for the caption generation. However, we only have a singular object, thus we drop the attention mechanism. The goal of the first layer to process global information and combine it with the context from the generated expression so far. The goal for the second layer is to use all information received for predicting the next token in the expression, since the output of the second layer is directly passed to the fully-connected output layer. The two attribute variations in the Attribute-Referring Expression Generator (A-REG). On the left is the A-REG-hot, where the one hot features for the different attributes (action, color, and location) are the inputs to the first LSTM layer. On the right is the A-REG-att, where the embedding is extracted for each attribute. These embeddings are then processed through an attention layer guided by the hidden state of the first LSTM layer. The attention's output is then passed to the second LSTM layer. Best viewed in color.
Box Feature LSTM (CNN-LSTM-box) Given an image I and an object o ∈ O I , we use ResNet-152 (He et al., 2016), without the final prediction layer, to extract object feature vector o ∈ R 2048 . Next, we define the following two layer LSTM setup: where h (t) l is the hidden state of layer l at timestep t, and i l is the input of the layer l. For the CNN-LSTM model, the inputs are defined as follows: where x (t−1) is the expression token at timestep t − 1 for which we extract the embedding using function Emb, and ';' is the concatenation operator. We initialize the hidden states h l with zeros and every sentence at t = 0 starts with a special start-ofsentence token.
Adding Global Image Feature Vector (CNN-LSTM-full) Adding the global image feature vector v I can help the model recognize the context in which the object appears in the image. Therefore, we extract the image feature vector v I ∈ R 2048 with the same ResNet-152 network as used for o. Now we replace Eq. 7 with Eq. 9: Adding Difference Features (+diff) To help the model distinguish between different objects, an additional average of the object features from other boxes can be provided to the model (Yu et al., 2016b). The five closest objects of the same category as the object we are trying to describe are selected from the remaining candidates O I \o, with o the previously selected object. We extract the object features for the closest boxes, subtract the original box from them, and take the average, resulting in the o d ∈ R 2048 and can be added to the input of the first LSTM layer by concatenating it to the input and updating Eq. 7 or Eq. 9.
Loss Function To define the loss function for our referring expression models, we use the Max-Margin Maximum Mutual Information (MMI-MM) from Mao et al. (2016). The loss consists of two parts, where the first is the cross-entropy loss often used for language generation tasks: WhereX is the predicted expression, X * the target ground-truth expression, I an image with o (pos) one of the candidate objects that forms a positive pair with the target expression, and θ are the model parameters.
To make sure that the expressions generated for each of the objects in a single image are unique, a Maximum Mutual Information (MMI) constraint is added to the loss: Where the second term enforces a minimum difference of margin M between the expression generated for the positive candidate object o (pos) and some negative candidate object o (neg) . The final loss is given by: The second term is multiplied with weight λ MMI between zero and one. The negative object o (pos) is randomly selected from the entire set of object candidates without the positive object O I \o (pos) .

Attribute-Referring Expression Generator and its Variations (A-REG)
The simplistic CNN-LSTM model only uses the features directly obtained from the image pixels. In this section, we propose the Attribute-Referring Expression Generator (A-REG) that can efficiently use several extracted properties of the object: • The bounding box location property vector b I . • The distance count property vector d I .
• The attribute properties vectors for actions a I , colors c I , and the locations l I . • The class label embedding vector k I . To reiterate, the CNN-LSTM consists of two LSTM layers, with the first layer's goal to process global information and combine it with the context from the generated expression so far. The second layer's goal is to use all information received for predicting the next token in the expression.
Since the object is only a part of the image, it can be useful for the model to know where in the image the box is located, as well as the size of the box. Therefore, we construct the bounding box vector as follows where x and y are the coordinates of the top left corner, w and h are the width and height of the box in pixels, and W and H are the width and height of the image. For the distance count, we make use of the LiDAR data available in the Talk2Car dataset. This provides us with a point map defining the distance for every point. Based on several experiments, we decided to take the closest LiDAR point that lies in a bounding box, to represent the distance of that bounding box. We sort the bounding boxes that belong to the same class, the same spatial location (left, right, in front), and the same color, based on distance, and assign integer labels for the object's count. We map the count to a one-hot vector d I ∈ R 6 , where every count higher than five is mapped to the final index.
Since these properties are useful for expression generation, we add them as input to the second layer of the LSTM model, so we update Eq. 8 as follows: There are several options for integrating object label and attribute label information. First, we define the feature vectors; next, we describe how we apply them in the models: A-REG-Att, A-REG-Hot, and A-REG-Full.
Attribute and Class Labels The methods for extracting the attribute labels are discussed in Subsection 4.1. These provide us with a probability distribution for the actions a I ∈ R 11 , the colors c I ∈ R 12 , and the locations l I ∈ R 3 . While one could use these probability distributions as input we found that converting them to one-hot-encodings, for each attribute type, yields the best result. The conversion is done such that: where x is some probability distribution vector. We further evaluate the integration of attributes in the question generation in two settings, using the one-hot encodings directly or making use of an attention mechanism over the attributes. Similarly, we obtain the class label. In Subsection 3.2 we describe how we use the CenterNet to predict the probability distribution over classes k I ∈ R 23 , which we again convert to one-hot-encodings using Eq. 15.
A-REG-Att The first option for integrating the attribute and class labels in the model is by using an attention mechanism, so the model can decide which of these features is important at what time step in the sequence. Therefore, we collect the Glove embeddings (Pennington et al., 2014), for each of the attribute types, such that where A I is the matrix with each row an embedding feature vector. Now we implement an attention mechanism identical to the one described in Equations 5 and 6. However, we replace W with the attribute embeddings A I , and s is replaced with h (t) 1 , so that different embeddings are selected at every timestep. We shall rename the resulting output to a I . Now we simply concatenate this to the input vector from Eq 8 so we get: A-REG-Hot Instead of directly inserting the attribute and class labels, we could also provide these as global information. Therefore, we design a variation of the model in which the onehot-encodings are entered as input in the first LSTM layer, by updating Eq. 9 to the following: 9 A-REG-Full The final option for the A-REG is to use the full set of features. This is a combination of both the attention mechanism from A-REG-att as well as the one-hot-encodings from A-REG-hot, as discussed in the last two paragraphs. Both the updated equations for the input into layer one from Eq. 18 and the updated input into layer two from Eq. 17 are used.
Optional Class Guidance (+Cls) Because it is important that a generated expression clearly refers to a certain object, it could be beneficial to always provide the model with knowledge regarding the class label. Therefore, we add an option for adding the class label to the inputs for the second LSTM layer (in all equations 7, 14, and 17), similar to the following equation: Switch for Forcing Attributes (+Switch) The generator must make expressions with clearly defined attributes, such that expressions become unique for the specific object. We hypothesize that guiding the model in generating the attribute words can improve the quality of the expressions. Therefore we train a special gate that allows the final predictor to switch between predicting words from the entire vocabulary or only from the predicted attributes and class labels for the object. The switch that decides which predictor to use takes the hidden state from the second LSTM layer h (t) 2 : Where we use a fully connected layer as function f , and σ is the sigmoid function. During training we useγ (t) I in a switch-loss (discussed in next paragraph), but decide whether to predict an attribute based on the ground-truth expression. During inference,γ (t) I is rounded such that one of the attributes is forced when round(γ (t) I ) = 1, and the regular vocabulary predictions are used when round(γ (t) I ) = 0 Switch Loss Because we do not use the output from the switch during training due to teacher forcing, we require a loss function to make sure that the switch improves over time. The switch loss can be formulated as a regression target, such that it gets close to either one or zero.
where γ (t) I is the ground-truth value for the switch and is defined as: if X (t) belongs to the attributes vocabulary 0 otherwise where X (t) is t-th word from the ground-truth expression X.
Finally, if a model uses the switch loss, the full loss is given by:

State-of-the-Art Baselines
In this section, we discuss the baselines that are used to compare the A-REG model described above. Our criterion for choosing a specific model is for having its code available.
SLR consists of three sub-networks: A Speaker that follows a CNN-LSTM architecture for generating referring expressions for the target object, a joint-embedding model, called Listener, trained to minimize the distance between a paired object and expression representations, and a Reinforcer that guides the model to sample more discriminative expressions. This model uses three loss functions: a generation loss for the expression, such as the one used in this paper, an embedding loss, and a reward loss. For a more detailed explanation, we refer the reader to the work by Yu et al. (2016b). 3 SR This model is relatively similar to the SLR model, except that it consists of two sub-networks. The first sub-network is the Speaker that follows a CNN-LSTM architecture for generating referring expressions for the target object indicated by a Gaussian distribution over the image. The difference with the Speaker in SLR is that this Speaker uses a target-centered prior, the features of the target object, and the sentence context under generation. The second sub-network is the Reinforcer, which works in the same way as in SLR. This model uses two losses: a generation loss for the expression, such as the one used in this paper, and a reward loss. For a more detailed explanation, we refer the reader to Tanaka et al. (2018)

Generating a Question
Given the resulting expressions that describe each uncertain object, a possible way to generate the question for the passenger is by using the following template pattern: Where [expr o n ] is a referring expression that describes the uncertain object o n ∈ O c , generated by a model from Section 4.

Experiments and Discussion
As mentioned in Section 1, we first want to detect which objects are causing uncertainty with regard to the referred object in the command for the VG model and then present these to the passenger. However, we are also interested in evaluating if we can describe these uncertain objects in a question geared towards the passenger. This question aims to give the passenger confidence in the car's abilities and provides insight into why the objects are causing uncertainty.
We start this section with some statistics of the used Talk2Car dataset. Afterward, we discuss how we have augmented the dataset with referring expressions as these are not included in the original dataset. Then, we report on the experiments and results for uncertainty detection and quantification, followed by experiments and results for the expression generation. We finish this section with the final accuracy of our URS and human evaluation results. For the reader's convenience, we have grouped experiments, results and discussion with regard to the specific steps of the URS.

Talk2Car Dataset
Talk2Car Dataset Statistics The Talk2Car dataset (Deruyttere et al., 2019) contains images of a self-driving car driving around in Boston and Singapore (right vs. left hand traffic), natural language commands referring to objects, bounding boxes for 23 data classes, and the bounding box of the referred object in the command. These images, provided by nuScenes (Caesar et al., 2019), are taken in different weather (rain or sun) and time conditions (night or day). In total, the Talk2Car dataset provides 11,959 commands for 9,217 images. The dataset is split into a training, validation and test set that respectively contain 8,349 (69.8%), 1,163 (9.721%) and 2,447 (20.4%) commands. From the latter, four challenging subsets have been created. These are used to evaluate the proposed models for uncertainty detection and ambiguity resolution, their handling of different command lengths, and how they are coping with referred objects that are far away. The ambiguity resolution in this dataset is defined as how well a model can cope with having multiple instances of the referred object class in the same image. On average, the command contains eleven words, while 21% of the commands' words are nouns, 21% verbs, and around 6% are adjectives. In each image, we can, on average, find eleven objects and more than four objects of the same class as the referred object. Since Talk2Car is built on nuScenes, it also provides a whole range of other modalities such as radar, Li-DAR, and instance segmentation. For more information we refer the reader to https://www.nuscenes.org/ and https://talk2car.github.io/. An example of the Talk2Car dataset can be seen in Figure 1.

Augmenting Talk2Car With Ground-Truth Referring Expressions and Attributes
When training our models for generating referring expressions we have experimented with other existing datasets that contain descriptions/referring expressions for objects such as RefCOCO (Yu et al., 2016a), RefCOCO+ (Yu et al., 2016a), RefCOCOg (Mao et al., 2016), and CityScapes-Ref (Vasudevan et al., 2018). However, when applying a trained model from these datasets on Talk2Car, we saw that the domain shift was too large and resulted in unusable results. To this end, the objects in the Talk2Car dataset are annotated (using Amazon Mechanical Turk) both with discriminative referring expressions and object attributes. We first discuss this annotation procedure, afterward, we analyze the resulting dataset, which we name Talk2Car-Expr. 5 Annotation Tool We developed the used annotation tool with EasyTurk (Krishna, 2019). The annotation interface includes a short instruction summary explaining the annotation task's goal and a button that leads the annotators to a video explanation of the task, and a detailed textual explanation. We ask each annotator to enter a discriminative expression and to select the attributes for the indicated bounding box of an object. The three types of attributes are color, the action the object is performing, and where the object is (left of us, in front, right of us).
Analysis of the Annotations The dataset consists of 11,959 referring expressions, one for each object with a command in the Talk2Car dataset. Each expression has, on average, 6.9 words, a minimum of 4, and max 22. Each object also has three attributes: color, action, and location relative to the ego car. The training, validation, and test splits of the dataset follow the official splits of the Talk2Car dataset. In the appendix, the reader can find figure B.11 where we display some statistics of the dataset.

Evaluation Metric for Visual Grounding
The Intersection over Union (IoU) between the bounding boxes of the predicted and the ground-truth object is used to measure the accuracy in the Visual Grounding task. If this IoU > 0.5, we say that the predicted object is correct. We refer to this as IoU .5 . The IoU is defined as follows:

IoU =
Area of Overlap of the two boxes Area of Union of the two boxes .

Evaluation Metric for Jointly Detecting Uncertainty and Uncertain Objects
To compare our methods for jointly detecting uncertainty and uncertain objects, we use the following measures and include (↑) or (↓) to indicate which direction is better: • CertIoU .5 (↑): The IoU .5 of the used meta-classifier on the Talk2Car validation set when solely using the certain predictions. This is the lower bound accuracy of the metaclassifier. • CertAcc (↑): The accuracy of the meta-classifier for classifying the output of the VG model as certain. It is measured by evaluating how many times the certain outputs contain the correct answer. • CorrUnc (↑): The percentage of uncertain outputs where the correct object is amongst the uncertain objects. • T h.IoU .5 (↑): For this task, we would also like to know what the theoretical accuracy on the Talk2Car validation set for a certain meta-classifier can be. Remember that when outputs have been classified as uncertain and when the correct referred object is in the set of uncertain objects, it can be recovered by interaction with the passenger. This measure is computed as T h.IoU .5 = CertIoU .5 + (total uncertain outputs * CorrUnc). • AvgUncOb j (↓): The average number of objects that cause uncertainty. We want this number to be as low as possible since the selected objects will be shown to the user and optionally also used to generate a question. Since giving a command to a self-driving car is a time critical task, one would like to keep the interaction with the passenger and the generated question as short as possible not to waste time. • MaxUncOb j (↓): The maximum amount of objects that caused uncertainty.

Evaluation Metrics of the Generated Expression
For evaluating the generated expressions, we will use the following three metrics.

METEOR (↑)
The first metric is the METEOR score (Denkowski and Lavie, 2014). It is based on the harmonic mean of unigram precision and recall. This method looks at all possible unigram matches between the candidate and reference sentence. Therefore, it also applies methods such as stemming and synonymy matching. METEOR works by first creating an alignment (i.e., a mapping of unigrams) between the candidate sentence and the reference sentence. It then computes precision and recall between the unigrams to finally compute a harmonic mean. To account for gaps and differences in word order, a fragmentation penalty is also computed by first counting the number of n-grams that are in contiguous and identical order in both expressions. Then, we divide this number by the total number of matched words. The more n-grams are not adjacent, the higher the penalty.

ROUGE-l (↑)
The second metric is ROUGE-l (Lin, 2004). This metric focuses on the longest common subsequence (LCS) between the candidate sentence and the reference sentence. The ROUGE-l metric first computes how many words the LCS from the referenced sentence and the candidate sentence have in common compared to the length of the referred sentence and candidate sentence. Then it uses an F-measure to get the final ROUGE-l score.

BLEU (↑)
The third metric is BLEU (Papineni et al., 2002) which uses a modified form of precision to compare the candidate sentence to the reference sentence. The reason to use a modified form of precision is that a model can achieve a score of 1 if it predicts the same word as many times as the length of the reference sentence. To combat this, BLEU considers the maximum amount that a specific n-gram is present in the reference sentence and clips the count of each n-gram in the generated sentence to the maximum of this n-gram in the reference sentence. Additionally, BLEU also adds a penalty for brief expressions. The used value of n is indicated as BLEU-n. In this paper, we use BLEU-4. The IoU .5 scores on the Talk2Car test of the used VG-model with different top-k scoring objects as input. Inference speed is computed on the Talk2Car test on a RTX TITAN.

Human Evaluation Metrics
In the human evaluation, we use the following metrics to determine the quality of the referring expression generators as part of URS: • For human agreement we measure the Krippendorf's Alpha (Krippendorff, 2011) since it allows for any number of annotators, any number of categories or measures, and for missing data. In addition, there is no minimum sample size restriction, allowing it to be very applicable in our setting with a possible different number of objects per image. • The accuracy (Acc, ↑), measures how often the annotator correctly assigns a generated expression with its paired object.

• With SingleExpr (↓) we indicate the fraction of images
where the used REG model generates the same sentence for all uncertain objects. • We also use the F1 measure to compare the referring expression generators. It is taken over the assignment by annotators of generated expressions with their paired objects.

Uncertainty -Influence of Number of Objects in O I for the VG model
As described in Subsection 3.3.4, the number of objects might influence the number of uncertain objects. Yet, before we investigate this issue, we ask ourselves what the influence is on both IoU .5 and inference speed when giving the VG model a varying amount of k objects. To select these k objects, we sort the predicted objects from CenterNet, provided by Vandenhende et al. (2020), according to their softmax confidence and take the top-k objects. For this experiment, the VG model (Subsection 3.2) is trained anew for each of the top-k values. During training, we use SGD with a learning rate of 0.01 and Nesterov momentum of 0.9. The learning rate is divided by 10 after 4 and 8 steps respectively. The used batch size is 8 except for top-64 where we could only fit 2 batches on one RTX Titan. For top-64, we also change the learning rate to 0.005.
In Table 1 we see that using the top-32 scoring objects results in the highest IoU .5 score. On the other hand, using only top-8 objects results in the lowest score. This should come as no surprise as with 8 predicted objects, there is a lower chance of having the correct object amongst the predicted objects. On the other hand, we see that top-64 performs worse than top-32. We argue that when having that many boxes, some low quality  Results for the referred object (super)class predictors on the Talk2Car validation set. Inference time was measured on a RTX TITAN.
boxes can be included as well, which can eventually confuse the model.

Uncertainty -Predicting the Referred Object Class
A second way to reduce the number of uncertain objects is by creating a classifier that takes as input the command c and outputs the (super)class of the object referred by the command. This way, we can ignore all predicted objects that do not belong to the predicted class. For this experiment, we train the three models described in Subsection 3.3.4 and evaluate their predictive accuracy on the objects' classes in the Talk2Car validation set. Training the two LSTM models with Adam (Kingma and Ba, 2014), a learning rate of 0.0001, batch size 8, and weight decay 10 −5 yields the best results. Training is stopped if during ten epochs there is no improvement on the validation set. For the Sentence-Bert model, AdamW (Loshchilov and Hutter, 2017) with a learning rate of 5 * 10 −5 and batch size 64 worked best. Additionally, we also compute the inference speed of these models. Table 2 shows the accuracy and the inference speed of the referred object class predictors on the validation set. Due to space constraints, we abbreviate "Class Accuracy" to "Cls Acc." and "Superclass Accuracy" to "S-Cls Acc.". We see that Sentence-BERT performs better than an LSTM architecture for representing the command's content. However, it only performs marginally better than LSTM-Att but at a significantly slower pace. The LSTM-Att model shows that by letting the model attend to certain words in the command, one can improve compared to the LSTM model. The results also indicate that predicting the superclass (e.g., vehicle) of the object instead of the class (e.g., truck) results in higher accuracy. In the following experiments, we will use the LSTM-Att model because of its good trade-off between speed and accuracy according to its results on the validation set.

Uncertainty -Influence of ensemble on IoU
As described in Subsection 3.3, ensembles can be used to calibrate a model, but they can also influence the accuracy. Hence in this experiment we compute the IoU .5 on the Talk2Car test set of an ensemble ranging from 2 to 5 models using 8, 16, 32, or 64 top-k boxes. We repeat the experiment from Subsection 5.4, where we first predict the (super)class of the referred object based on the command with the LSTM-Att model. Then, we remove all objects from the top-k scoring predicted objects that do not belong to the predicted (super)class. The ensemble's inference speed is equal to the inference speeds in Table   1 as these models can be parallelized. For this experiment, we train 5 VG models for each of the top-k values following the instructions in Subsection 5.3. We also wish to mention that we tested the recent work by Havasi et al. (2020) of combining multiple models of an ensemble into one model, yet we only achieved results of around 0.5IoU .5 on Talk2Car.
The results are visualised in Figure 4. Increasing the ensemble size positively influences the IoU .5 for all the displayed variations of the VG model. The top-32 model is consistently better than the other models in terms of IoU .5 for any ensemble size. It also reaches an IoU .5 of 0.70 when the ensemble size is equal to 5. We also see that using Class Filtering with top-8 and top-16 reduces the performance compared to not using it. We argue that this is because the used RPN tends to miss-classify some of the objects with a high confidence score. However, these miss-classified objects are still classified in the correct superclass. For top-32 and top-64, we see that using Class Filtering and Superclass Filtering improves performance over not using it.

Uncertainty -Jointly Detecting Uncertainty and the Uncertain Objects
This experiment investigates which combination of the methods described in Subsection 3.3 produces the best metaclassifier to jointly detect the uncertainty of a VG model and the uncertain objects.
The following notation indicates which order the methods from Subsection 3.3.3 are used. "CF + TS + SoftTr" stands for first using Class Filtering to ignore the predicted objects from CenterNet, whose class is different from the predicted class obtained with LSTM-Att. Afterward, Temperature Scaling is used on the output of the VG model, and finally, Softmax Thresholding is used to detect if the model is uncertain and which objects are causing the uncertainty.
Because of the many possible combinations, it is unfeasible to display all of them in this paper. Therefore, we select the combinations that have the following three restrictions: MaxUncOb j ≤ 5, T h.IoU .5 > 0.75, and CertAcc > 0.8, and display them in Table 3 in no particular order. When we allow for more than five objects, it is possible to achieve a T h.IoU .5 up to 0.9317. In our opinion, this is not useful because having so many uncertain objects will lead to a sensory overload for the passenger. From the table, we see that using an ensemble jointly with class filtering and ensemble voting is an effective strategy for detecting uncertainty and uncertain objects. Remember, next to only displaying the objects on a touch screen, we are also interested in generating questions for the passenger. Ideally, you want to limit the number of objects that are flagged as uncertain because you want to limit the execution time of generating a question and the cognitive effort of the passenger understanding it in a self-driving car setting. In this respect, by using Ensemble Voting, we can introduce an upper limit of the number of possible uncertain objects as this coincides with the number of models in the ensemble. From this experiment, we decide to use Ens 4 +EV with top-16 objects and Ens 5 +CF+EV with top-64 objects as our meta-classifier in the remainder of   Figure 4: The influence on Iou .5 when using an ensemble of size E with different top-k scoring predicted objects.
At the top of each plot (a,b,c,d), we display the number of top-k scoring predicted objects that have been used. The y-axis, shows the Intersection over Union score with threshold 0.5 (IoU .5 ). The x-axis, indicates the size of the ensemble (E). We also include the IoU .5 score when E = 1 for convenience. These values are equal to the values in Table 1. Best viewed in color.  The top-10 results of the uncertainty detection methods from Subsection 3.3 on the Talk2Car validation set. The measures used are explained in Subsection 5.6. our experiments as both methods have high CertAcc and high T h.IoU .5 .

Visual Uncertainty Examples
We show some examples of the possible uncertain objects detected by the uncertainty models in Figure 5. In the examples, there are multiple items that can cause this confusion. Often, this is between objects of the same class, which is especially clear in Figures 5a and 5c. In these examples, the car has to make an action surrounding an object. In the former, the uncertainty can be resolved by the car, to ask concerning the color of the car. However, this is not possible for the cones in Figure 5c. Here, the car should ask regarding the distance of the cone: Do you mean the First, second or last cone? In the example in Figure 5b, the uncertainty is caused by a difficult to detect color, white and yellow. To resolve this, the car could ask a question regarding the location of the object, in front or on the right.

Attribute Prediction
We measure the accuracy of our proposed attribute predictors in Subsection 4.1 for the three different types of attributes:  Results of the attribute prediction on the Talk2Car-Expr validation set, with NN a two layer Neural Network, N. Neigh. is Nearest Neighbour, DT is the Decision Tree, RF is Random Forest, SVM is Support Vector Machine, RBF SVM is a SVM with a Radial Basis Function (RBF) Kernel, and LR is Logistic Regression. Inference time is measured on a RTX TITAN for the ResNet-152 models. For the other models, we measure their inference times on a Intel(R) Xeon(R) Silver 4208.
(a) The objects that cause uncertainty for the command: "Parallel park behind the car on the left".
(b) The objects that cause uncertainty for the command: "Change lanes and get behind the white car".
(c) The objects that cause uncertainty for the command: "After that signaling cone, turn left".

Figure 5: Uncertainty Examples
Examples of uncertain objects detected in different scenes. We see that the objects flagged as uncertain by URS are often from the same (super)class. Best viewed in color.
color, action, and spatial location. From the results shown in Table 4 we see that predicting locations is easier than predicting colors or actions. Predicting actions is the most challenging task as it involves reasoning about movement and pose from a 2D image. We see that for predicting the location, the decision tree has the highest accuracy. Although the Neural Network has a faster inference time, we decide to use the decision tree for our next experiments as the difference in inference time is relatively small. We display the best result obtained with a multitask network that jointly predicts action and color. The single task networks had fairly similar results.
For the location neural network, we train using Adam with a learning rate of 0.0003 and a batch size of 16. The multitask network to jointly predict action and color was trained with Adam with a learning rate of 5 * 10 −5 , batch size 16, and weight decay of 0.0001.

Referring Expression Generation -Quantitatively
For our referring expression generation, we perform two experiments: A quantitative one and a qualitative one. The former is performed by generating expressions of objects on the Talk2Car-Expr test set by giving each model an image together with the bounding box of an uncertain object. The generated expressions are evaluated with the metrics explained in Subsection 5.2.3: METEOR, ROUGE-l, and BLEU-4.
Hyperparameter Search for A-REG We had to finetune multiple settings to achieve optimal results for our CNN-LSTM and A-REG models. We train with three different random seeds during the hyperparameter search for a maximum of 75 epochs and select the best epoch based on METEOR (Denkowski and Lavie, 2014).
Based on the validation set, we found that the Adam optimizer with a learning rate of 5 * 10 −6 performs best (Kingma and Ba, 2014), the optimal fully connected layer sizes are 512, except for the attention layers where it is 256. We use a batch size of 16, and during inference, we use a beam search with a beam-size of 10, where the generation is started with a start-ofsentence token. In the next experiments, we train with a maximum of 125 epochs and select the best epoch using METEOR.
All our expressions are cut to a maximum length of 15 tokens with an additional start-of-sentence token and an end-ofsentence token. All tokens are converted to pre-trained Glove embeddings (Pennington et al., 2014). Tokens without a pretrained embedding that occur more than five times are randomly initialized. The remaining tokens are cast to unknown tokens. We also test with a pre-trained BERT model for extracting the embeddings (Devlin et al., 2018). However, this significantly decreased performance and inference speed.
Tuning the Loss For the finetuning of the parameters for the MMI-MM-loss and the switch-loss on the validation set, we report our findings in Table 5 and 6, respectively. The results are given for the METEOR metric.
For the MMI-MM-loss we had to tune the margin used in the margin M from Eq. 11 and its weight λ MMI . For the MMI-MM-loss, we see that a small margin is advantageous and that a higher margin decreases the model's performance. It is beneficial to use this loss, though only with a small weight. With the weight λ MMI set to 0.1 we achieve the best results. Therefore, for all further experiments we set the weight λ MMI to 0.1 and the margin M to 0.1.
Opposed to the MMI-MM-loss, we find that when the switch and its loss are used, it is essential to set its weight high. We decided not to make it larger than one since the main objective is still the generation of correct referring expressions. For all future experiments with the switch, we set its weight λ switch to 1. Furthermore, when tested with several loss functions (Cross-Entropy, Smooth-L1, and MSE), but found that the MSE loss, as described in Eq. 21, performs best. Table 7 shows the results for different combinations of settings for generating expressions on the test set of Talk2Car  Results for finding optimal weight for the MMI loss, as well as the margin value. Results for validation split with the A-REG-Full. The best score is bold with the second best underlined. Since the margin has no effect with a weight of zero, the same score is reported for all.  Results for finding optimal loss function and the weight for the Switch Loss. Results for validation split with the A-REG-Full. The best score is bold, with the second best underlined. Since the margin has no effect with a weight of zero, the same score is reported for all.

Results of the Referring Expression Generation
both the class label embeddings and the switch, and finally our model variations with the difference features.
For the baselines, we find that the SR model by Tanaka et al. (2018) performs best across all metrics. However, the inference time is more than twice as long compared to the SLR model by Yu et al. (2016b), which only performs a few points lower on some of the metrics (e.g., ROUGE-l). Both baselines also have a way to re-rank multiple simultaneous generated expressions by passing them to either the Listener in the case of SLR (Yu et al., 2016b) or the Reinforcer in the case of SR (Tanaka et al., 2018). The expressions are then ranked according to the generated scores by Listener or Reinforcer. We found that this did not improve the results on Talk2Car.
For the implementations of our models, we note that A-REG performs much better than the simpler CNN-LSTM. This observation confirms our hypothesis that guidance by the attributes of the object aids in describing it. More specifically, when comparing A-REG with its variations (-hot, -att, and -full), we note that the attributes and the class label give a great improvement over just using the global image feature v I , the bounding box feature b I , and the distance count d I .
Interestingly, according to most metrics the best variations of the A-REG model are these that use the additional class label embedding (+cls) on top of A-REG-hot or A-REG-att. This indicates that using the full model adds unneeded complexity and hurts performance across many of the metrics. We note that between the models that do not use the class label embedding as input, the A-REG-hot is one of the worst-performing varia-tions. This indicates how important the class label is for generating the referring expressions. Training the switch and forcing attributes are not as beneficial as expected. We expect this has to do with the same complexity issues mentioned before when using the full model and the model becoming too complex to train properly. Finally, Table 7 shows an improvement in terms of METEOR, ROUGE-l and BLUE-4 when the difference features (+diff) are added as input (see paragraph Adding Difference Features (+diff) in Subsection 4.2.1). These features are obtained by the subtraction of the object feature from the features from the other objects. This shows that this kind of delta information about the other objects is beneficial for generating good referring expressions.
Overall the best implementations of the A-REG model outperform the state-of-the-art models across all metrics. This improvement is most likely due to the use of specifically trained attribute predictors for the Talk2Car task and directly fine-tuning the model parameters for the same task. Furthermore, the inference time is greatly reduced due to a simpler model design. Two example images with a referred object are shown, with in its caption the expressions generated by two of our top performing models (A-REG-hot+cls+diff and A-REG-att+cls+diff) and the two state-of-the-art models (SLR and SR), see Figure 6a and Figure 6b. For more examples we refer the reader to Appendix C.
Although the automated evaluation with metrics such as ME-TEOR, ROUGE and BLEU, gives a great indication of the quality of the metrics, it is hard to determine whether they measure how well the expressions actually disambiguate the objects in the image. For instance, we note that our A-REG tend to predict somewhat similar referring expressions. We believe that this can be resolved with better attribute predictors and a larger variety of attributes.
Human Evaluation For our REG models, we also performed a human evaluation using Amazon Turk. We gave the workers an image with a referred object indicated with a bounding box and a set of generated expressions by the used referring expression models in this paper. These expressions are presented as described in Subsection 4.3, where they have to indicate for each object expression [expr o n ] if it corresponds with the indicated object. The set of generated expressions was created by using Ens 4 + EV or Ens 5 + CF + EV as uncertainty detection models, where images were kept if there is exactly one bounding box that indicates the referred object amongst the uncertain objects, to avoid too many overlapping boxes. This results in 534 uncertain situations for Ens 4 + EV and 628 for Ens 5 +CF + EV.
The results are shown in Table 8. For most metrics, we see that the results are better for A-REG compared to the state-ofthe-art models. We note that this is not the case for the Single-Expr, indicating that multiple objects received the same expression. However, when looking at the examples in Figure 6 the difference between the state-of-the-art baselines is that A-REG is able to correctly identify the objects in the box. Therefore, we believe that this issue is more acceptable for the end-user.
Method METEOR (↑) ROUGE-l (↑) BLEU-4 (↑) Inf. Speed (ms) SLR (Yu et al., 2016b) 0.268 0.597 0.245 663.90 SR (Tanaka et al., 2018) 0  Quantitative evaluation of our proposed models and state-of-the-art baselines on the Talk2Car-Expr test set. CNN-LSTM-box only uses the ResNet-152 box features, and CNN-LSTM-full additionally integrates ResNet-152 features for the entire image. The A-REGs makes use of the object label (A-REG), with optionally the attributes and object label as a one-hot vector (A-REG-hot), optionally an attention over the word embeddings for the attribute and the class label (A-REG-att), or both of them (A-REG-full). All options can be extended with the class label embedding inputted in the LSTM of the decoder at every timestep (+cls) or with a trained switch that decides if attributes should be forced in the expression (+switch). AttrExpr-full is the model using the full set of features: attention with both attributes and class embedding, the image feature vector, box feature vector, and the hot-vectors for attributes and classes. Inference speed is measured on an RTX TITAN. To indicate the best, second, and third best scores we use boldface and underline and cursive, respectively.  Human Evaluation using several metrics. For the human agreement we use the Krippendorf Alpha. Acc is the accuracy of correct assigned expressions to objects. SingleExpr is the fraction of images where the used REG model generates the same expression for all uncertain objects.

Final Result
To finish this paper, we investigate the accuracy of the full URS. We hypothesize that the final accuracy of URS is equal to the theoretical accuracy column (T h.IoU .5 ) from Table 3 regardless of the generated question for the uncertain objects. We argue that the passenger who has given the command will always select the object they refer to if it is part of the selected uncertain objects. Following this reasoning, we can say that the overall improvement of using the VG model with URS compared to only using the VG model is a 9% absolute increase whenever we use Ens 4 + EV or Ens 5 + CF + EV to detect un-certainty.
We are now interested to see how URS fares compared to the best performing VG model from Table 1 without the uncertainty system on the four challenging subsets present in Talk2Car (Subsection 5.1). As discussed above, Figure 7 shows that the combination of the VG model and the proposed uncertainty detection models largely improves the T h.IoU .5 metric when confronted with difficult visual situations (for the depth subset we witness an increase of 12.1%), with long language commands (for the longest sentences subset we witness an increase of 7.8%) and with referred objects in the commands that In this figure, we plot the IoU .5 score of the used VG model (Subsection 3.2) and the VG model in conjunction with the top two uncertainty methods from Subsection 5.6: Ens 4 +EV and Ens 5 +CF+EV. The former model means that we have an ensemble of four VG models using ensemble voting (EV) to determine the uncertain objects. The latter indicates that we have five VG models that first use class filtering (CF) and then EV for detecting the uncertain objects. At the top of each plot, we give the name of the subset. Each plot also shows the easy examples on the left and increases the difficulty when moving along the plot's x-axis. Best viewed in color.
have an ambiguous interpretation in the visual scene (for the ambiguous subset we witness an increase of 12.6%), while for easier situations the increase is less pronounced (for the short sentences subset we witness an increase of 5.9%). These results show that for the difficult situations, our uncertainty detection models are very beneficial. Finally, the overall inference speed of URS is around 286.99ms in case of using the Ens 4 + EV uncertainty detection method or 336.05ms in case of using Ens 5 + CF + EV as the uncertainty detection method and A-REG-hot+cls+diff as REG model. Using the state-of-theart baseline SLR as the REG model, this would be 732.99ms and 782.05ms. For SR, the total inference speed would be or 1447.55ms and 1495.66ms.

Expected Requirements and Maintenance for Implementing the Uncertainty Resolving System (URS) Pipeline
We assume the use of a modern self-driving car, which is already equipped with cameras and (possibly) LIDAR or radar sensors. Furthermore, most modern cars have microphones and a speech recognition function, which can be used to give commands regarding music and text messaging while driving.
We first discuss the capital expenditure of our system. We believe that adding our pipeline to a self-driving car would not require much work. We could directly couple our system with existing infotainment systems that can be controlled through speech (i.e., BMW ConnectedDrive 6 , Ford Sync 7 , ...). To do so, a system must be added that detects when a command is given regarding controlling the car's internal systems, or a command regarding the surrounding of the car. This could be achieved with a simple classifier choosing between the two options. If the latter is detected, the command is forwarded to our pipeline. A pre-trained model is loaded into the cars internal computer, such that the model is small and quick to run. As shown by the results, the complete inference time from the entire URS pipeline with our proposed REG model is very low, always staying below 400 ms. The internal computer must have enough capacity to maintain a low inference time.
For the operating expenditure we need to make sure that the model is capable of improving and solving future issues. It is important to roll out updates of the pre-trained CU and A-REG models. Compared to current interactions of a driver with a Tesla car (e.g., during braking or steering), the interactions of a passenger of the self-driving car can be logged and used as feedback to further retrain and fine-tune our system. These updates can then be published in a similar fashion as Tesla's regular over-air updates 8 . When technology advances and new discoveries are made, a major update can be released to the car, introducing a novel CU or REG model.

Conclusion
In this paper, we have proposed the Uncertainty Resolving System (URS) for a self-driving car. Uncertainty Resolving System (URS) augments a Visual Grounding (VG) model with the ability to (1) detect if a given natural language command leads to uncertain situations and (2) finding the objects causing the uncertainty. The best method for these two abilities is to use an ensemble of VG models in conjunction with Ensemble Voting (EV) and optionally Class Filtering (CF). Our contribution for detecting uncertainty lies in (1) evaluating many different methods and their combination for detecting said uncertainty and (2) proposing a novel set of constraints tailored for a self-driving setting. URS has a visual output or, in addition to the former, it can also present the passenger a generated question that describes the uncertain objects. For this purpose, we have developed a new referring expression model that we called A-REG. It is designed as a robust two-layer LSTM referring expression generator that can efficiently use an object's attributes for the textual output. These attributes, specifically tailored for the Talk2Car dataset, include the bounding box, distance count, color, location, action, and the class label. If one would like to change the set of attributes, this can simply be done by retraining the attribute predictor based on data annotated with the new set of attributes and by retraining the A-REG model. Our method stands in contrast to the single layer network of the state-of-the-art baselines. The first layer processes all the global information (such as the image representation, object representation, and one-hot encodings for the attributes). The second layer receives all the inputs required to predict the next expression token (such as the object properties, embeddings for the class label, and the attributes). In our experiments, we show that by using URS, a 9% absolute increase in terms of IoU .5 is achieved compared to the VG model without URS on the Talk2Car dataset. Additionally, we also show that URS helps the most in challenging situations such as objects being far away or multiple objects of the same class. In these cases, our URS increases the IoU .5 of the used VG system IoU .5 by 12.1% and 12.6% respectively. We show that A-REG for referring expression generation beats existing state-of-theart models on the Talk2Car dataset by leveraging the previously mentioned attributes. With all these features combined the state of the art is relatively outperformed with up to 6% METEOR and 8% ROUGE-l but at a nearly three times faster inference speed.

Future Work
We identify several exciting possibilities as future work. First, in this work the questions posed to a passenger of the self-driving car are in textual format assuming their translation into speech by a text-to-speech engine. It would be interesting to study how speech signals could be generated, describing the visual scene's uncertain objects without any intermediate textual step. This might also benefit the latency when generating a question in speech format. Second, as mentioned in our experiments, we experimented with combining multiple models from an ensemble into one. However, the achieved results were worse than using the models of the ensemble separately. Concerning power consumption on a self-driving car, using an ensemble of four or five models might be expensive. Hence, it might be interesting to study how the computations of uncertainty detection and quantification can be put into one model. Although we demonstrated the validity of uncertainty detection when understanding natural language commands in a visual context, it might be interesting to investigate whether the proposed method performs well in other tasks executed by a self-driving car. For instance, in bad weather conditions it is important to detect how uncertain the visual recognition is, so that the car can take appropriate actions (e.g., pull over to avoid an accident). Finally, we note that the referring expression models sometimes generate the same sentence for multiple objects in the scene. One reason is the overlap between predicted boxes. Another reason is the level of detail of the object's class and its attribute labels. We believe there is a trade-off between how detailed the attributes should be, causing longer questions, versus creating short descriptions that can be quickly understood by the passenger. Future work could investigate the level of detail of attributes without them endangering the safety of the passenger.

Appendix A. Conducted Survey
For creating this application paper, we held a survey to know how people would like to have a self-driving car report back about its uncertainty. We present the survey results here, and we hope that some of these results could also lead to future work.
The conducted survey on Social Media (LinkedIn, Facebook, mailing lists, ...) received 254 responses amongst people of different ages, genders, and educations (See Figure A.8).
As shown in Figure A.9a, we found that a majority does not trust self-driving cars yet. Among these people, we asked for their argument why this is. Below in Table A.9 is an excerpt from the 91 responses.
We noted that most of the responses have to do with safety concerns and fear of giving away control to an autonomous vehicle. We also note that this often has to do with a lack of knowledge. Furthermore, we found that some of them have concerns regarding the legality and responsibility in case of an accident. Some of these concerns are caused by media and examples of incidents with self-driving cars.
Following these questions regarding trust, we asked whether the ability to give commands to the car, like those in the Talk2Car dataset, would improve this trust. We found that with the ability to give commands the percentage of people who would feel confident, increases to 69.6% compared to the original 63.2% ( Figure A.9). However, when asking if people would use the ability to give commands, we found that almost all participants (82.6%) will make use of this, as shown in Figure A.9c.
Finally, we asked a couple of questions regarding the format for uncertainty resolution, with the results shown in Figure A.10. We found that 71.9% of the participants would like the car to respond by speech ( Figure A.10a). However, when asked how they would prefer the vehicle to report back about its uncertainty, either by only visually showing the uncertain objects or also with a textual/spoken question, we found that only 38.7% ( Figure A.10c) would prefer the former and 54.9% would prefer the latter. The remaining participants indicated they do not have a preference. Finally, from the survey, we also found that people have a slight preference for responding to the car via a touch screen compared to via a touch screen augmented with speech ( Figure A.10b).

Safety
It's about trust. Everything tech is never safe for me. it will take us years to test self-driving cars so they are at least equally safe as an experienced driver. Afraid of giving up control and not being certain that the car will be able to respond quickly to any given situation. I think the roads have too many exceptions for it to work well.

Control
Because I would hesitate to give away all control. I don't trust todays technology enough to depend my life on it. Unless I have 100% control, I wouldn't feel safe. I am a bit a control freak. I would trust a self-driving car IF I would still be able to take over control whenever I want to.

Examples
People have died in self-driving Teslas. Because of the uber incident. see Tesla crash vids... Reading the current news related to the technology, it seems that we are not quite there yet to have a fail-proof system for self-driving cars. Once we are there, then I would feel confident.

Human Skills
Complexity of human behavior in traffic + too abstract as a concept in this stage. Computers have not yet achieved the cognitive capabilities of humans.
Responsibility Because if anything happens, I, as the owner of the car would have to take full responsibility. An excerpt of some of the answers that participants of the survey gave for not trusting self-driving cars.  In these plots we show the distribution of participants among different age categories (subfigure a), different genders (subfigure b), and different levels of education (subfigure c). Best viewed in color. (c) Question: "Would you use a system that allows you to give commands to a self-driving car? Example commands could be: park in the shade, ..." Results for three questions regarding the trust in the self-driving car with or without option to give commands. Questions are reported in the captions below the subfigures. Best viewed in color. (c) Question: "You are in a self-driving car. At a particular moment, you give the command "pick up that person." However, there are two persons in front of the vehicle. The car detects this and has the following two options to make you aware of this: (Option 1) It displays the uncertain objects on a touch screen with a rectangular box for each object, or (Option 2) in addition to showing the objects on a touch screen, the car also describes the objects in a question through speech. With which option would you feel more confident in the abilities of the car? If none of the options make you feel more confident, please indicate 'none of the above'." Best viewed in color. Results for three questions regarding the format of the uncertainty resolving system for commands given to the self-driving car. Questions are reported in the captions below the subfigures. Best viewed in color.

Appendix B. Talk2Car-Expr Dataset Statistics
In this section we show some of the statistics for the newly created Talk2Car-Expr dataset. In Subfigure B.11(a) we show the distribution of the expression lengths, Subfigure B.11(b) shows the distribution of the location attributes, Subfigure B.11(c) shows the distribution of the action attributes, and Subfigure B.11(d) shows the distribution of the color attributes.