Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction

This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction.


I. INTRODUCTION
In this paper, we consider the problem of predicting future hazards that drivers may encounter while driving a car.Specifically, we approach the problem by formulating it as a task of anticipating an impending accident using a single input image of the scene in front of the car.An example input image is shown in Fig. 1, which shows a taxi driving in front of the car on the same lane, and a pedestrian signalling with their hand.From this image, one possible reason is that the pedestrian may be attempting to flag down the taxi, which could then abruptly halt to offer them a ride.In this scenario, our car behind the taxi may not be able to stop in time, resulting in a collision.This simple example shows that predicting hazards sometimes requires abductive and logical reasoning.
Thus, our approach formulates the problem as visual abductive reasoning (VAR) [1], [2] from a single image.As an underlying thought, we are interested in leveraging recent advances in multi-modal AI, such as visual language models (VLMs) [3]- [8].Despite the growing interest in selfdriving and driver assistance systems, little attention has been paid to the solution we consider here, to the best of our Fig.1: Example of driving hazard prediction from a single dashcam image.The pedestrian in the green box may be attempting to flag down a taxi, and the taxi may abruptly stop in front of our car to offer them a ride.
knowledge.Existing approaches rely on predicting accidents through computational simulations using physics-based [9] or machine-learning-based models of the environment [10].For instance, they predict the trajectories of pedestrians and other vehicles.Another approach formulates the problem as detecting anomalies from input videos [11], [12].However, these methods, which rely only on a low-level understanding of scenes, may have limitations in predicting future events that occur over a relatively long time span, as demonstrated in the example above.If we expand our focus beyond self-driving and driver assistance systems, there are several studies [13], [14] employing deep learning to predict accidents and analyze factors contributing to accidents within the broader context of traffic systems.However, these studies are beyond the scope of this research.
The present study is among the first to apply multimodal AI, integrating vision and language, to the prediction of driving hazards in the ego-car context.Our study seeks to realize in vehicle intelligence the complex reasoning and prediction capabilities that human drivers perform when driving.This line of research is important because human drivers naturally perform complex reasoning tasks, such as understanding the intentions of other road users, including drivers and pedestrians.Current systems in autonomous vehicles and ADAS (advanced driver-assitance system), which mostly rely on simulation-based methods, do not possess these reasoning capabilities.We argue that further research in this area could substantially improve how autonomous driving technologies function, addressing their present limitations.It should be noted that our approach does not seek to replace existing simulation-based methods.Instead, it will naturally enhance them by integrating humanlike abductive reasoning into vehicle intelligence.
The advantage of VAR over simulation is its capability for extended forecasting.As the prediction horizon extends, the complexity of forecasting increases with the exponential growth in potential hazards.VAR stands out by integrating logical reasoning into forecasts, aiding in prioritizing events according to their likelihood.Specifically, it evaluates the behaviors of various traffic participants, such as pedestrians and drivers, to anticipate potential incidents.In contrast, simulationbased approach often finds it challenging to handle such predictive tasks efficiently.Figure 1 demonstrates the difficulties in predicting a taxi's stopping behavior without specialized algorithms.
An important note is that in the present study, we use a single image as input, which may seem less optimal than using a video to predict hazards encountered while driving.There are two reasons simplifying the problem for our choice.First, human drivers are capable of making accurate judgments even from a static scene image, as demonstrated in the example above.Our study considers this particular type of hazard.Humans are apparently good at anticipating the types of hazards that may occur and further estimating the likelihood of each one.Second, there are technical challenges involved in dealing with video inputs.Unlike visual inference from a static image (e.g., visual question answering [15]), there is currently no established approach in computer vision for performing highlevel inference from dynamic scene videos; see [1], [2] for the current state-of-the-art.While videos contain more information than single images, we believe that there remains much room to explore in using single-image inputs.
To investigate this understudied approach to driving risk assessment, we present a formulation of the problem and create a dataset for it.Since actual car accidents are infrequent, it is hard to collect a large number of images or videos of real accidents.To cope with this, we utilize existing datasets of accident-free images captured by dashcams, specifically BDD100K (Berkeley DeepDrive) [16] and ECP (EuroCity Persons) [17]; they were originally created for different tasks, e.g., object detection and segmentation.From these datasets, we have human annotators first identify scenes that potentially pose risks, in which an accident could occur a few seconds later.We then ask them to provide descriptions of the hypothesized accidents with mentions of entities (e.g., traffic signs, pedestrians, other cars, etc.) in the scene.
The proposed dataset, named DHPR (Driving Hazard Prediction and Reasoning), is summarized as follows.It contains 15K scene images, for each of which a tuple of a car speed, a description of a hypothesized hazard, and visual entities appearing in the image are provided; see Fig. 2.There are at least one and up to three entities in each scene, each represented by a bounding box with its description.Each entity is referred to as 'Entity #n' with n(= 1, 2, 3) in the hazard description.
Despite certain limitations, DHPR can contribute to the development of AI systems capable of predicting potential hazards in traffic scenes, similar to human drivers.The first limitation is that DHPR only includes single images and vehicle speed as inputs, utilizing a limited portion of the information available in real driving scenarios.The second limitation is that the annotated hazards are hypothesized rather than real.However, these limitations do not undermine the utility of DHPR for two main reasons.First, it is curated by highly skilled annotators, as will be explained later, ensuring a rich collection of realistic hypothesized hazards.Second, although the hazards are restricted to those identifiable from still images and given vehicle speed, the diversity and frequency of these hazards are substantial.
Based on the dataset, we examine the task of inferring driving hazards using traffic scene images.This task involves making inferences based on uncertain observations and falls under the category of visual abductive reasoning, which has been the subject of several existing studies, [1], [2].These studies have also introduced datasets, such as Sherlock [1] and VAR [2].However, our study differs in several key ways, detailed in Table I.While our study specifically targets traffic scenes, we tackle a broader visual reasoning challenge within this context.This involves recognizing multiple objects, understanding their interactions, and engaging in multi-step reasoning.Aiming to deal with these challenges, we introduce a novel method tailored to this task, which emphasizes efficiently extracting relevant information from both input images and texts.
It is also worth mentioning that numerous studies on traffic accident anticipation have been conducted, building datasets with similar dashcam imagery (e.g., [11], [12], [18], [19]).However, these studies only provide annotations for closed-set classes of accidents/causation or address relatively uncomplicated traffic scenarios involving singular objects in inference.In contrast, our study includes annotations for openset driving hazards expressed in natural language texts and focuses on more complicated visual reasoning in forecasting potential traffic hazards.
The following section provides a more detailed discussion of related work (Sec.II).We then proceed to explain the process of creating the dataset (Sec.III).Next, we explore various task designs that can be examined using this dataset (Sec.IV).The experimental results, which evaluate the performance of the proposed method and existing methods for general vision and language tasks in predicting driving hazards, are presented in Sec.V. Finally, we conclude our study in Sec.VI.

A. Reasoning in Traffic Scenes
Traffic accident anticipation has received significant attention in the fields.We focus here exclusively on studies that utilize a dashboard camera as the primary input source.
The majority of these studies employ video footage as input and formulate the problem as video anomaly detection, where methods predict the likelihood of an accident occurring within a short time frame based on the input video.While some studies consider supervised settings [18], [27]- [29], the majority consider unsupervised settings, considering the diversity of accidents.Typically, moving objects are first detected in input videos, such as other vehicles, motorbikes, pedestrians, etc., and then their trajectories or future locations are predicted to identify anomalous events; more recent studies focus on modeling of object interactions [11], [30]- [32].Some studies consider different problem formulations and/or tasks, such as predicting driver's attention in accident scenarios [33], learning accident anticipation and attention using reinforcement learning [34], and understanding traffic scenes from multisensory inputs by the use of heterogeneous graphs representing entities and their relation in the scene [35].
Meanwhile, to establish trust in autonomous driving and driver assistance systems, it is crucial that these systems offer explanations of their reasoning in a format that humans can understand, specifically through natural language explanations.Building on this premise, recent studies have approached the task by framing it as a challenge of image/video question answering [21], [25] or generating descriptions about the ego vehicle's actions [23], [24] or the objects that may affect ego vehicle's driving [19], [22], accompanied by relevant datasets.
Our study aligns with the aims of these prior works but places a stronger emphasis on the prediction of future potential hazards.This requires the identification and interpretation of multiple objects and their interactions within a traffic scenario.Although some studies [21], [22] have tackled the prediction of potential risks, there has been a lack of focus on the interactions of multiple objects.Moreover, our dataset stands out by providing comprehensive explanations of potential hazards, which have been extensively annotated to encompass a wide range of accident scenarios, including the underlying reasons for these accidents and their possible outcomes.The comparisons between our dataset and those of previous works are outlined in Table I.

B. Visual Abductive Reasoning
Abductive reasoning, which involves inferring the most plausible explanation based on partial observations, initially gained attention in the field of NLP [1], [2], [36], [37].While language models (LMs) are typically adopted for the task, some studies incorporate relative past or future information as context to cope with the limitation of LMs that are conditioned only on past context [38].Other researchers have explored ways to enhance abductive reasoning by leveraging additional information.For example, extra event knowledge graphs have been utilized [39] for reasoning that requires commonsense or general knowledge, or general knowledge and additional observations are employed to correct invalid abductive reasoning [40].However, the performance of abductive reasoning using language models exhibits significant underperformance, particularly in spatial categories such as determining the spatial location of agents and objects [41].
Visual abductive reasoning extends the above text-based task to infer a plausible explanation of a scene or events within it based on the scene's image(s).This expansion goes beyond mere visual recognition and enters the realm of the "beyond visual recognition" paradigm.The machine's ability to perform visual abductive reasoning is tested in general visual scenarios.In a recent study, the task involves captioning and inferring the hypothesis that best explains the visual premise, given an incomplete set of sequential visual events [2].Another study formulates the problem as identifying visual clues in an image to draw the most plausible inference based on knowledge [1].To handle inferences that go beyond the scene itself, the authors employ CLIP, a multi-modal model pre-trained on a large number of image-caption pairs [42].

A. Specifications
The DHPR dataset 1 provides annotations to 14,975 scene images captured by dashcams inside cars running on city 1 The dataset will be made public upon acceptance.streets, sourced from BDD100K (Berkeley Deepdrive) and ECP (EuroCity Persons).Each image x is annotated with • Speed v: a hypothesized speed v(∈ R) of the car • Entities {e n = (e bbox,n , e desc,n )} n=1,...,N : up to three entities (1 ≤ N ≤ 3) leading to a hypothesized hazard, each annotated with a bounding box e bbox,n and a description e desc,n (e.g., 'green car on the right side of the road') • Hazard explanation h: a natural language explanation h of the hypothesized hazard and how it will happen by utilizing the entities {e n } n=1,...,N involved in the hazard; each entity appears in the format of 'Entity #n' with index n.The 14,975 images are divided into train/validation/test splits of 12,975/1,000/1,000, respectively.

B. Annotation Process
We employ Amazon Mechanical Turk (MTurk) to collect the aforementioned annotations [43].To ensure the acquisition of high-quality annotations, we administer an exam resembling the main task to identify competent workers and only qualified individuals are invited to participate in the subsequent annotation process.We employ the following multi-step process to select and annotate images from the two datasets, BDD100K and ECP.Each step is executed independently; generally, different workers perform each step on each image; see the supplementary material for more details.
In the first step, we ask MTurk workers to select images that will be utilized in the subsequent stages, excluding those that are clearly devoid of any hazards.This leads to the choice of 25,000 images from BDD100K and 29,358 images from ECP.For each image, the workers also select the most plausible car speed from the predefined set [10, 30, 50+] (km/h) that corresponds to the given input image.
In the second step, we assign workers to evaluate if the car could be involved in an accident within a few seconds, assuming it travels at 1.5 times the speed noted in the first step.This speed increase from the initially annotated one is because the original images were captured under normal driving conditions, with no accidents occurring subsequently.Our aim is that this adjustment will aid workers in more effectively assessing accident risks and formulating realistic scenarios.Images judged as safe are excluded, reducing the total count from 54,358 to 20,791.
In the third step, we assign workers to annotate the remaining images by identifying potential hazards involving up to three entities.They draw bounding boxes, describe each entity, and explain the hazard, ensuring the hazard explanation is at least five words and references all entities as 'Entity #n'.See examples in Fig. 2.
Finally, we conduct an additional screening to enhance the quality of the annotations.In this step, we enlist the most qualified workers to evaluate the plausibility of the hazard explanations in each data sample.This process reduces the number of samples from 20,791 to 14,975.

C. Hazard Types
To facilitate proper evaluation and detailed analysis of reasoning models, we categorized the hazards annotated as above into multiple types; see Table II.Initially, these hazards were divided into two groups based on the entities involved.The first group encompasses hazards related to other vehicles on the road, i.e., cars, buses, and bikes.The second group consists of hazards related to entities other than these vehicles.The latter group, termed "Entity Related," is smaller and further subdivided into four types based on the entity involved, i.e., Pedestrian, Stationary Object, Traffic Indication (e.g., traffic signal), and Driving Condition.(i.e., unusual conditions such as sun glare or inclement weather).The former group, labeled as "Driving Scenario," is more numerous and is categorized into the following five types according to the scenario of the hazard: Speeding and Braking, which frequently result in rearend collisions; Sideswipe incidents stemming from improper distance estimation; Merging Maneuvers that lead to conflicts or misunderstandings about right-of-way; Unexpected Events, such as sudden stops during turns or mid-motion changes of intent; and Chain Reactions, where multiple vehicles contribute to a series of accidents.Table II shows the statistics for these hazard types in the validation and test subsets.It also displays the number of selected samples for retrieval tasks, which will be explained later.

A. Task Definition
Our dataset supports a variety of tasks with varying levels of difficulty.Each sample in our dataset consists of (x, v, h, {e 1 , . . ., e N }), where x is an input image, v is the car's speed, h is a hypothesized hazard explanation, and e n = (e bbox,n , e desc,n ) are the entities involved in the hazard.
In this study, we consider two main tasks: image/text retrieval and text generation, aiming to conduct a comprehensive evaluation to understand the core aspects of the problem better.The image/text retrieval task involves ranking a set of candidate images or texts based on their relevance to given input texts or images.The text generation task entails creating a natural language explanation h that corresponds to a specific input image x.Each task presents its own advantages and challenges, including varying levels of task difficulty and complexity in evaluation.
In addition, we must also consider how to handle visual entities, as different methods impact task difficulty.The most challenging method requires models to autonomously detect and identify entities.A simpler alternative is to provide models with the correct entities already marked as boxes in the input image.We opt for the latter method to minimize ambiguity in hazard prediction.In any given scene, multiple potential hazards may be hypothesized.Specifying the entities helps to narrow down the options available for the models to evaluate.
We address two types of retrieval tasks: image-to-text retrieval (TR) and text-to-image retrieval (IR).In TR, we rank each hazard explanation h from a set S h based on its relevance to an input image x and its auxiliary data.As explained above, the boxes of involved entities E = {e bbox,1 , . . ., e bbox,N } are assumed to be provided for each image x.The models then compute a score for each h using these inputs as s = s(h, x, E).
In IR, we rank each pair of an image and its associated boxes (x, E) from a set of candidate image-box pairs S x,E using the same score.
In the generation task, models directly generate a hazard explanation h as natural language text from a given input image x and its associated entity boxes E. Typically, these models use large language models (LLMs), which generate the text autoregressive.

B. Evaluation Procedure and Metrics
In the retrieval tasks, we evaluate the models' performance using a subset of 100 selected samples for each of the validation and test subsets, as detailed in Table II.Acknowledging the original data's imbalance in hazard types, we ensure a balanced distribution of these types for a more meaningful evaluation.For TR, we rank the 100 texts (hazard explanations) from them for each image-box pair within the same 100 samples.For IR, we perform a similar ranking but with the roles of the query and the database reversed.To assess the quality of these rankings, we use two metrics: the average rank of the correct entry and Recall@k, which measures the frequency of the correct entry appearing within the top k positions.
In the generation task, we utilize four standard metrics commonly used in image captioning: BLEU-4 [44], ROUGE [45], CIDEr [46], and SPIDEr [47].These metrics assess the similarity between the generated text and its ground truth, which is the annotated hazard explanation.However, they primarily measure formal similarity across various aspects, rather than semantic similarity.To address this, we incorporate GPT-4 [48] to compute a semantic similarity score, which ranges from 0 to 100.We craft a specific prompt for GPT-4 to align the output score closely with our understanding of semantic accuracy.This method assesses spatial and causal relationships and the accuracy of entity references.It is also designed not to penalize the generated text for including additional content that is not present in the ground truth.More detailed information is provided in the supplementary material.Unlike the retrieval tasks, we use all the samples in the validation/test sets.

Retrieval Generation
Entity #1 may turn left suddenly, also due to Entity #2, our car ... Fig. 3: The proposed method for the retrieval and generation tasks.

V. EXPERIMENTS
A. Methods a) How to Input Visual Entities?: As explained in Sec.IV-A, we assume that the entities involved in hazard prediction are given for each input image x; they are given in the form of bounding boxes E = {e bbox,1 , . . ., e bbox,N }.Given that multiple entities can exist within a single image, they are identified in the hazard explanation as 'Entity #1', 'Entity #2', etc.Then, inference models, whether for retrieval or generation, must discern the relationship between these textual references and their corresponding bounding boxes.
To enable this, we employ an approach to augment the input image x with color-coded bounding boxes, following [1], [49].Specifically, an opaque color is used to represent a bounding box.As there are up to three entities, we employ a simple colorcoding scheme, i.e., using purple, green, and yellow colors to indicate Entity #1, 2, and 3, respectively.We employ alpha blending (with 60% opaqueness) between boxes filled with the above colors and the original image; see the bottom of Fig. 3 for examples.We will use x to indicate the augmented image in what follows.In this setup, models will learn the color coding scheme through training using the DHPR training set.
Note that a different method is applied when using GPT4V for zero-shot generation, as will be explained later.While similar color codes are used, the bounding boxes are unfilled instead of filled.No learning process is involved in this case; the colors of the boxes are directly specified in text prompts [50].
b) Proposed Method: We introduce a new method designed to serve as a baseline for our task.This method is based on CLIP [42], a well-established method for image and text retrieval in general contexts.CLIP is also applied to VAR [1].However, our task, DHPR, demands advanced capabilities, including the recognition of multiple objects (via the colorcoding scheme) and identifying their spatial relationship.Thus, our method focuses on enhancing the ability to extract pertinent information from both text and images.The retrieval score continues to use the cosine similarity between the embeddings from the CLIP's encoders.We use the pre-trained CLIP ViT-L/14 for the visual encoder and BERTbase for the text encoder.The added two encoders have a simple design, each comprising two standard Transformer layers.
For the generation task, the vision encoder trained as above is repurposed and integrated with a pre-trained LLM to construct a visual language model (VLM).We employ LLaMA2 7B.Recent studies predominantly employ a blend of a pre-trained LLM and a vision encoder, specifically aligning the output of the vision encoder with the textual input of the LLM.While there exist variations in this methodology, we adopted the strategy outlined in [8].Specifically, we employ multi-modality adapters [8] injected into the Transformer layers of the vision encoder and the LLM.We extract three [CLS] embeddings after every eight Transformer layers of the vision encoder and project them into the space of the LLM's input tokens.A notable difference from [8] is that we use the [BOS] embedding as a routing token.We append the projected tokens with the tokens of a text prompt, as illustrated in Fig. 3.We then train the adapters while freezing the CLIP vision encoder and the LLM for efficient training and to mitigate overfitting.c) Compared Methods: In addition to the above method, we evaluate several existing methods.The first is the original TABLE III: Results for the image retrieval (IR) and text retrieval (TR) tasks on the DHPR test split.The retrieval tasks are evaluated by the average rank and Recall@1.For all metrics except the rank metric, higher values indicate better performance.CLIP, fine-tuned on DHPR, for the retrieval tasks.The cosine similarity between the input image and text provides their relevance score.Additionally, we evaluate two popular VLMs, BLIP [51] and BLIP2 [4], for both retrieval and generation tasks.BLIP employs separate transformers, a ViT for extracting image embeddings and a BERT Transformer for text embeddings.Besides ViT, BLIP2 uses an additional Q-Former for extracting image embeddings.For the retrieval task, we calculate the relevance score using cosine similarity between image and text embeddings from these models.For the generation task, BLIP employs classic BERT as a text decoder, and BLIP2 uses a LLM, OPT 6.7B, as a text decoder.
For the generation task, we also evaluate a SOTA VLM, LLaVA-1.5 [7], which utilizes LLaMA 7B as the underlying LLM.Furthermore, we evaluate GPT-4V's capability to generate hazard explanations in a zero-shot setting.Distinctively, GPT-4V can accurately relate each bounding box in images with input/generated texts, requiring only instructions like 'Entity #1 is highlighted by the magenta box' in the prompt, without prior training on the DHPR dataset.To test its maximum performance in a zero-shot scenario, we include the ego car's speed in the prompt for GPT-4V, a piece of information not available to the other models.We utilized a GPT-4V model, gpt-4-vision-preview, through OpenAI API.Details about the prompt design are available in the supplementary material.

B. Training
All models mentioned (but GPT-4V) are fine-tuned on the DHPR training set, following their respective training protocols.For the proposed method, we train (or fine-tune) the retrieval model using the image-text contrastive loss (ITC) [42] and image-text matching losses (ITM) [52] from two auxiliary encoders, as explained earlier.We train it in 15 epochs.For the generation task, we employ the cross entropy loss as the training objective and train the model in 20 epochs following the setting in [8].All our experiments are conducted on 4 A100 GPUs (40GB each), taking approximately 80 minutes for generation tasks and about 60 minutes for retrieval tasks.a) Entity Shuffle Augmentation: While a hypothesized hazard explanation can contain multiple visual entities, their order in the explanation is arbitrary, e.g., 'Entity #1' may appear after 'Entity #2' etc in the text.As explained earlier, we assign a color to each index (n = 1, 2, 3), and this assignment is fixed throughout the experiments, i.e., purple = 'Entity #1,' green = 'Entity #2,' and yellow= 'Entity #3.' To facilitate the models to learn this color coding scheme, we augment each training sample by randomly shuffling the indices of entities that appear in the explanation, while we keep the color coding unchanged.

C. Quantitative Results and Discussions
Table III presents the results of the compared methods for the retrieval tasks, while Table IV details the results for the generation tasks.From the retrieval task results, we observe that both CLIP and our model-a variant that extends CLIP-surpass BLIP and BLIP2 in performance.This superiority is likely attributed to CLIP's comprehensive pretraining on a wide range of image-caption pairs.Additionally, our model, enhanced with dual auxiliary encoders, achieves the highest performance.
The experimental results for the generation task present several observations.Firstly, the VLMs, such as BLIP2, LLaVA, and ours, attain fairly good performance.It is largely attributed to their use of advanced LLMs, highlighting the critical role of LLMs in the generation task.
Secondly, in the comparison between LLaVA and our model, both utilizing the state-of-the-art language model LLaMA-2, it is evident that our model outperforms LLaVA across all metrics.This outcome is unexpected, especially considering that LLaVA (version 1.5) employs visual instruction tuning with a variety of datasets, leading to top-tier results in general vision and language tasks.The likely reason for LLaVA's underperformance may be its approach to processing image information.Our task demands specific capabilities, such as accurately identifying and understanding entities within colorhighlighted boxes.These skills are closely linked to the method of extracting information from images, which depends on the model's architecture and learning techniques.The fact that LLaVA freezes its visual encoder and uses a simple transformation to align its outputs with text tokens could be a significant factor in its lower performance.
The third observation concerns the evaluation results using GPT-4.It is seen that the GPT-4 scores of the VLMs, excluding GPT-4V as an inference model, generally align with traditional metrics.However, unlike traditional metrics that focus primarily on formal similarities, we can expect that the GPT-4 score provides a more nuanced measure, capturing the closeness in meaning.The scores 56-58% for LLaVA and ours indicate a  We did not expect that Entity #1 would stop giving preference to Entity #2 and cross over to our lane.Due to this, we would fail to stop at the speed and hit Entity #1.

Our model:
Entity #1 would stop nearly due to more cars in the front in queue and Entity #2 would cross our way by taking the center lane.we would crash either Entity #1 or Entity #2 at this speed

GPT-4V:
At a speed of over 75 km/h, our car approached Entity #1 (the magenta box), which is a stationary vehicle that unexpectedly opened its door or moved into traffic, potentially causing a side-swipe collision.Concurrently, Entity #2 (the cyan box), which appears to be a large bus merging into our lane, could result in our car being squeezed between Entity #1 and Entity #2, leading to an accident.commendable level of performance in capturing true meaning, though there is still considerable room for further improvement.It is noteworthy that our model outperformed others in this respect.
The final observation is about the performance of GPT-4V as an inference model.GPT-4V operates in a zero-shot manner, meaning it has not learned what the correct text should be.As a result, the text produced by GPT-4V often significantly deviates in form from the expected correct text, usually providing overly detailed explanations.This tendency leads to its lower performance on the traditional metrics.However, GPT-4V yields a fairly high GPT-4 score.Assuming that the GPT-4 score accurately assesses the correctness of meaning, GPT-4V's zero-shot generation performance is impressive and may open new avenues for future research.

D. Qualitative Results
We showcase hazard explanations produced by our baseline model and GPT-4V in Fig. 4. Figure .4 (a) illustrates that our model is capable of generating hazard explanations that are not only more accurate but also more semantically aligned with those provided by human annotators compared to GPT-4V.
In Fig. 4 (b), a scenario is presented that poses a greater challenge, necessitating the recognition of interactions among multiple objects for accurate hazard prediction.In this example, both models struggle to correctly infer the causal relationships among the involved objects.Specifically, our model does not accurately identify that the large bus (Entity #2) is in the process of changing lanes, leading to Entity #1 coming to a stop.GPT-4V exhibits a similar limitation in recognizing the interplay between the objects.Additional examples of hazard explanations generated by the models can be found in the supplementary materials.

E. Ablation Studies
Our proposed method is composed of multiple components and configurations.We conduct ablation tests to validate their effectiveness.Table V shows the results.More ablation studies are provided in the supplementary material.
In the retrieval task (Tables Va -Vc), we find that superior vision encoders (ViT-L/14) significantly enhance performance over their less powerful counterparts (ViT-B/16), as evidenced by improved average rank and R@1 scores.The addition of auxiliary image-text matching losses further boosts these metrics, affirming their importance in fine-tuning when compared with the CLIP baseline that solely relies on contrastive loss.Lastly, using entity shuffle augmentation leads to retrieval improvement, showing that it enhances the model's proficiency in associating color codes with entity names.
For the generation task (Tables Vd -Vf), the choice of text decoder emerges as a pivotal factor.The advanced LLaMA 2 model outperforms its predecessor (LLaMA 1), leading to higher BLEU-4 and CIDEr scores.The optimal use of three CLS tokens from the vision encoder as inputs to the text decoder is also established, showing superior results over other configurations.Additionally, the use of a BOS token as a routing token in adapters proves to be more effective than an indicator token proposed by [7].

VI. CONCLUSION AND DISCUSSIONS
We have introduced a new approach to predicting driving hazards that utilizes recent advancements in multi-modal AI, to enhance methodologies for driver assistance and autonomous driving.Our focus is on predicting and reasoning about driving hazards using scene images captured by dashcams.We formulate this as a task of visual abductive reasoning.
To assess the feasibility and effectiveness of the approach, we curated a new dataset called DHPR (Driving Hazard Prediction and Reasoning), featuring around 15,000 dashcam-captured scene images, annotated through crowdsourcing with details like car speed, hazard explanations, and visual elements marked by bounding boxes and text.The dataset was used to create specific tasks and evaluate model performances, including a CLIP-based baseline, VLMs, and GPT-4V, on image-to-text and text-to-image retrieval tasks and a text generation task.The results affirmed the approach's feasibility and efficacy, offering insights for future research.
This paper serves as a preliminary exploration of using multimodal AI for abductive reasoning in car driving, mimicking human behavior.Due to technical constraints and data acquisition challenges, the current method relies solely on still images and vehicle speed for inference, limiting its ability to address all real-world driving hazards.Despite these limitations, our results affirm the framework's utility.
Future research will extend the framework to include video input and a broader set of vehicle data, enhancing hazard identification capabilities.Notably, our framework can adapt to incorporate additional vehicle information simply by altering the initial prompts fed into our models.For instance, introducing instructions like 'As we turn right' allows the model to integrate steering data to anticipate hazards contextually, which will be explored further in subsequent studies.The Driving Hazard Prediction and Reasoning (DHPR) dataset will be publicly available at:

B. Notes on Availability and Maintenance of the Data
The DHPR dataset created in this study will be openly available for access from the URL given above.We plan to establish an evaluation server and leaderboard in the future.Any updates pertaining to the dataset will be communicated through the aforementioned repository, ensuring that users have access to the most up-to-date information.

C. Ethical and Responsible Use
The present study complies with the ethical standards for responsible research practice.Our dataset is built upon images of two existing datasets, ECP and BDD100K.It is compliant with GDPR for ECP [2] and other data-related regulations for BDD100K [1].We protected the anonymity of personal information by blurring identifiable details in the images used in both the main paper and this supplementary material.The datasets are sourced following the licensing regime of each dataset.Almost all of the annotated hazards pertain to the risk of the ego-vehicle potentially colliding with an object.Table I details the various entities that the self-car may collide with within the hazard explanations for each data split.

III. USING GPT-4 FOR EVALUATING GENERATED TEXTS
As stated in our main paper, our experiments used OpenAI's GPT-4 (not the GPT-4V(ision) used for inference) as one of the evaluation metrics for the generation task.We explain this method in detail here.To evaluate a model, we measure how similar the hazard explanation texts generated by the model are to the correct texts, i.e., the annotated hazard explanations.Our aim is to supplement conventional metrics, which can measure formal similarity but not semantic similarity.Here, GPT-4 refers to the gpt-4-1106-preview model and was used via the API provided by OpenAI.
Specifically, when given the correct and generated texts, we input a prompt containing these into GPT-4, and have it respond with a score reflecting their similarity, as shown in Fig. 1.The prompt is constructed by placing a task instruction at the beginning, followed by the correct and generated texts as a query.The construction of the prompt, particularly the task instruction, is crucial to making the evaluation meaningful.We focused on several aspects, described below.
Firstly, we set the output score range as 0-100.A perfect semantic match with the correct answer yields 100, while no commonality (including incomplete generated texts) results in 0. Intermediate values should ideally represent semantic similarity aligned with human intuition.
Secondly, given the nature of the hazard prediction task, it's essential that the generated text correctly refers to the entities indicated by colored boxes in images.This accuracy should be rigorously evaluated.
However, as this task involves hypothetical hazard reasoning, there's often no single correct answer.Therefore, we only require the generated text to mention the same content as the correct text, without penalizing for extra content not included in the correct text.This policy significantly affects the evaluation of GPT-4V as an inference model, as it tends to generate texts with additional content not in the correct answer.This is reasonable as GPT-4V infers in a zero-shot manner (without learning from DHPR training data), as described in Sec.4.2 of the main paper and Sec.IV in this supplementary material.Figure 2 illustrates how the presence or absence of this policy changes the evaluation of the same generated text from GPT-4V.The input prompt is designed with these considerations in mind; see Fig. 1.
For the sake of efficiency in evaluation processing through the API, we included 25 pairs of correct and generated texts, corresponding to 25 scenes (images), in a single prompt.Furthermore, to ensure the reproducibility of the evaluation, we set the temperature of GPT-4 to 0.

IV. HAZARD EXPLANATION GENERATION BY GPT-4V
In the experiments for the generation task, we evaluated GPT-4V along with other models.Here, we detail how GPT-4V executed these tasks.Unlike traditional text-only models, GPT-4V processes both text and images as input.As reported in [3], it can perform a variety of image understanding tasks, and is currently considered one of the most advanced VLMs.For the experiments, the gpt-4-vision-preview model was utilized via the OpenAI API.
For each scene, we input two inputs to GPT-4V: an image and a text prompt.Examples of these inputs, along with their corresponding output results, are illustrated in Fig. 3.
As shown in the figure, the input image includes color-coded bounding boxes (BBs), explicitly specifying entities in the scene involved in a hypothesized hazard.Specifically, Entity #1, #2, and #3 correspond to magenta, cyan, and yellow, respectively.
In the text prompt (explained later), GPT-4V is instructed to System Instruction Prompt You are a highly accurate language assistant.Your task is to compare the sentences under 'ref' with those under various model predictions (e.g., 'model1', 'model2', etc.) for each key, and provide a semantic meaning similarity score for every described accident situation, with a particular focus on the spatial and causal relationships leading to the collision.Any prediction that forms an incomplete sentence must be assigned a score of 0. Scores should range from 0 to 100, where a higher score indicates greater similarity in accurately representing these relationships.Please calculate and provide the score for each model's prediction, comparing how well they reflect the accident event and the relationships between entities as described in 'ref'.The data for evaluation is presented in the following JSON format, where Entity #1, Entity #2, and Entity #3 refer to distinct objects and are not interchangeable.Situations involving different entities should be considered entirely dissimilar, warranting a similarity score of 0 only for the model(s) that reference an incorrect entity.Additional details that can still fit in the situation of the 'ref' are not considered as factors that reduce the similarity score.Additional context that fits in the 'ref' does not reduce the score.Be aware that some sentences speak from an ego-car perspective, so they will not mention themselves.Ignore the presence of 'possibility words' such as 'might', 'could', 'would', or 'may' in the prediction sentences and do not let these affect the similarity score.Provide answers in JSON format, including everything in the same JSON, with no additional text explanations.The format should be: {k1:{model1: score, ...}}.

An Example Query Prompt {
"07434289-7c2e8f71":{ "ref": Entity #1 is in green color, so I will not reduce my speed, now Entity #2 from the opposite slowed down its speed while turning, at this time, I could not stop my car in time.","ours": Entity #1 is in green, so I will not reduce my speed.At this time, as Entity #2 suddenly reduces its speed, my car will hit Entity #2.} } Fig. 1: Our prompt for GPT-4 to evaluate the similarity between a model-generated hazard explanation text and its corresponding ground truth text.This prompt includes a system instruction, followed by query prompts.Each query prompt contains the generated text alongside the ground truth text for comparison.refer to these entities in the generated text using the format of 'Entity #n', corresponding to the color-coded BBs in the image.
It is important to note that while the method of indicating entities with colored BBs is common across all models compared, we adopted a different approach for GPT-4V to coloring these boxes than the other models.Unlike other models for which each box is filled with an opaque color, we simply

Hazard Evaluation Inspection
System Instruction: You are a highly accurate language assistant.Your task is to….Additional details that can still fit in the situation of the 'ref' are not considered as factors that reduce the similarity score.Additional context that fits in the 'ref' does not reduce the score.Be aware that ….

Case1: Complete System Instruction GPT-4 Answer:
{ "84ff92de-6ea2aced": { "gpt4v": 70 } } User Input: "84ff92de-6ea2aced": { "ref": "Entity #2 and Entity #1 are about to crash into each other.As that happens, I may not be able to stop my car in time considering my current speed.","gpt4v": "As our car was traveling at speeds over 75 km/h, it was unable to stop in time when Entity #2, the vehicle in the Cyan box, made a sudden lane change.The high rate of speed also reduced our reaction time, leading to a collision with Entity #1, the object in the Magenta box, which was stationary at the side of the road."}

User Input:
Please explain the reason for the zero score.

GPT-4 Answer:
The zero score was assigned because the model's prediction introduces several elements that are not present in the reference sentence, thus changing the scenario significantly: 1.The reference does not specify the speed of the car ("over 75 km/h").2. The reference does not describe Entity #2 making a "sudden lane change," which is a new action introduced by the model.3. The reference does not mention that Entity #1 is "stationary at the side of the road," which is a specific detail added by the model.4. The reference does not specify the colors "Cyan" and "Magenta" for the entities, which the model includes.
These discrepancies alter the spatial and causal relationships of the accident scenario, leading to a score of 0 for semantic meaning similarity.

Prompt:
You are a highly accurate decision-making assistant.Your task is to create a hypothetical accident explanation involving our car.Given the speed of our car and the entities involved in the image from the dashcam view, each entity is highlighted by a specific color box.Refer to them as follows: Entity #1 for the Magenta box, Entity #2 for the Cyan box, and Entity #3 for the Yellow box.Up to three entities may be involved.Please do not address anything not visible in the image.Provide a short hypothetical accident paragraph, no longer than two sentences, considering that our car is traveling at a speed of 45 km/h, and there are two entities involved.

GPT-4V:
In the given scenario, if our car, traveling at a speed of 45 km/h, were to be involved in a hypothetical accident with Entity #1 (Magenta box) and Entity #2 (Cyan box), it could occur as follows: Our car could fail to maintain a safe following distance and rear-end Entity #2 due to a sudden stop or decreased speed, potentially causing Entity #2 to lurch forward and collide with Entity #1.

Fig. 3: Example of input text and image prompts for GPT-4V
and output text.colors the outline of the BBs since this is a proven method for GPT-4V [4].Even without training using images with embedded color-coded BBs, GPT-4V works sufficiently well by simply having each input text prompt explain which entities in the image correspond to Entity #1-3.
The other input, the text prompt, consists of a template that is fixed regardless of the input scene and contextual information, which changes according to the input scene, embedded in the template; see Fig. 3.The template provides the task instructions, while the context information includes two pieces of information: the ego-vehicle's speed and the number of entities involved in the hazard to predict, both of which are included in the DHPR dataset.
Note that this information is not provided (i.e., not included in the prompt) to models other than GPT-4V.This approach for GPT-4V, partly due to its zero-shot inference nature, is intended to prevent its outputs from exceeding the intended scope of the dataset.Figure 4 illustrates output examples in the absence of these context details-three different outputs for the same input.
As these examples demonstrate, without specifying the egovehicle's speed, GPT-4V contemplates various possibilities for the vehicle's behavior, such as driving in reverse or unexpected acceleration.Similarly, not specifying the number of entities can lead GPT-4V to fabricate non-existent entities based on the color coding information in the instruction template, always referring to the maximum number of entities.
Examples of generated hazard explanations are presented in Sec.H.

GPT-4V No Context Information Prompt:
You are a highly accurate decision-making assistant.Your task is to create a hypothetical accident explanation involving our car.Given the speed of our car and the entities involved in the image from the dashcam view, each entity is highlighted by a specific color box.Refer to them as follows: Entity #1 for the Magenta box, Entity #2 for the Cyan box, and Entity #3 for the Yellow box.Up to three entities may be involved.Please do not address anything not visible in the image.Provide a short hypothetical accident paragraph, no longer than two sentences.

GPT-4V Answer 1:
In a hypothetical scenario where an accident occurs, Entity #1 (the vehicle in the magenta box) might unexpectedly reverse into the intersection while our car is proceeding forward, causing a collision.Simultaneously, Entity #2 (the vehicle in the cyan box) could misjudge the traffic flow and make an abrupt lane change without signaling, leading to a side-impact with Entity #1 or our car.Entity #3 (marked by the yellow box), although not directly involved, may contribute to the confusion if it is illegally parked and obstructing the view of the drivers.

GPT-4V Answer 2:
In the given scenario, if our car were to suddenly accelerate or fail to stop at the intersection, a hypothetical accident might occur as follows: Our car could collide with Entity #1 (the car in the Magenta box), which appears to be crossing the intersection perpendicular to our direction of travel.If Entity #2 (the car in the Cyan box) were to simultaneously initiate a left turn or stop unexpectedly, our car might be unable to avoid a secondary collision with Entity #2 due to the close proximity and potential loss of control following the initial impact.Entity #3 (marked by the Yellow box) is not directly in the path of our car and appears less likely to be involved based on this single frame.

GPT-4V Answer 3:
In a hypothetical accident scenario, our car, while driving at a moderate speed, could be involved in a collision if Entity #2 (Cyan box), which appears to be a vehicle directly ahead, were to stop abruptly, leaving insufficient space for our car to come to a safe stop.Concurrently, if Entity #1 (Magenta box), another vehicle to the right, were to swerve unexpectedly into our lane while trying to avoid Entity #3 (Yellow box), a stationary object or a vehicle performing an action like turning, this could result in a multi-car collision involving our vehicle.
Fig. 4: Example of generated texts from GPT-4V when no context information is provided (i.e., ego-vehicle speed and the number of involved entities).

A. More Details of the Proposed Model
The proposed baseline model consists of a text encoder, a vision encoder, and an LLM, as explained in Sec.5.1 of the main paper.
For the text and vision encoders, we utilize the pre-trained model sourced from OpenAI's CLIP repository 1 .We augment each encoder by incorporating two additional Transformer layers, which are stacked on top of the existing structure.These added layers adhere to the standard Transformer architecture having self-attention and cross-attention mechanisms.In terms of configuration, each layer is designed with a token dimensionality of 512 and is equipped with eight attention heads.Furthermore, we have implemented a dropout rate of 0.1.To enhance positional awareness, we employ relative position embeddings [5], with a maximum distance of 128 and a total of 32 buckets.
For the LLM, we utilize the LLaMA-2 7B model.We follow the configuration specified in Luo et al.'s research on visual instruction tuning with LLMs [6].
Our model incorporates two loss functions to train the retrieval tasks: the Image-Text Contrastive (ITC) loss and the Image-Text Matching (ITM) loss.For the ITC loss, we follow the original CLIP [7].Regarding the ITM loss, we intentionally introduce a mismatched pair in half of the image-text pairs during training.In our model's architecture, there are two auxiliary encoders: the Image-to-Text (I-2-T) and Text-to-Image (T-2-I) encoders.The ITM loss is calculated from their outputs as follows.Each of the two base encoders generates a [CLS] token.The ITM encoders transform the corresponding class token into a binary logit, for which the binary cross-entropy loss is calculated.After completing training on retrieval tasks, we proceed to fine-tune our model for generation tasks.As outlined in Sec.5.1 of the main paper, we repurpose the trained vision encoder and integrate it with the LLM.This phase of training utilizes the standard objective of next token prediction and employs the DHPR training set.In this setup, the model receives a preprocessed image and a text prompt as inputs, and generates hazard explanation text as its output.We follow the configuration parameters detailed in [6] for our model's fine-tuning2 .This includes setting an effective batch size of 32 and running a total of 20 training epochs.In accordance with their guidance, we configure the adapters within both the vision encoder and the LLM to have a dimensionality of 8. Notably, during the training phase, we exclusively optimize these adapter parameters.

B. Training of Compared Models
For the retrieval tasks, we apply a uniform training procedure across all models, including CLIP, BLIP, BLIP2, and our own model.We utilized the public repository for each model.Each model begins training with weights from its corresponding pretrained version and undergoes 15 epochs of weight updates.For the baseline, CLIP employs the ITC loss.BLIP adds the ITM loss; see also Sec.V-A.
BLIP2 utilizes the Image-Grounded Text Generation (ITG) loss within its Quantum-Former (Q-Former) architecture, in addition to the ITC and ITM losses.For training BLIP2, we adopt the stage-1 training method from the original BLIP2 study by Li et al. (2023) [8], using the DHPR training set.Thus, the ITG loss involves prompting the Q-Former to generate hazard explanation texts based on input images, which are then compared to their ground truths.
For the generation task, we adhere to standard fine-tuning procedures, specifically tailored to each model.In alignment with their respective original studies, we fine-tune3 the BLIP model for 5 epochs [9] and the BLIP2 model for 5 epochs [8].However, we deviate from the recommended protocol for LLaVA-1.5 4 , opting to fine-tune it for 12 epochs, as opposed to the suggested single epoch [10].This adjustment is based on our findings that the standard setting for LLaVA-1.5 resulted in subpar performance, whereas extending the fine-tuning to 12 epochs -a strategy derived from LLaVA-1.0 -significantly improves outcomes.We utilize the same loss of next token prediction for the training; see also Sec.V-A for more details.

C. Image Preprocessing
In line with standard practices, we preprocess input images for all models under comparison as follows.For models other than those using ViT-L/14, images are first resized to a square dimension of 224+16, and then randomly cropped to 224×224.In contrast, for baseline models employing ViT-L/14, images are resized to a square dimension of 336 + 16 and subsequently cropped to 336 × 336.Additionally, we apply color jitter augmentation with settings of 0.5 for brightness, 0.3 for hue, and 0.3 for saturation, prior to accentuating regions containing entities of interest.Notably, we avoid using horizontal flip augmentation to preserve the spatial integrity of the images.

VI. ABLATION TEST WITH IMAGE/TEXT INPUTS
As outlined in the main paper, the DHPR dataset allows us to design various tasks, each with a distinct level of difficulty.Selecting the text/image retrieval tasks as a testbed, we conducted further experiments to explore the impact of different input formats.In particular, we varied the format of each of a scene image and a hazard explanation to assess their effects on retrieval performance.In all the experiments, we used the proposed baseline.a) Image Input: To test the effects of different formats of image inputs, we considered four types of format: position only, 'no entity', 'no context', and 'only context', as illustrated in Fig. 5.We trained and tested the model using each format.
The results are presented in Table III.Notably, when only the positions of entities were shown ('position only'), a significant decline in performance occurred, with the average rank dropping to over 16.7.This outcome is reasonable as the model lacks visibility of the visual entities or context.Furthermore, excluding any direct specification of entities ('no entity') also led to considerably poorer performance.This result highlights the necessity for the models to accurately identify the entities present in the given hazard explanations to accurately estimate the similarity between an input image and an explanation.
Interestingly, when the context was omitted from the input images (labeled as 'no context'), we observed relatively improved results.The retrieval ranks varied between 11.8 and 12.0, and the Recall at 1 (R@1) exceeded 20%.However, these results were marginally but noticeably inferior to those of the baseline.In contrast, using only the contextual information (denoted as 'only context') led to significantly poorer outcomes compared to the removal of context.These findings highlight the critical role of both visual entities and their surrounding context in making accurate predictions.
b) Text inputs: In terms of text input formats, we add the descriptions of visual entities to hazard explanations.It's important to note that these descriptions have not been utilized in our previous experiments, although DHPR includes them as part of its annotations.Specifically, we enhance the hazard explanation for each sample by incorporating the descriptions of all visual entities into the corresponding section of the explanation.This involves replacing the first occurrence of "Entity #n" with "Entity #n, <its description>."For example, the following explanation "Entity #1 decides to go behind of Entity #2 to cross street misjudges my speed, can't stop in time and hits Entity #1" changes into "Entity #1, cyclist on right side by sidewalk, decides to go behind of Entity #2, white car in front of my car, to cross street misjudges my speed, can't stop in time and hits Entity #1."We employed two experiments for training and testing the model.The first experiment involved training the model using the comprehensive format of explanations mentioned above and testing it on the original format of explanations (without the descriptions).This experiment aimed to improve the model's performance on the test split while maintaining the same experimental setting as before.
The second experiment involved testing the same model on the comprehensive explanation format, which includes the additional descriptions.This experiment aimed to evaluate the impact of incorporating these descriptions.However, it is important to note that obtaining the descriptions requires inference and is not freely available.Therefore, we consider this case as an "oracle" scenario.
The lower block of Table III presents the results.When evaluating the model trained on comprehensive explanations on the original explanations without the entities' descriptions, it performs significantly worse than the baseline.This decline in performance can be attributed to the disparity in the format

Only Context
No Entity

No Context Position Only
Entity #1 decides to go behind of Entity #2 to cross street, misjudges my speed, can't stop in time and hits Entity #1 Fig. 5: Illustrations of our image input ablations, which are conducted by drawing in pixel-space directly, following [11].
between the training and test data, indicating that the intended aim was not achieved.However, when the same model is evaluated on the test data using the comprehensive format, there is a notable enhancement in its performance, an average reduction of 6.5 in its ranking.We attribute this improvement mainly to the model's ability to associate the added entities' descriptions with the contents of the images.It is why we termed the setting 'oracle.'

VII. QUALITATIVE ANALYSIS ON THE RETRIEVAL TASKS
We present here multiple results for the text retrieval task using our proposed baseline model.Figure 6 displays two instances of successful retrievals.In both cases, the groundtruth explanation appears first in the rankings, denoting accurate retrieval.Notably, in Fig. 6(b), the top-ranked results are semantically similar to the ground-truth explanation positioned at the top rank.
Figure 7 shows challenging examples of the text retrieval task.In the scene shown in Fig. 7(a), the ground-truth explanation expects our car to turn left and collide with Entity #1.The explanations ranked 1st and 2nd propose that our car continues straight while Entity #1 makes a left turn, resulting in a collision.Both explanations present valid hazard hypotheses.Figure 7(b) showcases an example where the ground-truth explanation is ranked very low, specifically at the 25th position.This discrepancy might be due to the term 'red lights' being used to refer to a specific concept, the reverse ramp of a bus, rather than its typical meaning of a traffic signal.Nonetheless, the explanations ranked 1st, 2nd, and 3rd convey meanings that are highly similar to the ground-truth explanation: the self-car cannot stop and collides with the bus.
We found from the results that our model tends to favor shorter, similar versions of hazard explanations, leading to an inaccurate ranking of longer and more complex sentences.A typical example is shown in Fig. 8(a), where the ground-truth explanation was assigned only the 6th rank.
Furthermore, we encountered different types of challenges that may arise from using a single image instead of a video or multiple frames.In Fig. 8(b), the car in front of ours appears to be turning left towards our left side.However, our model seems to assume that the car is moving backwards and turning right in front of our car.Consequently, it retrieved hazard explanations that predict the convergence of both cars into the same lane, resulting in a collision.

VIII. QUALITATIVE ANALYSIS ON THE GENERATION TASK
We present multiple examples of hazard explanation texts produced by our baseline model and GPT-4V.Remember, our model has undergone fine-tuning using the DHPR training set, whereas GPT-4V has not and relies solely on zero-shot inference.However, GPT-4V is provided with extra context information, such as the speed of the ego-vehicle and the total number of entities involved in the hazard.
Figure 9 presents two examples pertaining to the hazard type Speeding & Braking.Both our model and GPT-4V generate hazard explanations that closely match the annotated example.Our model consistently follows a specific sentence structure, as outlined in the instructions.In contrast, while GPT-4V produces sensible explanations, it deviates from the instructed format.This deviation results in a failure to explicitly mention one of the two entities shown in Fig. 9(b).
Figure 10 displays examples for the Merging Maneuver hazard type.Our model generates explanations that are similar GT at Rank 1: Entity #1 decides to go behind of Entity #2 to cross street, misjudges my speed, won't be able to stop in time and hits Entity #1 Rank 2: I'm driving forward, Entity #1 goes against Entity #2, and due to this, did not anticipate someone crossing and hitting Entity #1 at my speed. in content to the annotated ones, but they are overly simplified and could apply to a wide variety of situations.GPT-4V, on the other hand, incorrectly identifies the traffic sign in example (a) and the movement direction of the referenced vehicle in example (b).Moreover, it typically starts with descriptions of the specified entities before moving on to the reasoning aspect.
In the Pedestrian hazard type examples shown in Fig. 11, our model produces hazard explanations that align semantically with the annotated examples but, similar to previous instances, are overly simplistic.In contrast, GPT-4V offers detailed descriptions of the entities involved and bases its hazard predictions on these details.However, in example (a), the incorrect interpretation of the Entity#1's action leads to the prediction of an unplausible hazard.
Figure 12 presents additional examples of the Pedestrian hazard type.Our model accurately explains the hazard in example (a), but in example (b), it incorrectly predicts the pedestrian's motion direction.It is implausible that a pedestrian not facing the road would walk into the street.In contrast, GPT-4V correctly identifies the hazard in example (b), recognizing that the pedestrian is about to step off the curb, and its explanation aligns with the annotation.However, in example (a), GPT-4V incorrectly identifies a traffic sign as a taxi when recognizing Entity#2.
In examples involving the Unusual Condition hazard type shown in Fig. 13, our model delivers effective explanations, demonstrating a high degree of semantic similarity to the annotations.GPT-4V also performs well by accurately identifying the relevant entities and offering hazard explanations that closely match the annotations.Figure 14 presents Sideswipe hazard type examples.Both models provide reasonable explanations for example (a).However, in example (b), both models incorrectly identify the entity as a parked vehicle, highlighting the inherent difficulty in discerning whether vehicles shown in still images are moving or stationary.
The explanations for the Stationary Object hazard type examples in Fig. 15, provided by both models, are logical and align well with the annotations.Remarkably, GPT-4V demonstrates an accurate understanding of the sides of the road.This is surprising, considering that it struggles with more specific entities, such as traffic signs or traffic lights, as demonstrated in previous examples.
Figure 16 illustrates examples of the Traffic Signal hazard type.Our model successfully formulates hypotheses about accident causes, such as braking or lane changing.However, in example (b), the logic appears flawed; the connection between the speed limit sign (Entity #2) and the actions of another vehicle (Entity #1) is not well-defined.Conversely, GPT-4V does not accurately identify the traffic signals and signs in  these examples, leading to implausible hazard explanations.
Figure 17 illustrates examples of Unexpected Events and Chain Reaction hazard types, where neither model offers fully accurate reasoning.Both models present explanations that do not correspond well with the given annotations.Although our model's explanation seems logical, it fails to clearly state the causal relationship between the events involved.Specifically, in example (b), our model does not clarify that Entity #1 leads to the actions of Entity #2, unlike the annotated explanation where the first entity causes the second to brake suddenly.Additionally, in example (b), GPT-4V incorrectly identifies a white truck as a pedestrian.
The behavior of our model and GPT-4V, as observed in the examples, can be summarized as follows.Our model, which is fine-tuned, excels in following specific instruction formats and produces consistent, patterned outputs, but it tends to oversimplify explanations of hazards.In contrast, GPT-4V is proficient in a descriptive-first, reasoning-based approach to inference.However, it often struggles with accurate entity recognition and full contextual understanding of images, leading to errors in hazard prediction.Notably, GPT-4V frequently fails to recognize traffic signs and signals.Both models exhibit limitations in spatial and temporal recognition of traffic scenes.These observations suggest substantial room and necessity for improvement, implying several directions for future research.

IX. LICENSE
The image assets from the BDD100K dataset are distributed under the BSD 3-Clause License, while the ECP dataset is governed by the eurocity persons dataset research use license.Our usage of both datasets complies with their respective licenses, and we employ anonymization techniques, such as blurring identifiable faces and license plates, to adhere to regulations governing personal data processing.
The DHPR dataset created in this study is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.This license allows others to use, adapt, and distribute the dataset, provided they give appropriate credit to the original creator and do not use the dataset for commercial purposes.

A. Overview
We used Amazon Mechanical Turk to generate the dataset.Figure 18 provides an overview of the data collection process.The process involves two steps: the preliminary step for qualifying workers and the main pipeline for creating annotations.Only the workers filtered in this qualification step were invited to participate in the main pipeline (Sec.X-B).
The main pipeline consists of three tasks: • Task 1: The car's speed is estimated for each input scene image (Sec.X-C); • Task 2: The possibility of an accident occuring is estimated so that risk-free scene images are removed (Sec.X-D); • Task 3: A driving hazard is hypothesized and annotated for an input image (Sec.X-E).
We employ these three steps for the following reasons.In Task 3, workers are asked to annotate a hypothesized hazard for each input image.Since the original images are of accidentfree scenes, it can often be challenging to identify hazards, even as hypotheses.If workers are given the freedom to choose whether or not to annotate an image based on its difficulty, we will struggle to collect a sufficient amount of annotations.However, it is also problematic to force workers to identify hazards in risk-free scenes.Therefore, we designed Task 2 to address this issue.In Task 2, we remove scene images with minimal risk and transfer the remaining images to Task 3. To avoid removing too many images, we ask workers to assess the risk of the scene images by assuming a 50% increase in the car's speed.We designed Task 1 to estimate the normal speed of the car.
For each task in the qualification step and the main pipeline, we created HITs (Human Intelligence Tasks).All communication was conducted in English.The dataset was collected and evaluated from January through March 2023.Throughout the process, we paid the workers around $10 USD/hour.To be specific, we paid $0.02 USD per HIT for multiple choice HITs such as Task 1 and Task 2. For Task 3, we paid $0.2 USD per HIT, and some workers may finish this within 40 seconds.
We input 54,358 images from two datasets, BDD100K and ECP, to the main pipeline, followed by an additional qualification step (Sec.X-B).Consequently, we acquired the annotations for a total of 14,975 images, which comprise the final DHPR dataset.

B. Preliminary Step: Qualification/Screening of Workers
As mentioned above, we utilized a qualifying test to identify competent workers.This test not only serves as an evaluation tool but also provides potential workers with an overview of the tasks discussed earlier.On the initial page, we present essential information in the form of clickable/expandable items within a menu.This includes a description of the qualifying task (Fig. 20), instructions on annotating visual entities (Fig. 21), guidelines for writing effective hazard explanations (Fig. 22), and examples of exemplary annotations (Fig. 23).
The qualifying task is designed to mimic Task 3. Its purpose serves three objectives: (i) to assess the workers' understanding of the instructions, (ii) to evaluate their experience and

GPT-4V:
Our car, traveling at a speed of 45 km/h, could potentially be involved in a rear-end collision with Entity #1 if it were to suddenly brake and we did not have sufficient time to react and stop.

GPT-4V:
In the hypothetical scenario, traveling at a speed of 75+ km/h in wet road conditions, our car could potentially lose traction and control, leading to an accident with Entity #1 (Magenta box: the car directly ahead).The likelihood of a collision increases if Entity #1 were to make a sudden stop and our car's stopping distance were insufficient due to the high speed and slippery road surface.

GPT-4V:
As our vehicle traveled at speeds exceeding 75 km/h, Entity #1 (magenta box) abruptly changed lanes in front of us without signaling, leading to a rear-end collision.Simultaneously, Entity #2 (cyan box) decelerated unexpectedly to avoid a separate incident, causing our car to swerve and further contributing to the unavoidable accident.

Annotation:
Entity #1 pulls out from parking spot, does not check mirror or factor my approaching speed, due to this, i can't stop in time and hits Entity #1

Our model:
Entity #1 pulls out, does not check mirrors, due to this my car hits Entity #1

GPT-4V:
In the provided dashcam image, a hypothetical accident might involve Entity #1 (the vehicle in the magenta box) suddenly reversing or backing out of the parking space without noticing our approaching car.Given that our car is traveling at 45 km/h, the driver may not have sufficient time to react and avoid a collision with Entity #1.

GPT-4V:
As our car approached the intersection at 45 km/h, Entity #2 (Cyan box) unexpectedly changed lanes without signaling, causing our driver to swerve to avoid a collision.This sudden maneuver led our car into the path of Entity #1 (Magenta box), a cyclist crossing the street, resulting in an accident.

(a)
Annotation: Entity #1 walks on the roadway on my left, does not factor lane is tight and my car is moving forward thus getting clipped by my car

Our model:
Entity #1 is in our way and we would hit Entity #1 at this speed too.

GPT-4V:
As our car proceeded at a speed of 15 km/h along the street, Entity #1, highlighted with the Magenta box, unexpectedly stepped off the curb and into the path of our vehicle, resulting in a collision.Despite the low speed, the proximity of Entity #1 to the car did not allow sufficient time to stop and avoid the accident.

Annotation:
I'm driving forward, Entity #1 goes against Entity #2, and due to this, did not anticipate someone crossing and hitting Entity #1 at my speed.

Our model:
Entity #1 crosses against Entity #2 and does not look for oncoming traffic, due to this, my car won't stop in time and hits Entity #1

GPT-4V:
As our car was traveling at a speed of 45 km/h, Entity #1 (highlighted with a Magenta box), a pedestrian, stepped into the crosswalk without noticing our approaching vehicle.Consequently, our car had to perform an emergency stop to avoid a collision, prompting Entity #2 (highlighted with a Cyan box), a taxi, to swerve unexpectedly and risk an incident with nearby traffic.

GPT-4V:
Traveling at 45 km/h in a narrow street crowded with pedestrians, our car could potentially cause an accident with Entity #1, who appears to be stepping off the curb into the roadway without noticing approaching traffic.If our vehicle does not slow down or Entity #1 does not step back onto the curb, there could be a collision.

Our model:
i'm driving my car in a very high speed, at this time the car is out of control because of the high speed, so i will hit the Entity #1.

GPT-4V:
Given the high speed of 75+ km/h of our vehicle and the situation depicted, a hypothetical accident could occur if Entity #1 (highlighted by the Magenta box), representing a barrier or median, were to be inadvertently approached too closely, leading to a potential side-swipe or loss of control if our car veered off course.If our car drifted toward this barrier while attempting to perform an evasive maneuver to avoid another vehicle or due to driver distraction, the high speed could exacerbate the severity of the resulting accident.

GPT-4V:
While travelling at 45 km/h, our car had to unexpectedly swerve to avoid colliding with Entity #1 (the Magenta box), which appeared to be a stationary police vehicle partially obstructing our lane.This sudden maneuver inadvertently led to our vehicle losing control and veering towards Entity #2 (the Cyan box), resulting in a collision with the roadside object within that highlighted area.

Annotation:
As per Entity #1, we need to stop immediately, but we are in the middle of an intersection, while Entity #2 is stopping for no reason.We rear-end Entity #2 at this speed.

Our model:
Entity #1 is in red color, so Entity #2 put the sudden brake, now i have to stop my car, because of less distance between us even if i applied the brake here my car will stop after hitting the Entity #2.

GPT-4V:
As our car proceeded down the street at 15 km/h, Entity #1 (the vehicle with the Cyan box) abruptly stopped to avoid a pedestrian stepping off the curb, leading to a rear-end collision.Meanwhile, the driver of our car, also attempting to avoid Entity #1, swerved but was unable to prevent a low-speed impact.

(a)
Annotation: Our car would hit Entity #1 as it was entering our lane, while we were way above Entity #2's limit speed.

Our model:
driving forward, following Entity #2, Entity #1 decides to change into my lane, does not check mirrors before doing so, due to speed and following Entity #2, my car hits Entity #1

GPT-4V:
Traveling at 75+ km/h, our car could potentially rear-end Entity #1 (magenta box), which appears to be changing lanes or stopping abruptly, while simultaneously Entity #2 (cyan box) could misjudge the speed of our car, attempting to merge into the same lane and resulting in a side collision.

Annotation:
We did not expect that Entity #1 would stop giving preference to Entity #2 and cross over to our lane.Due to this, we would fail to stop at the speed and hit Entity #1.

Our model:
Entity #1 would stop nearly due to more cars in the front in queue and Entity #2 would cross our way by taking the center lane.we would crash either Entity #1 or Entity #2 at this speed

GPT-4V:
At a speed of over 75 km/h, our car approached Entity #1 (the magenta box), which is a stationary vehicle that unexpectedly opened its door or moved into traffic, potentially causing a side-swipe collision.Concurrently, Entity #2 (the cyan box), which appears to be a large bus merging into our lane, could result in our car being squeezed between Entity #1 and Entity #2, leading to an accident.

(a)
Annotation: Entity #2 directly ahead of me is about to stop due to Entity #1 pulling into traffic.I will rear-end Entity #2.

Our model:
Entity #1 is moving slowly and suddenly Entity #2 applies the brake, now at the given speed of my car I could not stop my car without hitting the Entity #2.

GPT-4V:
While traveling at 15 km/h, our car may fail to maintain a safe distance from Entity #1 (the vehicle highlighted with a magenta box), resulting in a low-speed rear-end collision if Entity #1 were to brake suddenly.If Entity #2 (the pedestrian highlighted with a cyan box) unexpectedly steps off the curb and into the road while our car is passing, the driver may need to perform an emergency stop to avoid a pedestrian accident.To ensure a diverse range of annotations, we invited over 500 workers worldwide to participate in the qualification test.In order to maintain quality control, we specifically targeted workers with a proven track record of approving more than 10,000 HITs and maintaining an approval rate of over 95%.Following the evaluation of their performance on the test, we manually selected 60 workers who met our criteria.These selected workers were then invited to participate in the main annotation pipeline.

C. Task 1: Estimating the Car's Speed
The workers were instructed to estimate the speed of the car based on the dashcam image of a scene.The web interface for the HIT of the task is shown in Fig. 24.A total of 54,358 scene images were used, with each assigned to a single worker.As a result of this task, we obtained annotations for all 54,358 images.

D. Task 2: Predicting Accident Possibility to Filter Images
The second task aims to assess the probability of a car being involved in an accident within a few seconds.In order to make Task 3, annotating hypothesized hazards, efficient, it was necessary to eliminate scene images with a very low likelihood of accidents.To achieve this, we introduced an increased speed that was 1.5 times faster than the annotated Fig. 19: Examples of the qualfifying test speed used in the first task.The intention behind this speed increase was to instill a stronger sense of the potential for an accident among the workers, given that the original speed was determined based on their perception of a safe speed.Figure 25 illustrates the instruction and annotation form provided for this task.Consequently, out of the initial set of 54,358 images, 20,791 images were filtered and subsequently used in the following task.

E. Task 3: Hypothesizing and Annotating a Hazard
The final task involves hypothesizing and annotating a hazard for each of the previously filtered images.This task comprises two parts.The first part involves annotating the visual entities associated with the hazard.Workers are instructed to draw a bounding box around each entity and provide a brief description of it.The second part requires providing a natural language explanation of the hazard, using the term 'Entity #n' to refer to the involved entities.Figure 26 shows the instruction and annotation forms for this task.Task 3 is the most time-consuming, accounting for 80% of the total annotation time.Based on our statistics, each worker may spend up to three minutes per image and can complete a maximum of three hundred HITs per day.In order to enhance productivity and reduce inconsistencies in answers, we have implemented data input validation and a user interface assistant 5 .

F. Post Process: Data Validation
We checked the results during and after Task 3 ourselves, and removed annotations with obvious errors.Additionally, we invited a small number of the most reliable workers to an additional task of eliminating Task 3 annotations that had unsatisfactory quality.The web page design for the HIT is shown in Fig. 27.The workers were presented with annotations for each scene image, including bounding boxes, descriptions of visual entities, and hazard explanations.They were asked a binary (yes/no) question regarding the acceptability of the annotations.If necessary, the workers were also requested to correct minor mistakes such as grammatical errors or incorrect word choices.Following this screening step, we obtained annotations for a total of 14,975 scene images. 5Our data input validation system ensures that submissions meet the following criteria: at least one box must be drawn; each box should have a corresponding entity description, and vice versa; only one bounding box per entity is allowed; When adding a box, a new entity must be utilized; and the hazard explanation must be at least five words long.If any of these criteria are not met in a submission, a warning prompt will be displayed.
In addition, the "Word Assistant Pads" feature was provided to the workers to minimize the need for typing, as shown in Fig. 26(b) and (c).It automatically fills in the text prompt input form by clicking buttons.This aid also serves as a reminder to workers regarding the expected content of the input form.Additionally, a brief guideline emphasizing the necessary components of the sentence was provided, including an accident-related entity, its relative position, and the resulting accident.Also, in close proximity to the hazard input form, there are reminders for the specific entities ('Entity #1', 'Entity #2', 'Entity #3'), as well as preposition words, to discourage the input of noun words for the entity.

Fig. 2 :
Fig. 2: Illustration of the DHPR dataset with example annotations (left) and hazard explanations retrieved and generated by our model and GPT-4V (right).

Figure 3
Figure 3 shows the architecture of our model.In CLIP, the input text and image do not interact until the computation of the cosine similarity at the final stage.Text and image embeddings are independently calculated by two separate encoders.Aiming for a higher-level integration of both, we introduce two additional Transformer-based encoders.The first is a text-to-image encoder, designed to attend to the image embeddings from CLIP with the text embeddings from the same, extracting more relevant information.Similarly, we use an inverse image-to-text encoder.For the updated text and image embeddings, we compute two image-text matching (ITM) losses, each requiring an ITM head on their respective encoders: one for the image-to-text (I-2-T) encoder and one for the textto-image (T-2-I) encoder.These losses are combined with the original contrastive loss of CLIP and used for training.The retrieval score continues to use the cosine similarity between the embeddings from the CLIP's encoders.We use the pre-trained CLIP ViT-L/14 for the visual encoder and BERTbase for the text encoder.The added two encoders have a simple design, each comprising two standard Transformer layers.For the generation task, the vision encoder trained as above is repurposed and integrated with a pre-trained LLM to construct a visual language model (VLM).We employ LLaMA2 7B.Recent studies predominantly employ a blend of a pre-trained LLM and a vision encoder, specifically aligning the output of the vision encoder with the textual input of the LLM.While there exist variations in this methodology, we adopted the strategy outlined in[8].Specifically, we employ multi-modality adapters[8]  injected into the Transformer layers of the vision encoder and the LLM.We extract three [CLS] embeddings after every eight Transformer layers of the vision encoder and project them into the space of the LLM's input tokens.A notable difference from[8]  is that we use the [BOS] embedding as a routing token.We append the projected tokens with the tokens of a text prompt, as illustrated in Fig.3.We then train the adapters while freezing the CLIP vision encoder and the LLM for efficient training and to mitigate overfitting.
turns too tight thus hitting my car Our model: Entity #1 turns tight to my left side, thus hitting my car due to my car being on Entity #1 blindspotGPT-4V:While our car was traveling at a speed of 45 km/h, an accident occurred involving Entity #1, highlighted by the Magenta box.Entity #1, unexpectedly entered our car's path, resulting in a collision despite attempts to brake and avoid the incident.

Fig. 4 :
Fig. 4: Examples of hazard explanations generated by our baseline model and GPT-4V.
Mr. Korawat Charoenpitaks is currently pursuing a Ph.D. degree at the Graduate School of Information Sciences, Tohoku University.His research interests are in the field of computer vision and natural language processing.Dr. Van-Quang Nguyen received the Ph.D. degree from Graduate School of Information Sciences, Tohoku University, in 2022.He is currently the postdoctoral researcher at RIKEN AIP, Japan.His research interests are in the intersection of computer vision and natural language processing.Dr. Masanori Suganuma received the Ph.D. degree from Graduate School of Environment and Information Sciences, Yokohama National University, in 2017.He is currently an Assistant Professor at Tohoku University.His research interests are in the field of computer vision and machine learning.Dr. Masahiro Takahashi received his Ph.D. in science (Physics) from the Graduate School of Natural Science and Technology at Okayama University in 2010.He is currently working as a Project Assistant Manager at DENSO CORPORATION.His current research interests lie in the field of machine learning.Mr. Ryoma Niihara received the master degree from Kyushu Institute of Technology in 2009.He is currently a manager at DENSO CORPORATION.His research interests are in the field of machine learning and autonomous vehicles.Dr.Takayuki Okatani earned his B.Eng., M.Sc., and Ph.D. degrees in Mathematical Engineering and Information Physics from the Graduate School of Engineering at the University of Tokyo in 1994, 1996, and 1999, respectively.He currently serves as a Professor in the area of computer vision at Tohoku University.In addition, he heads the Infrastructure Management Robotics Team at the RIKEN Center for Advanced Intelligence Project.With over 100 publications in peer-reviewed journals and conference proceedings, his work encompasses computer vision, deep learning, and multi-modal AI.He is an active member of several professional societies, including the IEEE Computer Society, the Information Processing Society of Japan, the Institute of Electronics, Information and Communication Engineers, and the Society of Instrument and Control Engineers.Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction Supplementary Material I. ACCESS TO THE DHPR DATASET A. URLs arXiv:2310.04671v4 [cs.CV] 1 Jul 2024

Fig. 2 :
Fig. 2: Example of GPT-4 evaluation of a generated text from GPT-4V with and without the policy of unpenalizing for extra contents.

Rank 3 : 1 Rank 4 : 1 Rank 2 :Fig. 6 :
Fig.6: Example of image-to-text retrieval by our best-performing model, including the annotated hazard (GT) and its rank, alongside the other top three candidates.Each candidate rank is indicated as Rank n.

GT at Rank 25 :Rank 1 : 1 Rank 3 : 1 Rank 1 :
My car will destroy Entity #1 while turning left and blocking our path unexpectedly.Entity #1 makes left turn without signaling, misjudges my speed going forward and we collide at intersection Rank 2: Entity #1 decides to make a left turn instead but does not signal, due to this my car won't stop in time and hits Entity #Making my left turn, I misjudge Entity #1 speed, and due to this, my car Entity #1 collide(a)GT at Rank 25: Entity #1 would stop showing red lights at the back.Due to speed, we would hit Entity #For the given low speed level, we're following Entity #1 at a close range.We would cause our car to hit Entity #1 in the back as it would apply the brakes unexpectedly.

Rank 2 :Rank 3 :Fig. 7 :
Fig. 7: Examples of challenging cases of image-to-text retrieval by our best-performing model, including the annotated hazard (GT) and its rank, alongside the other top three candidates.Each candidate rank is indicated as Rank n.

GT at Rank 6 :Rank 1 : 1 Rank 3 : 1 Rank 2 : 3 :Fig. 8 :
Fig.8: Examples of challenging cases of image-to-text retrieval by our best-performing model, including the annotated hazard (GT) and its rank, alongside the other top three candidates.Each candidate rank is indicated as Rank n.
brakes, due to speed, my car won't stop in time and crash into the back of Entity #1 Our model: Entity #1 brakes, does not factor my speed, due to speed, can't stop in time and rear end Entity #1 brakes, due to my speed and Entity #2, my car won't stop in time and rear ends Entity #1 Our model: Entity #1 brakes, due to speed and Entity #2, my car hits Entity #1

1
steps back and does not see my car approaching, won't stop in time and hits Entity #1Our model:Entity #1 decides to cross road, does not ensure road is clear, due to this my car hits Entity #1
passing, I pulled more to the right to let Entity #1 pass, and due to this my car mounts Entity #2 Our model: i pull to the right to avoid Entity #1 but mounts Entity #2

Fig. 20 :
Fig. 20: Instructions for the qualifying test, which also serves as an introduction to the real tasks in the main annotation pipeline.

Fig. 24 :
Fig. 24: (a) Instruction and (b) the anotation form for Task 1, which requests the workers to estimate the car's speed.

Fig. 25 :
Fig. 25: (a) Instruction and (b) annotation form for Task 2, which is to predict the possibility of an accident for an input image.
Fig. 26: (a) Instruction, (b) & (c) annotation forms for Task 3, which is to annotate visual entities involved in a hypothesized hazard and provide an explanation of the hazard.

Fig. 27 :
Fig. 27: Annotation form for the data validation task

TABLE I :
Comparison of DHPR with existing datasets.A checkmark (✓) indicates the presence of the feature, and a crossmark (✗) indicates the feature's absence

TABLE II :
Hazard types and their statistics.'Retrieval' represents the candidates sampled for retrieval tasks.

TABLE IV :
Results for the generation task on the DHPR test split, evaluated using BLEU (B4), ROUGE (R), CIDEr (C), SPIDER (S), and the GPT-4 score.Higher values indicate better performance.For GPT-4V † , we perform a zero-shot evaluation on the test split; see text for details.

TABLE V :
Results of ablation tests on the retrieval task (a-c) and the generation task (d-f) Table II provides a detailed statistical breakdown of the dataset.It first shows the source of each image, either from the ECP or BDD dataset for the training, validation, and test sets, respectively.The table offers further granularity for each set by showing the speed of our car, the average length of hazard descriptions, hazard type, and the common position words used.It also includes the hazard types, which are already shown in Table 2 (in the main paper).

TABLE I :
Types of entities that the self-car is described as hitting in the hazard explanations.

TABLE II :
Detailed Dataset Statistics

TABLE III :
Ablation results for the retrieval tasks with different input data formats.The proposed model for retrieval tasks is utilized in the experiments.