Surrogate Safety Measures Prediction at Multiple Timescales in V2P Conﬂicts Based on Gated Recurrent Unit

: Improving pedestrian safety at urban intersections requires intelligent systems that should not only understand the actual vehicle–pedestrian (V2P) interaction state but also proactively anticipate the event’s future severity pattern. This paper presents a Gated Recurrent Unit-based system that aims to predict, up to 3 s ahead in time, the severity level of V2P encounters, depending on the current scene representation drawn from on-board radars’ data. A car-driving simulator experiment has been designed to collect sequential mobility features on a cohort of 65 licensed university students who faced different V2P conﬂicts on a planned urban route. To accurately describe the pedestrian safety condition during the encounter process, a combination of surrogate safety indicators, namely TAdv (Time Advantage) and T 2 (Nearness of the Encroachment), are considered for modeling. Due to the nature of these indicators, multiple recurrent neural networks are trained to separately predict T 2 continuous values and TAdv categories. Afterwards, their predictions are exploited to label serious conﬂict interactions. As a comparison, an additional Gated Recurrent Unit (GRU) neural network is developed to directly predict the severity level of inner-city encounters. The latter neural model reaches the best performance on the test set, scoring a recall value of 0.899. Based on selected threshold values, the presented models can be used to label pedestrians near accident events and to enhance existing intelligent driving systems.


Introduction
Modern innovations in car-sensing devices, along with the development of deep learning techniques and recognition algorithms, has allowed engineers and researchers worldwide to implement increasingly reliable Advanced Driver Assistance (ADAS) and Automated Driving (ADS) Systems, which are expected to lead toward an improvement in road users' safety levels (reducing the severity of injuries and/or preventing fatalities on roads), driving efficiency (e.g., vehicle fuel consumption), and, consequently, the sustainability of transportation infrastructures. The capabilities of modern cars to detect surrounding objects, represent traffic situations, and adapt their dynamic state have been enhanced to the point that some car manufacturers have first-ever presented driving automation systems capable of performing the entire dynamic driving task in a sustained manner under specific operating conditions. Nevertheless, deaths of vulnerable road users (VRUs) continue to make for a significant percentage of all road fatalities worldwide [1], and, consequently, more attentive actions are called for attaining the road infrastructure's sustainability in terms of protecting VRUs. Although active safety systems are not considered driving automation as and spatial interactions between pedestrians in a crowded space by combining an LSTM and a Graph Attention Network (GAT) to obtain "socially" plausible trajectories.
Although several literature studies have attempted to evaluate the trajectories and intentions of road users, AEB systems require a Risk Assessment Model (RAM, core element of the perception level) that can objectively and quickly capture the risk level of the encounter process between the ego-vehicle and a VRU. In fact, an RAM that can fully understand the relationships between behavior and risk is essential to adequately judge the AEB system's intervention timing and thus prevent collisions [18]. Risk (or severity) level is intended as the potential of an elementary traffic event to become an accident [19]. Specifically for vehicle-pedestrian (V2P) encounters in inner-city traffic (i.e., the simultaneous arrival of a driver and a pedestrian at the crosswalk or in a specific limited area), such process is a traffic event characterized by a continuous interaction over time and space between the two road users [20] The pedestrian's decision to enter the zebra crossing depends on the perceived speed and distance of the approaching vehicle; concurrently, the driver evaluates whether to grant or deny the priority to the pedestrian, based on the estimated arrival time at the crosswalk. Since the two traffic participants may enter a collision course during the encounter process, such a conflict has the potential to end up in a collision. For example, the latter would occur if the driver's attention levels, his/her ability to control the vehicle, or the vehicle's dynamic state were not adequate for a safe stopping behavior [21].
Many researchers have addressed the issue of risk assessment for pedestrian-vehicle interactions both for current and future encounter states (for the latter, on the basis of predicted trajectories) [7]. However, the application of traffic safety indicators, also referred to as Traffic Conflict Techniques (TCTs), has been very successful as a proactive surrogate approach (i.e., complementary to accident statistical analysis) for traffic event safety assessment, due to its efficiency and short analysis time [22]. There are various continuous or discrete TCTs for V2P conflicts [23]. Nevertheless, Laureshyn et al. [19] identified and developed a set of safety indicators to continuously describe the severity level of the encounter process and, thus, to relate "individual interactions to the general safety situation" of the event. In fact, a single indicator is not sufficient to accurately classify interaction patterns into severity categories (i.e., the RAM's purpose), as it cannot fully reflect the current safety situation [23]. Differently, a combination of at least two indicators should be considered to properly identify pedestrian "near-accident" [24] situations (i.e., the traffic conflicts between safe passages and collisions), which are the most relevant for pedestrian-AEB systems (PAEB), using appropriate threshold values on the selected TCTs. Among the indicators analyzed by Laureshyn et al. [19], Table 1 provides a detailed description of those most relevant to the discussion that follows: Time to Collision (TTC), Nearness of the Encroachment (T 2 ), Post-Encroachment Time (PET), and Time Advantage (TAdv).
Upon selecting safety indicators to classify near-accident events, the study by Kathuria and Vedagiri [25] showed that TTC and PET profiles are of equal importance to effectively categorize pedestrian-vehicle interactions at unsignalized intersections into severity levels when neither pedestrian nor vehicle takes an evasive action. Moreover, Zhang et al. [23] trained a GRU neural network to predict near-accident events at signalized intersections using PET and TTC indicators generated from videos captured by fixed cameras. The latter work is also of particular interest, as it represents the first attempt (although it was not intended for the development of PAEB systems) to implement a model capable of directly predicting the current severity level of V2P interactions, which were described by three categories: "serious conflict", "slight conflict", and "safe". However, as described previously, the encounter process may have crash potential even though the two road users are not on a collision course [19]. Conversely, the TTC calculation requires both users to be on a collision course and, thus, limits the events to be consider in safety analysis. These findings reveal that V2P encounters are extremely complicated and that, to improve the reliability of RAMs, safety indicators capable of describing the whole encounter process as a continuous interplay between vehicle and pedestrian should be considered.
The solution to this problem is offered by the supplementary indicators presented by Laureshyn et al. [19]: namely, the Nearness of the Encroachment (T 2 ) and the Time Advantage (TAdv), which broaden the concepts of TTC and PET, respectively, to situations in which the two road users are not on a collision course (Table 1). In addition, Borsos et al. [26] recently performed a comparison of collision course indicators with indicators that include crossing course interactions at signalized intersections, demonstrating that TTC and T 2 are transferable for crash probability estimation, with stricter threshold values for T 2 . Among all the pedestrian-vehicle front-end contact points in a collision, the one leading to the lowest TTC value should be selected.
The lower the TTC, the higher the risk.

Nearness of the Encroachment (T 2 )
The expected time that it takes for the second road user to arrive at the conflict zone.
A set of values continually calculated over time At the moment of transfer from crossing to collision course, T 2 provides a smooth transition between the two situations and equals the TTC.

Post-Encroachment Time (PET)
The time that would elapse between the passage of the first and the second user through the same conflict zone.

A discrete value
For an encounter, the TAdv has a single value that can be measured directly.

Time Advantage (TAdv)
At any time, it represents the expected PET value if the road users maintained their current trajectory and relative speed.
A set of values continually calculated over time Values above 2-3 s indicate that a user has a temporal advantage over his opponent in a competition over the same spatial zone and is likely to pass first.
This paper presents a GRU-based system that predicts, up to 3 s ahead in time, the severity level of V2P encounters in inner-city traffic (i.e., encounters between a car and a pedestrian on a pedestrian crossing), depending on the current scene representation drawn from on-board radars' data. A car driving simulator experiment has been designed to collect sequential mobility features on a cohort of 65 licensed university students and generate T 2 and TAdv indicators for accurately classifying pedestrian safety conditions during the whole encounter process. Based on selected threshold values, the presented model could be used to label in advance pedestrians near accident events and to enhance existing PAEB systems. So, this might be a relevant contribution to improving the transportation safety.
Compared to the existing literature that has mainly focused on predicting the trajectories (or intentions) of the ego-vehicle and/or other surrounding road users [9], the developed approach differs not only by modeling safety parameters directly related to the V2P interaction severity levels but also by the fact that the multi-step-ahead prediction depends on a low-dimensional representation of the current situation: the calibrated system depends only on six parameters related to driver mobility features and traffic scene properties. Indeed, Ortiz et al. [13] proved that there is no need to employ the state of the actuators as features for predicting future behavior: the authors have actually predicted with simple learning algorithms (i.e., multi-layer perceptron neural networks) the braking behavior of drivers approaching a traffic light with very good accuracy at time scales up to 3 s, using as input features only the ego-vehicle speed, state, and distance to the nearby traffic light. In addition, the approach presented in this paper has some aspects in common with the research of Zhang et al. [23], but the authors predict the instantaneous severity level (based on TTC and PET) of vehicle-pedestrian interactions whose dynamics are captured by fixed cameras placed at signalized intersections, whereas the aim of the current study is a multi-step-ahead prediction on a running vehicle.

Data Collection
To meet the purpose of the study, it would require the collection of mobility data on several vehicle-pedestrian encounters in real-world urban scenarios. In addition, these data must allow for the proper evaluation of interaction safety indicators (especially during an online system application) and be easily acquired by on-board sensors (i.e., cameras, pedal potentiometers, IMU, GNSS, or millimeter-wave radars). In particular, the analysis of the relevant literature [9] allowed us to identify three main groups of features for studying the problem under analysis: the actuators and steering wheel states, the information about the car dynamics, the pedestrian speed, and the direction vector. To collect this information while keeping drivers safe, a cohort of 65 licensed university students was recruited to participate in a driving simulation experiment at the Road Laboratory of the Polytechnic Department of Engineering and Architecture (DPIA) of the University of Udine.
The use of advanced driving simulators for the analysis of driver behaviors and the development of AEB systems is an accepted and widely recognized practice [27], since the driver's performance observed in driving simulation shows the same patterns as real-world driving (relative validity) [28]. Saito et al. [29] designed and validated with a driving simulator a PAEB system that controls vehicle subsystems (brake and accelerator) in potentially dangerous or uncertain situations to decelerate the vehicle and maintain a safe driving speed. Hou et al. [30], studying the braking behaviors of drivers in typical vehicle-to-bicycle conflicts with a driving simulator, proposed a method to improve the timing and braking phases of bicyclist-AEB systems. Bella et al. [27] recently used a fixedbase driving simulator to evaluate the functionality and effectiveness of two types of ADAS that provide the driver with an audible alarm and a visual alarm to detect a pedestrian crossing into and outside the crosswalk.

The Car-Driving Simulator Experiment
The car-driving simulator at the Road Laboratory (product name AutoSim 1000-M) has been already validated and successfully used for studying the drivers' braking behavior affected by cognitive distractions [21]. The simulator cockpit is made with real car parts (e.g., dashboard, steering wheel, pedals, gear lever, handbrake, driver seat, seatbelt) of an Italian city car. These are important components of the equipment to give more realistic sensations to the driver during the simulation experiment, along with the steering force feedback, the engine sound, and the two-degree-of-freedom motion base system that reproduces the vehicle's roll and pitch. Three 43-inch LCD screens allow the road scenario to be visually reproduced, showing a 180 • view.
In the Lab's AutoSim 1000-M driving simulator [21], it was possible to record a dataset for the GRU models' development by observing test participants' behavior toward two planned traffic encounters: a boy/girl entering a crosswalk from the curb. The simulated scenes (Figure 1) were set in a typical urban environment on a course that could take about 15 min to be completed, considering the 50 km/h speed limit. On the urban course, participants experienced many traffic light intersections, tight turns (90 • ), short straight streets, and occasional crossings of pedestrians, some of which occurred outside the crosswalks. The encounters that were used to record data (and compute surrogate safety indicators for each participant in an offline fashion, following the procedure described in Appendix A of the study by Laureshyn et al. [19]) were set on a four-lane road (two lanes in each direction), placing traffic signs and markings that met European standards (Figure 1). participants experienced many traffic light intersections, tight turns (90°), short straight streets, and occasional crossings of pedestrians, some of which occurred outside the crosswalks. The encounters that were used to record data (and compute surrogate safety indicators for each participant in an offline fashion, following the procedure described in Appendix A of the study by Laureshyn et al. [19]) were set on a four-lane road (two lanes in each direction), placing traffic signs and markings that met European standards ( The scenarios were designed as follows: (1) the participant arrives at a red traffic light approximately 200 m from the crosswalk, on which he/she has a clear view; (2) as the participant starts driving, the pedestrian (initially hidden) walks at a 90° angle toward the road and then stops at the edge of the curb; (3) at the moment the vehicle, based on its current speed, is about 3 s from the crosswalk, the pedestrian enters the zebra crossing and maintains a speed of 1.4 m/s. These conditions, consistent with relevant literature [31], require the participant to stop, giving priority to the pedestrian. In this way, the data collected refer to stopping maneuvers that avoided a collision with the pedestrian.

Partecipant Statistics and Experimental Procedure
The students recruited for the experiment were between the ages of 20 and 30, had a valid driver's license, and had driven at least 5000 km in the past year. All participants were properly trained in the use of the simulator actuators before the experimental driving and were able to test their abilities on a simulated suburban course that took approximately 5 min. Before the urban driving simulation, each participant filled in a questionnaire aimed at collecting his/her personal data (e.g., age, driving experience). Conversely, at the end of the simulation test, a second questionnaire about the discomfort perceived while driving was completed by each participant to identify and remove from the cohort those drivers who had experienced excessive annoyance. The simulation test procedure just presented has been drawn from relevant literature that provided for its validation [21,27,31]. The recruited cohort includes 19 females and 41 males, with a mean age of 24.0 (SD 3.43) and 24.1 (SD 1.91) years and a mean non-verbal intelligence quotient (IQ) [32] of 33.4 (SD 2.50) and 33.9 (SD 1.66), respectively. Therefore, the sample of males and females can be considered roughly balanced for age and IQ. However, the original cohort had an additional five participants who were excluded from the analyses, as they could not complete the experiment due to excessive discomfort. The scenarios were designed as follows: (1) the participant arrives at a red traffic light approximately 200 m from the crosswalk, on which he/she has a clear view; (2) as the participant starts driving, the pedestrian (initially hidden) walks at a 90 • angle toward the road and then stops at the edge of the curb; (3) at the moment the vehicle, based on its current speed, is about 3 s from the crosswalk, the pedestrian enters the zebra crossing and maintains a speed of 1.4 m/s. These conditions, consistent with relevant literature [31], require the participant to stop, giving priority to the pedestrian. In this way, the data collected refer to stopping maneuvers that avoided a collision with the pedestrian.

Partecipant Statistics and Experimental Procedure
The students recruited for the experiment were between the ages of 20 and 30, had a valid driver's license, and had driven at least 5000 km in the past year. All participants were properly trained in the use of the simulator actuators before the experimental driving and were able to test their abilities on a simulated suburban course that took approximately 5 min. Before the urban driving simulation, each participant filled in a questionnaire aimed at collecting his/her personal data (e.g., age, driving experience). Conversely, at the end of the simulation test, a second questionnaire about the discomfort perceived while driving was completed by each participant to identify and remove from the cohort those drivers who had experienced excessive annoyance. The simulation test procedure just presented has been drawn from relevant literature that provided for its validation [21,27,31]. The recruited cohort includes 19 females and 41 males, with a mean age of 24.0 (SD 3.43) and 24.1 (SD 1.91) years and a mean non-verbal intelligence quotient (IQ) [32] of 33.4 (SD 2.50) and 33.9 (SD 1.66), respectively. Therefore, the sample of males and females can be considered roughly balanced for age and IQ. However, the original cohort had an additional five participants who were excluded from the analyses, as they could not complete the experiment due to excessive discomfort.
It is worth pointing out that participation in the experiment was voluntary, there was no monetary reward, and all participants gave informed consent after being instructed on the simulated driving test procedure. Conversely, they were not informed about the objectives of the research. Finally, the study has been conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Local Ethics Committee of Udine's University (Progetto_Guida).

Problem Formulation
Formally, the severity prediction system has to perform a mapping between the situation representation at the current driving time t, i.e., the real-time scene properties and vehicle status described by a set of sensor-observed features, and the expected severity level of the vehicle-pedestrian encounter, which is defined by T 2 and TAdv, for times t + 1s, t + 2s, and t + 3s. In practice, a learning algorithm is trained to accurately predict the surrogate safety indicators at instants t−2s, t−1s, and t using the vehicle sensing system observations available at instant t−3s. Once the expected convergence between predicted outputs and ground-truth targets is reached, the trained model represents the desired medium-term prediction system and can be used to make forward-in-time predictions based on the current scene representation. It is assumed that the input features can all be acquired simultaneously and at a regular time interval given by the lowest sampling frequency among those of the involved sensors, since the goal is to perform real-time prediction on running vehicles. However, considering that the current study is based on a driving simulator experiment, the calculation of surrogate safety indicators for each participant and the training of the learning algorithm were both performed in an offline fashion (please refer to Appendix A of the study by Laureshyn et al. [19]) at the end of the driving experiments.

Low-Dimensional Input Representation
The coordinate system established for the vehicle sensing system is shown in Figure 2.
The following features form the six components of the learning model input vector and describe the V2P interaction process at the current time t for each participant in the experimental data set: driver's behavior primitive A(t), current T 2 (t) value, ego-vehicle's speed v v (t), pedestrian's speed v p (t), and position vector (r 0 (t), φ(t)).

Problem Formulation
Formally, the severity prediction system has to perform a mapping between the situation representation at the current driving time , i.e., the real-time scene properties and vehicle status described by a set of sensor-observed features, and the expected severity level of the vehicle-pedestrian encounter, which is defined by T2 and TAdv, for times + 1s, + 2s, and + 3s. In practice, a learning algorithm is trained to accurately predict the surrogate safety indicators at instants −2s, −1s, and using the vehicle sensing system observations available at instant − 3s. Once the expected convergence between predicted outputs and ground-truth targets is reached, the trained model represents the desired medium-term prediction system and can be used to make forward-in-time predictions based on the current scene representation. It is assumed that the input features can all be acquired simultaneously and at a regular time interval given by the lowest sampling frequency among those of the involved sensors, since the goal is to perform real-time prediction on running vehicles. However, considering that the current study is based on a driving simulator experiment, the calculation of surrogate safety indicators for each participant and the training of the learning algorithm were both performed in an offline fashion (please refer to Appendix A of the study by Laureshyn et al. [19]) at the end of the driving experiments.

Low-Dimensional Input Representation
The coordinate system established for the vehicle sensing system is shown in Figure  2. The following features form the six components of the learning model input vector and describe the V2P interaction process at the current time for each participant in the experimental data set: driver's behavior primitive ( ), current 2 ( ) value, ego-vehicle's speed ( ), pedestrian's speed ( ), and position vector ( 0 ( ), ( )). Behavior primitives are the set of elementary behaviors into which the driver's approaching maneuver (in the longitudinal dimension of the event) can be segmented based only on speed, gas pedal, and brake pedal information [13]. These data, easily acquired Behavior primitives are the set of elementary behaviors into which the driver's approaching maneuver (in the longitudinal dimension of the event) can be segmented based only on speed, gas pedal, and brake pedal information [13]. These data, easily acquired on modern vehicles (e.g., by means of pedal potentiometers, IMU, GNSS), are processed to define a categorical variable whose values, in the case of V2P conflicts, are 0 for "stopped behavior", 1 for "braking behavior", and 2 for "maintaining speed behavior". Based on the study by Ortiz et al. [13], the parts of the stream wherein the vehicle speed v v is less than 2 km/h (about 0.56 m/s) are labeled "stopped". The "braking" behavior begins at the moment the driver releases the gas pedal completely, since in urban traffic, this condition represents the beginning of slowing down in response to an event. All other moments in the stream that do not belong to the two illustrated categories are labeled as "maintaining speed". The categorical variable of behavior primitives represents important information for the GRU model, as it allows the sequence of mobility features to be segmented according to the current driver behavior. Similarly, the current value of T 2 , which can be calculated directly using the other features as shown by Laureshyn et al. [19], is of high importance, since neurosciences indicates that the human brain relies closely on judgments of TTC (and, consequently, of T 2 , since the two indicators are transferable [26]) to perform coordinated action [33].
Position vector components are the azimuth angle φ of the pedestrian with respect to the travel direction and the distance r 0 between the ego-vehicle (point A or C) and the pedestrian (point B, please see Figure 2). Specifically, v p , r 0 , and φ are the pedestrian state data acquired by millimeter-wave radars mounted on the vehicle front end (points A and C, Figure 2). In fact, since the detection of the pedestrian presence in the scene is usually performed using robust frame classification algorithms on vehicle camera videos [34], the status of the detected pedestrian can be easily acquired by radars: in this study, we assumed data acquisition equipment consisting of a long-distance millimeter-wave radar installed at the center of the vehicle front end and a mid-range radar on either side of the vehicle front, with maximum detection distances of 100 m and 50 m and azimuth angles of ±10 • and ±45 • , respectively, to ensure the optimal scene coverage [18]. Compared to video sensors that require object tracking and perspective transformation after object detection to generate the trajectory profiles of VRUs in time-series [23], radars allow the direct and continuous acquisition of pedestrian mobility features [18]. However, the use of cameras is essential for pedestrian detection in the traffic scene; in this regard, for the further online analysis of the proposed system, the use of the state-of-the-art Mask R-CNN (Region-based Convolutional Neural Network) [35] is recommended to ensure high performance of the automated object detection process.
Therefore, time sequences of participants' maneuvers begin with the detection of the pedestrian's status from the long-distance radar, approximately 100 m from the crosswalk, assuming the simultaneous recognition of the pedestrian's presence by the ego-vehicle detection model. Time sequences end when the first road user (the pedestrian) leaves the conflict zone or the second road user (the driver) comes to a complete stop, and none of the TCTs can be calculated (since the collision is no longer possible). Furthermore, although the sampling rate of radars (e.g., the current 77 GHz band millimeter-wave radar) is 20 Hz, we assumed that the encounter process is recorded at 10 Hz to account for limitations of other on-board equipment and raw data processing times.

Safety Indicators and Severity Classes Generations
Regarding the model output vector, we decided to split the severity level prediction problem into single-output learning tasks: T 2 continuous values and TAdv categories were the learning target of two distinct GRU models. Although both safety indicators can be calculated continually over time, TAdv does not provide the same smooth transfer between crossing and collision courses as T 2 : conversely, it quickly goes to zero and holds that value as long as the two road users remain on a collision course (since, by definition, TAdv cannot take on negative values). After the change from collision to crossing courses due to the driver's braking or slowing down, the TAdv value starts gradually growing. Such a behavior induces singularities in the TAdv pattern during the maneuver (i.e., if TAdv is plotted on a graph as a function of time, it will not make a continuous curve), which are difficult to capture with ML regression techniques, unless overfitting the model. To avoid running into such problems, TAdv values less than 1s have been labeled as "collision course" and those greater than 1s have been labeled as "crossing course", since this indicator can be interpreted as the minimal delay of the driver that, if applied, will result in a collision course [19]. Conversely, T 2 is defined by a continuous curve over time and provides information about how soon the encroachment will occur. Thus, the TAdv prediction is treated as a binary classification problem, whereas the T 2 prediction represents a regression problem.
After training these models separately, their predictions were used to classify conflict interactions ahead in time, distinguishing between "safe" and "unsafe" processes. These categories have been defined according to the static threshold values reported in the literature [23,26,36,37] and the conditions under which the simulation experiment was Sustainability 2021, 13, 9681 9 of 18 performed [31]: when the TAdv class is "collision course" and the T 2 value is less than 3 s over the same time horizon, the interaction is defined as "unsafe"; in all other cases, it is considered "safe". Additional intermediate classes have been avoided (e.g., the case wherein TAdv is reporting "collision course" but the T 2 value is greater than 3 s), since the main interest is to detect actual hazard situations or "serious conflict" iterations ahead of time (i.e., before they happen) [23]. By comparing the models' results with the ground-truth targets, it was possible to evaluate the effectiveness of the proposed system in classifying pedestrian's near-accident events. However, an additional GRU neural network has been developed to directly predict the severity level of V2P encounters to verify which of the proposed models would guarantee the best classification performance.
It is worth recalling that the procedure reported in Appendix A of the study by Laureshyn et al. [19] has been applied for the offline calculation of TCTs. In addition, the dimensions of the simulated vehicle were considered in calculations so that, among all the pedestrian-vehicle front-end contact points in a potential collision, the one leading to the lowest T 2 and TAdv values has been selected. Then, the obtained values were time-translated (by 1, 2, and 3 s backward) to compose the ground-truth target matrix.
Finally, before being inputted to the GRU, each feature (and target) was standardized; i.e., it was subtracted by its minimum value and divided by the distance between its minimum and maximum value computed on the training set, to remain in an acceptable range with respect to the activation functions.

Gated Recurrent Unit (GRU)
Due to their extended success in many time-series applications [38][39][40][41][42][43], RNNs were employed to implement the proposed learning system. In particular, the GRU architecture was selected because of its increased robustness with respect to vanilla RNNs and of its expressive power that matches the more sophisticated LSTM networks, but which is achieved with less training effort.
The working mechanism of GRU networks is now described generally [11,12]. Based on the input feature vector input and previous output, each GRU learning neuron performs different operations using so-called "gate" operators. The "update" gate decides the amount of past information to be forwarded to the future, while the "forget" gate focuses on which part of the past information to forget. In more detail, let us suppose x t to be the feature vector available at the t step of the time-series, h t to be the hidden state, and y t to be the output vector. These vectors are used by the GRU in the following operations: (1) which are applied for all the temporal states t happening during the series, i.e., from t = 0 to t = T with T being the series length (at t = 0, h 0 is set to be a zero vector). The before mentioned "update" gate and "forget" gate are implemented by Equations (1) and (2) respectively. W z , U z , W f , U f , W c , and U c are weight vectors that are optimized during the training phase of the GRU, σ(·) is an activation function that is implemented as a sigmoid, while is the element-wise product. For a better comprehension, the flow of operations is visualized in Figure 3.
which are applied for all the temporal states happening during the series, i.e., from = 0 to = with being the series length (at = 0, ℎ 0 is set to be a zero vector). The before mentioned "update" gate and "forget" gate are implemented by Equations (1) and (2) respectively. , , , , , and are weight vectors that are optimized during the training phase of the GRU, (•) is an activation function that is implemented as a sigmoid, while ⊙ is the element-wise product. For a better comprehension, the flow of operations is visualized in Figure 3. These particular operations allow a network to produce more meaningful gradients than vanilla RNNs during the learning phase, ultimately making GRUs acquire enhanced long-term relations between features. To additionally improve the quality and abstractness of such feature relations, the learning model can be organized as stacked layers of GRUs.
In the setting of this work, a GRU composed of two layers each with neuron cells has been used. The first GRU layer receives in input the feature vector = [ ( ), 2 ( ), ( ), ( ), 0 ( ), ( )], which is the sensory information (described in Section 2.4) available at the temporal step of the time-series representing the driver's maneuver. At the same temporal step, the second GRU layer gets the feature representation outputted by the first GRU layer. After those, a final dense output layer with as many neurons as the outputs was applied to predict the target values. Overall, the role of the GRU layers is to abstract a meaningful representation that summarizes the driver's maneuver up to time-step . In turn, such higher-level features are exploited by the output layer that produces the predicted future T2 and TAdv states.

Implementation Details
In this section, the details of the implemented procedure to train the proposed GRU model are given.
A grid search strategy was employed to determine the most important architectural and training hyperparameters such as the number of neurons of the GRU, the learning rate values, and the learning rate drop factor [44]. The first was tried across the values 64, 128, and 256, the second among 0.05, 0.01, and 0.005, and the third one considering the values 0.40, 0.60, and 0.80 [23]. For each iteration of the grid search, a k-fold cross-validation procedure with five folds was implemented to get the best configuration of the model across different distributions of the employed dataset. In each fold, the full dataset was split subject-wise in training and test sets using a 4:1 ratio, resulting in 68 subjects for training and 17 for testing. For the T2 task, the model has been trained to minimize the  These particular operations allow a network to produce more meaningful gradients than vanilla RNNs during the learning phase, ultimately making GRUs acquire enhanced long-term relations between features. To additionally improve the quality and abstractness of such feature relations, the learning model can be organized as stacked layers of GRUs.
In the setting of this work, a GRU composed of two layers each with N neuron cells has been used. The first GRU layer receives in input the feature vector , which is the sensory information (described in Section 2.4) available at the temporal step t of the time-series representing the driver's maneuver. At the same temporal step, the second GRU layer gets the feature representation outputted by the first GRU layer. After those, a final dense output layer with as many neurons as the outputs was applied to predict the target values. Overall, the role of the GRU layers is to abstract a meaningful representation that summarizes the driver's maneuver up to time-step t. In turn, such higher-level features are exploited by the output layer that produces the predicted future T 2 and TAdv states.

Implementation Details
In this section, the details of the implemented procedure to train the proposed GRU model are given.
A grid search strategy was employed to determine the most important architectural and training hyperparameters such as the number of neurons N of the GRU, the learning rate values, and the learning rate drop factor [44]. The first was tried across the values 64, 128, and 256, the second among 0.05, 0.01, and 0.005, and the third one considering the values 0.40, 0.60, and 0.80 [23]. For each iteration of the grid search, a k-fold cross-validation procedure with five folds was implemented to get the best configuration of the model across different distributions of the employed dataset. In each fold, the full dataset was split subject-wise in training and test sets using a 4:1 ratio, resulting in 68 subjects for training and 17 for testing. For the T 2 task, the model has been trained to minimize the RMSE between its predictions and the ground-truth T 2 targets. For the TAdv task instead, the GRU model was optimized by the minimization of the Binary Cross Entropy loss computed between the model's predictions and TAdv ground-truth categorizations. For all the experiments, the learning procedure has been conducted for 1000 epochs using the Adam optimizer [45]. A weight decay with a factor of 0.0005 was added as a regularization term. The initial learning rate was decayed by the considered values every 50 epochs. A Synthetic Minority Over-Sampling Technique (SMOTE) [46] has been implemented to improve the performance of the model with respect to the test data distribution. This procedure generates new samples of the minority class by interpolating their features. For this work, SMOTE allowed achieving a 1:1 ratio between majority and minority classes that initially was 3:1. Finally, after each epoch, the model was executed on the validation set to assess its generalization capabilities to new maneuvers. The model obtaining the lowest loss function score on such tests, hence the most general one, was retained as the final learned model.

Results
The aim of this study was to develop a GRU model that could predict, up to 3 s ahead in time, the level of severity of vehicle-pedestrian encounters in inner-city traffic. In this regard, two equivalent approaches have been identified [9], i.e., (1) using multiple GRUs to separately model the supplementary safety indicators (T 2 and TAdv) that allow the interaction severity to be estimated, or (2) using a single Recurrent Neural Network as a sequence classifier to directly label near-accident events, based on relevant mobility features of V2P encounters. It is worth pointing out that although the two approaches considered are equivalent from the standpoint of the final outcome (i.e., the prediction of severity levels ahead in time), Approach (1) would allow more flexibility in labeling the pedestrian's near-accidental events, since TCTs' threshold values that are different from those considered in this study could be selected to classify the risk level of V2P interactions (e.g., using specific threshold values for geographic contexts of system operation, which are derived from studies of local/national driving behaviors) [23]. In what follows, we refer by the acronyms GRU T2 and GRU TADV to the GRU models predicting T 2 and TAdv, respectively. Differently, to make the comparison between the two approaches presented previously, the severity classification model resulting from Approach (1) is presented as M-GRU SL , whereas that of Approach (2) is presented as S-GRU SL .
In this section, the generalization capabilities of the trained models are evaluated for each time horizon (i.e., 1 s, 2 s, 3 s ahead) based on commonly used evaluation metrics, which are different for classification (i.e., Accuracy, Precision, Recall, Specificity, False Alarm Rate (FAR), and Area Under the Curve (AUC)) and regression problems (i.e., Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE)), and averaging their scores over the five test folds. These metrics are briefly presented throughout the discussion to make the models' evaluation clearer to the reader; however, more detailed descriptions can be found in [9,23,47]. Moreover, to prevent predictions from reacting with a delay especially for longer time horizons [8], an additional metric to evaluate T 2 's time-series regression, which is usually applied in multi-step-ahead prediction problems [47], has been considered, namely the Modified Index of Agreement (md) [48]. In fact, md is able to concurrently consider differences in observed and predicted means and variances, providing a better evaluation of model predictions than traditional metrics.
After comparing the models' performance under different hyperparameter combinations, the most appropriate values for each GRU model have been selected: the initial learning rate is set to 0.001 for the GRU TADV model and 0.005 for both GRU T2 and S-GRU SL models, whereas the learning rate drop factor is 0.8 for all considered models. The unit number N within the GRU memory is 64 for GRU T2 , 256 for GRU TADV , and 128 for S-GRU SL . Tables 2 and 3 present the evaluation metrics scores, validated on the five test folds, distinguishing by model and prediction time horizon (standard deviations on the five folds are shown in brackets). In addition, Figure 4 presents the empirical cumulative distribution probability (ECDP) of the absolute prediction error over each fold for the GRU T2 model. Conversely, Figures 5 and 6 show the Receiver Operating Characteristic curve (ROC) for the GRU TADV and S-GRU SL models, respectively. This curve, which plots recall as a function of FAR, is a comprehensive metric to evaluate classification models' performance [23], since the closer the area under the ROC curve (AUC) is to 1, the better the prediction quality. Finally, Table 4 summarizes the performance of the severity classification models, M-GRU SL and S-GRU SL , averaging the test results over folds and time horizons. As expected, the quality of predictions increases if the time horizon decreases, no matter which model is considered. This result suggests that there is a strong correlation between the selected mobility features at time t and the desired target at time t+1. In contrast, this correlation becomes weaker for longer time horizons, and as a result, distinguishing the expected target becomes more difficult. For example, considering the GRU T2 model (Table 2), all evaluation metrics get slightly worse as the time-step gets longer both in training and testing stages. However, focusing on the test RMSE (column 5, Table 2), i.e., the square root of the mean squared difference between predicted and observed values, its worsening is characterized by a deviation close to one-tenth of a second: the maximum deviation, equal to 0.138 s, is measured moving from the 1s (RMSE = 0.327 s) to the 2s (RMSE = 0.465 s) prediction time-frame, whereas the prediction quality worsens by 0.122 moving from 2 s to 3 s (RMSE = 0.587 s). Therefore, the results obtained are on the whole satisfactory, as also shown by the other parameters: the test MAE (column 3, Table 2), i.e., the mean absolute error, was on each time scale less than half a second; the 90th percentile of the absolute error (Figure 4), i.e., the value of the prediction error (in absolute value) which is exceeded in no more than 10% of time sequences, is equal to 0.472 s (averaged over the five folds) at the 1 s, 0.696 s at the 2 s, and 0.967 s at the 3 s prediction time-frame; the md parameter is always greater than 0.850 whatever the prediction time horizon, proving the close agreement between the observed and forecast curves. 0.122 moving from 2 s to 3 s (RMSE = 0.587 s). Therefore, the results obtained are on the whole satisfactory, as also shown by the other parameters: the test MAE (column 3, Table  2), i.e., the mean absolute error, was on each time scale less than half a second; the 90th percentile of the absolute error (Figure 4), i.e., the value of the prediction error (in absolute value) which is exceeded in no more than 10% of time sequences, is equal to 0.472 s (averaged over the five folds) at the 1 s, 0.696 s at the 2 s, and 0.967 s at the 3 s prediction timeframe; the parameter is always greater than 0.850 whatever the prediction time horizon, proving the close agreement between the observed and forecast curves.  whole satisfactory, as also shown by the other parameters: the test MAE (column 3, Table  2), i.e., the mean absolute error, was on each time scale less than half a second; the 90th percentile of the absolute error (Figure 4), i.e., the value of the prediction error (in absolute value) which is exceeded in no more than 10% of time sequences, is equal to 0.472 s (averaged over the five folds) at the 1 s, 0.696 s at the 2 s, and 0.967 s at the 3 s prediction timeframe; the parameter is always greater than 0.850 whatever the prediction time horizon, proving the close agreement between the observed and forecast curves.  Regarding the prediction quality of classification models, it is first necessary to remind the reader that in a binary classification problem, samples are labeled as positive and negative so that evaluation metrics can be computed through the confusion matrix. The latter is a representation of the model's classification accuracy, since it makes clear whether the system is mislabeling one class with another: each row of the confusion matrix represents the instances in an actual class, whereas each column represents those in a predicted class. Thus, the number of "true positives" (correctly classified positives), "true negatives" (correctly classified negatives), "false positives" (actual negatives classified incorrectly), and "false negatives" (actual positives classified incorrectly) can be defined. For a better understanding, a schematic representation of the confusion matrix is shown in Figure 7. Regarding the prediction quality of classification models, it is first necessary to remind the reader that in a binary classification problem, samples are labeled as positive and negative so that evaluation metrics can be computed through the confusion matrix. The latter is a representation of the model's classification accuracy, since it makes clear whether the system is mislabeling one class with another: each row of the confusion matrix represents the instances in an actual class, whereas each column represents those in a predicted class. Thus, the number of "true positives" (correctly classified positives), "true negatives" (correctly classified negatives), "false positives" (actual negatives classified incorrectly), and "false negatives" (actual positives classified incorrectly) can be defined. For a better understanding, a schematic representation of the confusion matrix is shown in Figure 7.
mind the reader that in a binary classification problem, samples are labeled as positiv and negative so that evaluation metrics can be computed through the confusion matrix The latter is a representation of the model's classification accuracy, since it makes clea whether the system is mislabeling one class with another: each row of the confusion ma trix represents the instances in an actual class, whereas each column represents those in predicted class. Thus, the number of "true positives" (correctly classified positives), "tru negatives" (correctly classified negatives), "false positives" (actual negatives classified in correctly), and "false negatives" (actual positives classified incorrectly) can be defined For a better understanding, a schematic representation of the confusion matrix is shown in Figure 7. To properly evaluate the TAdv classification model, samples in the "collision course class have been labeled as positives. Thus, the recall (column 5, Table 2) of the GRUTAD model, representing the proportion of actual positive samples classified correctly, prove that the system is able to predict very accurately (values are greater than 0.870) whethe the V2P interaction is moving or not moving on a collision course, whatever the prediction time-frame (despite the slight worsening discussed previously). Similarly, the AUC val ues (last column of Table 2), greater than 0.995 at each time horizon, confirm the high accuracy of the binary classifier. Differently, Figure 5 shows the effect of data distribution variability over the folds by ROC curves: as the time window gets longer, differences be tween folds are more evident, i.e., data distribution becomes sparser, and the model lose To properly evaluate the TAdv classification model, samples in the "collision course" class have been labeled as positives. Thus, the recall (column 5, Table 2) of the GRU TADV model, representing the proportion of actual positive samples classified correctly, proves that the system is able to predict very accurately (values are greater than 0.870) whether the V2P interaction is moving or not moving on a collision course, whatever the prediction time-frame (despite the slight worsening discussed previously). Similarly, the AUC values (last column of Table 2), greater than 0.995 at each time horizon, confirm the high accuracy of the binary classifier. Differently, Figure 5 shows the effect of data distribution variability over the folds by ROC curves: as the time window gets longer, differences between folds are more evident, i.e., data distribution becomes sparser, and the model loses slightly in generalization capability. However, the performance is still more than acceptable.
For the S-GRU SL model and the combined M-GRU SL model, samples within the "unsafe" severity class have been labeled as positives. In the following, since the AUC calculation for a model that results from the combination of two GRU subsystems cannot be performed, the comparison between Approaches (1) and (2) is mainly based on accuracy and recall metrics (columns 3 and 5 of Table 3). Analysis of results on the test set shows an improved ability of the S-GRU SL model to label serious conflict interactions in nearaccident events, especially over longer prediction timeframes. The accuracy of the S-GRU SL model, i.e., the proportion of samples (positive and negative) classified correctly among all samples, is always higher than 0.970 on each timescale in contrast to the M-GRU SL model, which only on the shortest time frame (1 s) reaches the value of 0.971. The situation gets even worse for the M-GRU SL model to the advantage of the S-GRU SL model in terms of recall (columns 5, Table 3). In fact, the percentage gain in recall score (+15.40%, fourth column of Table 4) by moving from the M-GRU SL to S-GRU SL model is noteworthy.
These results show that the combined model, due to the nonlinear error propagation from the single models (GRU T2 and GRU TADV ), fails to generalize satisfactorily (as also evidenced by the standard deviations over folds) and performs significantly lower than S-GRU SL . This finding is not trivial since, in many multi-step-ahead prediction problems, complex models (e.g., multiple GRUs) have performed better than equivalent simple models, such as a single GRU [9].
In conclusion, despite the good results obtained in individual modeling of the T 2 and TAdv safety parameters, the S-GRU SL model achieved the best performance, with an accuracy of 0.980, recall of 0.899 and AUC of 0.996 (averaged over the time windows and folds, Table 4) in predicting near-accidental events of V2P encounters.

Conclusions
In this study, it has been shown that a simple Deep Learning system, based on GRU cells, can predict ahead in time changes in the severity level of V2P encounters in inner-city traffic. The proposed system uses a gradient-descent optimizer to learn how to label V2P interaction severity by exploiting individual driving features and traffic scene properties. Such an approach differs from previous studies in the two TCTs considered to classify the pedestrian's near-accident events, namely T 2 and TAdv, which can be computed either when the road users are on a crossing or a collision course. The trained model, which directly predicts the severity level class (i.e., "safe" or "unsafe") up to 3 s ahead in time, provides satisfactory and promising results (accuracy = 0.980, recall = 0.899, and AUC = 0.996) for enhancing current PAEB systems. In fact, the proposed system could be applied to warn drivers of the anomalous or hazardous interaction with a pedestrian, to anticipate a braking maneuver, as well as to enhance vehicle deceleration, with the aim of pursuing an improvement in transportation safety, with regard to pedestrians, within the inner-city traffic.

Future Research
Future research perspectives opened by the current study include (1) comparing the presented multi-step-ahead prediction system to a physical trajectory prediction system; (2) generalizing the proposed GRU models to different and more complex encounter scenarios; (3) modifying the system for online learning and operation; and (4) validating the prediction system with data acquired during real vehicle-pedestrian encounters.
Indeed, a more thorough baseline for comparing the presented approach with those most widely used in the literature should first be established. Further research efforts are also needed on a wider cohort of drivers, as well as on different drivers' groups (such as young inexperienced drivers, experienced and expert drivers, older drivers, etc.), in different urban environments and driving situations to consider as many behavioral components (such as aggressiveness and anxiety) and interaction patterns as possible. Well before the online system validation (i.e., the validation of the calibrated system on data acquired in the real-world scenarios), it will be necessary to integrate the risk assessment model (e.g., the S-GRU SL ) with an automated object detection system (e.g., the Mask R-CNN) and test their coordinated real-time operation (along with the car sensing system) in collecting reliable mobility data during driving simulations in both simple and complex urban scenarios. At this stage, there may be a need to introduce specific learning subsystems for the evaluation of the more complex scenarios (e.g., the ego-vehicle's concurrent interaction with surrounding vehicles and pedestrians), along with a method to merge the predictions given by each subsystem. The further transition to a real vehicle application will require an additional functional requirement: although the risk assessment model (i.e., the presented prediction system) is essential for the good functional safety of an PAEB system, the key to achieving active pedestrian collision avoidance is to control the vehicle dynamics. For this purpose, it will be mandatory to design the upper-and lower-layer controllers which, after receiving the control signal (i.e., the likely occurrence of a collision) from the RAM, are in charge of outputting the deceleration value required for safe stopping and controlling the vehicle subsystems (i.e., throttle opening and brake line pressure regulation) to realize the control of the actual vehicle deceleration, respectively. The control strategy proposed by Yang et al. [18], based on fuzzy neural network and PID (Proportional Integral Derivative controller) theory, could represent a handy reference for the implementation of control modules. Thereafter, using a properly equipped vehicle, the developed automatic emergency braking pedestrian collision avoidance system will have to be online validated in vehicle-pedestrian test scenarios established by the Euro-NCAP standards [49].
In conclusion, this research represents a valuable contribution to the study, understanding, and knowledge of the interaction processes between road users in an urban environment and, consequently, to improving the sustainability of transportation infrastructures.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding authors upon reasonable request.