A Dynamic Bayesian Network-Based Real-Time Crash Prediction Model for Urban Elevated Expressway

Traffic crash is a complex phenomenon that involves coupling interdependency among multiple influencing factors. Considering that interdependency is critical for predicting crash risk accurately and contributes to revealing the underlying mechanism of crash occurrence as well, the present study attempts to build a Real-Time Crash Prediction Model (RTCPM) for urban elevated expressway accounting for the dynamicity and coupling interdependency among traffic flow characteristics before crash occurrence and identify the most probable risk propagation path and the most significant contributors to crash risk. In this study, Dynamic Bayesian Network (DBN) was the framework of the RTCPM. Random Forest (RF) method was employed to identify the most important variables, which were used to build DBN-based RTCPMs.-e PC algorithm combined with expert experience was further applied to investigate the coupling interdependency among traffic flow characteristics in the DBN model. A comparative analysis among the improved DBN-based RTCPM considering the interdependency, the original DBN-based RTCPM without considering the interdependency, and Multilayer Perceptron (MLP) was conducted. Besides, the sensitivity and strength of influences analyses were utilized to identify the most probable risk propagation path and the most significant contributors to crash risk. -e results showed that the improved DBN-based RTCPM had better prediction performance than the original DBN-based RTCPM and the MLP based RTCPM.-emost probable risk influencing path was identified as follows: speed on current segment (V) (time slice 2)⟶V (time slice 1)⟶speed on upstream segment (U_V) (time slice 1)⟶Traffic Performance Index (TPI) (time slice 1)⟶crash risk on current segment. -e most sensitive contributor to crash risk in this path was V (time slice 2), followed by TPI (time slice 1), V (time slice 1), and U_V (time slice 1). -ese results indicate that the improved DBN-based RTCPM has the potential to predict crashes in real time for urban elevated expressway. Besides, it contributes to revealing the underlying mechanism of crash and formulating the real-time risk control measures.


Introduction
Predicting road crashes in real time is a hotspot in road safety under the context of active traffic management (ATM) over the past two decades. Real-time crash prediction refers to the assumption that the occurrence probability of a crash on a specific road segment can be predicted within a very short precrash time interval by adopting instantaneous traffic flow characteristics [1][2][3]. e development of Intelligent Transportation System (ITS) and advanced transportation information systems (ATIS) is helpful for easily collecting traffic data in real time, promoting the effective and accurate assessment on crash risk on highways and expressways by use of RTCPMs [4][5][6][7][8][9][10][11].
In general, numerous RTCPMs studies establish a direct connection between traffic flow data (i.e., volume, speed, occupancy and their combinations) and crash data. In these models, the collinearity and correlation among dependent variables are avoided; thus, the independence of variables is guaranteed [12,13]. However, road crash is a complex phenomenon involving coupling interdependency among multiple influencing factors. e concept of coupling interdependency can be used to express the interaction between various risks. Although these complex system factors can exhibit many characteristics on their own, in reality these individual factors interact and couple with each other in even more complex ways in terms of coupling direction and coupling strength [14][15][16]. is interaction is called coupling interdependency, which can lead to an increased or a decreased risk of an accident [17]. erefore, it is essential for RTCPMs to account for the interdependency among influencing factors. Additionally, the one-time interval of traffic data is frequently adopted for real-time crash prediction in a number of RTCPMs [6,18]. However, for urban elevated expressway, the merging and lane-changing driving behaviours are frequent due to the dense-ramp setting. e traffic flow characteristic is prone to displaying dynamicity that varies over time, which is closely associated with crash risk [19]. erefore, the dynamicity of traffic flow in the temporal dimension should be considered with the implementation of the RTCPM for predicting crashes on urban elevated expressway. DBN, a particular form of Bayesian Network (BN), represents the dynamic evolution of some state space model through time [20]. It has been widely used to predict and assess the dynamically evolving process of risk in the field of maritime accidents, tunnel construction, ship-ice collision, etc. [21][22][23]. In order to express the dynamicity of traffic flow characteristics, some RTCPMs studies apply time-series traffic data consisting of several time intervals [24][25][26][27], which are proved to be feasible and robust. However, these researches ignore the investigation of interdependency among different traffic flow characteristics and simply connect each influencing factor to crash risk directly in the construction process of the graph structure. As the most critical step in the DBN construction, the interdependency of variables can be well assessed in the DBN model with the application of the structure learning algorithm. Besides, the neural network-based models (e.g., MLP) are also able to accommodate correlated dependent variables. However, the whole model should be rebuilt and recalibrated once the future new variables and knowledge from new data are input, whose tuning process can be highly resource-demanding [13].
Furthermore, considering the interdependency among influencing factors also helps to reveal the underlying mechanism of crash occurrences. e present study estimates the crash risk by quantifying the probability of crash occurrences. We hope this model can provide some realtime countermeasures to mediate risk when there is a high probability of crash. e formulation of countermeasures should be based on the identification of risk propagation path and significant risk contributors. However, there has been a dilemma between predictive and explanatory models: the models specialized in prediction are not the best in knowledge discovery, and vice versa [7,28]. e DBN model has the advantage of implementing uncertainty analysis and probability reasoning and conducting bidirectional uncertainty investigation for prediction and diagnostic analyses. Combining with the sensitivity and strength of influences analysis methods can not only identify the most probable risk propagation path, but also can recognise the most sensitive contributors in the propagation path [29]. Once the most sensitive risk contributors in the whole propagation path are revealed, the references for the sequence and emphasis of mediation can be provided, which helps to formulate appropriate real-time countermeasures to cut off the risk propagation path and decrease the probability of crash occurrence. e existing DBN-based RTCPMs mainly emphasize the dynamicity of traffic flow characteristics, lacking investigation on the coupling interdependency among traffic flow characteristics. e main contributions of this study are (1) to apply the DBN structure learning algorithm in an example to predict road crashes; (2) to compare the performances of two DBN-based RTCPMs (considering the interdependency and not considering the interdependency) and the MLP-based RTCPM; and (3) to identify the most sensitive risk contributors in the propagation path by the use of the sensitivity and strength of influence analyses. e manuscript is organized into five sections. e remainder is organized as follows. In Section 2, the materials and methods are presented. Section 3 presents the results and discussions. Section 4 provides some concluding remarks.

Study Area and Data Preparation.
e 40 segments of the Yan'an elevated expressway in Shanghai, China, sequentially linking up to each other along the westbound and eastbound expressway, were selected as the study areas (see Figure 1). All segments are three-lane with detectors spaced at an approximate distance of 300-500 m. Each segment has similar road geometry and on/off-ramp arrangement; thus, the road geometries and ramp locations were not considered as influencing variables on the crash risk. ere were 82 crashes that happened on the Yan'an expressway during August and September 2018. e dates, times, and segment IDs of the crashes were collected. Based on the matched case-control design, three corresponding noncrash cases for each crash case were randomly matched for the same segment and occurrence time (246 noncrash cases in total). Besides, traffic flow characteristics and weather variables were also obtained as inputs of the RTCPM, aiming at classifying the crash and noncrash states based on the investigation of relationship between crash risk, traffic flow characteristics, and weather conditions. e existing dual-loop detectors in study areas are available for providing the average speed (km/h) and the average volume of a single lane (pcu/h) for each segment. Hourly weather variables, including visibility (km) and weather type (rainy or sunny), were collected from the Shanghai Xujiahui Observatory, which is 7.5 km far from the Yan'an expressway. In this study, the Traffic Performance Index (TPI) varying between 0 and 1 was applied as an indicator to measure the magnitude of congestion degree, where 1 is a traffic jam state and 0 is a free flow state (equation (1)). Consider where V max is the maximum speed and V i is the average speed at the ith time period. e average speed data on the current, upstream, and downstream segments of the crash location and the TPI of the whole expressway were aggregated in 5-minute intervals. e evolution of traffic flow with time leading to a crash was a dynamic process; thus, the traffic flow characteristics of several time intervals before the crash should be combined to build the model. e intervals of 0-5 min (time slice 0), 5-10 min (time slice 1), and 10-15 min (time slice 2) prior to the crash were considered. e time slice 0 was excluded, because the crash warning system needs some time to recognise crash states, and the actual crash occurrence time and recorded time are not always completely consistent. Due to the raw weather data updated once an hour, the weather condition was regarded as a stable influencing variable across different time slices. Finally, the traffic flow and weather data corresponding to 82 crash cases and 246 noncrash cases were generated. In total, nine variables combining traffic flow characteristics on current, upstream, and downstream segments of the crash location with weather condition are shown in Table 1.

Random Forest (RF).
e main purpose of constructing RTCPM is to evaluate crash risk in real time. High-dimensional variable space can increase the processing complexity of the RTCPM. us, Random Forest (RF), a widely used variable selection model, was implemented in this study to select influencing variables and reduce the redundancy of variables. Variable importance (VI) metric was used as the criterion to pick the mostly related variables [12,30], which can be determined with the following steps.
(1) Sample N amount of data from the learning set to build a tree classifier by bootstrap sample technique, and the remaining samples of the learning set were not used in the growth of the tree. e left-out samples, an effective internal test data set, were called out-of-bag (OOB) data, which were adopted to obtain an unbiased error estimate. m number of variables were randomly selected from the original variable set M (m <M), and the best split variable in m at each tree node was adopted to split node. Each tree grew naturally without pruning. Repeat this step k times to construct RF consisting of k trees.
(2) Each tree classifier produced a classification result by voting for the binary target (crash or noncrash) based on OOB samples, and the classification error rate R i was calculated consequently. (3) Add random noise disturbance for the values of any variable in the OOB sample, and the new OOB sample was produced. Each tree that was implemented for crash/noncrash classification tests with the new OOB sample was used to calculate the classification error rate R i '. (4) VI was calculated as the increase in the mean of the classification error rate of trees after adding random noise disturbance. e calculation formula was shown in the following equation:

Dynamic Bayesian Network (DBN).
e Bayesian Network (BN) is a probabilistic graphical model that expresses the probability relationships among a set of variables that connect those variables in a directed acyclic graph (DAG). e BN has the advantages in learning causal relationships, predicting the consequences of intervention, and analyzing the most probable explanations of consequences. Some researchers have adopted BN to evaluate and analyze traffic accidents risk [31][32][33]. Most crashes did not happen based on a particular point in time, but they can be described through multiple traffic states among a series of time slices. e DBN is a kind of BN, which can couple time-series data to express the risk evolving process with time flowing forward [20]. With the application of the probabilistic inference, the critical step of BN generalization was to reveal the probabilistic dependencies of random variables, which are . , x t } and hidden variables Y � {y 1 , y 2 , . . . , y t }, which were traffic state variables and crash likelihood, respectively. When a Markov model and a BN were integrated to construct a DBN model, there were a transition model P(x t |x t−1 ), an observation model P(y t |x t ), and an initial state distribution P(x 1 ). e joint probability distribution can be expressed as follows: ere were three key steps to initialize a DBN model: (1) e ChiMerge algorithm was adopted to implement the discretization of continuous variables. (2) Structure learning was applied to present the graphic dependencies among variables. In this step, the DBN not only estimated the dependencies between variables within one time slice but also examined them among different time slices. e PC algorithm was used to build the structure of the BN within one slice among traffic state variables and crash likelihood. en, the same variables among different slices were connected to build the structure of the DBN. (3) Parameter learning was conducted to learn the conditional probability distribution of variables within one time slice and across time slices. Parameter estimation was tested by the Expectation Maximization (EM) algorithm.

ChiMerge Algorithm.
e continuous variables are usually problematic in DBNs because it fails to capture the relationships between the continuous variables [34]. e classical way to deal with continuous variables in DBNs is to discretize the variables [35]. Discretization is the operation of dividing continuous variables into a small number of intervals, where each interval is mapped to a discrete symbol.
ere are two widely used simple methods, the equal-width intervals, which divides the variables between the minimum and maximum values into a number of intervals in equal size, and the equal-frequency intervals, where the interval boundaries are chosen based on the fact that each interval contains the same number of samples. However, both of the methods ignore the class of samples [36]. A good discretization has both the intrainterval uniformity and interinterval difference. ChiMerge algorithm performs merging operation by using the χ 2 statistic to test whether there are significant differences or similarities of relative class frequencies between adjacent intervals. e ChiMerge algorithm is mainly consisted of several steps.
(1) Sort the samples according to their value.
(2) Calculate the χ 2 value for each pair of adjacent intervals with the following equation: where m � 2 (the 2 intervals being compared), n � 2 (number of classes, i.e., crash and noncrash), A ij � number of samples in the ith interval, jth class, and E ij � expected frequency of A ij . (3) Merge the pair of adjacent intervals with the lowest χ 2 value until all pairs of intervals with χ 2 values beyond χ 2 threshold. e χ 2 threshold is determined by a desired significance level (0.95 percentile level in this study) and the number of degrees of freedom (1 less than the number of classes). ere are 2 classes (crash and noncrash); thus, the degree of freedom is 1. Finally, the χ 2 value is 3.841.

PC Algorithm.
e PC algorithm is an efficient and classical algorithm used for structural learning in BN [37,38]. e process of the PC algorithm mainly consists of three steps: (1) Determine the skeleton of the graph by conditional independence tests. Let X � {x 1 , x 2 , . . ., x k }be a set of random variables and V � {v 1 , v 2 , . . . , v k } be a set of nodes in a graph so that each node in V represents a random variable in X. en, construct an undirected graph G where all nodes are connected to each other, and then the PC algorithm implements statistical tests to remove or maintain edges between adjacent nodes x i and x j given a conditioning x c in the graph by calculating the cross entropy CE(x i , x j | x c ): e PC algorithm adopts G 2 test statistic, which equals 2nCE(x i , x j |x c ) with n indicating the sample size, to verify the independence. e result of this first step is the skeleton of the graph.
(2) Search the v-structures. If two variables x i and x j are not conditionally independent with given x c , then v c is determined as a collider node and a v-structure v i ⟶ v c ← v j is drawn, and the other edges remain Combining with expert experience, some undirected edges between nodes are specified based on the principles where any cycle and any other v-configuration are not allowed.

Expectation Maximization (EM) Algorithm.
e EM algorithm is a general algorithm to calculate maximal log likelihood and the performance has been proved to be effective in parameter learning of BN [39]. e basic theory of the EM algorithm is to learn the dependence among the nodes by iterating the process of parameters estimation [40]. e EM algorithm mainly consists of three steps: (1) Initialize θ: Given a set of unknown parameters θ, the value of a log likelihood is maximized. e object function is Introduce a distribution Q(Y): an initialization distribution of θ is defined based on Jensen's inequality: (2) E-Step: Calculate the distribution Q(Y) � P(Y|X; θ), which is viewed as the E-step. (3) M-Step: Optimize the parameters based on the estimation of the joint probability distribution, which is viewed as the M-step.
θ′ replaces θ. e iterations process would be repeated until a local optimum of the estimated parameters is achieved.

Multilayer Perceptron (MLP).
e neural network, an effective function approximator, is often used to solve regression and prediction problems in various fields. A general multilayer perceptron model can be performed by the following 3 steps.
(1) Initialize the MLP model. Assume that the original function can be approximated by a set of basic functions: where F is the original function, x is the input vector, m is the number of network synapses, w is the weight of synapses, φ is the basis function allocated on synapses, commonly used functions with S-shaped curves (such as tanh), and e is the error. (2) Load the sample point pair (x, y) and calculate the error between the predicted and true values: where y is the true value of the sample. (3) Adjust network synapse weights according to error feedback. e general calculation formula of the adjustment is where k is the layer after which the neuron to be adjusted is located. When the adjustment ∆w is less than a preset threshold value η, this step would be terminated; otherwise, the weights would be updated and the process would go back to Step (2).

Results and Discussion
e DBN-based RTCPM was constructed based on the training dataset (involving 264 crash data and noncrash data) and validated based on the validation dataset (involving 64 crash data and noncrash data). Figure 2 shows the variable importance ranking which was determined by RF method. It is clear that the most important three variables were the TPI (0.151), V (0.144), and U_V (0.144) (VI >0.14). e relative importance of other variables was less than 0.12. erefore, the TPI, V, and U_V were selected as influencing variables to construct the DBN model. It is surprising that the visibility and weather played limited roles. It is probably explained by the collection time of the crash data (from August to September), when there were more daylight and less visibility differences (mean � 21.32 km, SD � 10.90 km). According to the Horizontal Visibility Grading of Chinese Standard (GB/ T 33673-2017), when the visibility is greater than or equal to 10 km, the visual field is considered as a good level. erefore, the overall good visibility did not contribute a lot to crash risk in this study. In addition, a classified weather variable, the weather type (rainy/sunny), was used as the proxy to represent the weather condition in this study, rather than a quantized variable, rainfall. We assume that the relationship between the crash risk and rainfall might be more obvious than the weather type.

DBN Model Construction.
e DBN models with and without considering the interdependence among traffic flow characteristics (TPI, V, and U_V) were both constructed based on the training dataset. e former model (the improved DBN-based RTCPM) was the main purpose, and the latter one (the original DBN-based RTCPM) was developed for comparison. Before constructing the graphical structure of the improved DBN-based RTCPM, the three traffic flow characteristics were discretized according to their corresponding crash cases using the ChiMerge algorithm. e number of discretization states of every variable was confined to 10 so that the calculation complexity in DBN models can be decreased. e discretization ranges of TPI, V, and U_V are presented in Figures 3-5, respectively. e results showed that the adjacent discretization intervals in every variable were characterized by distinguishable crash/noncrash ratio, indicating that the ChiMerge algorithm performs a good discretization.
After discretization, the PC algorithm and expert assessment were utilized to investigate the interdependency among traffic flow characteristics within one time slice. e dynamicity of traffic flow characteristics was reflected by connecting the same variables from time slice 2 to time slice 1. e dynamicity and interdependency determined the graphical structure of the improved DBN-based RTCPM ( Figure 6). e original DBN-based RTCPM did not consider the interdependency among traffic flow characteristics, and its graphical structure was directly determined by connecting the traffic flow variables to crash risk within one time slice and connecting the same variables between two time slices (Figure 7).
Afterwards the parameter learning process was implemented using the EM algorithm. e initial states of the improved DBN-based RTCPM and the original DBN-based RTCPM are presented in Figures 8 and 9, respectively. It was observed that their overall probabilities of a traffic flow state being associated with a crash were different (36% and 42%, respectively) when no new evidence was entered into the DBN.
is difference suggested the importance of comparing the performance of the two types of DBN-based RTCPM.

Model Validation and Comparison.
e validation dataset was used to validate the DBN models. When no new evidence was entered into the DBN, the marginal probability of crash risk node of initial state of DBN model was set as the classification threshold for evaluating the model performance. And then, each validation dataset was entered individually in the models. e crash risks, i.e., the posterior  probability of crash risk node, relating to the prone traffic condition, were calculated based on the prior probabilities. Several evaluation metrics based on the confusion matrix ( Table 2) are presented in the following equations: precision Besides the overall accuracy from equation (12), the sensitivity from equation (13), G-means from equation (17), and F-measure from equation (18) were used to compare the performance of two types of DBNs and MLP. For imbalanced classification, the overall accuracy metric is not sufficient due to its inability to examine the minor positive samples; thus, the sensitivity was chosen as the supplementary metric to examine the crash classification accuracy.
e F-measure is the harmonic mean of precision and recall and represents the ability to detect crashes. Furthermore, the balanced classification ability can be reflected by G-means, which is the geometric mean of sensitivity and specificity. e comparison results are presented in Table 3. As illustrated by Table 3, all the models can reach a good classification accuracy. Among them, the improved DBNbased RTCPM showed the best overall classification accuracy, followed by the original DBN-based RTCPM and MLP-based RTCPM. For the crash detection ability, the sensitivity metric indicated that the improved DBN-based RTCPM performed the best, and the relatively poor performance was seen in the original DBN-based RTCPM. Furthermore, the F-measure also suggested that the improved DBN-based RTCPM had better crash prediction ability than the original DBN-based RTCPM and MLPbased RTCPM. With respect to the balanced classification ability, the G-means revealed that the improved DBNbased RTCPM achieved better than the other models. For all the comparisons, the results demonstrated that the improved DBN-based RTCPM can achieve desirable overall prediction performance. It is also demonstrated that this model had an effective ability to monitor crashes in real time. Meanwhile, the model can keep the balance between crash and no-crash prediction. In summary, the prediction performance of DBN-based RTCPM can be improved by accounting for the interdependence of traffic flow characteristics.

Sensitivity and Strength of Influences Analysis.
Investigation of the interdependency among the traffic flow also contributes to revealing the underlying mechanism of crash occurrence, which is helpful for formulating the realtime risk control measures. e sensitivity and strength of influences analysis were implemented in a professional DBN analysis software, Genie, to identify the most significant contributors to crash risk and the most probable risk propagation path.

Sensitivity Analysis.
e sensitivity analysis of Genie can be utilized to identify which node had greater contribution to the target node in DBN. Setting the crash risk as the target node, conducting sensitivity analysis on it, and the contribution degrees of traffic flow characteristics on crash risks are presented in Figure 10 in a descending order. e results showed that the TPI in time slice 2 was the most sensitive factor that results in crash risk, followed by V in time slice 2, TPI in time slice 1, etc.

Strength of Influences Analysis.
e strength of influences analysis was utilized to identify the most probable risk propagation path based on the improved interdependency structure. e strength of influence is always calculated from the distance between the probability distributions of the  Journal of Advanced Transportation 7 child node conditional on the state of its parent node. As shown in Figure 11, the arcs have different values and thicknesses, presenting the strength of influence between connected nodes. e biggest accumulative value indicates that the most probable risk propagation path is V (time slice 2)⟶V (time slice 1)⟶U_V (time slice 1)⟶TPI (time slice 1)⟶crash risk on current segment.
Synthesizing the results of sensitivity and strength of influences analysis can be used to identify the most probable risk propagation path, as well as determine the most sensitive contributor in the propagation path. e results suggested that the sequence and emphasis of the real-time risk countermeasures should sequentially lay on V (time slice 2), TPI (time slice 1), V (time slice 1), and U_V (time slice 1).

Conclusions
is study aimed to build a RTCPM for urban elevated expressway by using the DBN model to capture the dynamicity and coupling interdependency among traffic flow characteristics before crash occurrence. e model was built and validated adopting traffic flow data collected on the Yan'an elevated expressway. Based on the DBN-based RTCPM, the sensitivity and strength of influences analysis were utilized to identify the most probable risk propagation path and the most sensitive contributors to crash risk. e main conclusions are as follows: (1) In model construction process, interdependency in the DBN model was determined by the PC algorithm and expert experience, and the dynamicity of traffic flow characteristics was expressed by adopting data in time slices. By validation, the improved DBNbased RTCPM got an overall accuracy of 76.6%, with a crash prediction accuracy of 68.8% and a crash/ noncrash balanced classification accuracy of 73.8%. e results indicated that the model can achieve an effective crash prediction for urban elevated expressway.
(2) Comparisons of the original DBN-based RTCPM and MLP-based RTCPM suggested that the improved DBN-based RTCPM can identify the interdependency among traffic flow characteristics before crash occurrences. e comparison results also indicated that the improved DBN-based RTCPM was more suitable for the prediction of real-time crashes for urban elevated expressway. (3) According to the results of sensitivity and strength of influences analysis, the most probable risk propagation path is V (time slice 2)⟶V (time slice 1)⟶ U_V (time slice 1)⟶TPI (time slice 1)⟶crash risk on current segment, and the most sensitive contributor to crash risk in this path is V (time slice 2), followed by TPI (time slice 1), V (time slice 1), and U_V (time slice 1). e results suggested that the formulation of the real-time risk countermeasures should sequentially focus on this sequence in the propagation path.
ere would be two extensions in future research. On the one hand, the model was built and validated on the same urban elevated expressway; thus, the transferability of the model to another urban elevated expressway has not been discussed. On the other hand, the specific real-time risk countermeasures such as variable speed limit (VSL) can be investigated to improve crash risk.

Data Availability
e research data are available in the .CSV format file. ey are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.