A New GNB Model of Crash Frequency for Freeway Sharp Horizontal Curve Based on Interactive Influence of Explanatory Variables

2 . The results demonstrated the effective use of flexibility and elasticity in analyzing explanatory variables and in predicting freeway sharp horizontal curve segments. In six models, the result of model 6 is much better than those of the other models by fitting rules. We also compared the actual results from crashes of 88 sharp horizontal curve segments with those predicted by models 1, 3, and 6. Results demonstrate that model 6 is much more reasonable than the others.


Introduction
Accidents, and specifically highway-vehicle accidents, cost the lives of roughly one and a quarter million people worldwide every year. In addition, highway-traffic injuries are globally the leading cause of death among people 15 to 29 years old with over 300,000 deaths [1]. Compared with other highways, a freeway is often designed with relatively good driving environment characterized with high alignment indexes, good pavement, total enclosure, absence of pedestrians, no low speed interference, perfect traffic safety devices, and so on. Thus, the crash rate and death toll of freeways average 30%-51% and 43%-76%, respectively, compared with those of ordinary highways in developed countries. In China however, the average crash number, death toll, injury toll, and the direct loss of property are 3.2, 8.4, 7.2, and 24.3 times more than those of the ordinary highways [2]. Therefore, it is important to determine the real law of crash occurring in freeways and how the different types of freeway environment influence the crash number based on reliable databases.
Over the past several decades, historical surveys covering the features and frequencies of crashes in freeways have been actively pursued [3,4]. Researchers mainly focus on investigating the observed factors that affect the number of crashes for roadway segments or intersections over some fixed time periods [5]. However, in terms of freeway crashes within China, specialized crash databases and highway design databases are not available at present. Similarly, 2 Journal of Advanced Transportation investigations that could clarify China's current situation have not been performed. Thus, Liande et al. [6], Zhuanglin [7], and other researchers developed a crash prediction model with a relatively small number of samples. To improve on this effort, this paper attempts to establish a model with huge samples.
Mathematical statistics and regression analyses are common methods to predict highway crashes. Other methods, such as fuzzy mathematics, grey theory, nerve cell method, and clustering analysis, have also been used to establish the prediction models. American HSM2010 is an established prediction model based on statistical regression. IHSDM made a good simulation of the American two-lane highway crash prediction (U.S. federal highway data). Chengcheng [8] carried out two-lane highway crash prediction model research, which focused on low-grade highways in China.
The abovementioned methods have explained how a single factor influences the crashes. However, road crashes are complex events that involve a large variety of factors with multifaceted interactions, making it challenging to fully understand them. Advancing improved methodologies for road safety analysis and applying them to crash analysis continue to be investigated [9].
Many statistical models are applied to crash frequency analysis. Improved NB and GNB models were successfully used to predict the crash rate of freeway basic segment, tunnel entrance and exit, and so on [10,11]. However most of these models with fixed parameters fail to reveal the true interrelationship between explanatory variables. Flexibility and elasticity are concepts introduced to analyze the degree of interaction between variables when the translog function is extended to the field of traffic accident prediction. In our research, flexibility and elasticity have been introduced to express the elastic influence of explanatory variables and interaction of explanatory variables on crash rate. To demonstrate the effective use of flexibility and elasticity in predicting freeway SHCS, we conducted sensitivity analysis by AIC, BIC, and Pseudo R 2 and compared our proposed models to traditional NB and GNB models. The results of the proposed prediction models were also compared with observed data.
Our research on the crash prediction models divided the freeway into several segments, namely, basic segment, general segment, and special segment. Since we have discussed the crash prediction model of the basic segments in the paper published in Journal of Southeast University (Wang et al. 2014), we take the freeway sharp horizontal curve segment (SHCS) as the research object in this paper. In the crash prediction model, segment length, curve radius, and traffic volume are assigned as explanatory variables and crash amount per year is determined as the dependent variable.
In the next section, relevant literature was reviewed. In Section 3, we presented the data and the glossaries as well as the basic model formulated and the discussion on the variables. In Section 4, we discussed the statistical analysis used to determine the model parameters and the sensitivity analysis. We also compared the actual crashes with the predicted ones. Moreover, we presented the final results in this section. In Section 5, we presented theconclusions.

Literature Review
A wide variety of advanced statistical count models are applied to crash frequency analysis over the past years, and the strengths and weaknesses were well summarized by Lord and Mannering [5], Mannering and Bhat [12], and Mannering et al. [13]. Mohammadi et al. [14] refined a good summary of prediction models based on several representative references  Lord et al., 2010Lord et al., , 2007Lord et al., , 2005. "Different formulations were used based on the purpose of study and nature of available data. When the data are overdispersed, the Negative Binomial (NB) model structure with a log-link function is the most favored. When the data are not overdispersed, the Poisson structure is the most favored" [14]. Then, they developed a series of aggregate crash prediction models that relate to the modal split step of the conventional four-step demand models [14]. Yajie (2014) investigated the effect of different functional forms on the estimation of the weight parameter as well as the group classification of the finite mixture of NB regression models, using crash data collected on rural roadway sections in Indiana.
The above review shows the general linear model or logarithm linear model by logarithmic transformation into linear equation is one of the most commonly used methods of building a highway traffic crash prediction model. Although there are several limitations [15], many of the crash prediction models of Highway Safety Manual 2010 are based on logarithm linear models and turned to be reliable to a certain degree. The conventional NB or GNB model with fixed parameters may fail to capture the possible variability of the individual effects associated with the variables across observations, which may lead to biased parameter estimation and incorrect inferences [16].
Analysis of the common traffic crash prediction models has resulted in the observation that, in the process of building the model, the basic assumption that all explanatory variables that are relatively independent are common does not consider the influence of each variable. This observation results in a situation where the relationship between explanatory variables and the traffic crash is not fully in accordance with the actual situation. Although a considerable number of recent highway safety studies [17][18][19] considered the interaction among explanatory variables, most are based on the analysis of the relationship between driver, vehicle, highway [20], and environment [21]. The results of these studies show the different dangers when driving in highways and the effect of division on the traffic flow, among others. Moreover, the results show that when the lengths of segments analyzed are different, the traffic flow prediction for the crash is also different.
Thus, we can find that the traditional log-linear model has two limitations when used to analyze and predict the frequency of road traffic crashes. Firstly, the assumption that the elastic coefficient is constant is not compatible with common logic [22]. Secondly, exploratory variables are simplified or idealized as independent variables, which is hard to reveal the true interrelationship between variables [22]. Thus, the idea of flexibility has been introduced in our models to overcome the above limitations. Flexibility is often used in the manufacturing industry to explain the variation environment or the probabilistic ability from the variation. Cobb-Douglas production function, linear production function, Leontief production function, Variable Elasticity of Substitution (VES) production function, and translog production function are often used to analyze flexibility. Among these methods, the translog production function is the most popularly used to analyze traffic problems [11]. Wei Huang (2007) and Li Li (2011) studied the generalized beyond the logistic cost function (GTCF). António (2011), Lurong Wu (2010), Juan Zeng [23], Rong Li (2013), and Xiang Liu [24] introduced the translog function to analyze the traffic crash of loss and frequency. Using the logarithmic function NB model, Xiang Liu [24] and Rong Li (2013) established the frequency forecast model of the highway traffic crash in Ontario, Canada. Compared with the log-linear NB model, it was proven to be more credible. Other results also show that the above limitation can be well overcome [22,25,26]. and their crash data of five years were selected for analysis. The statistics are shown in Table 2.

Definition.
In the study, the following segments are defined as sharp horizontal curve prediction segments: (1) Radius of horizontal curve: less than 1000 m

Modeling Method.
When building a road traffic accident frequency prediction model, in order to keep the model stable (that is, to keep variance as low as possible), model transformation is performed to the NB, which is commonly used method. Thus, NB model will be transferred into loglinear model. Equations (1)∼(5) and all NB models in this paper are results of transformation. Thus, it looks to be not a common NB formulation of the form Y = exp(a + Σ ixi). In the paper, we still name these formulations as NB regression models or GNB models. The basic expression form of the translog function model is as shown in formula (1) [27,28].

Model
The basic function form of model Estimated parameters * 1 (NB) ln = 0 + 1 ln ( ) + 2 ln ( ) + 3 ln( ) where Y is the dependent variable, K and L are the explanatory variables, and 0 , , , , , are the estimated parameters.
All the basic formulas are formulated based on formula (1) by introducing second cross variables and using trans logarithmic (TCF) cost function form to extend NB and the Poisson model. Consequently, the interaction between the variables can be reflected. Couto et al. [25] established the logarithmic function model based on AADT, segment length, density of access, and time trend variables. The logarithmic functional formula is expressed as where is the average accidents per year, , , , and are the explanatory variables, which are referred to as AADT, segment length, density of access, and time trend variables, and ( = 0 ∼ 9) and are the estimated parameters.
In our paper, the freeway crash prediction model is built by selecting AADT, length of sharp horizontal curve segments, and curve radius as explanatory variables. The NB crash prediction model is set up based on the constant elasticity and flexibility of variables; see formulas (3) and (4).
where is the estimate of crash amount for a specific year of segment i, is AADT for a specific year of segment i, is radius of the curve i, is the length of segment i, and ( = 0, 1, 2, . . . 5) is estimated parameter. Then, flexibility and elasticity are defined to express the elastic influence of explanatory variables and interaction of explanatory variables on crash rate. Totally, we proposed 6 types of models to predict crash frequency; see Table 3. These 6 types of models include 2 NB models (models 1 and 2), 2GNB models (models 3 and 4), one NB model (model 5), and one GNB model (model 6) with flexibility and variable elasticity considered. As discussed above (literature review), there are 2 main limitations of NB or GNB. Translog transformation and variables flexibility are expected to solve the limitations. Thus, first 4 models (2 NB models and 2 GNB models) are used as comparisons to check whether translog transformation and variables flexibility can improve the prediction. Akaike Information Criteria (AIC criterion), Bayesian Information Criteria (BIC) rule, and Pseudo R2 test were used as criteria to evaluate the imitative effect of the crash CPM of SHCS.
For NB distribution, the overdispersion parameter is a constant greater than zero. The higher the value is, the more dispersed the observations are. GNB distribution also follows this rule but allows the overdispersion parameter to change with other variables. The overdispersion parameter was selected and its expression equation was determined by analyzing the fit goodness of overdispersion parameter of different parameters. Then, we compared the fit goodness of the six models below and ultimately determined the CPM of SHCS. The corresponding forms of each model and estimated parameters are shown in Table 3.

Overdispersion Parameter.
The difference between the NB model and the GNB model is whether overdispersion parameter is a constant or not.
The first 2 types are NB models and the overdispersion parameter is a constant greater than zero. The difference of models 2.1, 2.2, and 2.3 is variables' coefficients. In model 2.1, we set a fixed value 1 to Ri as relative coefficient to find the relative influence of Ti and Li on Ri. Purposes are similar in models 2.2, 2.3, and 4.1∼4.3. The comparison gives us a clear understanding about relationship between 3 variables. This method has also been applied in some other literatures [10,11].
Overdispersion parameter of GNB is not a constant value. We separately tried T, L, R, TR, TL, RL, and TR as explanatory variables of . Thus, each GNB model has 7 specific models.
AIC, BIC, and Pseudo R 2 are used to select the best specific model for each of the six main models with the best goodness of fit. The overdispersion parameter of each model and its AIC, BIC, and Pseudo R 2 coefficient are listed in Table 4.
The following standards are used to examine and verify the goodness of fit of parameters of : (1) The Pseudo R 2 statistical magnitude should be used to test the goodness of fit of the models. The bigger it is, the better the model is.
(2) AIC is used to evaluate whether the model is useful or not. The smaller it is, the better the model is.
(3) BIC states that any given problem can find the smallest error probability by the likelihood ratio test of decision rules. Thus, the smaller it is, the better the model is.
As shown in Table 4, among 3 NB models, overdispersion parameter of model 5 has relatively small AIC and BIC and big Pseudo R2. It proves that flexibility and elasticity are beneficial to improve the models' overdispersion parameter. Then, among 3GNB models, although the value of the models is quite close for some models, when T is selected as explanatory variable of , the AIC and BIC values of models tend to be smaller, and the Pseudo R 2 value tends to be larger than the others. The results indicate that the fitting effect of these models is better than those of others. Thus, we determined T as explanatory variable of . That is, = e ( 0 + 1 ln( )) .

Model Result.
Based on the collected data mentioned in Section 2, we calibrated the estimated parameters of the six main models and the specific models cited above. The goodness of fit was also calculated. The results are shown in Table 5. Table 5 indicates the obvious interactive influence between two variables. Model 5 and model 6, which take the interactive influence into consideration, have a good fitting effect, particularly when compared to models 2 and 4. Thus, we ignored models 2 and 4 directly. According to 3 evaluation criteria, model 6 is better than model 5. By contrast, we found that the Pseudo R2 of model 6 is larger than those of models 1 and 3, indicating that model 6 is much better than models 1 and 3.
Based on the above analysis, we determined model 6 as CPM and expressed it as follows: The overdispersion parameter is where N is estimate of crash amount for every year of SHCS, is the basic segment of the annual average daily traffic, is the length of the SHCS, and is the radius of the SHCS.

Prediction Analysis.
To demonstrate the effectiveness of the prediction, we performed prediction of a certain freeway with model 1, model 3, and model 6. Then, we compared the results with the real crash data we collected from the institutions. See Table 6.
As shown in Table 6, the crash averages of the 3 models are all close to the real crash value. However, there are some differences when referring to standard deviation as shown in Table 6. Due to overfitting through the translog specification and the large number of interactions, the result of model 6 is much closer to the statistics value of the real cash data than those of the other two models. For the maximum and minimum values, the forecast range of model 6 is very close to the actual situation. Based on the above discussion, model 6 is the best among the six models.

Conclusion and Discussions
The traditional log-linear model has two limitations when used to analyze and predict the frequency of road traffic crashes. One is the constant elastic coefficient and the other is independent relationship between variables. Translog transformation and the idea of flexibility have been introduced in our models to overcome the above limitations. Flexibility and elasticity are defined to express the elastic influence of explanatory variables and interaction of explanatory variables on crash rate. The analysis sheds light on crash prediction effect of SHCS of freeways. Thus, totally six types of models, a total of 10 models, were proposed to predict the crash. These 6 types of models include 2 NB models (models 1 and 2), 2GNB models (models 3 and 4), one NB model (model 5), and one GNB model (model 6) with flexibility and variable elasticity considered. Among the models, model 6 is much better than the other models. All parameter estimates in Table 5 satisfy the models with a confidence of more than 1%. Through the detailed analysis and study, the following conclusions have been drawn: (1) Among 3 NB models, with flexibility and elasticity considered in the model, model 5 has relatively small AIC and BIC and big Pseudo R2. It demonstrated that   flexibility and elasticity help to improve the models' overdispersion parameter. (2) When T is selected as the overdispersion parameter in GNB models, the AIC and BIC values of models tend to be smaller, and the Pseudo R2 value tends to be larger than the others. The results indicate that the fitting effect of these models is better than those of others.
(3) With sufficient samples and data, the effective use of the GNB model in analyzing the interactive influence of explanatory variables and predicting freeway basic segments can be demonstrated. The prediction results with relatively good models (model 1, model 3, and model 6) have also been compared to that of real data. The results show model 6 has the best prediction effect.
In summary, the results show suitable length of circle (Li) allows drivers to adapt to the driving environment. Long sharp curve and high traffic volume Ti result in high crash rate. However, when Ti is low (e.g., free flow), the influence of Li on crash rate will decrease. Sharp curve with high traffic volume Ti results in high crash rate. When Ti is low (e.g., free flow), the influence of Ri on crash rate will decrease. Long sharp curves with really small Ri are difficult for drivers to handle the steering wheel. Drivers are also nervous driving along sharp curves. There must be an interaction between them. Thus, the findings of this study can help enhance understanding of the relationship among traffic volume, highway horizontal radius, and curve length. Such understanding can help to develop crash prevention strategies for specific conditions. For example, the findings can provide an important guide for designers when applying the horizontal radius and curve length. Moreover, the results could be used as basis to implement a variable traffic speed limit on curves to reduce crash risk while traveling on a hazardous roadway segment.
However, further efforts should be made to demonstrate the differences between the NB and GNB models. The experimental data were limited. Thus, the model fitting effect is slightly far from ideal. What the impact degree of variable interaction on prediction accuracy is, is also our future research topic. With flexibility and elasticity defined to express the elastic influence of explanatory variables and interaction of explanatory variables on crash rate, GNB model 6 is proved to have good fittingness, which offers a certain reference value for crash prediction in general.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.