DEVELOPMENT OF MODELS TO STUDY TRAFFIC ACCIDENTS ON THE FINAL SECTIONS OF ACCESS ROADS TO THE CITIES: A CASE STUDY OF THREE MAJOR IRANIAN CITIES

The final sections of main access roads to the cities require especial attention as the frequency of accidents in these road sections are considerably higher than other parts of interurban roads. These road sections operate as an interface between the rural roads and urban streets. The previous researches available on this subject are limited and they have also mainly focused on a narrow range of factors contributing to the accidents in these areas. The main contribution of this research is to consider a relatively comprehensive range of potential factors , and to examine their impacts through the development and comparison of both conventional probabilistic models and Artificial Neural Network (ANN) models. For this purpose, information related to the main access roads of three major Iranian cities were collected. This information consisted of accident frequency data together with the field observations of traffic characteristics, roadway conditions and roadside features of these roads. Various ANN and probabilistic models were developed. The frequency of accidents, i.e. fatal, injured, or damaged accidents, was considered as the output of the developed models. The results indicated that a hybrid of ANN models, each comprised of 10 input variables representing traffic, roadway and roadside conditions, outperformed several probabilistic models, i.e. Poisson, Negative binomial, Zero-truncated Poisson, and Zero-truncated Negative Binomial models, also developed under similar conditions in this study. Moreover, effective roadway width, roadway lighting condition, the standard deviation of vehicles speed, percentage of drivers violating the speed limit, average annual daily traffic, percentage of heavy goods vehicles, the density of roadside commercial and industrial landuses, the density of median U-turns, the density of local access roads, and the effective width of the left-side shoulder were identified as the most effective factors contributing to the accidents in these areas. The developed ANN model can be used as a tool to predict accident rates in these road sections, and to estimate a potential reduction in the accident rates, following any improvements in the major factors contributing to the traffic accidents in these areas.


Introduction
Road safety has been one of the main concerns of road users and highway authorities over the years. The Global Status Report on Road Safety published by the World Health Organization in 2018 (WHO, 2018), indicated that the annual number of road accident fatalities was around 1.35 million people in 2016 and it is the leading cause of fatality among the children and young adults aged 5 to 29 years. It is anticipated that by 2030, the road accidents will be the fifth most important cause of human fatality worldwide (WHO, 2018). Traffic accidents would also impose a great burden on the families, economy, health sectors, insurance companies, etc. in each country. In Iran, road transportation mode nationally has a share of around 94% among all transportations modes and therefore, the provision of safety for this mode is an essential issue. (Khalili and Pakgohar, 2013). The spatial pattern of accident distribution in this country, indicates that around 40% and 70% of traffic accidents in rural roads have occurred within the 5km and 30km distances from the city entries, respectively. Considering the fact that these main access roads to the cities only constitute around 17% of the rural roads in this country, it can be concluded that these road sections are more prone to the traffic accidents than other rural road sections (Afandizadeh and Golshan Khavas, 2006). An access road to a city acts as an interface and transitional section between the rural roads and urban streets. Therefore, its functionality, physical characteristics and surroundings are a mixture of these two different environments. This situation would create confusion and inappropriate driving behavior along these road sections which in turn will result in reduced safety and increased accidents (Davoodi and Ahmadi, 2015). Thus, the safety of such road sections should be treated with a higher priority than other rural road sections. An initial but important issue in the study of traffic safety along the main access roads to the cities is to decide on the overall study length. Through a literature search, the authors could not find any concrete guidance or criteria for this. However, a road length of 30km is widely adopted by the Iranian officials in the Ministry of Roads and Urban Development and also the Police as the influence area. Thus, this length was adopted in this research as well.
Due to the transitional status and mixed function of main access roads to the cities, a wider range of factors comprising both rural and urban related factors, which may contribute to the occurrence of accidents along these road sections, should be considered. Pedestrians and two-wheeled motorized and non-motorized vehicles, such as motorcycles and bicycles, are highly involved in the urban mobility and due to their vulnerability features, are highly subjected to the traffic accidents. On this basis, a number of studies have been conducted in the past to estimate the frequency and severity of accident accidents in ur-  Das and Burger, 2015). However, factors contributing to the accidents on rural roads could be different from urban streets. This may be attributed to their high-speed driving environment and much lower presence of pedestrians and other non-motorized traffic. Furthermore, the roadway and traffic characteristics of rural roads and their surrounding environment are different from the urban streets. This is another reason for the apparent differences between the nature of accidents in urban and rural roads (Holdridge et Khavas (2006), in their research on the safety of access roads to the cities, identified six parameters including roadway width, Average Annual Daily Traffic (AADT), number of local access roads along these road sections, population of the destination city, road longitudinal gradient, and the road section length as the major factors contributing to the frequency of accidents in these road sections (Afandizadeh and Golshan Khavas, 2006). Also, Khalili and Pakgohar (2013), identified the road lighting, density of retailing and other landuse activities along these road sections and the frequency of local access roads, as the major factors contributing to the accidents in these areas. A number of other researches have reported that unexpected factors such as the existence of various retailing, industrial and agricultural activities along these road sections, roadside parked vehicles, and turning movements of vehicles at the local intersections would complicate the situation in these areas, causing further deteriorations to the through traffic flow and degrading the safety of road users in these areas. In terms of study approach, previous researches have widely relied on the development of accident prediction models as a tool to investigate the underlying factors contributing to the occurrence of traffic accidents. For this purpose, various statistical and probabilistic models and more recently, artificial intelligence based models are used. Among these models, the probabilistic models are widely used by researchers for both urban and rural road areas . More recently, artificial intelligent based models, e.g. Artificial Neural Networks (ANN) models are also used as a potential approach to predict the number of traffic related accidents by some reseachers (Dougherty, 1995 Considering higher rates of traffic accidents on these road sections and the fact that previous researches available on this subject are insufficient, mainly focusing on a narrow range of factors contributing to the accidents in these areas, further researches in which a wider range of sites and factors are considered, would be beneficial. Hence, a major contribution of this research is to consider a relatively comprehensive range of factors that may have a significant impact on the frequency of traffic accidents (crashes) along these road sections. The factors considered in this research are: lane width, number of carriageway lanes, shoulder width, road lighting condition, road pavement condition, traffic composition, median width, actual driving speeds and deviation from speed limits, along with some other factors that would indicate surrounding land use and road traffic characteristics. Another major contribution of this research is to examine a range of probabilistic accident prediction models for these areas and to identify the outperformed model and compare its performance with an appropriate ANN model, also developed for these road sections. From the outperformed model, it would be possible to identify the dominant factors that are contributing to the accident occurrence, to use the developed model to predict the accident rates, and to estimate a potential reduction in the accident rates, following any improvements in the major factors contributing to the traffic accidents in these road areas. In addition to the parameters mentioned above, the driver characteristics and driving behavior are also important features that should be taken into account. However, in the this research, these factors are not considered as the main purpose of this research was to investigate only the impacts of traffic, geometric and surrounding environment of the main access roads to the cities on the frequency of accidents in these road sections.

Data collection process
The data collection process used for the development of accident prediction models and to identify the models with overall best performance is presented in this section. In this research, the data related to the main access roads of three cities namely Isfahan, Kerman, and Yazd were collected. All three cities are the capital of their provinces in Iran with around 2.4, 0.6 and 0.6 million population, respectively. Fig. 1 shows the bird's-eye views of these cities and their main access roads. The leading parts of rural roads, providing main access to each city, were considered. A road length of 30km from the city entrance was considered for each access road unless there was a small town along this distance, which in this case, a shorter length was adopted for that access road. In total, 565km stretches of road carriageways related to the access roads of these three cities were examined and the required information was gathered through the site visits and further data acquired from their corresponding road authorities. The data related to 325km of these roads were used for the model development and the remaining data was used for the validation of the developed models. In order to consider the variations in the factors contributing to the traffic accidents in these road areas, each 30km access road was subdivided into 6 road sections, each with 5km length. Therefore, in total, 113 carriageway sections were investigated in this research. The details of access roads that were examined in this study are presented in Table 1. As indicated in previous section, one of the main objectives of this research was to identify the major factors contributing to the traffic accidents on the access roads to the cities. To achieve this objective, a set of 16 potential factors or variables were identified. These factors were initially extracted from the factors identified in the previous researches (e.g. The overall variables considered in this study and their assigned codes for data analysis are shown in Table 2. The accidents data related to each carriageway section were collected from the local road authorities and the police for a period of 3 years (2015 to 2017). The definition of accident in this study is any incident in which when a vehicle collides with another vehicle, pedestrian, animal, road debris, or other stationary obstructions, such as a tree, pole or building. Traffic accident may result in injury, disability, death, and property damage. A Summary of descriptive statistics related to the data gathered for this study is presented in Table 3. It should be noted that, because of the difference in the nature of these data items, they have been normalized, using a proven method presented in Equation 1 (Sukirty et al., 2018;Jiawei and Micheline, 2006): Where: Xthe normalized variable, the real value of i th variable; the real observed minimum value of i th variable; the real observed minimum value of i th variable; the real observed maximum value of the i th variable.

Model development
In this section, the methodology used for the development of accident prediction models and to identify the models with overall best performance is presented. Then using these models, underlying traffic, roadway and roadside features contributing to the traffic accidents along the main access roads to the cities are identified. The normalized data, obtained from the procedure explained in the previous section, was used for the development and comparison of probabilistic and ANN models which will be described in the following sections. The probabilistic and ANN models were developed using STATA Ver.14. and MATLAB 2018.b software, respectively. In order to compare the performance of the developed statistical models, the Goodness Of Fit (GOF) measures including Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), McFadden pseudo R, McFadden adjusted pseudo R and Root Mean Square Error (RMSE) were used. These indicators have widely been used in the past, and it has been proven that they are powerful measures to compare the fitness level of the probabilistic models such as Poisson and Negative Binominal family of models (Khishdari and Fallah Tafti, 2017). This measure is calculated using Equation 2: Where: Oithe annual average number of observed accidents in the i th road carriageway section; Eithe average annual number of expected accidents in that section which has been estimated by the developed model; 't'the number road carriageway sections considered for each access road to the city.  Moreover, overfitting control and sensitivity analysis were used to identify the input variables with significant impact on the model output and also for the ranking of these variables. It would then also be possible to develop further models in which only variables with significant impact are used and thereby the number of input variable required by the model can be reduced. A model with the reduced number of input variables would be preferable as it would require less data items, provided that it provides a level of accuracy comparable to the original model. In this research, due to the difference in the nature of probabilistic and ANN models, their sensitivity analysis was performed differently. The procedure used for the sensitivity analysis of probabilistic models and ANN models are shown in Figs. 2 and 3., respectively. The sensitivity analysis of the probabilistic and ANN models was carried out using the stepbackward analysis and the randomization test, respectively. Finally, based on the sensitivity magnitude of the model output to the input variables (e.g. p values obtained for each input variable), the non-significant variables were eliminated, while still maintaining the same overall performance of the model. Through this process, the model with the minimum number of input variables and almost the same level of accuracy was selected as the final model.

Results of probabilistic models Development
Two probabilistic models that are widely used for traffic accident frequency prediction namely, the Poisson and Negative Binominal and their deviations were developed in this research. The Poisson probability distribution model can be deviated from the Binomial distribution. When the number of binomial tests reaches to infinity, its corresponding Binomial distribution tend to approach a Poisson distribution (Hilbe, 2011).   Where: the number of accidents occurred in the i th carriageway section of the studied road; the average number of accidents occurred in that section, which can be calculated using Equation 4: Where: ، β -a (1×n) vector of regression coefficients; the j th study variable. For the determination of the regression coefficients of Poisson model, the Logarithm Likelihood should be used. One issue associated with the Poisson model is that it assumes that the variance and average values of the dependent variable (model output) are the same (Hilbe, 2011). This assumption has rarely been observed for accident data in practice. Depending on the data types used in the modeling process, the variance of dependent variable can be lower or higher than the average (Cameron and Trivedi, 1998). To overcome this issue, the Negative Binomial model, first introduced by Cameron and Trivedi, is usually used (Cameron and Trivedi, 1998). In this model, the over-dispersion parameter (α) was used to include the inequality of average and variance of data. The α parameter can be calculated using Equation 5.
Where all parameters of this equation are previously defined. In practice, another type of the over-or-under-dispersion can occur due to the outliers, parts of data that are extremely higher or lower than the overall data, leading to over-dispersion and under-dispersion, respectively (Saffari et al. 2011). To resolve this issue, the Zero-truncated Poisson and the Zerotruncated Negative Binomial models were introduced. Following the development of the Poisson, Negative binomial, Zero-truncated Poisson, and Zero-truncated Negative Binomial models, the sensitivity analysis was performed in accordance with the process presented in Fig. 4. Finally, Fig. 5. was plotted to determine the appropriate number of input variables and to overcome overfitting issue. In order to address the overfitting issue, it would be more appropriate to consider adjusted R 2 values rather than R 2 values, as the earlier would be able to indentify the number of input variables with sensible impact on the overall performance of the model. Forthermore, the number of input variables should be limited to the maximum number of input variables beyond which the Adjusted R 2 values would be reduced. On this basis, the number of appropriate variables to prevent overfitting issue can be observed from second order polynomial curves fitted to the results in Fig. 5. According to this Figure, it can be observed that the best performance of Poisson and Negative Binomial regression models was achieved by 9 input variables and the best performance of Zero-truncated Poisson and Zero-truncated Negative Binominal models was achieved by 11 input variables. For each probabilistic model type, the appropriate input variables were selected according to the ranking list obtained from the sensitivity analysis to prevent the overfitting problem. New probabilistic models were developed based on these reduced number of input variables and compared using appropriate measures presented in Table 4. According to the values of performance measures presented in Table 4, it can be concluded that the Zero-truncated Poisson model comprising of 11 input variables has demonstrated an overall better performance than other probabilistic models. The input variables of this model in descending order of their significance are: standard deviation of vehicle speeds, road lighting status, the density of commercial and industrial landuses, the density of local access roads, percentage of heavy goods vehicles, effective roadway width, AADT, the effective width of left-side shoulder, the density of agricultural landuses, the effective width of right-side shoulder, and density of median U-turns.

Results of ANN models development
Generally, an ANN model consists of three layers: the input layer, the hidden layer, and the output layer.
Here, the input layer is related to the independent variables which are effective on the frequency of accidents, the output layer is related to the average annual number of accidents in the i th carriageway section of the road under study. The hidden layer provides an interface between the input and output layers. The feedforward backprobagation neural network is one of the most common types of ANN models, which is widely used for the prediction type models and was therefore used in this research. In this research, a hybrid structure was used for the ANN model. The hybrid structure consists of 10 ANN model each with the same input variables and the same number of neurons in the hidden layer but each model was trained with a different initial random number or seed number. The hybrid model calculates the average output of these 10 ANN models as the final model result. For the development of this model, the "trainlm" function in MATLAB 2018.b was used to train the ANN models. In this process, Levenberg-Marquardt optimization technique was used to train the ANN models. The structure of the developed hybrid ANN model is shown in Fig. 6. During the ANN modeling process, the collected data is usually divided into three separate parts for training, testing and validation of the model. Usually, a predefined percentage of the available data are used for each of the above-mentioned parts and each part of the data is randomly selected from the available data. To minimize the impact of data ratio selection on the model performance, two commonly used combinations of data proportions were tested in this research as follows: − 70% (train) + 15% (test) + 15% (validation) − 60% (train) + 20% (test) + 20% (validation) To develop the ANN model, 16 independent variables were considered as the initial input variables of the model. As for the number of neurons in the hidden layer, it has been suggested to choose the number of neurons in the hidden layer in the range of number of output variables and the sum of input and output variables (Heaton, 2010). Thus, for each of the above-mentioned data proportions, the number of neurons in the hidden layer was increased in each step by 2, in the range of 1 to 17 neurons. In each step, the hybrid structure for the neural network shown in Fig. 6. was used to develop 10 similar ANN models each trained with a different initial seed number. The best hybrid model was then identified by comparing the performance of these models. Table 5 shows the codes assigned to each one of the developed hybrid models. As mentioned earlier, the difference between in these models are associated with the number of neurons used in the hidden layer as well as the data proportions used to train, test and validate the model. The RSME was used as a measure of performance to compare the performance of the developed hybrid ANN models with each other and to identify outperformed model. This measure has successfully been used in previous researches as well and is calculated using

Sensitivity analysis of the best Hybrid ANN model
In this section, the results of the sensitivity analysis, performed on the best developed ANN model (NN6) using the randomization test procedure shown in Fig.  (3) is presented. The results presented in Fig. 8. indicate the degree of effectiveness of each input variable to the model output. . In this Figure, the variables shown on the horizontal axis are previously defined in Table 2. The results of sensitivity analysis were then used to evaluate the performance of the new ANN models developed with reduced number of input variables. In each step, the least effective variable was eliminated and a range of new ANN models, comprising different neurons in the hidden layer, were developed. The optimum ANN model for that step was identified as the model with the maximum R 2 value. The various conditions in which these ANN models were developed and evaluated are listed in Table 6. Moreover, the characteristics of the best ANN models developed with different number of input variables are presented in Table 7. The performance of outperformed ANN models in each step were then compared to identify the ANN model with the best overall performance. The ANN model with the best overall performance was identified as the model with the least RMSE and no overfiting issue. The variation of the RMSE against the reduced number of input variables is shown in Fig. 9. The overfitting issue was investigated using the plots of R 2 and adjusted R 2 values against the number of input variables in accordance with Fig. 10 ANN model. The input variables in this model in descending order of their significance are: effective roadway width, road lighting status, standard deviation of vehicle speeds, percentage of drivers violated speed limit, AADT, percentage of heavy goods vehicles, density of roadside commercial and industrial landuses, density of median U-turns, density of local access roads, and effective width of the left-side shoulder. A comparison of the two outperformed model types namely, the Zero-truncated Poisson model ans the hybrid ANN model, indicate that these models possess 9 input variables in common. These common input variables are: standard deviation of vehicle speeds, road lighting status, density of commercial and industrial landuses, density of local access roads, percentage of heavy goods vehicles, effective roadway width, AADT, effective width of left-side shoulder and density of median U-turns.

Summary of results, discussion and conclusions
In this research, a series of two model types were developed to anticipate the frequency of traffic accidents along the road carriageway sections located at the leading parts of the main access roads to three major cities of Iran. These two model types were: the probabilistic models and the ANN models. A Zero-truncated Poisson model, comprising 11 input variables, was identified as the best probabilistic model. All of these variables are measured along each road carriageway section.  As for the ANN models, a Hybrid ANN model, consisting of 10 similar ANN models and each comprising of 10 input variables, 11 neurons in the hidden layer, developed with the train-test-validation data proportions of 70-15-15 percentages, demonstrated the best performance in comparison with other developed ANN models. All of these variables are measured along each road carriageway section. The best ANN and probabilistic models were then validated with the data related to the main access roads of the Isfahan city. This fresh data was not used throughout the model development process. This data was used to validate the generalization ability of the best Zero-truncated Poisson and Hybrid ANN models. For this purpose, the RMSE and R 2 values of these two models were calculated under the modeling and validation data sets. The results are presented in Table 8. Furthermore, a comparison of the normalized values of accident frequencies in different sections obtained from the final Zero-truncated Poisson and Hybrid ANN models with their corresponding observed values are presented in Fig.  11, indicating an overall better performance of the Hybrid ANN model. As for the ANN models, a Hybrid ANN model, consisting of 10 similar ANN models and each comprising of 10 input variables, 11 neurons in the hidden layer, developed with the train-test-validation data proportions of 70-15-15 percentages, demonstrated the best performance in comparison with other developed ANN models. All of these variables are measured along each road carriageway section.  1-3-5-7-9-11-13-15-17 2 18 15 1-3-5-7-9-11-13-15 2 16 14 1-3-5-7-9-11-13-15 2 16 13 1-3-5-7-9-11-13 2 14 12 1-3-5-7-9-11-13 2 14 11 1-3-5-7-9-11 2 12 10 1-3-5-7-9-11 2 12 9 1-3-5-7-9 2 10 8 1-3-5-7-9 2 10 7 1-3-5-7 2 8 6 1-3-5-7 2 8 5 1-3-5 2 6 4 1-3-5 2 6 3 1-3 2 4 2 1-3 2 4 1 1 2 2 Total No. of developed ANN models 160 The best ANN and probabilistic models were then validated with the data related to the main access roads of the Isfahan city. This fresh data was not used throughout the model development process. This data was used to validate the generalization ability of the best Zero-truncated Poisson and Hybrid ANN models. For this purpose, the RMSE and R 2 values of these two models were calculated under the modeling and validation data sets. The results are presented in Table 8. Furthermore, a comparison of the normalized values of accident frequencies in different sections obtained from the final Zero-truncated Poisson and Hybrid ANN models with their corresponding observed values are presented in Fig.  11, indicating an overall better performance of the Hybrid ANN model.  accuracy of the 92 to 94% for the best overall ANN model. Moreover, the R 2 value for the ANN model under the validation data set is even higher than the R 2 value obtained using the modeling data set and it is higher than its corresponding value for the Zerotruncated Poisson model. Hence, it could be concluded that the performance of both models are satisfactory in terms of accuracy and transferability. However the performance of the best overall ANN model is somewhat better than the Zero-truncated Poisson model.   It should be noted that in this research, the frequency of accidents was considered as a dependent variable or the output of the developed models, meaning that no distinguish was made between different types of accidents, i.e. fatal, injured, or damaged accidents. Thus, future researches in which the severity of accidents is also taken into account would be very useful.