Data-DrivenUrbanTrafficAccidentAnalysis andPredictionUsing Logit and Machine Learning-Based Pattern Recognition Models

Modeling the severity of accidents based on themost effective variables accounts for developing a high-precisionmodel presenting the possibility of occurrence of each category of future accidents, and it could be utilized to prioritize the corrective measures for authorities.)e purpose of this study is to identify the variables affecting the severity of the injury, fatal, and property damage only (PDO) accidents in Rasht city by collecting information on urban accidents from March 2019 to March 2020. In this regard, the multiple logistic regression and the pattern recognition type of artificial neural network (ANN) as a machine learning solution are used to recognize the most influential variables on the severity of accidents and the superior approach for accident prediction. Results show that themultiple logistic regression in the forward stepwisemethod has R of 0.854 and an accuracy prediction power of 89.17%. It turns out that the accidents occurred between 18 and 24 and KIA Pride vehicle has the highest effect on increasing the severity of accidents, respectively. )e most important result of the logit model accentuates the role of environmental variables, including poor lighting conditions alongside unfavorable weather and the dominant role of unsafe and poor quality of vehicles on increasing the severity of accidents. In addition, the machine learning model performs significantly better and has higher prediction accuracy (98.9%) than the logit model. In addition, the ANN model’s greater power to predict and estimate future accidents is confirmed through performance and sensitivity analysis.


Introduction
Transportation, like any other industry and phenomenon, along with its advantages, has its disadvantages and limitations for road users [1,2]. Traffic and its related predicaments have been on the rise all over the world and have had a detrimental effect on the lives and property of the people of the community. Urban pollution, rising fuel and energy consumption, wasting millions of hours a day in traffic congestion, wasting community service facilities and national assets, and, ultimately, the occurrence of accidents resulting in injury, death, and property damage are the result of poor traffic facilities and conditions [3][4][5].
Road traffic accidents now represent the eighth leading cause of death globally. In addition, road traffic fatalities have increased to 1.35 million a year and caused up to 50 million injuries in 2016; that is, nearly 3700 people die on the world's roads every day. In Iran, the reported number of road traffic deaths is 15932 in 2016 [6]. e statistics also reveals the high number of injury and property damage only (PDO) accidents in urban and suburban roads in the country. erefore, the effect of factors affecting the severity of accidents should be investigated meticulously to provide practical solutions for improving safety and reducing the high number of accidents.
Various studies have been carried out on the subject area of traffic safety in recent years using multiple logistic regression. Sherafati et al. explored road traffic fatalities after receiving emergency services using multiple logistic regression in Langarud. Results showed that males, motorbikes, and pedestrians had a positive and significant relationship with fatal accidents [7]. Intini et al. conducted a study to investigate the relationships between road familiarity/unfamiliarity and the occurrence of accidents using multiple logistic regression. Factor analysis is a very vital step in many applications. e factors of minor intersections/ driveways, autumn/winter, and speed limits less than 80 km/ h were more related to familiar driver crashes, but the factors of head-on and rear-end accidents, summer, and heavy vehicles were more related to unfamiliar drivers involved in accidents [8]. Also, Casado-Sanz et al. applied a logistic regression model to investigate the impact of age on accident severity in the rural crosstown road and concluded that female drivers and motorbikes had a negative impact on the likelihood of accidents [9].
Generally, metaheuristic algorithms and machine learning techniques have been widely used in different engineering studies, especially in transportation problems, which they are in desperate need of complex and accurate solutions to provide more accurate prediction models than statistical methods due to their capability of handlig more complex functions and classification problems [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Pattern recognition tools and their accurate analysis using optimized prediction tasks are a trendy topic in the two recent years [26][27][28][29][30]. Supplementary to this, various prediction methods have been used in different engineering problems by the emergence of various datasets [31][32][33][34][35]. Some studies have applied traditional methods such as regression models alongside machine learning approaches to validate the ANN-based solutions as an effective alternative method to predict the severity of accidents with higher accuracy [36][37][38][39][40][41][42][43][44]. Because of the close connection of accidents to human life, machine learning approaches would be applied in important functions, like predicting the type and severity of accidents, due to their higher precision [45,46]. In order to understand the effects of the determining factors in an activity and predict incidents in the future, nonlinear links between variables with different forms of ANN may be modeled [47][48][49].
Chimba and Sando investigated the severity of traffic accidents using ANN and compared this method's accuracy with the ordered probit model. Results indicated that the prediction accuracy of ANN was higher [50]. Khair et al. tried to present an accident prediction model using a new framework of an artificial neural network (ANN). ey affirmed that the predicted collisions were similar to the number of real incidents and thus found the proposed model [51]. Omrani presented multiple logistic regression and ANN models to predict individuals' travel mode in Luxembourg. Results showed that ANN models had better performance [52]. In order to provide a decent prediction for traffic collisions in metropolitan areas of Nuevo León, Contreras et al. utilized an innovative ANN model. e programming feature of Scilab software has been used in this analysis to verify the highest sensitivity on the expected neural network [53]. Amin used the backpropagation ANN approach to investigate gender characteristics of older driver accidents and model the variables of their accidents and finally illustrated that journey purpose was the highest contributor factor of accident risk for older drivers and lighting condition was the second most important factor [54]. Ghasedi et al. used the logit model, factor analysis, and machine learning approaches to recognize the most effective variables on suburban accidents and generate the most accurate model for predicting future accidents in the busiest suburban highway of Guilan Province located in the north of Iran.
eir studies showed the outstanding role of environmental factors, such as rainy weather and inadequate lighting condition, on increasing the severity of pedestrian accidents [55].

Study Area and Methodology
Rasht is one of the most congested cities in Iran and also the most populous city in the north of Iran, with high traffic volume in most of the days during the year. According to the official census in 2016, its density was announced to be 414 people per square kilometer [56]. Besides, more than 60% of the accidents between 2019 and 2020 in the city of Rasht occurred in the inner ring road of this city. e effective factors on increasing the probability of accidents vary from city to city, and due to the congested traffic flows even in ring roads and high traffic interferences with other flows in several parts of this city, a separate study is needed to be conducted in Rasht metropolis. Due to the above-mentioned information and very dense urban texture and numerous traffic interferences between nonmotorized users and passing vehicles, the need to identify the most influential variables on the severity of accidents and provide the most accurate prediction model for future accidents are of great importance. As it is mentioned earlier, different approaches have been used worldwide for analyzing the number and severity of accidents considering the specific condition of crashes, including crashes with unknown severity, highseverity crashes, or the severity of crashes that may be expected to occur sometime in the future. However, in this study, due to a balanced number of datasets in terms of three categories of fatal, injury, and PDO accidents, we decided to use logit and pattern recognition type of ANN as a machine learning approach to analyze and provide a prediction model. In order to conduct research on road safety improvements, accident information that includes factors such as the time of the accident, human characteristics, environmental characteristics, and accident types should be collected meticulously. e results of consecutive visits to the Rasht Traffic Police Statistic Center led to the collection of information on the 12-month accidents from March 2019 to March 2020. In total, the statistical population includes 965 accidents that occurred in Rasht, 738 of which are related to urban accidents and the rest are related to suburban accidents. e dependent variable in this study is the severity of accidents, which are classified into three categories: fatal, injury, and PDO accidents. Since the amount of fatal accidents is small in comparison to total accidents, and the independent variables significance and goodness of fit of a model cannot be satisfied by considering the three types of dependent variables, fatal accidents have been merged with injury accidents, and the dependent variable has been split into two categories: PDO and injury/fatal accidents [55]. Table 1 classifies the independent variables influencing the occurrence of accidents in Rasht city and the appropriate coding for each of them for modeling purposes. Moreover, the remainder of this article is structured as indicated in Figure 1.

Statistical and Machine Learning
Forecasting Approaches

Multiple Logistic Regression.
To connect a set of variables X and a dependent variable such as Y, one has a multivariate problem. In analyzing such a problem, different types of mathematical models have been used to consider the relationship's complexity. e logistics regression method is a mathematical method used to describe the relationship between several variables x and a binary dependent variable. e function used in this method is an S-shaped function called a logistics function. However, the application of logistics analysis is not just limited to the above-mentioned issue. Expanding the logistics function can also be used to solve multifaceted problems. erefore, the logistic regression method can be used to define the variable Y in a multifaceted way. In the simplest case, P(Y � i) can be considered a linear function of where β is the vector of regression coefficients. However, one of the considerations is that the probability of P i mentioned in the left part of the equation must be between zero and one, but the linear multiplication of x i β in the right part includes all real numbers. A simple way to solve this problem is to use the probability transfer function to remove the distance constraints and model the transfer function as a linear function of the parameters. is conversion takes place in two stages [57]. First, the probability of P i becomes a chance of success from In the second step, the logarithm of equation (1) is performed so that logit or logarithm of the chance of success is obtained from e reverse transfer function, called antilogistics, is used to calculate probability in terms of logistics from Logistics is a transfer function that extends the probabilities of the range (0, 1) to all real numbers. Negative logistics indicate probabilities of less than 50% and positive logistics indicate probabilities of more than 50%. erefore, the logistics model is a general linear model that has a logistics transfer function. In other words, the probability logistics of P i instead of the probability follow the linear model [58].

Neural Network.
Neural networks have significant abilities to detect complicated data relations and could be used to extract patterns and classify methods that humans and other computer methods find extremely difficult to comprehend [45]. ANNs have been interpreted as a nonlinear system based on human's brain activity [59,60]. e researchers could take advantage of these networks' precious ability in recognizing the unknown relationship of natural and complex systems. Due to the close relationship between accidents and the well-being of society and their direct and indirect influences on human lives, ANNs have been considered a powerful and highly precise approach to deal with accidents dilemma [61]. erefore, in this study, to find the most accurate prediction model, urban accidents of Rasht city have been modeled by pattern recognition type of ANN, which can result in various hands-on outputs to reduce the severity and number of accident.

Modeling Using Multiple Logistic Regression.
In this section, the multiple logistic regression is used to determine the effect of each independent variable on the dependent variables (severity of accidents). To model the severity of accidents in Rasht, initially, 63 independent variables and 3 dependent variables are defined. Due to the small number of fatal accidents (7 cases), the dependent variables are summarized into two categories, including fatal/injury and PDO accidents. To build the logit model, the entering, the forward stepwise, and the backward stepwise methods can be used. Now, it should be examined which of the three methods has the most suitable output, or in other words, which of the methods can provide a better model for accidents in Rasht.
To determine this, the criteria of higher prediction accuracy and the goodness of fit index of the model are considered to identify the best model. e coefficient of determination parameter (R 2 ) indicates the goodness of fit of the model. To be more specific, the closer R 2 to 1, the greater the model's goodness of fit, and the higher the correct percentage for the model, the higher the strength of the model to predict accidents. Table 2 presents a summary of the three methods, including two criteria of the prediction accuracy and the goodness of fit (R 2 ) of the model. e forward stepwise method with the correct percentage of 89.17% and the R 2 value of 0.854 is selected as the best method to build the logit model of the severity of accidents in Rasht. Table 3 indicates the chi-square, degree of freedom (df ), and significance (sig) of the forward method in the modeling process. e chi-square statistic index is used to determine the effect of independent variables on the dependent variable and, in general, the fit of the model and it is comparable to the F-statistic in normal regression analysis. e chi-square of the model in Step 23 is equal to 61.038 with a significant value of less than 5%. Since the significance of the forward model used to predict accidents is less than 5 percent, the capability of the model to predict accidents is confirmed.
us, independent variables affect the dependent variable and indicate a good fit. By selecting the forward stepwise method and entering all 63 selected variables in the modeling process, this method's final model is obtained in 23 steps. In this model, 11 variables are identified as the most effective variables in the severity of accidents leading to injury/fatal and PDO accidents in Rasht. Table 4 shows the variables entered in the model and their statistical indexes. According to Wald and sig, the Wald test examines the significance of the regression equation's variables and is comparable to the t-statistic in normal regression.
According to Table 4, the most effective variables increasing the severity of vehicle accidents are accident time, 18 : 00-24 : 00 and 12 : 00-18 : 00, KIA Pride vehicle, and rainy weather, respectively. Conversely, the most affecting variables with negative coefficients decreasing the severity of accidents include day time, accident time (6 : 00-12 : 00), and sunny weather. e most important result of the logit model underlines the role of environmental factors, including poor lighting conditions alongside unfavorable weather and the dominant role of unsafe and poor quality of vehicles on increasing the severity of accidents.

Modeling Using Artificial Neural
Network. An ANN prediction model can be created using a variety of neural networks. Given the qualitative data used in this analysis, the prediction model is created using a neural network with pattern recognition capabilities. Using either supervised or unsupervised grouping, pattern recognition divides input data into objects or classes based on main characteristics [55]. e machine learning method uses the same input attributes and output labels as the variables described in    Chi-square df Sig.
Step 23 Step  Table 1. It is worth noting that the dependent variable (output class) is the different levels of accident severity, as mentioned in the "study route and methodology" section. For vehicle accidents, it has been split into two categories: fatal/injury and PDO. e ANN used in this analysis is the utilization of an existing algorithm in the software. e neural network's input data is divided into three categories: (1) Training: these are presented to the network during training for the learning process, and the network is adjusted according to its error. (2) Validation: these are used to measure network generalization and to halt training when generalization stops improving (3) Testing: these have no effect on training and so provide an independent measure of network performance during and after training. In other words, it is the main criterion to realize how much the neural network's findings are similar to the actual result. Table 5 shows the details of the accident data entry into the software and the Mean Squared Error and Percent Error. Since the number of accidents is adequate for the network training process, 70% of the data are used for network training and 15% for validation, and the remaining 15% are considered as a test of the developed network.

e Results of the Confusion Matrix.
As can be seen in Figure 2, according to the all confusion matrix, which represents the result of the three processes of training, validation, and testing of the network, out of 632 property damage accidents, all 632 cases, and out of 106 injury/fatal accidents, 98 cases are predicted correctly by the model. e prediction accuracy of property damage accidents in the model is 100% and the prediction accuracy of injury/fatal accidents is 92.5%. erefore, the accuracy of the model in the classification and separation of injury/fatal accidents and property damage accidents from each other is at a high level. Also, the accuracy of the whole model in determining the severity of accidents is 98.9%. With confidence of 98.9%, the model can predict the severity of accidents in terms of effective parameters. Figure 3 shows that the training process stopped after 92 repetitions. e point marked on the diagram shows that the answers will no longer improve from this point on, and this point with the mean squared error (MSE) 0.0231 indicates the best point for the end of the calculations and creation of the ANN for the information given.

Sensitivity and Specificity Analysis of the Neural
Network for the Given Accident Data. Figure 3 shows the sensitivity analysis of the true positive rate of the generated ANN model compared to the false positive rate using a receiver operating characteristic (ROC) graph for accidents in Rasht.
e ROC graph is a technique for visualizing, organizing, and selecting classifiers based on their performance [62]. Its popularity comes from several well-studied characteristics, such as intuitive visual interpretation of the curve and easy comparisons of multiple models [63]. When it is needed to check or visualize the multiclass classification problem's performance, the ROC curve is used. In this model, 70% of the data are considered for training, 15% for testing, and 15% for validation. As shown in Figure 4, Class 1 indicates the accuracy of the network prediction for existing   Mathematical Problems in Engineering accidents, and Class 2 indicates the network's accuracy for future accidents. e more inclined the top and left curves, the more powerful the network in estimating and predicting more accurately [64].

Conclusion and Safety Approaches
In this study, aiming to investigate the main causes and severity of urban accidents in Rasht city, two models of accident analysis approaches are adopted and compared to provide practical solutions to increase overall safety and reduce the number of accidents within the city. erefore, using the multiple logistic regression and pattern recognition type of ANN, the variables affecting the severity of accidents and the most powerful approach to predict the accidents in Rasht have been presented; the most important results and prevention strategies of accidents are as follows: (1) Comparing the correct percentage of prediction in the multiple logistic regression and the machine learning model, the results showed that the ANN model performed better and has a higher prediction power than the logit model. To be more specific, the  prediction accuracy of the utilized ANN is 98.9%, while the logit model's prediction accuracy is 89.17%. In other words, the prediction error rate of the ANN model is 1.1%, while the logit model prediction error rate is 10.83%, which justifies the utilization of the ANN model. On the other hand, the network sensitivity analyzer diagram of ANN approach proves the high power of this model in the prediction of urban accidents.
(2) e multiple logistic regression results show that the forward stepwise method is the best method for making the logit model of accident severity in Rasht city, considering the two criteria of the goodness of fit (R 2 ) of 0.854 and the prediction accuracy of 89.17%.
(3) According to the logit model results, the variables of accident time (12)(13)(14)(15)(16)(17)(18) and (18)(19)(20)(21)(22)(23)(24), rainy weather, the accident reason (lack of attention), head-on collision, and KIA Pride vehicle increased the severity of accidents with a positive coefficient. In other words, the significant role of KIA Pride vehicle in the occurrence of accidents, especially at (18-24) accident time (night peak hours' traffic), is noticeable. erefore, it is expected that the officials should conduct major improvements and corrections in lighting facilities and urban routes infrastructures and improve the quality of mass vehicle production in collaboration with car manufacturing companies.
(4) e logit model also shows that the interactive effect of the darkness of the air and head-on collision increased the likelihood of accidents, which may be due to poor visibility of drivers and cognitive distraction. A greater presence of police and speed control strategies, especially at night (between the hours of 18 and 24) and rainy weather days, is one of the best ways to reduce traffic accidents. Last but not least, increasing penalties for mobile phone users to decrease accidents due to lack of attention and providing warning signs or other pavement-based warning techniques, including pavement markers and rumble strips, may help reduce accident risk.
Since statistical analysis and programming models are usually not able to consider all the required details of the problem, for future studies, it is suggested that to utilize Geographic Information System (GIS) roadway profile data along with more deep learning and optimization techniques to have an in-depth analysis and find the most desirable solutions [65][66][67][68]. Moreover, due to the significant importance of pedestrian accidents and permanent interference of nonmotorized users and vehicles flow within a city, it is recommended to use more analytical methods and pattern recognition type of machine learning approach to provide better decision making approaches and present the most accurate prediction model for pedestrian accidents separately occurring in urban environments [55,69]. Last but not least, to have a more in-depth analysis about pedestrian accidents and find the most dangerous conflicts, it is highly suggested to use AI-based object detection and image processing approaches.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.