Analysis of Factors Affecting the Severity of Automated Vehicle Crashes Using XGBoost Model Combining POI Data

,


Introduction
e autonomous vehicle (AV) technique has the potential to reduce crashes significantly. More than 30 thousand people die from traffic accidents every year in the US, with 2.2 million accidents resulting in injuries [1]. Traffic crashes cost the economy $277 billion a year, twice as much as congestion [2]. Over 40% of fatal accidents involved alcohol, distraction, drug addiction, and fatigue. Drivers' error is the leading cause of 90% of the accidents. Even crashes caused mainly by vehicles, roadways, and environmental conditions are accompanied by some human factors (e.g., inattention, distraction, or speeding). With the popularization of AV technology, drivers' errors may disappear, indicating the possibility of reducing the fatal accident rate by at least 40% [3]. erefore, clarifying how different influencing factors affect the severity of AV crashes is of considerable significance in comprehensively improving the safety of AVs.
Safety is the primary factor driving the development of AV technology. Previous literature has concentrated on the various advanced driver assistance systems (e.g., forward collision warning, vehicle collision warning system, and lane departure warning systems), traffic signal control (e.g., actuated signal control and cooperative adaptive cruise control), and accident responsibility [4][5][6]. However, designing a system that can operate safely in any unexpected circumstances remains a daunting challenge. e existing AV technology still has certain limitations in terms of technical indicators and driving environment requirements: (1) Robustness of environmental perception and visual recognition needs to be improved [7]. (2) Multistrategy decision-making algorithms in AV technology lack measures against abnormal behavior [8]. (3) Although AV technology can assist the driver in completing the driving task to a certain extent, it may also affect the driver. Hence, further studies on driver behavior are necessary [9].
Road safety is a complicated issue and is influenced by a series of risk factors, such as driver, environment, and vehicle factors. AVs will play an essential role in future transportation safety. Given the uncertainty in the safety of AVs, this study applies the publicly available Traffic Collision Reports of AV crashes in California to predict the severity of crashes involving AVs and analyze the effects of the different factors on crash severity. e knowledge gained from this research could contribute to the assessment and improvement of the safety performance of current AVs. e rest of the paper is organized as follows: Section 2 is the literature review related to our study; Section 3 describes the dataset and correlated variables; Section 4 introduces the main content of the proposed methodology in detail; Section 5 discusses the model results; and, finally, the conclusion and limitations are shown in Section 6.

Literature Review
Most previous studies on AV technology safety rely mainly on evaluating drivers' performance and behavior in a simulated environment and developing the performance of autonomous driving systems in a closed field environment. Some research focuses on the driving trajectory of AVs to avoid potential collisions. Hegedus et al. proposed a local trajectory optimization algorithm based on nonlinear optimization, which can provide a dynamic, feasible, comfortable, and customizable trajectory for highly automated vehicles [10]. Omidvar [11] developed an algorithm for trajectory optimization of AVs in the signalized intersections at a closed-course. is algorithm optimizes signal control and provides the best trajectory for AVs. As for the simulation studies regarding AV safety, many researchers use driving simulators as experimental tools. ey focused on the driver's physiological and psychological responses in an autonomous driving environment. Winter et al. [12] found that drivers can divert their attention to secondary tasks in a highly automated driving environment without affecting the driving performance of the vehicle.
e California Department of Motor Vehicles (DMV) release massive crash data involving AVs, and many machine learning models (e.g., logistic regression models [13], Classification and Regression Tree (CART) [14], neural network [15], and random forest [16,17]) have been utilized to identify the factors that contribute to the severity of crashes involving AVs. To investigate the factors contributing to the severity of AV involved crashes, Wang [18] developed CART models by harnessing California's Report from 2014 to 2018. e highway is recognized as the location where severe injuries are likely to happen. Crash severity significantly increases if the AV is responsible for the crash. Xu et al. [19] conducted a study based on the binary logistic regression model using California data. e driving mode of AVs, collision location, roadside parking, rear-end collision, and one-way road are the main factors that contributed to the severity level of AVs involved crashes. Boggs et al. [20] investigated factors contributing to AV involved crashes using the hierarchical Bayesian heterogeneity-based approach. According to this study, clear weather could reduce the likelihood of injury crashes involving AVs.
Agarwal et al. [21] proposed a relatively novel technology in 2016: eXtreme Gradient Boosting (XGBoost). It has high precision and fast processing speed as well as lower cost and complexity. Two studies [22,23] have shown that XGBoost is more accurate than other machine learning techniques (logistic regression, SVM, deep neural network, etc.) in predicting the likelihood of an accident. Meng et al. [24] use XGBoost to combine multiple data sources to predict the occurrence and duration of accidents, including geometric road design, historical accident data, and weather data. Fan et al. [25] also used artificial neural networks to integrate multiple XGBoost models to predict the duration of the accident. Finally, as an integrated algorithm, it is not affected by the multicollinearity of data.
However, the lack of reliable data and insufficient data sources have limited studies on accident analysis, especially for the accident mechanisms of AVs. Fortunately, reliable points-of-interest (POI) data can be collected from anywhere globally, providing a broad space for detailed accident detection [26]. Although these POI data may not be the typical factors used in traditional traffic accident analysis, they are specific data on land-use factors with precise location information [27]. Additionally, they are expected to be highly correlated with traffic accidents in the macro-and microaspects. e current study employs POI data to describe the built environment to replace traditional land-use data. It specifies the city's infrastructure distribution and has much better statistical granularity [28]. Simpson's diversity index is selected as the POI diversity evaluation index to quantify the diversity of land-use patterns in the buffer zone.
e primary purpose of this study is to use the XGBoost model incorporating POI data to predict the severity of crashes involving AVs and investigate the effects of the different factors on crash severity. is study employed 94 crash reports involving AV in California received in 2019. Synthetic Minority Oversampling Technique (SMOTE) was applied to address the imbalanced data. Ultimately, the knowledge gained from this study could contribute to the assessment and improvement of the safety performance of the current AVs.

Data Sources.
With the implementation of California Senate Bill 1298, the Department of Motor Vehicles (DMV) demanded that crash reports involving AV be provided within ten business days of the crash occurrence [19]. is study employed 94 crash reports involving AVs in California received in 2019. Information was manually extracted from crash reports submitted by various manufacturers for a comprehensive understanding of AV-related information (e.g., type of collision, manufacturer's name, crash severity, vehicle information, and weather). A vast number of reports did not count vehicle speed before the crash. us, it was not adopted for model development.
is paper analyzes the diversity of land-use patterns based on POI data because of the lack of traffic volume and land-use data. e integration of traffic accident data and POI data can enable a more accurate identification of land-use intensity on traffic safety [29]. e POI data were obtained from Google Map Application Programming Interface (API). e buffer analysis and cross summary toolbox in ArcGIS was used to match the POI data according to the latitude and longitude of the accident site. Different types of POI may have different effects on traffic status, but some types have similar functions. erefore, the POIs are divided into four major categories, as shown in Table 1.
Simpson's diversity index was selected as the POI diversity evaluation index to quantify land-use development intensity in the buffer zone, as shown in the following equation: where N i and N represent the number of POIs of a specific type and the total amount of POIs, respectively. e larger the value of D, the higher the diversity of POIs. Figure 1 illustrates the distribution of collision type. Rear-end collisions are the primary type of crashes, accounting for 64%. e road environment's perceptual system might cause the AVs emergency brakes, although the report does not provide clear instructions.

Statistical Analysis.
According to the statistical data, most crashes are conventional vehicles hitting the rear of AVs [30]. Furthermore, rear-end collisions usually occur at intersections because the trajectory of intersections is more complicated than that of the road segment [31]. e other common collision types are siding swipe (15%), broadside (12%), and head-on (9%). Crashes involving AVs are caused primarily by the complicated interaction between AVs and conventional vehicles [32]. erefore, specific attention should be given to the adverse effects of mixed traffic flow composed of AVs and conventional vehicles on the autonomous driving system during the low penetration rate of AVs [33]. AVs-pedestrian collisions or hit objects are not reported, which indicates the benefit of road environment perception and motion control systems for AVs. Figure 2 describes the proportion of crashes for each company. Cruise has the most crash reports in 2019, accounting for 58%, followed by Waymo (25%). Cruise is a representative company because it has launched many test vehicles in congested San Francisco. By contrast, Waymo's test site is in Arizona. e traffic environment in San Francisco is much more complicated than Arizona's, with its lots of intersections, steep hills roads, and aggressive driving. erefore, the probability of an emergency occurring is higher. Moreover, because of the insufficient sample size, which company's test vehicles are more prone to accidents cannot be proved. Figure 3 indicates the vehicle movement preceding the collision. e most common states of AVs and conventional vehicles before collision are stopped and proceeding straight, respectively. Unexpected situations in front (e.g., a pedestrian crossing the road) may cause the AVs to emergency brake, while a conventional vehicle behind cannot evade in time, resulting in a rear-end collision. Consistent with previous studies [30], most crashes are conventional vehicles hitting the rear of AVs. e second-largest percentage of AVs and conventional vehicle movements are proceeding straight and changing lanes. Taking effective emergency avoidance measures immediately when a conventional vehicle makes unsafe lane changes is challenging for the automatic driving system. Ultimately, researchers have pointed out that AV technology still needs to overcome many barriers to respond accurately in complex traffic environments. Figure 4 shows that most collisions involving AVs are significantly less severe than regular accidents, especially for severe injuries and fatal collisions. Specifically, 81% of crashes are property-damage-only (PDO) crashes, and 19% have minor injuries. Similarly, 72% of AVs are only minor damage, thereby suggesting that collisions occurred at lowspeed conditions. Speed and speed variations have been frequently regarded as critical factors closely connected with the injured crash [34,35]. AVs would not fall prey to personal faults. Drivers' error is the leading cause of 90% of accidents. AV technology reduces crash severity by overcoming driver error (e.g., speeding, aggressive driving, inexperience, slow reaction times, inattention, and various other driver shortcomings). e specific location of the collision can be collected from the crash report, while the approximate latitude and longitude of each accident can be obtained through OpenStreetMap. We use ArcGIS software to draw the heat map of AV crashes (shown in Figure 5). It provides the visualization and distribution of accident locations among counties. e accidents mainly occurred in San Francisco and Palo Alto because they were the main test sites for AVs. In the future, the use of AVs will be extended to any corner of any city in the United States, making it necessary to analyze further the effects of land-use intensity around the accident site on the crash.

Variable Collinearity Analysis.
Multicollinearity refers to the situation in which several explanatory variables in a regression model are highly linearly related. As an integrated algorithm, XGBoost is not influenced by the multicollinearity of the data; however, introducing excessive variables may cause overfitting of the model. Moreover, the interpretability of the model may be significantly affected, thereby increasing the complexity of the model. Variance inflation factor (VIF) was calculated using SPSS 26.0, which is a common indicator of multicollinearity [36][37][38].
Generally, independent variables with VIF values higher than 10 indicate severe collinearity between two variables, which suggests that one of them should be eliminated [39]. Finally, nine categorical variables were determined. e descriptive statistics of the variables are shown in Table 2.

Methodology
In the current study, two classification models were used to train the data. is section provides the relevant concepts of these models. We applied the Scikit-learn (sklearn) library in Python 3.6. Overall, the proposed models consist of these steps: Step 1. We employed the SMOTE algorithm to deal with imbalanced datasets.
Step 2. We randomly selected 70% of the data as the training set, and the remaining 30% were employed to test the model.
Step 3. We inputted the divided training set into the XGBoost and CART models, respectively, and used the grid search to determine the best combination of parameters to prevent the model from overfitting. e cross-validation method was used to measure the stability of the model.
Step 4. By comparing the performance of two models, choosing the well-performing model to predict the severity of crashes involving AVs and analyze the effects of the different factors on crash severity.

XGBoost Model.
e core of XGBoost is an integrated algorithm based on gradient boosted decision trees. It utilizes a series of decision trees, where every tree studies from the prior tree and influences the following tree to promote model performance [40]. In this section, we explain the formulas and evaluation indicators behind XGBoost. Interested readers can refer to the study published by Chen [21] for further detailed information. Chen and Guestrin made some improvements based on the Gradient Boosting [41] and presented the XGBoost in 2016. One of the unprecedented progress is the regularization of the loss function. e regularized objective L k for the k th iteration can be expressed, as shown in the following equation: where n is the number of samples, y k (i) is the prediction value of the sample i at iteration k, and l is the original loss function. Ω represents the regularization term, as shown in the following equation: Here, T is the number of leaf nodes and c and λ are two constants employed to constrain the degree of regularization.
Another development of XGBoost is the application of an additive learning approach [42] that combines the most reliable tree model f k (x i ) into the current classification model to provide the m th iteration prediction result [43]. erefore, equation (3) can be expressed further as follows: Additionally, XGBoost utilizes the second-order Taylor expansion to the objective function and equation (4) can be expressed further as the following equation: Here, g i � ð y k−1 l(y i , y k−1 ) and h i � δ 2 y k−1 l(y i , y k−1 ) are the first and second derivatives of the loss function, respectively, and C represents the constant. Finally, as an integrated algorithm, XGBoost is not affected by the multicollinearity of the data. is advantage makes XGBoost possibly gain more reliable results even if the variables have a strong linear correlation.

Classification and Regression Tree. Classification and
Regression Tree (CART) is a nonparametric decision tree learning method [15]. It can summarize decision rules from a series of data with features and labels and present them in a tree structure to solve classification and regression problems. e CART method usually consists of two main steps: tree growing and pruning. e tree extends from the root node, which includes all the data in the dataset. Divide the root node into two child nodes through a splitter (independent variable) to improve the purity of the two child nodes. e Gini index is used as the splitting criterion in the current study. If the root node m is divided into two child nodes (child nodes n 1 and n 2 ) by the variable θ, the Gini coefficient of any child node is calculated as follows: Here, H(n(θ)) represents the Gini index of the child node n , and p(k/n) is the proportion of class k records in node n. e impurity at node m is calculated as follows: Here, N m is the total number of observations at node mm and o 1 and o 2 are numbers of observations in child nodes n 1 and n 2 . e method tries to divide the root node m by selecting the variable θ * : When CART detects that no further gains can be made by further growing the tree deeply or when specific predetermined criteria that are stopping rules are met, the segmentation will stop. Given the defined branches and nodes of the tree, each corresponding variable falls into a terminal node.

Model Evaluation.
e confusion matrix is a multidimension measurement index system of binary classification problems that has been used widely in evaluating model performance (see Table 3) [44]. e overall accuracy is calculated as follows: Accuracy � TP + TN TP + TN + FN + FP .
However, this index could not be suitable for unbalanced data. Because the number of injury accidents in the current study is significantly less than the uninjured accidents, even if all minority instances are misclassified, the overall  Figure 1: e distribution of collision type.  Journal of Advanced Transportation accuracy might still be very high. To address the limitations of the overall classification accuracy, the G-mean (geometric mean) is considered a reasonable index to evaluate imbalanced data. It has a high value by balancing the classification accuracy of the minority and majority instances [45]. e G-mean is calculated as follows: e recall rate indicates the classification accuracy of minority instances, as shown in the following equation: Finally, G-mean and recall are employed as indexes to measure model performance.

Model Results.
We use the grid search to determine the best combination of parameters to prevent the model from overfitting. e optimal parameter values are shown  Table 4. e number of "n_estimators" is the number of trees that are fitted in the model. e parameter "gamma" is the minimum loss reduction required to make a further partition on a leaf node of the tree. e learning rate is used to shrink the weights in an update to prevent overfitting. e maximum depth of the tree represents the maximum number of splits; increasing the maximum depth can cause overfitting. e parameters "criterion" has the function of measuring the quality of a split. e parameter "min_samples_leaf" represents the minimum number of samples required to be at a leaf node. e optimal CART and XGBoost models are established after parameter tuning. en, the two models are tested on the same testing data to compare the predicted results. Table 5 shows the estimation results of the crash severity model. e accuracy, recall, and G-mean results for the two modes are shown in Table 5, in which we can see that the XGBoost model performed better than the CART model, thereby reflecting the stability of XGBoost. Besides, the XGBoost model's accuracy is reduced by 26.1% after incorporating POI data, but the recall and G-mean have increased by 100% and 11.1%. It indicates that highly mixed land-use areas have a positive effect on identifying injury accidents. Additionally, as mentioned in Section 4.3, the Gmean and recall metrics are appropriate for imbalanced data because we need to identify injury accidents as much as possible. e recall and G-mean results of the calibration dataset in the XGBoost model are 84.6% and 69.9%, respectively; the recall and G-mean results of the validation dataset are 80% and 68.8%, respectively. e results between the calibration and validation dataset are relatively balanced, indicating that the model is of good fitting performance and prediction ability. In summary, the XGBoost model with POI data performs well in identifying the injured crashes. Figure 6 illustrates the relationship between collision severity and potential contributing factors. Variables include the type of collision, the AVs movement preceding the crash, vehicle damage, accident location, driving mode, and weather.

Feature Analysis.
We can observe that the weather is the most critical feature in the model. Specifically, injured accidents are more likely to occur in extreme weather conditions (e.g., fog and snow) [46] because sensors have poor perception performance in extreme weather. Rain and fog are composed of small water droplets that block the reflector and produce false alarms during obstacle detection [47]. According to Hasirlioglu et al. [48], in foggy weather, the relationship between temperature and visibility is inversely proportional, and visibility represents the distance that the detector can detect in this case. e higher degree of vehicle damage corresponds to the higher severity of the accident. Crashes have a higher probability of occurring at intersections regardless of signalized or nonsignalized intersections [49].
e traffic environment at intersections is complex and changeable because vehicles, nonmotor vehicles, and pedestrians are highly mixed [50]. e crash reports do not cover the number of crossings under autonomous mode. erefore, it is necessary to study the stability and safety of AVs when crossing intersections in autonomous driving mode. Existing AVs do not take advantage of the convenience brought by infrastructure-to-vehicle (I2V) communication. ese facilities can decrease the spatial and temporal instabilities and promote the safety of drivers, cyclists, and pedestrians [51].   Injury crashes are also more likely to occur in areas with high mixed land-use areas. Chen [52] supposed that areas with a high degree of mixed land-use have various functions, which significantly increased the conflict points between vehicles, bicycles, and pedestrians. Diverse land-use regions are prone to diversified traffic behaviors and increased regional traffic flow, affecting traffic safety substantially. is finding seems intuitive because mixed land-use patterns typically exhibit diverse land-use types leading to complex roadway layouts. e results suggest that areas with mixed land-use require additional local-level research to develop effective target-oriented treatments to improve the safety performance of AVs.
Many rear-end accidents occur in the data set (i.e., 65%), but AVs are usually not the responsible party. Conventional vehicles are not accustomed to the AVs' characteristics, and, thus, crashes are often caused by conventional vehicles hitting the rear of AVs. Wang et al. [18] find that crash severity significantly increases if the AV is responsible for the crash. From the perspective of driving mode and vehicle movement before the accident, 56% of AVs' injury accidents were in autonomous driving mode. In February 2016, the AV produced by Google had its first accident when it changed lanes and collided with a bus at a low speed, causing no casualties. ree months later, Tesla in the US suffered a more serious fatal crash while driving in autonomous mode. It was the first known fatal crash in the history of AV technology.
ese two crashes were regarded as important events since the advent of AV, indicating a new type of traffic crash. e main reason for these two crashes was that the driver ignored the warning to take over from the AVs, which meant that the two drivers did not take over the driving in time to ensure driving safety. is also shows that AVs are responsible for these crashes. In conclusion, severe injuries can happen if the vehicle is on automated driving mode and is the crash's primary responsible party.

Summary and Conclusion
In the current study, the XGBoost model incorporating POIs data is adopted to investigate the factors contributing to the severity of AV involved crashes using reported crashes from California. e descriptive statistics analysis was employed to investigate the characteristics of AV involved crashes in terms of the crash location, collision type, crash severity, and vehicle movement before crash occurrence. A total of 94 accident cases were employed to train the model that reached the G-mean and recall of 68.8% and 80%, respectively. e recall and G-mean have increased by 100% and 11.1% after incorporating POI data. It indicates that highly mixed land-use areas have a positive effect on identifying injury accidents.
We find that the degree of vehicle damage, accident location, and type of collision significantly affect the severity of the crash, which is consistent with previous research. e difference is that this study finds weather conditions to be the most critical factor. Extreme weather and intersection accidents have a significant effect on the severity of an  accident. Regardless of signalized or nonsignalized intersections, intersections are the most likely places for rear-end collisions. Mainly because a crash occurred when the vehicle was waiting at the intersection or driving slowly. We recommend using the vehicle sensor with strong stability and high sensitivity. Besides, switching to manual driving is also a solution to avoid severe accidents caused by automatic driving. e results from the model can also provide a basis for policy decisions. For example, results reveal a significantly higher likelihood of injured crashes in mixed land-use settings. Hence, specific recommendations are made to promote the mixed-use of land and reduce traffic accidents. First, urban planning should focus on developing small-scale and high-intensity diverse land-use at the microlevel and constructing a multicenter urban pattern at the macrolevel to achieve a balanced population and transportation. e mixed land-use development needs to be based on the construction of rapid rail transit facilities, encourage walking and nonmotorized travel by improving road traffic conditions, and use public transportation as the core to reduce traffic accidents caused by the rapid development of vehicles. Areas with highly mixed landuse require additional local-level studies to develop effective treatments to improve the safety performance of AVs.
As a limiting factor, this study did not collect data on vehicle speed and driver characteristics before the accident to evaluate the safety performance of AVs. Despite the limited sample size, the collision database employed in this study includes all issued crash reports involving AVs in 2019. However, to gain a deeper understanding of the mechanism of AV crash, future research should continue to collect AV crash data and apply multisource data fusion to enhance prediction accuracy. e model adopted in this study supplies an accepted method for investigating and understanding AV safety issues. Moreover, because the sample size increases in the future, this advantage can continue to increase. e knowledge gained from this research could contribute to the assessment and improvement of the safety performance of the current AVs.

Data Availability
e data used to support the findings of this study are publicly available at https://www.dmv.ca.gov/portal/dmv/ detail/vr/autonomous/autonomousveh_ol316.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.