Analyzing Accident Injury Severity via an Extreme Gradient Boosting (XGBoost) Model

Vehicle to vulnerable road user (VRU) crashes occupy a large proportion of traffic crashes in China, and crash injury severity analysis can support traffic managers to understand the implicit rules behind the crashes. )erefore, 554 VRUs-involved crashes are collected from January, 2017, to February, 2021, in a city in northern China, including 322 vehicle-pedestrian crashes and 232 vehicle-bicycle crashes. First, a descriptive statistical analysis is conducted to investigate the characteristics of VRUs-involved crashes. Second, the extreme gradient boosting (XGBoost) model is introduced to identify the importance of risk factors (i.e., time of day, day of week, rushing hour, crash position, weather, and crash involvements) of VRUs-involved crashes. )e statistical analysis demonstrates that the risk factors are closely related to VRUs-involved crash injury severity. Moreover, the results of XGBoost reveal that time of day has the greatest impact on VRUs-involved crashes, and crash position shows the minimum importance among these risk factors.


Introduction
Crash injury severity analysis plays a crucial role in traffic crash analysis, which can assist traffic management [1][2][3][4]. Crash injury severity is defined as the degree of injury and property damage caused by a crash event. e crash injury severity analysis aims to explore the correlation between crash injury severity and various contributing factors, such as road-users-related factors, temporal-related factors, environmental conditions, and crash types. e universal rules support traffic managers to better understand the contributions of factors on crash injury severity and further reduce the crash severity and improve traffic safety by developing countermeasures [5][6][7].
Currently, the research approaches on crash injury severity can be divided into two categories, which are statistical models and machine learning-based models. e statistical models assume that the contributing factors affecting crash injury severity follow a particular distribution, which needs to be defined carefully for better capturing the relationship between crash injury severity and explanatory variables. e commonly used models contain multivariate Poisson regression model [8,9], ordered probit model [10,11], bivariate binary/ordered probit model [12,13], random parameter probit model [14], etc. Wang et al. focused on mountainous expressways and proposed a partial proportional odds model to determine the determinants of truckinvolved crash injury severity [15]. Xu et al. attempted to investigate pedestrian-involved crash injury severity by using geographically and temporally weighted regression model taking into account spatial-temporal correlation [16].
e statistical models could demonstrate and explain clearly the correlation between crash severity and related variables with the help of explainable and logical theoretical deductions. However, due to the nonlinear relationship between crash injury severity and contributing factors, these statistical models difficultly capture the inner and intrinsic correlations [17][18][19].
Machine-learning-based models have a powerful internal inferential capability, which makes them more flexible by learning without or little prior assumptions of related factors to describe the complex characteristics of crash events.
Previous researches employed logistic models (e.g., random parameter logit model, and mixed/ordered logit model) [20,21], support vector machine (SVM) [22,23], random forest (RF) [24,25], Bayesian-related models [26,27], etc., to explore the complex relationship between crash injury severity and contributing factors and further identify the risk factors on crash injury severity. For comprehensive accounting of the observed heterogeneity, Behnood et al. introduced a random parameter multinomial logit model for comparing the contribution of risk factors to crash injury severity under bicycle-vehicle crashes [28]. Liu et al. introduced an ordinal logistic regression model to examine the risk factors on pedestrian-motor vehicle collisions, taking into account the spatial-temporal correlation [29]. Li et al. introduced SVM model to investigate the potential correlation between external factors and crash injury severity, but the performance was suppressed due to multiclass classification problems [30]. Li et al. analyzed the key factors affecting electric bicycle-related crash injury severity with the help of random forest model [31].
Beyond that, Bayesian approaches, as a classical machine learning model, have been widely used in crash injury severity modelling, which were regarded as Bayesian-related models. For instance, Bayesian binomial logistic model [32,33], Bayesian multivariate regression model [34,35], Bayesian spatial model [36,37], and Bayesian mixed logit model [38] have successfully demonstrated their applicability in crash injury severity-involved correlation issues. Yuan et al. divided crash severity into two categories (property damage only and injury/fatality) and integrated bivariate probit model and Bayesian approach to identify the contributing factors associated with crash injury severity [39]. Haq et al. developed binary logistic model with Bayesian inference approach to investigate the effects on truck-involved crashes, especially on occupant injury severity considering comprehensive factors [40]. Guo et al. proposed a novel random parameter, that is, multivariate Tobit model, to identify risk factors on crash severity under different crash types [41]. Zhang et al. utilized a Bayesian multinomial logit model with conditional autoregression prior to examining the hazardous factors that contributed to freeway crash injury severity [42].
In sum, previous researches focused on identifying risk factors towards traffic injury severity by various statistical models and machine learning-based models, and satisfying results were obtained. However, decision tree ensemblebased models, also a type of machine leaning model, including adaptive boosting (AdaBoost), gradient boost decision tree (GBDT), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGBoost), are less utilized for crash injury severity analysis. Additionally, vulnerable road users (VRUs), as the vulnerable groups of the traffic participants, are prone to fatal injuries in crashes [43]. As the most common VRUs, the pedestrians and bicyclists have been paid much attention to, but decision tree ensembles-based models are rarely utilized to investigate the potential universal rules behind the VRUs-involved crashes. Hence, this paper attempts to describe the characteristics of VRUs-involved crashes and identify the contributing factors associated with crash injury severity. Based on this purpose, 554 VRU-vehicle crashes (contains 232 bicycle-vehicle crashes and 322 pedestrian-vehicle crashes) were collected. Moreover, XGBoost is introduced for crash injury severity modelling and ranking the importance of risk factors on crash injury severity. e contribution of this paper is twofold: (1) Conduct a descriptive statistical analysis to investigate the characteristics of VRUs-involved crashes from the perspective of six risk factors (i.e., time of day, day of week, rush hour, crash position, weather, and crash involvements), and further transform into universal rules to support traffic management (2) XGBoost is adopted to identify the risk factors contributing to VRUs-involved crash injury severity with the help of VRUs-involved crashes dataset from policy records, which further determine the real causes to enhance traffic safety e rest of this paper is organized as follows. Section 2 introduces the data details and candidate variables analyzed in this paper. Section 3 describes the details of XGBoost model. Section 4 provides the experimental results, which consist of crash severity characteristics and identified risk factors. Section 5 briefly concludes the study.

Data Source.
For exploring the characteristic of VRUsinvolved crashes, 554 crash samples were collected from police records on crashes, which have occurred in a city in northern China within about four years. e dataset contains various information, such as crash time, position, involvements, and injury severity, and six factors are extracted to explain the characteristics of VRUs-involved crashes. Vehicles and bicycles or pedestrians were involved in one crash, and bicyclists and pedestrians were defined as VRUs.
e crashes dataset contains 323 vehicle-bicycle crashes and 322 vehicle-pedestrian crashes. e propertydamage-only crashes are excluded because the vehiclebicycle or vehicle-pedestrian crashes are prone to injury or death, which belong to injury or fatal accidents. Additionally, the crashes dataset consists of 385 injury crashes and 169 fatal crashes, which caused 517 injuries and 173 deaths.

Candidate
Variables. Generally, if fatal or injured occupants are involved in a crash, it can be regarded as a severe accident. Considering that the dataset only contains fatal accidents and injury accidents, but without property-damage-only accidents, the crash injury severity is divided into two categories: injury accident (only injured occupant involved in the crash), which is coded as 0, and fatal accident (at least one fatality occupant involved in the crash), which is coded as 1. Figure 1 describes the extracted factors related to crashes from the dataset, which are time of day, day of week, rush hour, weather, crash position, and crash involvements. ese six factors are extracted to investigate the characteristics of VRUs-involved crashes, divided into two typical injury categories (see Table 1).
To some extent, time of day reflects the lighting conditions laterally, which is a crucial factor for traveling. Considering that, the crash position is complex, which mainly contains road section and intersection, but less sidewalk, roundabout. For better modelling, the crashes happened on sidewalk were regarded as road section. e weather information is collected from the related website (see http://www.tianqihoubao.com/lishi) based on the date and time of crashes [26]. Noting that this website provides the weather information only in two periods, that is daytime and night, it is not detailed enough to the specific hours. Additionally, due to the various types of weather, some of them have similar impact on traveling environment, for instance, sunny and cloudy, rainy and snowy. erefore, the weather was divided into two categories: good and adverse.

Methodology
Extreme gradient boosting (XGBoost), as a typical decision tree ensemble-based model, was proposed by Chen in 2016 [44]. XGBoost is optimized from GBDT, which introduced second-order derivatives into optimization process. It outperforms with advantages of parallel learning, high flexibility, built-in cross-validation, etc. Previous studies have proved the successful use in traffic crash severity analysis and risk prediction [45,46].

3.1.
Objective Function. Given training data T � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ) , the objective function is described as where L(y i , y i ) is the training loss, which measures the fitting performance of the model to training data, and Ω(f k ) denotes the regularization term, which controls the model complexity for preventing overfitting. n is the volume of training dataset, and K denotes the number of trees. Generally, for classification problems, the logistic loss is adopted as loss function, and the expression is given in where y i is the truth value, and y i is the predictive value. In XGBoost model, the predictive value is the sum of the score for each tree and the y i can be defined as where F denotes the functional space and f k is the function of k-th tree.

Additive Training.
In the training process, it is intractable to learn all trees simultaneously. Instead, XGBoost introduces an additive strategy, which corrects what we have learned and adds one new tree at a time. e details of additive strategy are provided as where y (t) i and f t (x i ) denote the predictive value and the added predictive function (i.e., new tree) at step t, respectively. For obtaining the best tree at each step, the objective function at step t is defined as where For the training loss function, Taylor formula is introduced in loss function and its second-order expansion is expressed as Journal of Advanced Transportation 3 where g i and h i are the first-order and second-order partial derivative on loss function and can be expressed as Generally, the constant terms in the objective function are ignored, and the simplified objective function can be obtained as

Model Complexity.
For defining the complexity of tree, we refine the tree f(x) as (9). It contains the vector of weight on leaves ω and mapping function q, which maps each data sample to the corresponding leaf.
Here, T is the number of leaves. en, the complexity can be defined as (10). c is the coefficient of leaf, and λ denotes the coefficient of L2 regularization term.
We define I j � i|q(x i ) � j , which denotes the set of data samples assigned to j-th leaf. en, we introduce (10) into (9), and the objective function of t-th tree can be rewritten as G j � i∈I j g i represents the sum of first-order partial derivative on jth leaf. H j � i∈I j h i denotes the sum of second-order partial derivative on jth leaf. We take partial derivation with respect to T j�1 (G j w j + (1/2)(H j + λ)w 2 j ), and the best ω j and objective value can be obtained as To measure how good a tree is and find the best tree structure, a greedy algorithm is introduced. It starts from a single leaf and attempts to split each leaf into two leaves and then calculates the information gain in this split process (see equation (14)).
where G L , H L and G R , H R are the derivative values of left and right after split. Gain denotes the gain for loss function in the split. If Gain > 0, the result of split will be considered.

Results
Based on the 544 crashes data, the time-related information, crash position, weather, and crash involvements are investigated. In the section, six risk factors are extracted to explore the characteristics of VRUs-involved crashes and further determine the risk factors contributing to crash injury severity.   Table 2). Maybe most people intend to travel during the daytime, which is prone to cause crashes. But at night, due to the terrible travel environment (i.e., poor light visible condition), the crashes are easy to cause deaths. Additionally, the proportion of crashes on weekdays is larger than that on weekends/holidays, but the fatality rate is the opposite and the values of weekday and weekend/holiday are 28.6% and 35.9%, respectively. e reason may be that people keep a relatively low safety alert when traveling on weekends/ holidays than on weekdays. Moreover, the VRUs-involved crashes are prone to happen in off-peak hours than in rush hours due to the longer period of off-peak hours. Similarly, the fatality rate of off-peak hours is higher than those of rush hours (the values are 31.7% and 28.0%, respectively). e variation tendency of VRUs-involved crashes counted by different days of the week is shown in Figure 3, and Table 3 provides the crash injury severity information under each day of the week. It illustrates that the largest number of crashes appears on ursday, while Sunday occupies the least number. e main reason possible is that ursday is the day near the weekend, the busiest day for most people as well as for the traffic, and yet Sunday is the final of a weekend when people are more likely to take a rest at home. However, the fatality rate is higher on Sunday (the value is 41.0%) because of the low safety awareness of people during leisure travel. Additionally, Monday takes up the minimum fatality rate with a value of 19.7%. e reason may be that Monday is the first day of weekday, and people will maintain a relatively high-security alert while commuting to work. e statistical information of VRUs-involved crashes for injury severity is shown in Table 4, and Figure 4 illustrates the variation tendency of crashes counted by hours of the day. It indicates that crashes are prone to appear in rush hour (i.e., 7:00-9:00, 17:00-20:00), especially in the rush hours of the morning, with the highest peak existing in 7:00-8:00 (the total number of crashes is 45). It is because that this period is the time to go to work when the traffic is busy, likely to cause crashes. Moreover, most of the crashes happened at 6:00-23:00, which is the time for human activities, while few crashes occur within 23:00-6:00, which is the sleeping time. Overall, we found that the mortality at night is relatively higher than that in the daytime.

Spatial Characteristics.
In the raw crash dataset, the crash position is complex, which makes the spatial characteristics hard to be described. Hence, we reorganized the complicated crashes environment into two types: road section and intersection. Table 5 provides the statistical information of crashes under two types of positions. ere are 169 fatalities involved in crashes, including 110 on road sections and 59 at intersections. Moreover, the crashes that occurred on road sections take a higher proportion than intersections, and the mortality of crashes on road sections and at intersections are 0.293 and 0.331, respectively. Additionally, the proportion of fatal crashes that happened at intersections is higher than that of injury crashes, with values of 34.9% and 30.9%. erefore, we can obtain that the crashes are more likely to happen on road sections, but the crashes happening at intersections have higher fatalities.

Weather Characteristics.
ere are various types of weather, so that it is hard to describe the weather characteristics associated with crash injury severity. Hence, the weather is divided into good (including sunny and cloudy) and adverse weather (including rainy, snowy, etc.). Table 6 shows the statistical information of injury severity in all weathers. Most VRUs-involved crashes happened in good weather, taking up 82.7%. at is because people prefer to travel in good weather compared to adverse weather. However, the mortality of crashes in adverse weather is higher than that in good weather, with values of 0.323 and 0.301, respectively. Similarly, the crashes that happened in adverse of fatal accidents account for a high proportion than injury accidents; the values are 18.3% and 16.9%, respectively. e results illustrate that VRUs-involved crash rarely happens in adverse weather. But once it happens, it may cause fatality.

Crash Involvements' Characteristics.
In the crash dataset, the simultaneous participants in the crashes are vehicle and bicycle or vehicle and pedestrian; thus, the crash  Week fatal injury Figure 3: e variation tendency of crashes counted by different days of the week.   Journal of Advanced Transportation involvements are divided into vehicle-bicycle and vehiclepedestrian. It can be seen that vehicle-pedestrian crashes take up a relatively high proportion not only in fatal accidents but also in injury accidents (see Table 7), and the proportion of fatal crashes is higher than that of injury crashes, with values of 68.6% and 53.5%, respectively. Additionally, the mortality of vehicle-pedestrian crashes is higher than vehicle-bicycle crashes, with values of 0.360 and 0.228. In sum, we can infer that vehicle-pedestrian crashes more easily result in death compared to vehicle-bicycle crashes, and most of these crashes may happen in intersections and crosswalks. It is probably because that the targets of bicycles are larger than pedestrians, more likely to attract the attention of vehicle drivers. And the reaction distance of cyclists is longer than pedestrians, which can reduce the injury severity in crashes.

Parameters Optimization.
In this section, XGBoost is utilized to identify the contributing factors influencing crash injury severity. It is noted that the parameters of XGBoost are crucial for the model performance, and the grid search algorithm is introduced to obtain the optimal parameters. For binary classification problem in this study, the logistic loss and area under receiver operating characteristic curve are defined as objective loss function and evaluation metric, respectively. Moreover, four parameters, including number of estimators (n estimators), learning rate, maximum depth, and coefficient of regularization (λ), are selected to optimize by grid search algorithm, and the candidate values are given in Table 8. e number of estimators refers to the number of iterations (i.e., the number of decision tree), learning rate controls the step size in weight updating, and maximum depth denotes the maximum depth of a tree. All these parameters contribute to preventing overfitting. Based on the grid search results, we found that the optimal parameters model can be obtained, when the number of estimators is set as 10, learning rate as 0.05, maximum depth as 4, and λ as 3, and the scores of AUC and accuracy are 0.675 and 0.706, respectively. Figure 5 provides the AUC variation trends under different parameter settings. From Figure 5(a), the AUC scores show a up and down trend, and the maximum scores is 0.675 when number of estimators is set as 10, which indicates the optimal value of number of estimators is 10. e optimal values of learning rate, max depth, and λ are 0.05, 4, and 3, respectively. It is noted that the other three parameters are set as optimal values (i.e., learning rate is set as 0.05, max depth as 4, and λ as 3) in Figure 5(a), and other cases follow this rule.

Risk Factors' Analysis.
e XGBoost model with optimal parameters can be obtained after the parameters optimization procedure by using grid search algorithm. en, the contributing factors were identified such that which factors show greater impact on VRUs-involved crashes injury severity. Figure 6 shows the importance of various risk factors from XGBoost model based on information gain, which is defined as the average gain for objective function optimization across all splits the feature (i.e., factor) is used in. e time of day occupies the most important role in VURs-involved crash injury severity, with the information gain score as 4.56. It reveals that time of day        Table 6). is may be because people do not like to travel in adverse weather and they keep a relatively high safety awareness when traveling. Additionally, the VRUs-involved crashes that happened in different position (i.e., road section and intersection) show semblable result.

Conclusions
VRUs-involved crash injury severity analysis transforms the relationship behind the crashes into universal rules and further supports traffic management. is paper demonstrates a descriptive statistical analysis of the characteristics of VRUs-involved crashes based on 554 crashes data collected in a city of northern China and further utilizes XGBoost to identify the risk factors affecting crash injury severity.
e important conclusions are summarized as follows. (1) e risk factors (i.e., time of day, day of week, rush hour, crash position, weather, and crash involvements) are closely related to VRUs-involved crash injury severity. More specifically, vehicle-bicycle and vehicle-pedestrian crashes are prone to involve fatalities at intersections on the weekend night in adverse weather. (2) e time of day plays a more important role in VRUs-involved crash injury severity compared with other factors, which reveals that VRUs-involved crashes that happened at night are prone to cause deaths. Additionally, the weather has little effect on VRUs-involved crash injury severity. (3) Compared to vehicle-bicycle crashes, vehicle-pedestrian crashes are prone to happen at intersections (especially at the crosswalk near the intersection), and these crashes readily cause deaths.
Although few factors were analyzed, the AUC and accuracy of XGBoost are 0.675 and 0.706, respectively, and the results still can be accepted and meet the current study. To obtain more accurate and detailed characteristics of VRUvehicle crash injury severity, several research directions are proposed. (1) More risk factors (e.g., lighting condition, drivers' age, gender, crash pattern, and crash location related factors) can be considered to better explain the characteristics of VRU-vehicle crash injury severity and further identify the crucial risk factors. e characteristics of VRUvehicle crash injury severity are not perfectly and accurately exploited due to the limitation of the risk factors. However, abundant risk factors may cause unfaithful characteristics to be described. To this topic, how to extract an appropriate number and precise risk factors is a crucial challenge. (2) Risk factors identified mechanism can be developed with high accuracy and robustness on crash injury severity analysis, such as random forest (RF) and nonparametric Bayesian approach, to better explain the characteristics and determine the real causes of crashes. e XGBoost model facilitates the investigation of crash injury severity issues, but the accuracy is limited due to the small sample size. erefore, how to develop risk factors identified approach with a small sample size is a hot point. In addition, how to consider the spatial-temporal correlations in modelling process is a crucial challenge.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Journal of Advanced Transportation 9