Analysis and visualization of accidents severity based on LightGBM-TPE

Inrecentyears,


Introduction
The majority of human population now lives in cities because of more health care, education, and employment opportunities [1,2].Despite their advantages, cities suffer from many problems in terms of "social dynamics" [3][4][5][6][7][8][9].Among phenomena that perturb or threaten the modern way of living, traffic accidents are remarkably conspicuous which account for massive accidental death [10][11][12][13][14]. Thus, it is essential and necessary to analyze the exact influence of diverse feature factors on accident severity.
The proposition of data mining and machine learning algorithm has offered an innovative and effective way for research of traffic accident severity [15][16][17][18][19][20][21][22].AlKheder applied Decision Tree, Bayesian Network and linear Support Vector Machine to analyze the traffic accidents' severities, and found that Bayesian network has the optimal prediction accuracy, and road and accident type were the most important features for the model [15].Hadjidimitriou conducted different machine learning algorithms, such as random forest, support vector machine and deep networks, to classify the severity of accidents [16].Ji compared multiple machine learning models and concluded that ensemble learning models outperformed individual models [17].Shen analyzed hazardous material road transport accidents in seven regions of China, and found that the performance of XGBoost is better than all other models, and the impact of features on accident severity varied among regions [18].Moreover, a severity causation network, which combines information entropy with Bayesian network, was constructed to explore the relationship between risk features and crash severity, and the entropy weight method was utilized to estimate the factors' impacts by considering both the weight and the accessibility [19].Zeng investigated the influence of real-time weather conditions on the freeway crash severity by using the Bayesian spatial generalized ordered logit model [20].Xing established a random-parameters ordered probit model to explore the impacts of contributing features on the severity of the accidents in Hazmat accidents [21].
Furthermore, Parsa applied XGBoost to detect the occurrence of accidents and employed SHapley Additive exPlanation (SHAP) to interpret the influence of individual features on the accidents [23].Moreover, a methodology framework based on XGBoost was proposed to explore the influential factors of accidents and the grid analysis was established to analyze the important factors on the traffic accidents severity in Los Angeles County [24].Umer compared tree-based ensemble models (Random Forest, AdaBoost, Extra Tree, and Gradient Boosting) and an ensemble of two statistical models (Logistic Regression Stochastic Gradient Descent) as voting classifiers in predicting road traffic accident severity, and found that random forest had the best performance in classification results [25].Ramya employed Random forest algorithm, in comparison with artificial neural network (ANN) and J48 decision tree, to predict the severity of accidents [26].It is worth noting that most scholars believed the machine learning model with the best prediction accuracy can be selected to calculate feature importance, since the higher the performance of the model is, the result of the calculated feature importance is more convincing [17,23,25,26].
Although these researchers have paid attention to accident severity analysis and tried to find the key features in predicting injury severity, they are mainly focused on the performance of the prediction model but usually ignore the interpretation for causes of fatal or serious traffic accidents.Fortunately, data visualization may help solve this problem because it can not only compare multiple features of the dataset to see what affect accidents severity, but also visualize variations over these features to characterize the trend [27][28][29][30].There have already been lot of visualization tools or methods providing novel perspectives for different research fields [31,32].However, the traditional visualization methods cannot identify important features of accidents severity which rely on the researchers' experience.Thus, the combination of machine learning algorithms and visualization methods is important and necessary to fill this gap.
Inspired by these, based on the traffic accident data in the United Kingdom in 2017, we combine a hybrid algorithm, namely LightGBM-TPE, with data visualization method to investigate feature importance of accident severity.LightGBM-TPE is most appropriate here because it performs better in prediction accuracy compared with other typical machine learning algorithms.In terms of SHAP, we find that "Longitude", "Latitude" and "Hour" are three risk factors playing decisive roles in accident severity.Visualization of these features help explain their exact large influences on traffic accidents.
The following sections mainly include the following: Section 2 gives a brief description of data and data preprocessing.Section 3 introduces the methodology, which contains the visualization design and algorithm details. Discussion and results will be presented in Section 4. Conclusion will be in Section 5.

Data description
We use the dataset of accidents that happened in the UK, which is downloaded from the popular machine learning competition platform Kaggle [33].The file contains rich information about road accidents.Each row of the file indicates an accident.It also provides 34 information columns to describe sufficient details of each accident.In the following, we will provide a short overview for this dataset.Fatal accidents are the most serious threat to human life, threatening existing human environment directly and affecting sustainable city development stratagem.Thus, here we use two classes, namely fatal and non-fatal (including serious and slight) accidents, to characterize all traffic accidents.Without loss of generality, data of accidents that occurred in 2017 is taken into account.Furthermore, since the study is focused on identifying the leading causes of fatal traffic accidents, the classification target is judging whether the traffic accidents are fatal.As a result, 1642 fatal accidents are classified as positive cases, while 21,780 serious and 100,308 slight accidents are marked as negative cases (non-fatal).

Data preprocessing
Before applying data into the machine learning model, it is usually necessary to preprocess the raw data.The process of preprocessing contains two parts: data cleaning and data formatting.

Data cleaning
Raw data contains a lot of data irrelevant to accidents severity: for example, "Did_Police_Officer_Attend_Scene_of_Accident" represents unnecessary information for our research; the information of "InScotland" is involved in "Longitude" and "Latitude", so is also unnecessary here.In the process of data cleaning, the useless data is excluded to achieve better performance.In this research, 10 features related with accident severity are considered, which can be divided into four parts: the location of accidents ("Longitude" and "Latitude"); the time of accidents ("Hour", "Month" and "Day_of_Week"); the environment ("Light Condition" and "Weather Condition"); the road of accidents ("Speed Limit", "Road Type" and "Road Surface Condition").We use these ten features as input and take accidents severity as output.Table 1 shows the descriptive statistics for the utilized categorical data.

Data formatting
For better performance, many categorical variables are conveyed to dummy variables, which can provide more information to the algorithm.In this study, we adopted one-hot encoding to convert categorical variables.For example, Day_of_Week, the categorical variable in the raw dataset, is formatted to Sunday, Friday and so on, which represent the exact day of week when the accident occurred.

Data balancing
In general, there are more non-fatal accidents than fatal ones, so the accident dataset is imbalanced.To solve this problem, many researchers applied the oversampling or undersampling technique to imbalanced data.In comparison, oversampling is preferred because undersampling may lead to a decrease in prediction accuracy.In this study, we use Synthetic Minority Oversampling Technique (SMOTE algorithm) [35], which is widely employed in previous studies [24], to deal with imbalanced data here.

Methodology
In this section, we will first introduce some typical machine learning methods for candidates, namely Logistic regression, Gradient boosted decision tree (GBDT), eXtreme Gradient Boosting (XGBoost), LightGBM and LightGBM-TPE.And the model having the optimal prediction accuracy will be used to calculate feature importance for the visualization process, as researchers consider that the higher the accuracy of the model, the more credible the results of feature importance calculation [17,23,25,26].

Logistic regression
Logistic regression, also known as logit model, belongs to statistical analysis, and it is often used to solve classification problems.In logistic regression, the output is the probability of each class, which can be calculated by the sigmoid function: where z is the output of the linear operation, and g(z) is the probability converted by sigmoid function.

GBDT
Gradient boosted decision tree (GBDT) [36] is a generalization of boosting of arbitrary differentiable loss functions.GBDT is an accurate and efficient off-the-shelf procedure for regression and classification problems in a variety of fields, including traffic accident areas, due to its state-of-the-art performance on many machine learning tasks.However, the emergence of big data brought GBDT with a new challenge about the trade-off between efficiency and accuracy [38].Therefore, the improved algorithm based on GBDT, called XGBoost, has been created to solve such problems.

XGBoost
XGBoost algorithm is proposed by Dr. Chen [37].The objective function of XGBoost is similar to that of GBDT, except that a regular term is added in the objective target.Given the training data {(x i , y i )} i=1 N , the objective function of XGBoost consisting of two parts reads where Obj is the objective target to minimize, l y i , b y i Þ À is the loss function per sample which measures the difference between the true value y i and the prediction b À is the regular term to prevent overfitting, where f k is the kth tree model, w and T are the leaf node weight and number of leaf node in kth tree, and λ is the regular penalty term of leaf weight.
Another important improvement in XGBoost is using both the first derivative and second derivative values of the loss function to optimize the target function.Therefore, XGBoost algorithm can handle the sparse data better than GBDT.
The gain of the splitting node of the tree is shown as follows: where represents the first and second order partial derivative of loss function of the previous tree model, respectively.I L and I R represent the set of samples in the left and right node, respectively.Besides, I = I L ∪ I R .

LightGBM
LightGBM is an improved framework based on the XGBoost released by Microsoft in 2017 [38].LightGBM and XGBoost [37] both support parallel arithmetic, but LightGBM is more powerful than the previous XGBoost model with a faster training speed and less memory occupation, thereby reducing the communication cost of parallel learning.LightGBM is mainly featured by the decision tree algorithm based on gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), and a histogram and leaf-wise growth strategy with a depth limit.GOSS can achieve a good balance between the number of samples and the accuracy of LightGBM's decision tree.In the training process, downsampling will pay more attention to the samples with larger gradients, which have a greater impact on information gain.When feature space is quite sparse, LightGBM can reduce the dimension of features by binding mutually exclusive features together with a new feature, by means of EFB.

LightGBM-TPE
LightGBM uses the leaf-wise tree growth algorithm, in contrast to many other popular algorithms using depth-wise tree growth.The leaf-wise algorithm can converge much faster than depth-wise growth, but its growth may be over-fitting without the appropriate hyperparameters [38].Hyperparameters optimization means the adjustment of algorithm hyperparameters to achieve better performance of the model.
In this paper, Tree-structured Parzen Estimator (TPE algorithm) is used to improve the performance of LightGBM models by finding the best hyperparameters combination.TPE is based on Sequential Model-Based Optimization (SMBO) algorithm, which can construct a response surface model and gather additional data based on the original model iteratively [39,40].In SMBO, there are several inputs for constructing the model: Target algorithm f, which is a mapping from hyperparameters x to cost metric y (such as loss function); Hyperparameter configuration space θ and history lists of observed variables H, {(x (1) , f(x (1) )), (x (2) , f (x (2) ))⋯(x (k) , f(x (k) ))}.
TPE algorithm adopts Kernel Density Estimation (KDE) to generate the surrogate function.In contrast to the rest of the SMBOs that use regression to build the agent model, TPE applies a classification approach to build surrogate function.In general, the algorithm is composed of the following steps.
Step 1. Initialize the lists of observed variables H by the randomly selected set of hyperparameters.
Step 2. Model p(x/y) using the sorted H by y, and the definition is as follows: where y * is some quantile of y, l(x) is the density formed by using the observations x i such that corresponding loss f(x i ) was less than y * , and g(x) is the density formed by using the remaining observations.
Step 3. Draw many candidates from the tree-structured form of l and g, and evaluate them according to g(x)/l(x).On each iteration, the algorithm returns the x * with the greatest EI.
Step 4. Add x * and f(x * ) to the history H and return to step 2 until the time limit is reached.
The searching space is shown in Table 2, and the explanation of hyperparameters is as follows.max_depth is max depth of tree model.The increase of max_depth may cause overfitting.min_data_in_leaf is a very important parameter in preventing LightGBM from overfitting.n_estimators is the number of trees, which represents the number of training iterations on the data.Sufficiently small n_estimators will lead to underfitting, while too large n_estimators will cause overfitting.Moreover, num_leaves is the main parameter to control the complexity of the tree model.To obtain the accuracy of the model, a 5-fold cross-validation was carried out in the optimization process.Finally, the best hyperparameter was found out by TPE algorithm, and these four Parameters were set as max_depth=10, num_leaves=246, n_estimators=380 and min_data_in_leaf=20.

Visualization design with SHAP
SHapley Additive exPlanation (SHAP) constructs an additive explanation model which is inspired by coalitional game theory [41].In SHAP, all features in the sample are considered as "contributors" for prediction of the machine learning model [42].For each prediction sample, SHAP produces an evaluation value assigned to each feature in that sample, which is referred to as SHAP value.Based on the addictive feature attribution methods, SHAP constructs a linear function g with binary variables to estimate the target function f.The explanation model is as follows: where z i * = {0, 1} means that M is the number of features if feature i participates the prediction process, and ϕ i represents the contribution of feature i.The contribution of feature i is introduced as follows: where N is the set of all input features, S is a subset of different features combination without feature i, and f x (C) is the prediction model conditioned on input instance x with the subset C of feature combination.For TreeSHAP, f x (C) is defined as which means the excepted value of function f in terms of input feature subset C [43].Traditional feature importance only tells which feature is important, but does not reflect how the feature affects the prediction result.In contrast, SHAP value can not only reflect the influence of the features in each sample, but also show the positive and negative influence.After feature importance is calculated through SHAP value, visualization results can be obtained.We display the SHAP dependence plots of the most important features to describe how they affect the prediction output.More importantly, the dependence of not only the total number of fatal accidents but also the ratio of fatal accidents to all traffic accidents, on these features, are investigated in this paper, which helps to uncover the fundamental factors associated with accident severity.

Performance comparison
To find out the best model in prediction accuracy, we compare the performance of LightGBM-TPE with four other machine learning models, namely LightGBM, Logistic Regression, Gradient boosting Decision Tree and XGBoost, in terms of 4 key metrics, that is, f1, accuracy, recall and precision.To verify the results, we conduct the 5-fold crossvalidation in this study.Specifically, we shuffled and split the dataset into 5 consecutive folds, and then treat each fold as a valid dataset, and the remaining 4 folds as the training sets.The final result is obtained by averaging these five validations.
As shown in Table 3, the precision of LightGBM-TPE is better than that of any other algorithm, reaching 0.9268.Besides, as to f1, accuracy and recall, LightGBM-TPE also has the optimal performance.In comparison, GBDT has the worst performance of these 5 machine learning methods.Thus, it is reasonable to select LightGBM-TPE as predicting model to calculate SHAP values for further analysis, which is consistent with previous results [23].

Feature importance and SHAP value
For accuracy, the mean magnitude of SHAP value utilized to characterize feature importance, as described in Eq. (3.6).The higher the feature importance is, the more contribution the feature provides in predicting the model.As shown in Fig. 1, 10 features are sorted according to their importance.Obviously, Longitude ranks top, followed by Latitude and Hour.
Fig. 1 provides the relative importance in the training dataset, but cannot reflect the range or distribution of the feature's impact on the output, or the relationship between the SHAP value of the feature and its exact impact.Thus the SHAP summary plots is provided to concisely describe these aspects.As presented in Fig. 2, each dot denotes an accident data, with its position on the y-axis and x-axis, corresponding to the feature value and the SHAP value, respectively.The dots with redder color refer to larger feature value.For Longitude and Latitude, the SHAP values are rather close to the x-axis of symmetry, implying that even two similar feature values might result in completely opposite effect on accident severity.In this sense, the combination of Longitude and Latitude, rather than they alone, makes sense.In contrast, other features can be analyzed independently.More importantly, it can be observed that the spatial features (Longitude and Latitude) play the most important role in LightGBM-TPE, the time features (Hour, Day_of_Week and Month) are "on the second tier", while other features (such as Weather_condition) have a slight impact on fatal accident occurrence (see Fig. 2).This observation is in line with previous research [23].

Analysis with visualization
Based on the calculated SHAP values, we use the visualization method, which can help to intuitively understand the results, to further explore how the decisive features exactly influence accident severity.Without loss of generality, three features having the highest SHAP values are taken into consideration: Longitude, Latitude, Hour and Day_of_Week.

Longitude and latitude
As shown in Fig. 3, each point represents a traffic accident, with its longitude on the x-axis, ranging from −7 to 2. The y-axis indicates the SHAP value, and the color of each point denotes its latitude value.Similarly, higher SHAP value implies higher likelihood of a fatal accident.Interestingly, as Longitude changes, SHAP value exhibits a nonlinear trend.Especially, when Longitude value is between −4.5 and −3.5, or 0 and 1.5, SHAP value remains negative for most accidents, signaling that fatal accidents rarely occur in this region of Longitude, in contrast to other regions.Moreover, more intricate relations between Longitude and Latitude can be explored based on Fig. 3.For example, when Longitude is around −3.9, the higher the Latitude, the higher the SHAP value, thus corresponding to more fatal accidents.
Then, the distribution of fatal accidents in the UK is explored, as shown in Fig. 4. To analyze the joint effect of Longitude and Latitude, the whole region is separated into 18 by 20 squares.Noticeably three important regions are marked to represent major urban agglomeration: region A (London), region B (Birmingham) and region C (Liverpool, Manchester and Sheffield).As shown in Fig. 4(a), the total number of fatal accidents in big cities (regions A, B and C) is significantly larger than that in other areas.However, this conclusion cannot be carried over to the distribution of proportion of fatal accidents.Instead the fatal accident ratio is stabilized at the highest level in high-Latitude regions, rather than in region A, B or C (see Fig. 4(b)).This coincides with the conclusion of Fig. 3.We can infer that effective traffic control and management strategies must be executed in these big cities, which has dramatically beneficial effect on decreasing fatal accident rate.Then it is advisable for authority of high-Latitude regions to focus more on the control of accident severity to facilitate traffic safety.

Hour and Day_of_Week
The time features is also worth investigating.Fig. 5 displays the impact of both Hour and Day_of_Week on accident severity.Interestingly, SHAP values are mostly negative from 1:00 am to 3:00 am and from 7:00 to 8:00, while they are mostly positive from 3:00 am to 4:00 am and from 19:00 to 23:00.At other times, the distribution of SHAP values was more "balanced".It seems that fatal accidents are most likely to occur at nights (19:00 to 23:00).It is also worth noting that 3:00 am is a turning point at which SHAP values are substantially increased.Another interesting phenomenon is that there exists an "black    Wednesday" when most fatal accidents of the small hours (from 0:00 am to 3:00 am) occur.
Last but not least, the influence of Hour on both the number and the proportion of fatal accidents is investigated.As shown in Fig. 6, as function of Hour, the number curve and the proportion curve show significantly different trends.Specifically, the number of fatal accidents reaches its highest (about 118) at 17:00, while the fatal accident ratio peaks at 4:00 am (about 4.5%).In this sense, although few in total number, traffic accidents in early morning seem to be the deadliest of the whole day.In addition, the fatal accident ratio gradually increases from 19:00 to 4:00 am.This is in agreement with the observation of Fig. 5. Based on these facts, we argue that a more strict control policy for UK traffic should be implemented from night to early morning.

Conclusion
Understanding and predicting traffic accidents is important for social engineering in general and urban planning in particular.Recently, scientists have discovered vehicular traffic as an interesting complex system [44][45][46][47][48][49][50], which is worthy of study with scientific methods such as machine learning [51][52][53].Based on the traffic accidents data of the UK in 2017, LightGBM-TPE, a hybrid machine learning model, is proposed.Drawing support from the visualization method, it is found that "Longitude", "Latitude", "Hour" and "Day_of_Week" play the most important role in determining accident severity.Interestingly, by this research we can provide some practical advice for urban planners: for example, more effective traffic control and management strategies should be executed in high-Latitude regions where the fatal accident ratio is remarkably high; a more strict control policy for UK traffic should be implemented from night to early morning, since the fatal accident ratio keeps ascending during this period, until reaching its peak at 4:00 am.We hope this work contributes to the understanding of how important risk factors influence accident severity, and further provides a new approach to effectively control the occurrence of fatal accidents.However, it is worth noting that the expected behavior of the "complex system" studied in this work can be reached without institutional incentives.Thus the influence of the existence of and further the cost of institutional incentives on the evolutionary dynamics of traffic accidents deserves deep investigation in the future [54,55].

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 3 .
Fig. 3. SHAP dependence plot showing impact of Longitude and Latitude on model output.

Fig. 4 .
Fig. 4. The distribution of (a) total number of fatal accidents and (b) proportion of fatal accidents to all accidents.

Fig. 5 .
Fig. 5. SHAP dependence plot showing impact of Hour and Day_of_Week on model output.

Table 1
Data description.

Table 2
Search space and optimized hyperparameters.

Table 3
Comparison of 5 machine learning models.