Factors, Prediction, and Explainability of Vehicle Accident Risk Due to Driving Behavior through Machine Learning: A Systematic Literature Review, 2013–2023

: Road accidents are on the rise worldwide, causing 1.35 million deaths per year, thus encouraging the search for solutions. The promising proposal of autonomous vehicles stands out in this regard, although fully automated driving is still far from being an achievable reality. Therefore, efforts have focused on predicting and explaining the risk of accidents using real-time telematics data. This study aims to analyze the factors, machine learning algorithms, and explainability methods most used to assess the risk of vehicle accidents based on driving behavior. A systematic review of the literature produced between 2013 and July 2023 on factors, prediction algorithms, and explainability methods to predict the risk of traffic accidents was carried out. Factors were categorized into five domains, and the most commonly used predictive algorithms and explainability methods were determined. We selected 80 articles from journals indexed in the Web of Science and Scopus databases, identifying 115 factors within the domains of environment, traffic, vehicle, driver, and management, with speed and acceleration being the most extensively examined. Regarding machine learning advancements in accident risk prediction, we identified 22 base algorithms, with convolutional neural network and gradient boosting being the most commonly used. For explainability, we discovered six methods, with random forest being the predominant choice, particularly for feature importance analysis. This study categorizes the factors affecting road accident risk, presents key prediction algorithms, and outlines methods to explain the risk assessment based on driving behavior, taking vehicle weight into consideration.


Introduction
There are around 1.35 million deaths worldwide per year due to vehicle accidents [1]; in Europe, 60% of such deaths occur on two-lane roads [2].In this regard, the United Nations Organization has proposed 17 sustainable development goals (SDGs) for the year 2030, where SDG-3, "Good health and well-being", aims to reduce deaths and injuries resulting from traffic incidents by 50% worldwide [3].One potential option is the implementation of autonomous vehicles.Nevertheless, complete automation in driving is still a considerable distance away, making it unlikely in the foreseeable future [4]; furthermore, extensive research is still needed, especially in the field of prediction.
Since 1975, research has focused on predicting vehicle accident risk (VAR).Chipman and Morgan [5] studied various factors such as demerit points, age, gender, license class, and accident history.Their findings highlighted demerit points as the key factor influencing future accident risk, offering a chance to prevent accidents when linked with driving behavior (DB).Extensive research over time has led to modifications and regulations in environmental, vehicular, traffic, driver, and management domains to reduce risks.These Computation 2024, 12, 131 2 of 21 measures aim to reduce risks, such as using deceleration devices and central protection barriers on roads for risk mitigation [2].Additionally, mechanisms have been implemented for collision prevention, pedestrian identification, lane change alerts, and detection of driver distraction and drowsiness with feedback to the driver, among other capabilities [6].These advancements prompted governments to implement safety manuals, such as the Road Safety Manual, which includes a widely used VAR prediction model [7,8]; however, this model does not consider DB in its statistical analysis [9].
The study problem in this article focuses on driving behavior (DB) and its impact on traffic accident incidence.DB refers to the actions and responses of a driver during various driving scenarios, encompassing the journey from an initial point to a final destination, taking into account factors such as travel time [10].DB can be categorized into distinct groups with similar patterns, facilitating the estimation of driving risk levels [11].These groups include the following: normal, drowsy, and aggressive behaviors [12].
In the VAR context, DB holds utmost significance as it accounts for the highest incidence of traffic accidents-surpassing 70% of accidents in certain countries, such as Peru [13].Therefore, the vehicle accident risk due to driving behavior (DBVAR) refers to the probability of a traffic accident occurring due to actions taken by drivers behind the wheel, which can increase the chances of suffering an accident and endanger road safety.Identifying this risk is fundamental to protecting human lives, promoting road safety, reducing the costs associated with traffic accidents, and developing effective safety policies.
Research has been conducted to determine factors for predicting traffic incidents using machine learning (ML) methods.Xu et al. [14] found that there was a strong correlation between aggressive DB and aspects of the driver, vehicle, and environment.In a similar vein, Li et al. [15] included the environment, vehicle, driver, and traffic.Likewise, Niu et al. [16] and Yang et al. [17] included the management domain.It is important to study each of these factors, not only to find a better model but also to mitigate the risk of accidents and their consequences, in addition to improving road safety [18].
Regarding prediction, different artificial intelligence algorithms have been used to predict the risk of vehicle accidents.Geng et al. [2] presented an extensive modeling framework for evaluating truck safety on two-lane rural roads using extreme gradient boosting (XGboost), achieving an impressive accuracy of 96.67%.In the study by Peng et al. [19], it was noted that long short-term memory (LSTM) is suitable for extracting significant and continuous information from vehicles such as accelerations and decelerations, which they applied for DBVAR prediction and achieved a 93.5% accuracy.
On the other hand, various algorithms have also been used for DBVAR explainability.In the study by Masello et al. [20], Shapley additive explanations (SHAP) was applied, and it was found that the speed limit was a very relevant factor for the riskiest events.In the same sense, the study by Alfai et al. [21], based on the random forest (RF) feature importance method, discovered that the most significant predictors for DBVAR were the mean speed of the vehicle, the vehicle's instantaneous speed, and its longitudinal acceleration.
The amount of research on DBVAR has motivated various researchers to perform stateof-the-art studies.In the study by Bouhsissin et al. [22], 93 articles were reviewed between 2015 and 2022, from which it was highlighted that ML algorithms occupied the predominant position with 60%, followed by deep learning (DL) algorithms and statistical methods (with 34.87% and 5.15%, respectively).The most-used algorithms were support vector machine (SVM), logistic regression (LR), LSTM, artificial neural network (ANN), k-nearest neighbors (KNN), RF, and convolutional neural network (CNN).In parallel, 39 relevant factors were identified in this area.In the study by Paredes et al. [23], 27 articles were analyzed between 2015 and 2020, finding 17 ML algorithms in which Bayesian algorithms and decision trees mainly stood out.In addition, 21 relevant factors were identified in this context, coinciding with the results of Bouhsissin et al. [22], where the most used were acceleration, deceleration, and speed.Likewise, in the research of Elassad et al. [24], 82 articles from the period 2009-2019 were reviewed, and the factors and prediction aspects were analyzed.A total of 14 general factors grouped into the dimensions of driver, vehicle, and environment were identified, and it was found that SVM, neural network (NN), Bayesian learners (BL), and ensemble learners (EL) were the four most used algorithms, present in 72% of the selected studies.On the other hand, in the study by Silva et al. [25], the prediction and explainability aspects were studied in relation to the frequency of accidents and severity classification, based on 26 articles from the period 2003-2020, and it was found that the main techniques were KNN and decision tree (DT); however, ANN was found to be the most suitable for predicting accident frequency.Furthermore, they highlighted the road environment, human behaviors, accident characteristics, and vehicle-related elements as the main contributors to the elucidation of accident causes.
Studies in the field have revealed that a wealth of knowledge exists that needs to be inventoried, analyzed, and classified.However, in the context of ML, there is a tendency to use algorithms that evaluate risk based on accident frequency and DBVAR, without differentiating between light and heavy vehicles or associated factors related to vehicle trip management.These factors include the estimated delay time to the destination or whether a heavy vehicle is loaded or empty.Furthermore, current approaches focus on contributing factors that explain the frequency or severity of accidents but do not identify the factors contributing to DBVAR.This gap is crucial as regulations increasingly mandate the incorporation of mechanisms for reading trajectory and security data.Through analyzing these data, conducting prediction in real time, and explaining the causes, we can significantly mitigate the number of accidents.
This study aims to systematically review all the important developed aspects related to the factors, prediction, and explainability of DBVAR based on ML and aims to answer the following research question: Which factors, ML advances for prediction, and explainability methods have been investigated in relation to DBVAR?
The main contributions of this article are as follows: • Providing a comprehensive catalog of traffic accident risk factors, classified into five dimensions;

•
Identifying the various prediction algorithms, data sets used, and performance metrics employed in the analysis;

•
Compiling the various studies utilizing multiple methods to explain factors contributing to DBVAR;

•
Providing the reader with a wide range of bibliographic references that they can utilize to delve deeper into understanding the models based on ML that facilitate prediction and explanation of DBVAR.
This article is organized into five sections, as follows.In Section 2, the methodology followed for the systematic review of the literature is presented.Section 3 presents the results, focused on answering the research questions, the discussion of which is presented in Section 4. Finally, the conclusions follow in Section 5.

Methodology
For this article, a systematic review of the literature was carried out based on the model applied by Silva et al. [25] and Shiguihara et al. [26] to ensure scientific rigor, which consisted of the following phases:

•
Planning: Define the research questions to be addressed, establish the sequence of steps to be carried out to search, and identify primary studies in indexed databases, also including the inclusion/exclusion criteria used for the selection of articles.

•
Development: The selection of primary studies is carried out in accordance with planning, following which the quality is evaluated and the data are extracted and synthesized.
Results: Statistics on publications are shown, and the research questions are answered in Sections 2.3 and 3, respectively.

Planning
Three research questions were proposed in order to determine the aspects developed to understand the factors, prediction, and explainability of the DBVAR: In order to address the research questions, we conducted a review of primary publications in journals indexed in the SCOPUS and Web of Science (WoS) databases, using the following search string: ("vehicle accident risk" OR "car accident risk" OR "car following" OR "driving behavior " OR "driving style" OR "driver behavior " OR "driving risk" OR "driver risk" OR "road safety") AND ((factors OR features OR causes) OR (predicti* OR forecast* OR progno*) OR (explainability OR explainable OR interpretabl* OR xai)) AND ("machine learning" OR "deep learning" OR lstm).
As shown in Table 1, the string was applied in "title-abs-key" format for Scopus and "topic" format for WoS, considering the period from January 2013 to July 2023.Additionally, the search was limited to publications with SCImago journal ranking impact factor.Finally, the inclusion and exclusion criteria established in Table 2 were applied.

Database Search String
Scopus TITLE-ABS-KEY (("vehicle accident risk" OR "car accident risk" OR "car following" OR "driving behavior" OR "driving style" OR "driver behavior" OR "driving risk" OR "driver risk" OR " road safety") AND ((factors OR features OR causes) OR (predicti* OR forecast* OR progno*) OR (explainability OR explainable OR interpretabl* OR xai)) AND ("machine learning" OR "deep learning"))

WoS
Results for ("vehicle accident risk" OR "car accident risk" OR "car following" OR "driving behavior" OR "driving style" OR "driver behavior" OR "driving risk" OR "driver risk" OR "road safety") AND ((factors OR aspects OR causes) OR (predicti* OR forecast*) OR (explainability OR explainable OR interpretable OR xai)) AND ("machine learning" OR "deep learning") (Topic)

Development
The possible original investigations found during the search were subjected to a selection procedure based on the criteria detailed in Table 2, covering both inclusion and exclusion criteria.To achieve this, it was necessary to carry out a prior review of the content, in order to determine its relevance for the present study and find those studies related to the factors, prediction, or explainability of DBVAR using ML.Most of the works were discarded as they corresponded to unrelated topics such as driver identification, energy consumption, autonomous vehicles, vehicles with fewer than four wheels, racing cars, pollution, level of accident severity, traffic study, or time and cost optimization.Figure 1 explains the applied process and identifies the activities carried out to select or reject studies.discarded as they corresponded to unrelated topics such as driver identification, energy consumption, autonomous vehicles, vehicles with fewer than four wheels, racing cars, pollution, level of accident severity, traffic study, or time and cost optimization.Figure 1 explains the applied process and identifies the activities carried out to select or reject studies.

Potentially eligible studies and selected studies
The systematic review search conducted in Scopus and WoS resulted in 1674 articles, of which 80 were selected (see Table 3).

Trend of studies by year
The number of publications in the aspects of factors, prediction, or explainability of DBVAR showed exponential growth both in potential articles (see Figure 2a) and in

Potentially eligible studies and selected studies
The systematic review search conducted in Scopus and WoS resulted in 1674 articles, of which 80 were selected (see Table 3).

Trend of studies by year
The number of publications in the aspects of factors, prediction, or explainability of DBVAR showed exponential growth both in potential articles (see Figure 2a) and in selected articles (see Figure 2b).This could be explained by the increasing number of traffic accidents and the introduction of ML technologies for accident prediction and explainability.
selected articles (see Figure 2b).This could be explained by the increasing number of tra fic accidents and the introduction of ML technologies for accident prediction and explai ability.

Articles selected by journal quality factor
Regarding the journal quality factor, 60% (48) of the articles were categorized in qua tile Q1 and 35% (28) in quartile Q2, indicating that 95% of the articles fell within the to two quartiles (see Figure 4).This highlights the quality of the studies.selected articles (see Figure 2b).This could be explained by the increasing number of traffic accidents and the introduction of ML technologies for accident prediction and explainability.
(a) (b)  Articles selected by journal quality factor Regarding the journal quality factor, 60% (48) of the articles were categorized in quartile Q1 and 35% (28) in quartile Q2, indicating that 95% of the articles fell within the top two quartiles (see Figure 4).This highlights the quality of the studies.Articles selected by journal quality factor Regarding the journal quality factor, 60% (48) of the articles were categorized in quartile Q1 and 35% (28) in quartile Q2, indicating that 95% of the articles fell within the top two quartiles (see Figure 4).This highlights the quality of the studies.selected articles (see Figure 2b).This could be explained by the increasing number of traffic accidents and the introduction of ML technologies for accident prediction and explainability.
(a) (b)  Articles selected by journal quality factor Regarding the journal quality factor, 60% (48) of the articles were categorized in quartile Q1 and 35% (28) in quartile Q2, indicating that 95% of the articles fell within the top two quartiles (see Figure 4).This highlights the quality of the studies.Articles selected by journal Figure 5 illustrates that the two most prominent journals-Accident Analysis and Prevention and IEEE Access-were situated in the Q1 quartile and collectively accounted for 25% of the publications.Notably, there were 27 other journals categorized under "Others", each contributing a single article.Articles selected by journal Figure 5 illustrates that the two most prominent journals-Accident Analysis and Prevention and IEEE Access-were situated in the Q1 quartile and collectively accounted for 25% of the publications.Notably, there were 27 other journals categorized under "Others", each contributing a single article.

Results
This section addresses the research questions posed in Section 2.1 based on the selected studies.
A. RQ1: What are the factors considered in predicting DBVAR?DB encompasses a driver s actions, awareness, and adherence to road regulations.These factors can directly impact a driver s behavior or prompt changes, and comprehending them aids in enhancing safety standards [28].In this context, 115 factors were found in 48 studies, which were classified considering three of the four categories from Silva et al. [25], separating the factors related to traffic from the environment category and adding a management category, then excluding the accident category (characteristics of the occurred accident type) as this was a result and not a risk, and so, it did not correspond to a DBVAR.The resulting categories were as follows: (1) Environment: environment and geographical distribution.
(2) Traffic: related to vehicles surrounding to the one being studied.
(3) Vehicle: static or moving mode features.(4) Driver: related to the human who drives the vehicle.
(5) Management: efficient vehicle fleet and drivers control and coordination.
Environment factors: A total of 20 factors were found from 23 articles, where the weather was the most used (in 9), followed by date-time and slope (in 8 and 5, respectively; see Table 4).

Results
This section addresses the research questions posed in Section 2.1 based on the selected studies.
A. RQ1: What are the factors considered in predicting DBVAR?DB encompasses a driver's actions, awareness, and adherence to road regulations.These factors can directly impact a driver's behavior or prompt changes, and comprehending them aids in enhancing safety standards [28].In this context, 115 factors were found in 48 studies, which were classified considering three of the four categories from Silva et al. [25], separating the factors related to traffic from the environment category and adding a management category, then excluding the accident category (characteristics of the occurred accident type) as this was a result and not a risk, and so, it did not correspond to a DBVAR.The resulting categories were as follows: (1) Environment: environment and geographical distribution.
(2) Traffic: related to vehicles surrounding to the one being studied.
(3) Vehicle: static or moving mode features.(4) Driver: related to the human who drives the vehicle.
(5) Management: efficient vehicle fleet and drivers control and coordination.
Environment factors: A total of 20 factors were found from 23 articles, where the weather was the most used (in 9), followed by date-time and slope (in 8 and 5, respectively; see Table 4).Traffic factors: A total of 17 were identified, where the most studied were the distance between two vehicles, the time to collision, and the traffic density, in 13, 10, and 9 studies out of 25, respectively (see Table 5).Vehicle factors: A total of 44 factors were identified in 39 articles, where the most used were speed, acceleration, and steering angle, in 27, 23, and 9 studies, respectively (see Table 6).Driver factors: A total of 25 were identified, where the most used were heart rate and eye, in 4 studies each, representing 39% of the 18 studies (see Table 7).Management factors: A total of nine were identified in two studies (see Table 8).The concentration of storage locations in an area, influencing traffic patterns and accident risks through truck and delivery vehicle flow, potentially increasing congestion and interactions with other traffic.
1 [17] On the other hand, the DBVAR prediction studies considered four variables of interest that described the accident, which were presented in two studies (see Table 9) and were used as a prediction object.
Table 9. Variables that describe the accident used in DBVAR prediction.

Studies
Factor [31] Severity, number of accidents [32] Accident type, accident causes B. RQ2: What are the advances of ML in DBVAR prediction?
Prediction based on statistical or ML methods allows behavior to be predicted in the case of an event, in order to predict probable future results such as DB or traffic accidents [30].These models use the factors as input to make predictions; however, once the result is obtained, the reasoning behind the decision becomes unknown, and it is not possible to determine which of the factors has contributed most significantly to the generated effect [65].For this reason, they are called "closed box" techniques, and to fully understand them, the use of additional explainability techniques is necessary.
XAI allows for adequate interpretation of the prediction process [17], for which models are used to analyze the importance and dependence of the factors that contribute to explaining the result [40]; in this way, confidence and transparency in the predictions can be ensured, such that they can reasonably be applied in the field of transportation safety [14].
To answer this question, 18 studies were found that used six methods to explain the factors with the greatest contribution to the DBVAR.In this context, RF and GB feature importance were the most used (in 50%), as well as SHAP (in 33%).They mainly focused on explaining DB in accident risk, where China and the United States were the main countries where the studies were applied (see Table 13).

Discussion
The result of this systematic literature review is a catalog of factors, prediction algorithms, and methods used to explain the importance of the factors.Researchers can use the different results to understand progress in the field and provide new approaches to reduce the risk of accidents to protect human lives, promote road safety, reduce the costs associated with traffic accidents, and develop effective safety policies.The relevance of this information is validated as 95% of studies were within the first two quartiles, such that the quality of the results is guaranteed.The research questions are discussed below.

About Factors
In this study, it was observed that the factors were classified into five dimensions (vehicle, environment, traffic, driver, and management), where vehicle was the most studied.Speed, acceleration, and distance between two vehicles stood out as the mostused factors due to their direct influence on the driver's ability to control the vehicle in various risk situations.In addition, they also determine the level of severity of an accident.Additional crucial factors include the geographical location, determined through the Global Positioning System (GPS), as it enables us to comprehend other related elements, such as the geographical environment.The increasing prevalence of cost-effective sensors and cameras in vehicles is driving a trend toward greater data acquisition in real time, consequently enhancing the precision of models.At present, China leads research on DBVAR factors, probably due to the growth, leadership, and expansion of its automotive industry.
Some studies have considered the accident domain; however, this refers to the results and not the causes.Furthermore, they tend to focus on accident characterization, and so, they have not been considered as factors; however, they could be considered as an object to predict.Likewise, it is important to highlight that there were no factors associated with trip management, such as delay in delivery, the driver's experience on the route, or whether the vehicle was loaded.Therefore, it is important to consider management-related factors (i.e., those in the management dimension) to evaluate commercial vehicles and improve the understanding of vehicle accidents as a whole.

About Explainability
In this study, six explainability methods were identified in 18 studies, where the most studied was "RF feature importance," with influencing factors related to the environment, such as road shape, road network, and weather.The increasing adoption of deep learning algorithms has highlighted the importance of understanding and trusting model decisions, driving the use of explainability methods to identify influential risk factors that might not be obvious to humans.Although the reviewed research barely addressed management factors, it is relevant to study their importance in explainability.Furthermore, there exist very successful methods, such as local interpretable model-agnostic explanations (LIME), which could provide good results in this context.

Conclusions
For this study, we conducted a systematic literature review related to DBVAR through ML.Out of the 1674 articles identified, 80 research papers were meticulously chosen through analysis, enabling the discovery of advancements in the field with respect to factors, prediction, and explainability.Within this review, we identified 115 factors across 48 studies, 22 prediction algorithms within 76 studies, and 6 explainability algorithms across 18 studies, all of which elucidated the influence of certain factors on prediction outcomes.Unlike other state-of-the-art studies on DBVAR, this work considered three crucial aspects: the influencing factors, accident prediction, and explainability.In relation to factors, we identified five dimensions: environment (20 factors), traffic (17 factors), vehicle (44 factors), driver (25 factors), and management (9 factors).In particular, speed, acceleration, and distance between two vehicles were the most-studied factors.In the realm of ML advancements, CNN and GB emerged as the most commonly employed algorithms.Moreover, there is a growing trend in leveraging deep learning and hybrid models for enhanced precision.Notably, XGboost achieved the highest accuracy at 100% on a DBD data set of Turkish origin.It is worth noting that the majority of studies focused on light vehicles, with limited research conducted on heavy vehicles and rural roads.In reference to advances in explainability, it was found that the most-used method was the RF algorithm with feature importance.Additionally, the most studied models were MLP, CNN, GB, LSTM, and RF, and the common factors influencing their performance were speed, acceleration, and heading angle.
This study had some limitations that should be considered.Only studies in English were included, and only the WoS and Scopus databases were used as sources of information.Based on our findings, future research should focus on developing practices and strategies to address DBVAR factors in order to reduce the occurrence of traffic accidents, as well as extending this study to include other languages and additional databases.

Figure 2 .
Figure 2. Number of publications per year: (a) potentially eligible and (b) selected studies.

Figure 3 .
Figure 3. Studies by authors country of affiliation.

Figure 2 .
Figure 2. Number of publications per year: (a) potentially eligible and (b) selected studies.Study trends across different countries Figure3illustrates the distribution of studies based on the authors' country of affiliation, with China and the United States representing 45% of the total concentration.

Figure 2 .
Figure 2. Number of publications per year: (a) potentially eligible and (b) selected studies.

Figure 3 .
Figure 3. Studies by authors country of affiliation.

Figure 3 .
Figure 3. Studies by authors' country of affiliation.

Figure 2 .
Figure 2. Number of publications per year: (a) potentially eligible and (b) selected studies.Study trends across different countries Figure3illustrates the distribution of studies based on the authors country of affiliation, with China and the United States representing 45% of the total concentration.

Figure 3 .
Figure 3. Studies by authors country of affiliation.

Figure 4 .
Figure 4. Articles by quality factor.Figure 4. Articles by quality factor.

Figure 4 .
Figure 4. Articles by quality factor.Figure 4. Articles by quality factor.

Table 1 .
Database search string.

Table 2 .
Inclusion and exclusion criteria.

Table 3 .
Potentially eligible studies and selected studies.
a 26 studies removed from WoS for being duplicates in Scopus.

Table 3 .
Potentially eligible studies and selected studies.
a 26 studies removed from WoS for being duplicates in Scopus.

Table 4 .
Environmental factors used in DBVAR.

Table 4 .
Environmental factors used in DBVAR.

Table 5 .
Traffic factors used in DBVAR.

Table 6 .
Vehicle factors used in DBVAR.

Table 7 .
Driver factors used in DBVAR.

Table 8 .
Management factors used in DBVAR.

Table 10 .
Algorithms used in the DBVAR.

Table 12 .
Studies applied by type of driving risk.

Table 13 .
Methods used in the explainability of the DBVAR.