Prioritizing Highway Safety Manual’s crash prediction variables using boosted regression trees

doi:10.1016/j.aap.2015.03.011

Accident Analysis & Prevention

Volume 79, June 2015, Pages 133-144

https://doi.org/10.1016/j.aap.2015.03.011 Get rights and content

Highlights

•
Boosted regression trees (BRT) is a data mining method that gives interpretable results.
•
We developed BRT models to evaluate variables’ influence on crash predictions.
•
Models showed non-linear and complex relation between variables and crash predictions.
•
BRT models with higher tree complexity levels resulted in better fitted models.

Abstract

The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency’s safety performance. However, the data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of the data variables identified in the HSM are currently unavailable in the states’ databases. Moreover, the process of collecting and maintaining all the HSM data variables is cost-prohibitive. Prioritization of the variables based on their impact on crash predictions would, therefore, help to identify influential variables for which data could be collected and maintained for continued updates. This study aims to determine the impact of each independent variable identified in the HSM on crash predictions. A relatively recent data mining approach called boosted regression trees (BRT) is used to investigate the association between the variables and crash predictions. The BRT method can effectively handle different types of predictor variables, identify very complex and non-linear association among variables, and compute variable importance. Five years of crash data from 2008 to 2012 on two urban and suburban facility types, two-lane undivided arterials and four-lane divided arterials, were analyzed for estimating the influence of variables on crash predictions. Variables were found to exhibit non-linear and sometimes complex relationship to predicted crash counts. In addition, only a few variables were found to explain most of the variation in the crash data.

Introduction

The Highway Safety Manual (HSM), published by the American Association of State Highway and Transportation Officials (AASHTO) in 2010, is designed to “assist agencies in their effort to integrate safety into their decision-making processes” (AASHTO, 2010). Part C of the HSM presents predictive models to estimate predicted average crash frequency at individual sites on different roadway facilities including rural two-lane two-way roads, rural multilane highways, and urban and suburban arterials. The general form of the predictive models in the HSM can be expressed as follows: $N_{predicted, i} = N_{spf, i} \times (CM F_{1, i} \times CM F_{2, i} \times ... \times CM F_{n, i}) \times C_{i}$ where N_predicted,i is the predicted average crash frequency for a specific year for site type i; N_spf,i is the predicted average crash frequency for a specific year for site type i for base conditions; CMF_1,i…CMF_n,i are crash modification factors for n geometric conditions or traffic control features for site type i; and C_i is the calibration factor to adjust SPF for local conditions for site type i.

As shown in Eq. (1), there are three components of the predictive models: base safety performance functions (SPFs), crash modification factors (CMFs), and calibration factors. Base SPFs are statistical models that are used to estimate predicted average crash frequency for a facility type with specified base conditions. CMFs are used to account for the effects of non-base conditions on predicted crashes. Calibration factors are required “to account for differences between the jurisdiction and time period for which the predictive models were developed and the jurisdiction and time period to which they are applied by HSM users” (AASHTO, 2010). Calibration factor is estimated as the ratio of the total number of observed crashes to the total number of predicted crashes calculated using the SPFs and CMFs provided in the HSM. The predictive models are most effective when calibrated to local conditions (Findley et al., 2012, Lu, 2013, Sun et al., 2006, Young and Park, 2013).

Very detailed roadway geometry, traffic, and crash characteristics data are needed to derive local calibration factors. Several of the variables are often unavailable in the states’ databases. Collecting and maintaining all the data variables on the entire road network for the purpose of implementing the HSM is not cost-feasible. Therefore, a process to streamline the data requirements that minimizes the potential impacts to the quality of analysis is desirable. The objective of this study is to investigate the impact of the variables identified in the HSM on crash predictions. The study used five years of crash data from 2008 to 2012 on urban and suburban two-lane undivided arterials and urban and suburban four-lane divided arterials in Florida. Boosted regression tree (BRT), a data mining approach, is applied to evaluate variables’ importance and analyze their marginal effects on crash predictions.

Section snippets

Literature review

Traditionally, statistical regression models are developed in highway safety studies to associate crash frequency with the most significant variables (for example, Hadi et al., 1995, Abdel-Aty and Radwan, 2000, Sawalha and Sayed, 2001, Hauer et al., 2004, Caliendo et al., 2007, Cafiso et al., 2010 etc.). The models, however, were limited in their scope to evaluate the influence of predictor variables on crash outcome. Few studies identified and ranked the influence of predictor variables on

Data collection and preparation

Table 1 provides the list of variables identified in the HSM for urban and suburban roadway facilities. The roadway characteristics inventory (RCI) database maintained by the Florida Department of Transportation (FDOT) is the primary source of information for the data variables on Florida roadways. Data were extracted from the RCI for urban and suburban arterials that are part of the state highway system in Florida. However, data were available only for three variables, AADT, median width, and

Methodology

This section discusses the methodology of the BRT technique. It includes the underlying principle of the BRT method, the algorithm for fitting a BRT model, and a synopsis about regularization parameters for optimizing BRT models.

Analysis setup

A series of BRT models were generated with combinations of shrinkage (0.05, 0.01, 0.005, 0.001, and 0.0005) and tree complexity (1, 5, 10, and 15) values by fitting a total of 20,000 trees for both urban and suburban two-lane undivided and four-lane divided arterial segments. The analysis was carried out using the gbm package of the statistical software R (R Core Team, 2014). Since crashes are random, non-negative, and discrete events, the models were built using Poisson distribution, where the

Analysis and results

This section presents the study results. The BRT model outputs are first presented to show model performance and parameter optimization. Based on the optimal parameter values, variable importance and the marginal effect of variables on crash prediction are evaluated.

Summary and conclusions

Calibration factors are required to adjust crash frequencies predicted using the HSM default safety performance functions (SPFs) to local site conditions. The HSM requires very detailed roadway geometry, traffic, and crash characteristics data to derive local calibration factors, and unfortunately, several of the variables are often not available in the states’ databases. Agencies are required to collect the missing data to generate calibration factors to be able to implement the HSM. As such,

References (48)

M. Abdel-Aty et al.
Analyzing angle crashes at unsignalized intersections using machine learning techniques
Accid. Anal. Prev.
(2011)
M. Abdel-Aty et al.
Modeling traffic accident occurrence and involvement
Accid. Anal. Prev.
(2000)
M. Ahmed et al.
A data fusion framework for real-time risk assessment on freeways
Transp. Res. C: Emerg. Technol.
(2013)
S. Cafiso et al.
Development of comprehensive accident models for two-lane rural highways using exposure, geometry, consistency and context variables
Accid. Anal. Prev.
(2010)
C. Caliendo et al.
A crash-prediction model for multilane roads
Accid. Anal. Prev.
(2007)
L.-Y. Chang et al.
Analysis of traffic injury severity: an application of non-parametric classification tree techniques
Accid. Anal. Prev.
(2006)
Y.L. Cheong et al.
Assessment of land use factors associated with dengue cases in Malaysia using boosted regression trees
Spat. Spatio-temporal Epidemiol.
(2014)
Y.-S. Chung
Factor complexity of crash occurrence: an empirical demonstration using boosted regression trees
Accid. Anal. Prev.
(2013)
A. Das et al.
Using conditional inference forests to identify the factors affecting crash severity on arterial corridors
J. Saf. Res.
(2009)
A.R. Ellis et al.
Confounding control in a nonexperimental study of STAR*D data: logistic regression balanced covariates better than boosted CART
Ann. Epidemiol.
(2013)

N. Elmitiny et al.

Classification analysis of driver’s stop/go decision and red-light running violation

Accid. Anal. Prev.

(2010)

A. Esther et al.

Correlations between weather conditions and common vole (Microtus arvalis) densities identified by regression tree analysis

Basic Appl. Ecol.

(2014)

A. Etter et al.

Regional patterns of agricultural land use and deforestation in Colombia

Agric. Ecosyst. Environ.

(2006)

J.H. Friedman

Stochastic gradient boosting

Comput. Stat. Data Anal.

(2002)

J.T. Froeschke et al.

Spatio-temporal predictive model based on environmental factors for juvenile spotted seatrout in Texas estuaries using boosted regression trees

Fish. Res.

(2011)

M. Gellrich et al.

Combining classification tree analyses with interviews to study why sub-alpine grasslands sometimes revert to forest: a case study from the Swiss Alps

Agric. Syst.

(2008)

R. Hale et al.

Separating the effects of water physicochemistry and sediment contamination on Chironomus tepperi (Skuse) survival, growth and development: a boosted regression tree approach

Aquat. Toxicol.

(2014)

R. Harb et al.

Exploring precrash maneuvers using classification trees and random forests

Accid. Anal. Prev.

(2009)

A. Jafari et al.

Spatial prediction of soil great groups by boosted regression trees using a limited point dataset in an arid region, southeastern Iran

Geoderma

(2014)

M.G. Karlaftis et al.

Effects of road geometry and traffic volumes on rural roadway accident rates

Accid. Anal. Prev.

(2002)

A.T. Kashani et al.

Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models

Saf. Sci.

(2011)

B. Lemercier et al.

Extrapolation at regional scale of local soil knowledge using boosted classification trees: a two-step approach

Geoderma

(2012)

D. Müller et al.

Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees

Agric. Syst.

(2013)

A. Neumann et al.

Measuring performance in health care: case-mix adjustment by boosted decision trees

Artif. Intell. Med.

(2004)

Cited by (76)

Environmental DNA and remote sensing datasets reveal the spatial distribution of aquatic insects in a disturbed subtropical river system
2024, Journal of Environmental Management
Biodiversity datasets with high spatial resolution are critical prerequisites for river protection and management decision-making. However, traditional morphological biomonitoring is inefficient and only provides several site estimates, and there is an urgent need for new approaches to predict biodiversity on fine spatial scales throughout the entire river systems. Here, we combined the environmental DNA (eDNA) and remote sensing (RS) technologies to develop a novel approach for predicting the spatial distribution of aquatic insects with high spatial resolution in a disturbed subtropical Dongjiang River system of southeast China. First, we screened thirteen RS-based vegetation indices that significantly correlated with the eDNA-inferred richness of aquatic insects. In particular, the green normalized difference vegetation index (GNDVI) and normalized difference red-edge2 (NDRE2) were closely related to eDNA-inferred richness. Second, using the gradient boosting decision tree, our data showed that the spatial pattern of eDNA-inferred richness could achieve a high spatial resolution to 500 m reach and accurate prediction of more than 80%, and the prediction efficiency of the headwater streams (Strahler stream order = 1) was slightly higher than the downstream (Strahler stream order >1). Third, using the random forest algorithm, the spatial distribution of aquatic insects could reach a prediction rate of over 70% for the presence or absence of specific genera. Overall, this study provides a new approach to achieving high spatial resolution prediction of the distribution of aquatic insects, which supports decision-making on river diversity protection under climate changes and human impacts.
Comparison of corridor-level fatal and injury crash models with site-level models for network screening purposes on Florida urban and suburban divided arterials
2024, Traffic Injury Prevention
Objective: Develop corridor-level network screening models to identify high-risk corridors where safety improvements could be implemented to reduce fatal and injury (FI) crashes. Methods: A novel corridor definition focused on context classification and lane count was developed and applied to urban and suburban four-lane divided arterial roadways in Florida. Negative binomial regression models were developed for multi- and single-vehicle crashes using 80% of the corridors (training set). Crash frequency predictions were obtained from the developed corridor models and similar site-level models from the Highway Safety Manual (HSM) models for the remaining 20% of the corridors (testing set). Results from all models were adjusted using the empirical Bayes (EB) method. Results: A total of 130 corridors were identified across seven counties. These corridors contained approximately 349 km (217 miles) of roadway and experienced 11,437 multi-vehicle and 746 single-vehicle crashes that resulted in fatalities or injuries from 2017 to 2021. After applying the HSM site-level models and the developed corridor-level models to the testing set (both with and without EB adjustments), the corridor-level models with EB adjustments were the most accurate for corridor crash prediction. Applying the corridor-level models with EB adjustments to the testing set gave a predicted value of 386.44 crashes/year, which was the closest to the observed crash frequency of 383.20 crashes/year. From the corridor-level models, a 3.48-km (2.16-mile) high-risk corridor in Miami-Dade County was identified and analyzed site-by-site using the HSM methodology to identify specific sites within the corridor where safety improvements could provide the most FI crash reductions. Conclusions: The corridor-level models were more accurate and statistically reliable than similar HSM models while being less data intensive. They also only required corridor-level data rather than data for each intersection and segment. By using readily available data, the methods in this paper can be easily replicated by agencies to develop their own network screening corridor-level models and expedite the identification of corridors in need of safety improvements to reduce FI crashes. Existing site-level network screening methods can be used to supplement the developed corridor-level methodology by identifying high-risk sites within identified high-risk corridors.
Varying Built Environment Contexts and Trip Chain Decisions: A Multinomial-Choice Gradient Boosting Decision Trees Analysis
2024, Travel Behaviour and Society
Understanding daily trip chain decisions is important for facilitating sustainable travel behavior and reducing traffic congestion. However, the relationships between varying built environment contexts across activity locations and trip chain choices remain unclear. This study adopts a multinomial-choice gradient boosting decision trees (MC-GBDT) model to compare the relative importance of socioeconomic variables with built environment variables. Additionally, it investigates the nonlinear and threshold effects between residential areas and primary activity areas. In the 2017 Beijing household activity survey, socioeconomic variables account for only about 21% of the relative importance in predicting the trip chain choices, whereas built environment variables constitute 78%, including 46% attribute to the primary activity area and 32% to the residential area. Moreover, several built environment variables have distinctive nonlinear and threshold effects, including local access to daily facilities, road density, and transit accessibility measures. Street design features and access to public transit have greater impacts on trip chain decisions in both residential and primary activity areas. These findings echo the uncertain geographic context problem (UGCoP) and imply that traditional built environment-trip chain research might underestimate the impact of built environment contexts at primary activity locations.
Simulating Seoul's greenbelt policy with a machine learning-based land-use change model
2023, Cities
This study builds a machine-learning-based land-use change (ML-LUC) model to analyze the effect of green belt (GB) regulation in the Seoul metropolitan area (SMA) and predict the spatially explicit development potential of the land within the GB under the assumption of a no-GB policy scenario. It stands out for its ML-LUC application to simulate counterfactual planning for urban land use regulation. After comparing the predictive power of extreme gradient boosting (XGB), random forest (RF), and artificial neural network (ANN), the ML-LUC model utilizes the XGB algorithm due to its outperformance. Three scenarios based on SMA's land market demand were simulated to estimate the potential population and employment within the GB under the no-GB policy: high, moderate, and low land market demand. The results suggest 0.6 to 1.5 million residents, 0.2 to 0.5 million manufacturing jobs, and 0.4 to 1.0 million service sector jobs could have been located within the GB, accounting for 3 % to 6 % of total population and 5 % to 13 % of all employment in SMA. The findings imply the GB regulation prevents population and employment from locating within the GB, pushing them to central Seoul or suburbs beyond the GB under a closed-city assumption.
Heterogeneous ensemble learning for enhanced crash forecasts – A frequentist and machine learning based stacking framework
2023, Journal of Safety Research
Introduction: This study aims to increase the prediction accuracy of crash frequency on roadway segments that can forecast future safety on roadway facilities. A variety of statistical and machine learning (ML) methods are used to model crash frequency with ML methods generally having a higher prediction accuracy. Recently, heterogeneous ensemble methods (HEM), including “stacking,” have emerged as more accurate and robust intelligent techniques providing more reliable and accurate predictions. Methods: This study applies “Stacking” to model crash frequency on five-lane undivided (5 T) segments of urban and suburban arterials. The prediction performance of “Stacking” is compared with parametric statistical models (Poisson and negative binomial) and three state-of-the-art ML techniques (Decision tree, random forest, and gradient boosting), each of which is termed as the base-learner. By employing an optimal weight scheme to combine individual base-learners through stacking, the problem of biased predictions in individual base-learners due to differences in specifications and prediction accuracies is avoided. Data including crash, traffic, and roadway inventory were collected and integrated from 2013 to 2017. The data are split into training (2013–2015), validation (2016), and testing (2017) datasets. After training five individual base-learners using training data, prediction outcomes are obtained for the five base-learners using validation data that are then used to train a meta-learner. Results: Results of statistical models reveal that crashes increase with the density (number per mile) of commercial driveways whereas decrease with average offset distance to fixed objects. Individual ML methods show similar results – in terms of variable importance. A comparison of out-of-sample predictions of various models or methods confirms the superiority of “Stacking” over the alternative methods considered. Conclusions and practical applications: From a practical standpoint, “stacking” can enhance prediction accuracy (compared to only one base-learner with a particular specification). When applied systemically, stacking can help identify more appropriate countermeasures.
Examining the nonlinear impacts of built environment on ridesourcing usage: Focus on the critical urban sub-regions
2022, Journal of Cleaner Production
Ridesourcing or on-demand ridesharing, offers a sustainable mobility option that connects drivers with passengers via mobile application directly, which helps reduce unnecessary vehicle cruising and energy consumption. It plays a crucial role in urban mobility within the built environment. However, the interdependency between ridesourcing usage and built environment has not been addressed adequately, particularly in the critical regions that have significant influence on ridesourcing usage in an urban context. Based on percolation theory, this study suggests a new concept, namely ridesourcing usage islands, defined as geographical areas of interest with a high or low concentration of ridesourcing usage. Within these noteworthy areas, a machine learning method, gradient boosting decision trees (GBDT), is further innovatively adopted to investigate the refined and discontinuous non-linear impacts of built environment on ridesourcing usage. The results reveal a hierarchical structure of ridesourcing usage islands. Regional imbalances of travel supply and demand at usage island level are sporadically identified across several regions. Besides, the formation of usage islands is highly influenced by the surrounding built environment. Most importantly, employment density and residential density have joint contribution of almost 20% for ridesourcing pick up demand and drop off demand respectively, reflecting the role of ridesourcing in commuting. Regardless of island's type, built environment features show obvious threshold effects on ridesourcing usage, and their specific effective ranges are different from each other. Findings in this paper are expected to help better understand ridesourcing use as a function of urban built environment, and provide valuable inputs for ridesourcing management and sustainable urban development.

View all citing articles on Scopus

¹: Tel.: +1 305 348 1896.

²: Tel.: +1 305 348 3116.

View full text

Published by Elsevier Ltd.

Prioritizing Highway Safety Manual’s crash prediction variables using boosted regression trees

Highlights

Abstract

Introduction

Section snippets

Literature review

Data collection and preparation

Methodology

Analysis setup

Analysis and results

Summary and conclusions

Accid. Anal. Prev.

Accid. Anal. Prev.

Transp. Res. C: Emerg. Technol.

Accid. Anal. Prev.

Accid. Anal. Prev.

Accid. Anal. Prev.

Spat. Spatio-temporal Epidemiol.

Accid. Anal. Prev.

J. Saf. Res.

Ann. Epidemiol.

Accid. Anal. Prev.

Basic Appl. Ecol.

Agric. Ecosyst. Environ.

Comput. Stat. Data Anal.

Fish. Res.

Agric. Syst.

Aquat. Toxicol.

Accid. Anal. Prev.

Geoderma

Accid. Anal. Prev.

Saf. Sci.

Geoderma

Agric. Syst.

Artif. Intell. Med.