Prioritizing Highway Safety Manual’s crash prediction variables using boosted regression trees
Introduction
The Highway Safety Manual (HSM), published by the American Association of State Highway and Transportation Officials (AASHTO) in 2010, is designed to “assist agencies in their effort to integrate safety into their decision-making processes” (AASHTO, 2010). Part C of the HSM presents predictive models to estimate predicted average crash frequency at individual sites on different roadway facilities including rural two-lane two-way roads, rural multilane highways, and urban and suburban arterials. The general form of the predictive models in the HSM can be expressed as follows:where Npredicted,i is the predicted average crash frequency for a specific year for site type i; Nspf,i is the predicted average crash frequency for a specific year for site type i for base conditions; CMF1,i…CMFn,i are crash modification factors for n geometric conditions or traffic control features for site type i; and Ci is the calibration factor to adjust SPF for local conditions for site type i.
As shown in Eq. (1), there are three components of the predictive models: base safety performance functions (SPFs), crash modification factors (CMFs), and calibration factors. Base SPFs are statistical models that are used to estimate predicted average crash frequency for a facility type with specified base conditions. CMFs are used to account for the effects of non-base conditions on predicted crashes. Calibration factors are required “to account for differences between the jurisdiction and time period for which the predictive models were developed and the jurisdiction and time period to which they are applied by HSM users” (AASHTO, 2010). Calibration factor is estimated as the ratio of the total number of observed crashes to the total number of predicted crashes calculated using the SPFs and CMFs provided in the HSM. The predictive models are most effective when calibrated to local conditions (Findley et al., 2012, Lu, 2013, Sun et al., 2006, Young and Park, 2013).
Very detailed roadway geometry, traffic, and crash characteristics data are needed to derive local calibration factors. Several of the variables are often unavailable in the states’ databases. Collecting and maintaining all the data variables on the entire road network for the purpose of implementing the HSM is not cost-feasible. Therefore, a process to streamline the data requirements that minimizes the potential impacts to the quality of analysis is desirable. The objective of this study is to investigate the impact of the variables identified in the HSM on crash predictions. The study used five years of crash data from 2008 to 2012 on urban and suburban two-lane undivided arterials and urban and suburban four-lane divided arterials in Florida. Boosted regression tree (BRT), a data mining approach, is applied to evaluate variables’ importance and analyze their marginal effects on crash predictions.
Section snippets
Literature review
Traditionally, statistical regression models are developed in highway safety studies to associate crash frequency with the most significant variables (for example, Hadi et al., 1995, Abdel-Aty and Radwan, 2000, Sawalha and Sayed, 2001, Hauer et al., 2004, Caliendo et al., 2007, Cafiso et al., 2010 etc.). The models, however, were limited in their scope to evaluate the influence of predictor variables on crash outcome. Few studies identified and ranked the influence of predictor variables on
Data collection and preparation
Table 1 provides the list of variables identified in the HSM for urban and suburban roadway facilities. The roadway characteristics inventory (RCI) database maintained by the Florida Department of Transportation (FDOT) is the primary source of information for the data variables on Florida roadways. Data were extracted from the RCI for urban and suburban arterials that are part of the state highway system in Florida. However, data were available only for three variables, AADT, median width, and
Methodology
This section discusses the methodology of the BRT technique. It includes the underlying principle of the BRT method, the algorithm for fitting a BRT model, and a synopsis about regularization parameters for optimizing BRT models.
Analysis setup
A series of BRT models were generated with combinations of shrinkage (0.05, 0.01, 0.005, 0.001, and 0.0005) and tree complexity (1, 5, 10, and 15) values by fitting a total of 20,000 trees for both urban and suburban two-lane undivided and four-lane divided arterial segments. The analysis was carried out using the gbm package of the statistical software R (R Core Team, 2014). Since crashes are random, non-negative, and discrete events, the models were built using Poisson distribution, where the
Analysis and results
This section presents the study results. The BRT model outputs are first presented to show model performance and parameter optimization. Based on the optimal parameter values, variable importance and the marginal effect of variables on crash prediction are evaluated.
Summary and conclusions
Calibration factors are required to adjust crash frequencies predicted using the HSM default safety performance functions (SPFs) to local site conditions. The HSM requires very detailed roadway geometry, traffic, and crash characteristics data to derive local calibration factors, and unfortunately, several of the variables are often not available in the states’ databases. Agencies are required to collect the missing data to generate calibration factors to be able to implement the HSM. As such,
References (48)
- et al.
Analyzing angle crashes at unsignalized intersections using machine learning techniques
Accid. Anal. Prev.
(2011) - et al.
Modeling traffic accident occurrence and involvement
Accid. Anal. Prev.
(2000) - et al.
A data fusion framework for real-time risk assessment on freeways
Transp. Res. C: Emerg. Technol.
(2013) - et al.
Development of comprehensive accident models for two-lane rural highways using exposure, geometry, consistency and context variables
Accid. Anal. Prev.
(2010) - et al.
A crash-prediction model for multilane roads
Accid. Anal. Prev.
(2007) - et al.
Analysis of traffic injury severity: an application of non-parametric classification tree techniques
Accid. Anal. Prev.
(2006) - et al.
Assessment of land use factors associated with dengue cases in Malaysia using boosted regression trees
Spat. Spatio-temporal Epidemiol.
(2014) Factor complexity of crash occurrence: an empirical demonstration using boosted regression trees
Accid. Anal. Prev.
(2013)- et al.
Using conditional inference forests to identify the factors affecting crash severity on arterial corridors
J. Saf. Res.
(2009) - et al.
Confounding control in a nonexperimental study of STAR*D data: logistic regression balanced covariates better than boosted CART
Ann. Epidemiol.
(2013)
Classification analysis of driver’s stop/go decision and red-light running violation
Accid. Anal. Prev.
Correlations between weather conditions and common vole (Microtus arvalis) densities identified by regression tree analysis
Basic Appl. Ecol.
Regional patterns of agricultural land use and deforestation in Colombia
Agric. Ecosyst. Environ.
Stochastic gradient boosting
Comput. Stat. Data Anal.
Spatio-temporal predictive model based on environmental factors for juvenile spotted seatrout in Texas estuaries using boosted regression trees
Fish. Res.
Combining classification tree analyses with interviews to study why sub-alpine grasslands sometimes revert to forest: a case study from the Swiss Alps
Agric. Syst.
Separating the effects of water physicochemistry and sediment contamination on Chironomus tepperi (Skuse) survival, growth and development: a boosted regression tree approach
Aquat. Toxicol.
Exploring precrash maneuvers using classification trees and random forests
Accid. Anal. Prev.
Spatial prediction of soil great groups by boosted regression trees using a limited point dataset in an arid region, southeastern Iran
Geoderma
Effects of road geometry and traffic volumes on rural roadway accident rates
Accid. Anal. Prev.
Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models
Saf. Sci.
Extrapolation at regional scale of local soil knowledge using boosted classification trees: a two-step approach
Geoderma
Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees
Agric. Syst.
Measuring performance in health care: case-mix adjustment by boosted decision trees
Artif. Intell. Med.
Cited by (76)
Environmental DNA and remote sensing datasets reveal the spatial distribution of aquatic insects in a disturbed subtropical river system
2024, Journal of Environmental ManagementVarying Built Environment Contexts and Trip Chain Decisions: A Multinomial-Choice Gradient Boosting Decision Trees Analysis
2024, Travel Behaviour and SocietyHeterogeneous ensemble learning for enhanced crash forecasts – A frequentist and machine learning based stacking framework
2023, Journal of Safety ResearchExamining the nonlinear impacts of built environment on ridesourcing usage: Focus on the critical urban sub-regions
2022, Journal of Cleaner Production