Elsevier

Accident Analysis & Prevention

Volume 79, June 2015, Pages 133-144
Accident Analysis & Prevention

Prioritizing Highway Safety Manual’s crash prediction variables using boosted regression trees

https://doi.org/10.1016/j.aap.2015.03.011Get rights and content

Highlights

  • Boosted regression trees (BRT) is a data mining method that gives interpretable results.

  • We developed BRT models to evaluate variables’ influence on crash predictions.

  • Models showed non-linear and complex relation between variables and crash predictions.

  • BRT models with higher tree complexity levels resulted in better fitted models.

Abstract

The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency’s safety performance. However, the data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of the data variables identified in the HSM are currently unavailable in the states’ databases. Moreover, the process of collecting and maintaining all the HSM data variables is cost-prohibitive. Prioritization of the variables based on their impact on crash predictions would, therefore, help to identify influential variables for which data could be collected and maintained for continued updates. This study aims to determine the impact of each independent variable identified in the HSM on crash predictions. A relatively recent data mining approach called boosted regression trees (BRT) is used to investigate the association between the variables and crash predictions. The BRT method can effectively handle different types of predictor variables, identify very complex and non-linear association among variables, and compute variable importance. Five years of crash data from 2008 to 2012 on two urban and suburban facility types, two-lane undivided arterials and four-lane divided arterials, were analyzed for estimating the influence of variables on crash predictions. Variables were found to exhibit non-linear and sometimes complex relationship to predicted crash counts. In addition, only a few variables were found to explain most of the variation in the crash data.

Introduction

The Highway Safety Manual (HSM), published by the American Association of State Highway and Transportation Officials (AASHTO) in 2010, is designed to “assist agencies in their effort to integrate safety into their decision-making processes” (AASHTO, 2010). Part C of the HSM presents predictive models to estimate predicted average crash frequency at individual sites on different roadway facilities including rural two-lane two-way roads, rural multilane highways, and urban and suburban arterials. The general form of the predictive models in the HSM can be expressed as follows:Npredicted,i=Nspf,i×(CMF1,i×CMF2,i×...×CMFn,i)×Ciwhere Npredicted,i is the predicted average crash frequency for a specific year for site type i; Nspf,i is the predicted average crash frequency for a specific year for site type i for base conditions; CMF1,i…CMFn,i are crash modification factors for n geometric conditions or traffic control features for site type i; and Ci is the calibration factor to adjust SPF for local conditions for site type i.

As shown in Eq. (1), there are three components of the predictive models: base safety performance functions (SPFs), crash modification factors (CMFs), and calibration factors. Base SPFs are statistical models that are used to estimate predicted average crash frequency for a facility type with specified base conditions. CMFs are used to account for the effects of non-base conditions on predicted crashes. Calibration factors are required “to account for differences between the jurisdiction and time period for which the predictive models were developed and the jurisdiction and time period to which they are applied by HSM users” (AASHTO, 2010). Calibration factor is estimated as the ratio of the total number of observed crashes to the total number of predicted crashes calculated using the SPFs and CMFs provided in the HSM. The predictive models are most effective when calibrated to local conditions (Findley et al., 2012, Lu, 2013, Sun et al., 2006, Young and Park, 2013).

Very detailed roadway geometry, traffic, and crash characteristics data are needed to derive local calibration factors. Several of the variables are often unavailable in the states’ databases. Collecting and maintaining all the data variables on the entire road network for the purpose of implementing the HSM is not cost-feasible. Therefore, a process to streamline the data requirements that minimizes the potential impacts to the quality of analysis is desirable. The objective of this study is to investigate the impact of the variables identified in the HSM on crash predictions. The study used five years of crash data from 2008 to 2012 on urban and suburban two-lane undivided arterials and urban and suburban four-lane divided arterials in Florida. Boosted regression tree (BRT), a data mining approach, is applied to evaluate variables’ importance and analyze their marginal effects on crash predictions.

Section snippets

Literature review

Traditionally, statistical regression models are developed in highway safety studies to associate crash frequency with the most significant variables (for example, Hadi et al., 1995, Abdel-Aty and Radwan, 2000, Sawalha and Sayed, 2001, Hauer et al., 2004, Caliendo et al., 2007, Cafiso et al., 2010 etc.). The models, however, were limited in their scope to evaluate the influence of predictor variables on crash outcome. Few studies identified and ranked the influence of predictor variables on

Data collection and preparation

Table 1 provides the list of variables identified in the HSM for urban and suburban roadway facilities. The roadway characteristics inventory (RCI) database maintained by the Florida Department of Transportation (FDOT) is the primary source of information for the data variables on Florida roadways. Data were extracted from the RCI for urban and suburban arterials that are part of the state highway system in Florida. However, data were available only for three variables, AADT, median width, and

Methodology

This section discusses the methodology of the BRT technique. It includes the underlying principle of the BRT method, the algorithm for fitting a BRT model, and a synopsis about regularization parameters for optimizing BRT models.

Analysis setup

A series of BRT models were generated with combinations of shrinkage (0.05, 0.01, 0.005, 0.001, and 0.0005) and tree complexity (1, 5, 10, and 15) values by fitting a total of 20,000 trees for both urban and suburban two-lane undivided and four-lane divided arterial segments. The analysis was carried out using the gbm package of the statistical software R (R Core Team, 2014). Since crashes are random, non-negative, and discrete events, the models were built using Poisson distribution, where the

Analysis and results

This section presents the study results. The BRT model outputs are first presented to show model performance and parameter optimization. Based on the optimal parameter values, variable importance and the marginal effect of variables on crash prediction are evaluated.

Summary and conclusions

Calibration factors are required to adjust crash frequencies predicted using the HSM default safety performance functions (SPFs) to local site conditions. The HSM requires very detailed roadway geometry, traffic, and crash characteristics data to derive local calibration factors, and unfortunately, several of the variables are often not available in the states’ databases. Agencies are required to collect the missing data to generate calibration factors to be able to implement the HSM. As such,

References (48)

  • N. Elmitiny et al.

    Classification analysis of driver’s stop/go decision and red-light running violation

    Accid. Anal. Prev.

    (2010)
  • A. Esther et al.

    Correlations between weather conditions and common vole (Microtus arvalis) densities identified by regression tree analysis

    Basic Appl. Ecol.

    (2014)
  • A. Etter et al.

    Regional patterns of agricultural land use and deforestation in Colombia

    Agric. Ecosyst. Environ.

    (2006)
  • J.H. Friedman

    Stochastic gradient boosting

    Comput. Stat. Data Anal.

    (2002)
  • J.T. Froeschke et al.

    Spatio-temporal predictive model based on environmental factors for juvenile spotted seatrout in Texas estuaries using boosted regression trees

    Fish. Res.

    (2011)
  • M. Gellrich et al.

    Combining classification tree analyses with interviews to study why sub-alpine grasslands sometimes revert to forest: a case study from the Swiss Alps

    Agric. Syst.

    (2008)
  • R. Hale et al.

    Separating the effects of water physicochemistry and sediment contamination on Chironomus tepperi (Skuse) survival, growth and development: a boosted regression tree approach

    Aquat. Toxicol.

    (2014)
  • R. Harb et al.

    Exploring precrash maneuvers using classification trees and random forests

    Accid. Anal. Prev.

    (2009)
  • A. Jafari et al.

    Spatial prediction of soil great groups by boosted regression trees using a limited point dataset in an arid region, southeastern Iran

    Geoderma

    (2014)
  • M.G. Karlaftis et al.

    Effects of road geometry and traffic volumes on rural roadway accident rates

    Accid. Anal. Prev.

    (2002)
  • A.T. Kashani et al.

    Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models

    Saf. Sci.

    (2011)
  • B. Lemercier et al.

    Extrapolation at regional scale of local soil knowledge using boosted classification trees: a two-step approach

    Geoderma

    (2012)
  • D. Müller et al.

    Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees

    Agric. Syst.

    (2013)
  • A. Neumann et al.

    Measuring performance in health care: case-mix adjustment by boosted decision trees

    Artif. Intell. Med.

    (2004)
  • Cited by (76)

    View all citing articles on Scopus
    1

    Tel.: +1 305 348 1896.

    2

    Tel.: +1 305 348 3116.

    View full text