A New Spatial Count Data Model with Bayesian Additive Regression Trees for Accident Hot Spot Identification

The identification of accident hot spots is a central task of road safety management. Bayesian count data models have emerged as the workhorse method for producing probabilistic rankings of hazardous sites in road networks. Typically, these methods assume simple linear link function specifications, which, however, limit the predictive power of a model. Furthermore, extensive specification searches are precluded by complex model structures arising from the need to account for unobserved heterogeneity and spatial correlations. Modern machine learning (ML) methods offer ways to automate the specification of the link function. However, these methods do not capture estimation uncertainty, and it is also difficult to incorporate spatial correlations. In light of these gaps in the literature, this paper proposes a new spatial negative binomial model, which uses Bayesian additive regression trees to endogenously select the specification of the link function. Posterior inference in the proposed model is made feasible with the help of the Polya-Gamma data augmentation technique. We test the performance of this new model on a crash count data set from a metropolitan highway network. The empirical results show that the proposed model performs at least as well as a baseline spatial count data model with random parameters in terms of goodness of fit and site ranking ability.


Introduction
The identification of accident-prone locations (so-called hot spots) is a core task of road safety management (Cheng et al., 2020;Huang et al., 2009;Lee et al., 2020). Of the various approaches for accident hot spot identification (see Wang et al., 2018;Zahran et al., 2019, for a comparison), crash frequency analysis is the most widely employed method. Crash count data models are used to produce model-based rankings of hazardous sites and to predict crash counts at hot spots under counterfactual traffic flow and road design scenarios (Deacon et al., 1975). Recent work also uses multivariate analysis for the joint modelling of road fatality and injury counts (Besharati et al., 2020), single and multi-vehicle crashes (Wang and Feng, 2019), and crash frequency by travel modes (Huang et al., 2017).
Crash counts are typically modelled using Poisson log-normal or negative binomial regression models (Lord and Mannering, 2010). Accommodating flexible representations of unobserved heterogeneity in model parameters and accounting for correlations between spatial units are central themes of the recent crash count modelling literature (Cai et al., 2019a;Cheng et al., 2020;Dong et al., 2016;Heydari et al., 2016;Mannering et al., 2016;Ziakopoulos and Yannis, 2020). However, these flexible representations of unobserved heterogeneity are achieved at the cost of a restrictive linear specification of the link function. Whilst linear-in-parameters link functions are appealing from an interpretability perspective, an over-simplification of the relationship between predictors and the explained variable may negatively affect the predictive performance of a model (Li et al., 2008;Huang et al., 2016).
Since the predictive performance of a model is of paramount importance in hot spot identification and site ranking applications, the specification of the link function should be carefully selected.
However, in practice, the space of possible link function specifications is prohibitively large, which precludes exhaustive specification searches. Modern machine learning (ML) methods offer a remedy to this challenge, as they enable automatic specification searches. A few studies have adopted kernelbased regression (Thakali, 2016), neural networks (Chang, 2005;Huang et al., 2016;Xie et al., 2007;Zeng et al., 2016a,b), support vector machine (Dong et al., 2015;Li et al., 2008), and deep learning architectures (Cai et al., 2019b;Dong et al., 2018) for crash count modelling.
Modern ML methods are shown to surpass the traditional count data models in terms of predictive accuracy but succumb to four limitations. First, with exception of the work by Dong et al. (2015), none of the existing ML studies account for spatial correlations between observations. It is important to note that a non-linear link function specification does not inherently account for spatial correlations, and ignoring these correlations can deteriorate the robustness of predictions (Dong et al., 2015). Second, unlike traditional count data models, ML methods do not provide a quantification of estimation uncertainty and thus do not offer straightforward ways to construct confidence or credible intervals.
Third, ML methods are fully nonparametric, with no easy ways to integrate interpretable components in link functions. In other words, if a user is interested in inferring the relationship between selected explanatory variables and crash counts, there is no straightforward way to specify a semiparametric link function in the above-mentioned ML-based count data models. Fourth, ML-based crash count studies benchmark the performance of their methods against simplistic parametric models which do not account for unobserved and spatial heterogeneity, which is not a fair comparison. More specifically, none of the previous studies address an important question-whether a count data model with a nonparametric link function can outperform a model with a linear link function that also accounts for unobserved heterogeneity.
We emphasise that the before-mentioned ML methods adopt classical inference approaches, which yield point estimates of parameters of interest. On the contrary, the Bayesian approach facilitates accounting for various sources of uncertainty in model formulation and inference. For instance, posterior draws of site rankings at each iteration of the Gibbs sampler can directly provide ranking estimates with credible intervals (Miaou and Song, 2005). Hence, the fully Bayesian approach has emerged as the workhorse method in the site ranking literature.
Along the same lines, this paper proposes a Bayesian negative binomial regression model, which not only addresses the first three limitations of the above-discussed ML methods by retaining all advantages of the statistical crash count models (interpretability, inference, accounting for random parameters and spatial correlations) but also allows for an additional nonparametric component in the link function with an endogenous selection of the specification. This nonparametric component is specified as sum-of-trees using the Bayesian Additive Regression Trees (BART) framework (Chipman et al., 2010).
The sum-of-trees specification inherently partitions the support of each explanatory variable during the estimation, resulting in a sum of step functions of individual predictors. Furthermore, if a tree depends only on one predictor, then this specification captures the main effect of the predictors; it represents an interaction effect, if a tree depends on more than one predictor. This process is equivalent to creating categorical variables from continuous variables, as it inherently accounts for interaction effects between predictors, while cut-offs and functional forms of interaction effects are endogenously selected during estimation based on predictive accuracy. This is particularly relevant in the context of the site ranking, where continuous predictors like speed limit and shoulder width are often converted into categorical variables by manually selecting cut-offs before entering into linear link functions because such explanatory variables are unlikely to have a constant marginal effect over the entire support.
In the Gibbs sampler for the proposed model, we adopt the Pólya-Gamma augmentation method to deal with the non-conjugacy of the negative binomial likelihood (Polson et al., 2013). The key idea of this data augmentation approach is to translate the negative binomial likelihood into a Gaussian likelihood by introducing auxiliary Pólya-Gamma-distributed random variables into the model. Finally, we address the last limitation of recent ML studies by providing a fair comparison of the proposed model with its linear-link counterpart, while also incorporating random parameters and spatial random effects. The proposed approach is a full Bayes (FB) method and the superiority of FB over empirical Bayes (EB) methods is well established in the literature (Guo et al., 2019). Nonetheless, we still benchmark our approach against EB in site ranking analysis.
The remainder of this paper is organised as follows: In Section 2, we introduce the formulation of the proposed model framework and in Section 3, we outline the estimation approach. In Section 4, we present the empirical analysis and in Section 5, we conclude and discuss avenues of future research.

Model formulation
Let y i denote an observed crash count on road segment i = {1, 2, . . . , N }. We consider the distribution of y i to be negative binomial (NB) with probability parameter p i and shape parameter r. The link function ψ i = log p i 1−p i depends on road-segment-specific attributes F i and X i . While F i enters in the link function in a linear, interpretable form, the effect of X i is specified using a sum of trees m j=1 g i (X i ; T j , M j ) (Chipman et al., 2010). Here, T j is a binary regression tree, and M j denotes the parameters associated with its terminal nodes. To account for spatial correlations between road segments, we also include a spatial random effect φ i into the link function. The collection of spatial random effects φ = (φ 1 , . . . φ N ) follows a matrix exponential spatial specification (MESS) of dependence that ensures exponential decay of influence over space (LeSage and Pace, 2007).
non-negative spatial weight matrix, τ captures the magnitude of spatial association, and σ is an error scale.
The proposed model is summarised below: Equation 3 can be rewritten in vector form. We have such that Chipman et al. (2010) introduce BART as a nonparametric prior over a regression function to capture non-linear relationships and interaction effects between predictors. BART has been widely applied to a wide range of regression and classification problems (see Hill et al., 2020) but the current study presents the first application of BART in spatial count data modelling.

Bayesian additive regression trees (BART)
BART specifies the regression function as a sum of trees. Each tree T j consists a group of decision nodes with splitting rules and a set of terminal leaf nodes. Figure 1 illustrates one such tree, whose splitting rules at the decision nodes are x 1 < 0.7 and x 2 < 0.4, and whose leaf nodes are M j = {µ j1 , µ j2 , µ j3 }. A tree and its associated decision rules create partitions of the predictor space such that each partition corresponds to a leaf node. In the illustrative example, three partitions are tree inherently accounts for the interaction between x 1 and x 2 . However, if a tree depends only on one predictor, the main effect of that predictor is captured. The function for the illustrative tree is: g(X ; T j , M j ) is a step function of a subset of predictors and thus, BART is a sum of step functions.
The key idea of BART is to regularise the fit by keeping the effect of each individual tree small such that each of tree can explain a different portion of the regression function. In other words, each tree is constrained to be a weak learner and BART is an ensemble of weak learners. The sum of trees model is a flexible specification, which results in an excellent predictive accuracy (Chipman et al., 2010). Furthermore, splitting rules and leaf nodes are estimable parameters in BART, which in turn facilitates automated specification searches through endogenous partitioning of the predictor space.

Spatial error dependence
In this study, we adopt the matrix exponential spatial specification (

Pólya-Gamma data augmentation
Irrespective of the link function specification, conditional posterior distributions of the parameters of the negative binomial model do not belong to a known family of distributions (Buddhavarapu et al., 2016;Park and Lord, 2009). In such situations, data augmentation techniques, which involve intermediate or additional latent random variables, are used to derive tractable conditional posteriors of model parameters (Van Dyk and Meng, 2001). BART was originally developed for Gaussian models and therefore, its application in non-Gaussian models can be operationalised by transforming the original model into a Gaussian model via data augmentation. For example, Chipman et al. (2010) extend BART for a binary probit model with a flexible link function and estimated it by adopting the data augmentation technique suggested by Albert and Chib (1993). Similarly, Kindo et al. (2016) use BART to specify indirect utilities in a multinomial probit model and estimated it by augmenting indirect utilities as latent Gaussian random variables.
In this study, we adopt the Pólya-Gamma data augmentation technique which transforms the negative binomial model into a heteroskedastic Gaussian model (Polson et al., 2013). This augmentation is not only appropriate for any negative binomial specification, as shown by Buddhavarapu et al. (2016), but also enables seamless integration of the Gibbs sampler of the heteroskedastic BART  into that of the proposed model.
Pólya-Gamma data augmentation introduces auxiliary Pólya-Gamma random variates ω i into the model. Conditional on ω i , the negative binomial likelihood is translated into a heteroskedastic Gaussian likelihood, while closed-form posteriors for the augmentation variables are retained (see Polson et al., 2013, for a detailed proof). where

Prior specification
We adopt the strategy used by Chipman et al. (2010) to specify priors on T j and M j |T j . Other prior distributions are summarised below: Here Hyper-parameters} is a set of hyper-parameters and Θ = γ, φ, T , M, σ 2 , ω, r, h, τ is a set of latent variables of the models. Thus, the joint distribution of observed and latent variables is:

Posterior updates
To infer the posterior distributions of the model parameters of interest, we construct Markov chains by generating samples from the conditional distributions of individual coordinates of the parameters space. One iteration of the Gibbs sampler is described below: • Update r by sampling from a Gamma distribution (see Zhou et al., 2012, Section 4.1.1).
• Update T and M as illustrated for the heteroskedastic BART by , where the dependent variable is Z − φ − F γ and the error covariance matrix is Ω −1 . The updates of BART parameters are based on an iterative Bayesian backfitting algorithm (Hastie et al., 2000), where one tree is updated conditional on all other trees.
In this study, we present the estimation procedure for a specification with only non-random parameters in the interpretable part of the link function. However, a modeller might expect unobserved heterogeneity in some elements of γ. For completeness, we note that the proposed estimation procedure can be extended for such specifications of unobserved heterogeneity to facilitate flexibility at no compromise in interpretability. Thanks to Pólya-Gamma data augmentation, the extension is as simple as deriving a Gibbs sampler for a linear mixed-effects model.

Empirical analysis 4.1 Data
The proposed model framework is empirically validated using detailed crash frequency data from 1,158 contiguous road segments of eleven highway facilities in the Greater Houston metropolitan area in the United States of America. The data were compiled by geographically fusing information retrieved from accident and road databases for an observation period covering four consecutive calendar years in the period from 2007 to 2010. For each road segment, the considered data include the annual crash count aggregated over all accident types and severity levels as well as other segmentspecific characteristics, namely the type of highway facility the segment in question is associated with, the traffic volume, the truck traffic percentage, the road condition and roadway design attributes. The identified Table 1 enumerates summary statistics of the annual crash counts and the segment features for the considered road segments for each year of the observation period. More details about the the data collection and processing are presented in Buddhavarapu (2015). In the subsequent analysis, we treat the data from each year as a separate sample and evaluate the empirical performance of the proposed model across the different years of the observation period.

Model specifications
We contrast the performance of four different model specifications. The proposed negative binomial Bayesian additive regression trees (NB-BART-I, henceforth) model is benchmarked against negative binomial regression models with spatial error terms and linear-in-parameters link functions. We consider one model with fixed parameters (NB-fixed, henceforth) as well as a model whose linear-inparameters link function contains both fixed and random parameters (NB-random, henceforth). The specifications NB-BART-I, NB-fixed and NB-random use a restricted predictor space, as indicated in column "Predictor space: I" of Table 1. In the restricted predictor space, some continuous predictors are converted into dummy variables, because assuming a constant marginal effect over the entire support of the predictors is not meaningful. For example, left shoulder width is a continuous variable but its support is partitioned at 10 feet. Column "Predictor space: I (random)" in the same table indicates the predictors that are associated with fully-correlated random parameters in the link function. We conducted an extensive specification search, during which we closely monitored model tractability, to determine which predictors to associate with random parameters. We also test another specification of the proposed model (NB-BART-II, henceforth), which considers all predictors in their original form, and allow BART to automatically partition the predictor space. Column "Predictor space: II" of Table 1 indicates which predictors are included in specification NB-BART-II.
For the definition of the spatial weight matrix W , we exploit the inherent neighbourhood structure of the road segments. First, we create a N × N contiguity-based proximity matrix C , which is informed by the neighbourhood relationships of the road segments. An element C i j of C is given by where d i ( j) takes a value of k if i and j are kth-order neighbours and zero otherwise. k * ∈ {x ∈ |x ∈ [1, N ]} is a truncation point, which is set to 3 in the current application. The intuition underlying the definition of C is that distant road segments exhibit weaker spatial correlation than proximal ones. Ultimately, we obtain W by row-normalising C , i.e. W i j =

Implementation and estimation practicalities
We implement the MCMC algorithm outlined in Section 3 by writing our own Python code. 1 Our implementation of the heteroskedastic BART updates follows the documentation provided by Bleich and Kapelner (2014) and Kapelner and Bleich (2016). For the generation of the Pólya-Gamma random variates, we rely on an existing Python implementation (Linderman et al., 2015(Linderman et al., , 2016a of the appropriate sampling methods (Polson et al., 2013;Windle et al., 2014).
For each of the considered model specifications, the MCMC algorithm is executed with four parallel Markov chains and 10,000 draws for each chain, whereby the initial 5,000 draws of each chain are discarded for burn-in. After the burn-in period, a thinning factor of 2 is applied. The step size of the Metropolis-Hastings algorithm for the generation of the posterior draws of the spatial association parameter τ is adaptively tuned with a target acceptance rate of 44%, which is the recommended 1 The Python code is publicly available at https://github.com/RicoKrueger/nb_bart. acceptance ratio for a one-dimensional target density (see Roberts et al., 1997). Convergence of the MCMC simulation is monitored with help of the potential scale reduction factor (Gelman et al., 1992).

Assessment of model fit
We evaluate the goodness of fit of the considered methods using the log pointwise predictive density (LPPD; Gelman et al., 2014) and the root mean square error (RMSE): • LPPD is a strictly proper scoring rule (Gneiting and Raftery, 2007), which corresponds to the logarithm of the pointwise likelihood integrated over the posterior distribution of the relevant model parameters. It is given by where θ i = {ψ i , r}. A larger value of LPPD indicates superior goodness of fit.
• RMSE is informed by the discrepancy between the predicted accident countŷ i and the observed accident count y i , whereby the former is given by the posterior mean of the conditional expectation of the crash count at site i. It is given by

Assessment of site ranking ability
Numerous techniques for the model- Different hot spot identification methods can be compared by assessing the temporal consistency of the produced site rankings (Cheng and Washington, 2008;Montella, 2010). In this study, we employ the site consistency, method consistency and total rank differences tests of Cheng and Washington (2008) to evaluate site ranking consistency We use H α,t to denote the set of sites that are identified as hazardous in time period t at risk level α and define the site ranking consistency tests as follows: • The site consistency test ( SC ) measures the ability of a method to consistently identify a site as high risk over two consecutive time periods. It is based on the assumption that an unsafe site should remain hazardous provided that no safety improvements are implemented. SC for period t is obtained by averaging the predicted accident countsŷ h,t+1 for period t + 1 of all sites h ∈ H α,t : A larger value of SC,t indicates superior site ranking consistency.
• The method consistency test ( MC ) measures the ability of a method to consistently identify the same hazardous sites over two consecutive time periods. MC for period t is given by the cardinality of the intersection of H α,t and H α,t+1 , i.e.
A larger value of MC,t indicates superior site ranking consistency.
• Finally, the total rank differences test ( TRD ) measures the average absolute differences in ranks of hazardous sites over two consecutive time periods. TRD for period t is given by where R h,t denotes the rank of site h in period t. A smaller value of TRD,t indicates superior site ranking consistency. and 11.46, respectively. Furthermore, we observe that NB-BART-I closely matches the performance of NB-random in the considered samples. As expected, NB-fixed exhibits the poorest performance among the four considered model specifications in all four samples due to its rigid link function specification.

Model fit
In sum, Table 2 shows that the endogenous partitioning of the predictor space induced by the BART component in NB-BART-II results in a substantial improvement in goodness of fit. We acknowledge that a modeller could exogenously partition the support of the continuous predictors in manifold ways to eventually achieve a comparable or better goodness-of-fit. However, the computation time of such an iterative hit-and-trial specification search would be prohibitive, as the space of possible specifications is enormously large. NB-BART-II automates this search process and achieves the optimal partitioning of the predictor space and thus obviates the need for expensive specification searches.
Consequently, NB-BART-II with an unrestricted predictor space should be preferred over NB-BART-I with a restricted predictor space.

Site ranking performance
Next  can be seen that the posterior mean probability plots corresponding to the specifications NB-fixed, NB-random, NB-BART-I, NB-BART-II are virtually indistinguishable from each other. This insight corroborates our earlier finding that the automated selection of an optimal link function specification in BART obviates the need for random parameters in a link function. Notwithstanding significant differences in goodness-of-fit (see Table 2 Next, we contrast the site ranking ability of the EB approach as well as of NB-fixed, NB-random, NB-BART-I and NB-BART-II using the site consistency ( SC ), method consistency ( MC ) and total rank differences ( TRD ) tests of Cheng and Washington (2008). Table 3 shows the results of the site ranking consistency tests for three reference period. We observe that NB-BART-II, followed by NB-BART-I, performs best in respect to site consistency for all reference periods. In terms of method consistency and total rank differences, none of the considered methods display superior performance across the three reference periods and the relative difference in the test score of the methods are not substantial. Therefore, we conclude that NB-BART affords similar site ranking consistency as NB-fixed and NB-BART. In addition, we note the EB approach underperforms in terms of site consistency but remains competitive with respect to the other methods in terms of method consistency and total rank differences.

Variable importance and parameter estimates
As a byproduct of the estimation, BART provides a statistic about the variable importance (see . In the context of BART, the notion of "importance" is different from elasticity. To be precise, variable importance of BART predictors denotes the proportion of times a variable of the predictor space is included in a splitting rule. Figure 3 shows the inclusion proportions of the predictors used in the model specifications NB-BART-I and NB-BART-II for the whole observation period. For both NB-BART specifications, we observe that the importance of the different variables is stable across the different years of the observation period. This insight shows that BART can consistently identify the variation explained by each predictor. Furthermore, the importance of several predictors (e.g. road quality index and left shoulder width) is higher in NB-BART-II than in NB-BART-I. This is because these variables are included as dummy variables in NB-BART-I, and variables with binary support only admit at most one split. In contrast, these variables are included as continuous predictors in NB-BART-II, and variables with continuous support inherently provide more ways to be incorporated into decision rules. This analysis suggests that the exogenous conversion of a continuous predictor into a dummy variable may reduce its importance. Yet, the endogenous partitioning of the predictor space in NB-BART-II allows to bypass the shortcomings of a manual specification selection. Furthermore, Figure 3 shows the estimated posterior distribution of the negative binomial shape parameter r (whose inverse is also referred to as the dispersion parameter) and the spatial association parameter τ for year 2010 of the observation period for each of the considered model specifications.
For both model parameters, it can be seen that the posterior distributions produced by the different model specifications closely overlap. For the spatial association parameter τ, we observe that the posterior distributions produced by NB-BART-I and -II are slightly shifted to the right relative the posterior distributions produced by the other model specifications. A possible explanation for this right shift is that the BART component absorbs some of the spatial dependence that is left unexplained by the other model specifications.
Finally, Table 4 displays the parameters estimates for NB-fixed and NB-random for all four years of the observation period. Along with the variable inclusion proportions shown in Figure 3, Table 4 suggests that the importance of the predictors is stable across time periods. Consequently, temporal instability (Mannering, 2018) is not a serious concern in the current application.

Conclusions
In this paper, we propose a spatial negative binomial Bayesian additive regression trees (NB-BART) model for the identification of accident hot spots in road networks. The contribution of our work is threefold: • First, the proposed model re-conceptualises the specification of the link function in count data models. Since NB-BART specifies the link function as a sum-of-step-functions, it can flexibly account for interactions and non-linear relationships between predictors. This flexibility comes at no expense in interpretability, because a modeller can still specify a semiparametric link function in combination with a BART component and random parameters in a linear component. Furthermore, a modeller does not require to exogenously create dummy variables from continuous covariates before entering them into the link function, because BART endogenously partitions the predictor space and automatically extracts the maximum possible information from the predictors.
• Second, we derive a Gibbs sampler for NB-BART by employing the state-of-the-art Pólya-Gamma data augmentation technique. The sampler ensures conjugacy of the conditional posteriors of all non-BART parameters with the exception of a scalar spatial correlation parameter, which can be updated in a Metropolis-Hastings step. The Bayesian inference approach allows for the construction of credible intervals and other derived quantities such as site rankings.
• Third, we benchmark the performance of NB-BART against a baseline negative binomial regression model with a linear link function and spatial correlations. Goodness-of-fit and site ranking measures indicate that NB-BART performs as well as or better than the baseline model with random parameters in the linear-in-parameters link function. Our results suggest that if the objectives of a study are centred around prediction, a modeller may be better off considering a flexible link function with a BART component in lieu of a linear-in-parameters link function with random parameters. Nonetheless, if heterogeneity in some parameters is of interest to the modeller, it can be incorporated in NB-BART with a semiparametric link function. At the same time, we acknowledge that it is possible that a count data model with a sophisticated mixing distribution such as a finite mixture-of-normals or Dirichlet process mixture-of-normals mixing distribution (see e.g. Buddhavarapu et al., 2016;Cheng et al., 2020;Heydari et al., 2016) could effectively control for non-linearities in the predictors and could be more competitive with NB-BART in certain empirical applications. However, such models are time-consuming to estimate and necessitate post-hoc model selection for determining the dimensionality and complexity of the mixing distribution, whereas NB-BART endogenously partitions the predictor space.
The hierarchical nature of NB-BART offers opportunities for extensions in two directions. First, a multivariate extension of NB-BART can facilitate the joint modelling of multiple crash frequency variables such as crash counts by crash type, severity level, and vehicle type (Dong et al., 2014;Yasmin et al., 2018). Second, the proposed NB-BART formulation accounts for only spatial random effects but can be extended to accommodate temporal variation in model parameters as well as spatiotemporal random effects (Li et al., 2019;Liu and Sharma, 2017).
Recent developments in BART can also be incorporated in the above-discussed extensions. First, in the case of sparse and large predictors spaces, the original BART framework is vulnerable to the curse of dimensionality. Recently proposed soft trees can be adopted in order to handle this challenge (Linero and Yang, 2018). Second, Bayesian regression trees have recently been used for observational causal inference (Hahn et al., 2020), which is a critical avenue for future research in the accident analysis literature (Mannering et al., 2020).