Elsevier

Accident Analysis & Prevention

Volume 61, December 2013, Pages 78-86
Accident Analysis & Prevention

Goodness-of-fit testing for accident models with low means

https://doi.org/10.1016/j.aap.2012.11.007Get rights and content

Abstract

The modeling of relationships between motor vehicle crashes and underlying factors has been investigated for more than three decades. Recently, many highway safety studies have documented the use of negative binomial (NB) regression models. On rare occasions, the Poisson model may be the only alternative especially when crash sample mean is low. Pearson's X2 and the scaled deviance (G2) are two common test statistics that have been proposed as measures of goodness-of-fit (GOF) for Poisson or NB models. Unfortunately, transportation safety analysts often deal with crash data that are characterized by low sample mean values. Under such conditions, the traditional test statistics may not perform very well.

This study has three objectives. The first objective is to examine all the traditional test statistics and compare their performance for the GOF of accident models subjected to low sample means. Secondly, this study proposes a new test statistic that is not dependent on the sample size for Poisson regression model, as opposed to the grouped G2 method. The proposed method is easy to use and does not require grouping data, which is time consuming and may not be feasible to use if the sample size is small. Moreover, the proposed method can be used for lower sample means than documented in previous studies. Thirdly, this study provides guidance on how and when to use appropriate test statistics for both Poisson and negative binomial (NB) regression models.

Highlights

► We examined test statistics and compared their performance for the GOF of crash models subjected to low sample means. ► We proposed a new test statistic that is not dependent on the sample size for Poisson regression model. ► We provided guidance on how and when to use appropriate test statistics for both Poisson and negative binomial regression models.

Introduction

The modeling of relationships between motor vehicle crashes and underlying factors, such as traffic volume and highway geometric features has been investigated for more than three decades. The statistical models (sometimes referred to as crash prediction models) from which these relationships are developed can be used for various purposes, including predicting crashes on transportation facilities and determining which variables significantly influence crashes. Recently, many highway safety studies have documented the use of Poisson-gamma (also referred to as negative binomial – NB) regression models (Miaou and Lum, 1993, Poch and Mannering, 1996, Miaou and Lord, 2003, Maycock and Hall, 1984, Lord et al., 2005, Miaou, 1994, Maher and Summersgill, 1996). On rare occasions, the Poisson model has been found to be suitable for modeling crashes, especially when crash sample mean is low (Joshua and Garber, 1990, Miaou et al., 1992, Ivan et al., 2000, Lord and Bonneson, 2007). Although Poisson model is not used as commonly as NB model, it is frequently used when the sample mean is low due to the nature of count data (see, e.g., Lord and Bonneson, 2007). With the Poisson or NB models, the relationships between motor vehicle crashes and explanatory variables can then be developed by means of the generalized linear model (GLM) framework.

Pearson's X2 and the scaled deviance (G2) are two common test statistics that have been proposed as measures of GOF for Poisson or NB models (Maher and Summersgill, 1996). Statistical software (e.g., SAS) also uses these two statistics for assessing the GOF of a GLM (SAS Institute Inc., 1999). Unfortunately, transportation safety analysts often deal with crash data that are subjected to low sample mean values. Under such conditions, the traditional test statistics may not perform very well. This has been referred to in the highway safety literature as the low mean problem (LMP). The study by Sukhatme (1938) concluded that, “for samples from a Poisson distribution with mean as low as one, Pearson's X2 test for goodness of fit is not good.” In the field of traffic safety, this issue was first raised by Maycock and Hall (1984) and further discussed by Maher and Summersgill (1996), Fridstrom et al. (1995), and Agrawal and Lord (2006). Wood (2002) proposed a more complex technique, the grouped G2 method, to solve this problem. The grouped G2 method is based on the knowledge that through grouping, the data become approximately normally distributed and the test statistics follow a X2 distribution. Some issues (e.g., sample size) regarding this method are discussed in Section 3. It should be noted that the comparison of different models could be achieved by means of Akaike's information criterion (AIC) (Akaike, 1974) or Bayesian Information Criterion (BIC) (Schwarz, 1978). However, similar to the previous studies (Maher and Summersgill, 1996, Wood, 2002, Agrawal and Lord, 2006), this research intends to study statistics for the GOF of a given model (either Poisson model or NB model), that is, to investigate how well a developed model fit observed data.

This study has three objectives. The first objective is to examine all the traditional test statistics and compare their performance for the GOF of GLMs subjected to low sample means. Secondly, this study will propose a new test statistic that is not dependent on the sample size for Poisson regression model, as opposed to the grouped G2 method. The proposed method is easy to use and does not require grouping data, which is time consuming and may not be feasible to use if the sample size is small. Moreover, as will be shown in this study, the proposed method can be used for lower sample means than documented in previous studies. The third objective is to provide guidance on how and when to use appropriate test statistics for both Poisson and NB regression models, especially the grouped G2 method that is complex and may not ready for practitioners.

Section snippets

Statistical models

GLMs represent a class of fixed-effect regression models for dependent variables (McCullagh and Nelder, 1989), such as crash counts in traffic accident models. Common GLMs include linear regression, logistic regression, and Poisson regression. Given the characteristics of motor vehicle collisions (i.e., random, discrete, and non-negative independent events), stochastic modeling methods need to be used over deterministic methods. The two most common stochastic modeling methods utilized for

Goodness-of-fit test statistics

GOF tests use the properties of a hypothesized distribution to assess whether or not observed data are generated from a given distribution (Read and Cressie, 1988). The most well known GOF test statistics are Pearson's X2 and the scaled deviance (G2). Pearson's X2 is generally calculated as follows: X2=i=1n[yiμi/σi]2, where yi is the observed data, μi is the true mean from the model, and σi is the error and is usually represented by the standard deviation of yi. The scaled deviance is

Discussion

The results of this study show that the Pearson's X2 statistic tends to overestimate GOF values for low μ values, since V(X2) are larger than 2. This is because the components (i.e., (yiμi)2/μi for Poisson models) will be inflated when the predicted values (μi) are low. For instance, with the observed crash dataset in the first case, the Poisson model predicted 1.02 crashes per year for one of the intersections. However, 4 crashes were observed at that intersection. The contribution to X2would

Conclusions and future work

The NB regression model is the most commonly type of model that is utilized for analyzing traffic crashes. Depending on the characteristics of the data, the Poisson model has been found to be a suitable model on rare occasions. These models help establish the relationship between traffic crashes (response variable) and traffic flow, highway geometrics, and other explanatory variables. To evaluate their statistical performance, GOF tests need to be used. Since crash data are often characterized

Acknowledgements

The authors greatly acknowledge the comments provided by Dr. Graham Wood from the Department of Statistics at Macquarie University in Sydney, Australia. This study is supported by National Natural Science Foundation of China (Grant No. 51128802) and Jiangsu Province leading disciplines of universities program.

References (42)

  • American Association of State Highway Transportation Officials

    Highway Safety Manual

    (2010)
  • R. Agrawal et al.

    Effects of Sample Size on the Goodness-of-Fit Statistic and Confidence Intervals of Crash Prediction Models Subjected to Low Sample mean Values. CD-ROM

    (2006)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Transactions on Automatic Control

    (1974)
  • F.J. Anscombe

    The statistical analysis of insect counts based on the negative binomial distributions

    Biometrics

    (1949)
  • K.A. Baggerly

    Empirical likelihood as a goodness-of-fit measure

    Biometrika

    (1998)
  • J. Baglivo et al.

    Methods for exact goodness-of-fit tests

    Journal of the American Statistical Association

    (1992)
  • C. Carota

    A family of power-divergence diagnostics for goodness-of-fit

    The Canadian Journal of Statistics

    (2007)
  • N.A.C. Cressie et al.

    Multinomial goodness-of-fit tests

    Journal of Royal Statistical Society Serious B

    (1984)
  • N.A.C. Cressie et al.

    Goodness-of-Fit Statistics for Discrete Multivariate Data

    (1988)
  • N.A.C. Cressie et al.

    Pearson's X2 and the log likelihood ratio statistic G2: a comparative review

    International Statistical Review

    (1989)
  • R.A. Fisher

    The negative binomial distribution

    Annals of Eugenics

    (1941)
  • Cited by (9)

    • Pedestrian and bicyclist flows in accident modelling at intersections. Influence of the length of observational period

      2016, Safety Science
      Citation Excerpt :

      Mathematical models in the form of Eq. (1) are often referred to as safety performance functions. Some studies apply Poisson regression for the modelling process (Geyer et al., 2006; Ye et al., 2013). Poisson distribution assumes equal mean and variance, and since accident data are often over-dispersed, this might influence the significance level of the parameters (Cameron and Trivedi, 1990; Poch and Mannering, 1996; Washington et al., 2013).

    • Spatio-temporal hotspots analysis of pedestrian-vehicle collisions in tunisian coastal regions

      2020, 2020 13th International Colloquium of Logistics and Supply Chain Management, LOGISTIQUA 2020
    • Single-vehicle run-off-road crash prediction model associated with pavement characteristics

      2020, International Journal of Engineering, Transactions A: Basics
    View all citing articles on Scopus
    View full text