Logistic Regression Approach to Modelling Road Traffic Casualties in Ghana

In this study, we shall derive a logistic regression model for predicting the annual distribution of the proportion of road traffic casualties who die as a result of road traffic accidents in Ghana. Road traffic casualties are defined as road traffic victims who are injured or killed within 30 days of the accident. With 1991 as our reference year, we considered ten independent variables that are represented by each of the 10 years from 1992 to 2001. Using a significance level of 0.05, we show that the logistic regression coefficients for the years 1993, 1998, 1999, 2000 and 2001 are significantly different from zero, while those of the remaining years are not significant. That is, there is little statistical justification for including coefficients for the years 1992, 1994, 1995, 1996 and 1998 in the model. The proposed model was used to estimate the number of road traffic fatalities from the year 2002 to 2011, a period of ten years and the results were compared with the actual fatalities. It was noted that all the calculated figures corresponding to the coefficients that were significantly different from 0 were within 10% of the actual figure and only one of the five coefficients, which were not significant, estimated road traffic fatality within 10% of its actual value.


INTRODUCTION
The increasing population size, with a corresponding increase in the number of registered vehicles accompanied by rapidly expanding road network, has resulted in increase in Road Traffic Fatalities (RTFs).Shirley (2006) discovered that safe human behaviour is a major risk factor in accounting for Road Traffic Injuries (RTIs) especially in developing countries where it is estimated that 64 to 95% of casualties are due to improper human activity by a driver, passenger or pedestrian.Unlike many fatal diseases, road traffic accidents kill people from all age groups including young and middle-aged people in their active years.Hesse and Ofosu (2014a) reported that a cumulative total of 17 436 fatalities is recorded over a 10-year period from 2001 to 2010 where the highest fatalities during this period were in the 26-35 year old.Road traffic accidents are responsible for a far higher rate of death among men, by an approximate ratio of 3:1 (Hesse and Ofosu, 2014b).
Road traffic casualty refers to any road traffic accident victim injured or killed within 30 days of the accident.It should be pointed out that the European Economic Commission (EEC) and the World Health Organization (1979) have recommended a definition for road traffic accident fatalities which includes only deaths which occur within 30 days following the accident, since 93-97% of these fatalities take place within a one month period.A number of countries have not yet adopted this definition (World Health Organization, 1979).For example, in some countries, a road traffic fatality is recorded only if the victim dies at the site or is dead upon arrival at the hospital.In order to make comparison of accident statistics between countries reasonable, figures obtained from countries which have not adopted the 30-day fatality definition, should be properly adjusted.No adjustment is required for figures from countries such as Ghana, U.S.A and Great Britain, which have adopted the standard fatality definition.
Table 1, adapted from the National Road Safety Commission (NRSC) of Ghana, shows the annual distribution of road traffic injuries and fatalities in Ghana, from 1991 and 2013.i Y e a r Casualty ------------------------------- According to NRSC of Ghana report, the number of road traffic crashes in 2013 (i.e., 9 200) represents a decrease of 23.9 and 18% over the 2012 and 2001 figures, respectively.The number of fatal crashes and their resulting fatalities in the previous year also saw a decrease.Compared to the 2012 figures, fatal crashes decreased in 2013 by 17% and fatalities by 15.3%.There was also a decrease of 17.9% in the overall number of casualties in 2013 compared with 2012.Relative to the year 2001, the 2013 figures for fatal accidents and fatalities recorded corresponding increases of 24.7 and 44.5%, respectively, whilst overall casualties recorded a decrease of 15.6%.
In the logistic regression analysis of this data, road traffic casualty is considered as the response or dependent variable of interest and year as predictors.The response has two categories: fatality and injury.The general objective of this analysis is to describe the way in which casualty distribution of road traffic fatalities varies by year and use this variation to predict future distribution.Logistic regression was proposed, as an alternative to ordinary least squares, in the late 1960s and early 1970s (Cabrera, 1994), and it became routinely available in statistical packages in the early 1980s.Since that time, the use of logistic regression has increased in the social sciences (e.g., Chuang, 1997;Janik and Kravitz, 1994;Tolman and Weisz, 1995) and in educational research, especially in higher education (Austin et al., 1992).
Other studies have been conducted in the area of road traffic casualties in Ghana.Hesse et al. (2014a) derived a Bayesian model for predicting the annual regional distribution of the number of road traffic fatalities in Ghana.The study showed that population and number of registered vehicles are predominant factors affecting road traffic fatalities in Ghana.Similar conclusions were arrived at when a least square regression method (Hesse et al., 2014b) and multilevel random coefficient method (Hesse et al., 2014c) were used to derive models for predicting road traffic fatalities in Ghana.

MATERIALS AND METHODS
Let n i denote the number of road traffic casualties in the i th year in Ghana and let i y denote the number of Road Traffic Fatalities (RTFs) in the th i year in Ghana.We view i y as a value of a random variable i Y that takes the values 0, 1, …, n i If we assume the n i observations for each year are independent, and they all have the same probability i p of dying as a result of RTAs, then Y i has the binomial distribution with parameters p i and i n i.e.Yi -B(ni, pi )The probability mass function of i Y is given by: It can be shown that the expected value and variance of i Y are (Ofosu and Hesse, 2010): ( ) and var( ) (1 ).
The odds i is the ratio of the probability to its complement, or the ratio of favourable to unfavourable cases.Thus: We take logarithms, calculating the logit or log-odds: If the logit of the underlying probability p i is a linear function of the predictors, then we can write logit( ) (5) where, i x : The transpose of a vector of covariates  : A vector of regression coefficients Exponentiating Eq. ( 5) we find that the odds for the th i unit are given by: Solving for the probability i p in the logit model gives: Maximum likelihood estimation: The p.d.f. of i Y is: The likelihood function is given by: ( ) The first derivative of where, ( ) , The comparison of observed to predicted values using the likelihood function is based on the following expression: For the saturated model, we replace ˆi y in Equation ( 12) by .
where, i y : The observed ˆi y : The fitted value for the th i observation In particular, to assess the significance of an independent variable, we compare the value of D with and without the independent variable in the equation.The change in D due to the inclusion of the independent variable in the model is: It can be shown that, when the variable is not in the model, the maximum likelihood estimate of 0  is where 1 0 0 0 and ( ). where, If the hypothesis that 0, 1, 2, ..., is true, then G has the chi-square distribution with k degrees of freedom (Hosmer et al., 2013).

RESULTS AND DISCUSSION
In this section, we illustrate the use of statistical packages in R to fit logistic regression models as a special case of a generalized linear model with family binomial and link logit.We first begin the analysis using nlme package in R. First, the data set, on road traffic casualties from 1991 to 2001 in Ghana, is loaded for analysis as shown in listing (1).

Listing (2):
> logistic<-glm(Y~year,family = binomial, data = rtf) The results of the application of the R function 'summary (logistic)', which presents the parameter estimate and standard errors for the model, are simplified in Table 2.
The fitted logistic equation, for the th i year, is therefore given by: ... 0.
Note further that, in computing for ˆ, 0, x takes the value one (1) for i = j while the remaining 9 predictors assume the value zero (0).Thus, from The remaining values of ˆi p are given in Table 2.The method for specifying the design variables involves setting all of them equal to 0 for the reference year (1991), and then setting a single design variable equal to 1 for each of the other groups.
The significance of the logistic regression relationship can be assessed by using the null deviance to test the hypotheses: at 0.05 level of significance.The test statistic is: 12402 ln(12402) 110148 ln(110148) 12284 ln(122550) .
When 0 H is true, G has the chi-square distribution with 10 degrees of freedom (Hosmer et al., 2013).We reject 0 H at significance level 0.05 if the computed value of G is greater than 2 0.05,10 18.31.

 
From the R function 'summary(logistic)', the value of the test statistic is 0 74.182.

G 
Since 74.182, the calculated value of , G is greater than 18.31, the test is significant at the 5% level.We therefore reject the null hypothesis in this case and conclude that at least one of the 10 coefficients is different from zero.
Since the analysis indicates that the null hypothesis should be rejected at the 5% level, it means that some of the coefficients are significantly different from zero.But as to which of the coefficients are significantly different from zero, the analysis does not specify.Before concluding that any or all of the coefficients are nonzero, we may look at the univariable Wald test statistics (Hosmer et al., 2013): ( ) .(1992, 1994, 1995, 1996 and 1998) are not significantly different from zero.According to Hosmer et al. (2013), the decision to include variables in a model cannot be base entirely on tests of statistical significance.The choice of variables in the model may be influenced by other considerations.It is possible for the coefficient of some variables to be zero at certain level of significance, but when taken collectively, considerable confounding can be present in the data (Rothman et al., 2008;Maldonado and Greenland, 1993;Greenland, 1989;Miettinen, 1976).
The purpose of analysing these data is not the determination of the parameters.Interest is centered on how good the model is in estimating future road traffic fatality values using these estimates.At this stage, we wish to use the model in Eq. ( 15) to estimate the number of road traffic fatalities from the years 2002 to 2011, a period of ten years.To do this, a single design variable , ij x for year i, is set equal to 2 when i = j and then all remaining variables are set equal to 0, where i represents any of the years from 2002 to 2011.We use 1 2 10 , , ..., i i i x x x in Eq. ( 15) as our design variables for the years 2002, 2003, and 2011, respectively. For instance, in year 3 (i.e., the year 2004)), the design variables together with the corresponding parameter estimates are given in Table 3.
Thus, a point estimate of the proportion of road traffic casualties who died in 2004 is given by Eq. ( 16 The actual road traffic fatalities D together with the values of D calculated from Eq. ( 16) are given in Table 4.The percentage differences between the calculated and actual values are also given in Table 4.
It can be seen that all the calculated figures, ˆ, D corresponding to the coefficients, , j  that were significantly different from 0, are within 10% of the actual figure and only one (i.e., 0.07179) of the five coefficients, that were not significantly different from 0, estimated D within 10% (i.e., 3.9%) of its actual value.

CONCLUSION
Logistic regression analysis of road traffic fatalities in Ghana has been performed using road traffic accident data from the National Road Safety Commission.The data span from 1991 to 2001.The formula for predicting the proportion of road traffic casualties who die in the i th year using a logistic regression approach is: likelihood of the saturated model)  ln(likelihood of the fitted model)  (11)The log-likelihood of the fitted model can be written as:

iy
Equation (11)  then becomes: model without the variable) -D (model with the variable) = (likelihood of the model without variable) (likelihood of the fitted model) are shown in the seventh row of The road traffic accident statistics in 2013 represent a reduction of 15.3% in fatalities over the 2012 figure.The fatality figure of 1 898 in 2013 is the lowest since year 2007.Relative to

Table 1 :
Annual distribution of road traffic fatalities and injuries in Ghana from 1991 to 2013 McCullagh and Nelder (1989) iterative in nature and have been programmed into logistic regression software.The interested reader may consult the text byMcCullagh and Nelder (1989)for a general discussion of the methods used by most programs.The second derivatives used in computing the standard errors of the parameter estimates, ˆ,  are: k   

Table 2
Table 2, labeled z.Under the hypothesis that the th i coefficient is zero, has the standard normal distribution.The eighth row of Table 2 shows the p-values which are computed under this hypothesis.The coefficients for the years 1993, 1998, 1999, 2000 and 2001 are different from zero, at 0.05 level i W

Table 2 :
Parameter estimates for logistic model of road traffic fatalities in Ghana from 1991 to 2001 ):From Table1, the total number of road traffic casualties in 2004 is 18 445.Thus a point estimate of the total number of road traffic fatalities in 2004 is (to the nearest whole number):

Table 2 .
Using the model to estimate the number of road traffic fatalities from 2002 to 2011 in Ghana, it was noted that of the 10 calculated figures, 6 are within 10% of the actual figure.