1 Introduction

Fatality resulting from motor vehicle crashes is the fifth leading cause of death in the United States. Data from the National Highway Traffic Safety Administration indicates that since 1949, more than 30,000 (40,000 average) fatalities result from motor vehicle crashes every year. However, the current trend shows this number is declining. For example, a 1.9 % decrease in crash-related fatalities was observed in 2011 as compared to 2010. Nonetheless, crash-related injuries are still large in number. In 2011, estimated 2.22 million people were injured in motor vehicle traffic crashes and 2.24 million in 2010 [1]. Fatal crashes on highway-rail grade crossings (HRGCs) contributed to 261 deaths in 2010 and 251 in 2011 [2].

HRGCs are conflict points between highway users and rail equipment (e.g., locomotive, freight car, caboose, or service equipment car operated by a railway company), which has contributed to a considerable amount of crashes in the U.S. history. Although the trend of highway user crashes with rail equipment is showing a decrease in numbers, much has yet to be done to improve the safety of HRGCs. Unlike highway traffic crashes, a significantly high percentage of vehicle-rail crashes lead to fatality and injury to vehicle users. For example, data in the past 8 years (2005–2012) indicate that 8.55 % of vehicle-rail crashes were fatal and 26.68 % resulted in injury [2]. However, in the case of highway traffic crashes, the percentage of fatal crashes is not more than 2 % [1].

Despite the fact that highway user-rail crashes have a significant effect on highway user safety, the subject (of examining the injury severity levels in such crashes) still receives little attention. An understanding of the factors contributing to the levels of injury severity is an important step toward making the transportation system safer and more attractive. Responsible jurisdictions may use the results of this research to derive road user safety measures and policies.

One of the most important tasks in improving road safety is to uncover influential factors and then to develop countermeasures. The relationship between the injury severity of traffic crashes and factors such as driver and passenger characteristics, pedestrian age and gender, vehicle type, environmental conditions, and traffic and geometric conditions has attracted much attention. A better understanding of this relationship is necessary and very important for improving facility design so that accidents can be reduced. It is important to note that reducing crash frequency and/or reducing crash–injury severity may necessitate different strategic approaches. The development of effective countermeasures requires a thorough understanding of the factors that affect the likelihood of a crash occurring or, given that a crash has occurred, the characteristics that may mitigate or exacerbate the degree of injury sustained by crash-involved road users. To gain such an understanding, safety researchers have applied a wide variety of methodological techniques over the years.

Numerous studies have applied statistical models for crash–injury severity. Among them, the unordered logit, ordered logit models, and their variations are the most often used models. Savolainen et al. [3] briefly discussed and summarized a wide range of methodological tools applied to study the impact of various factors on motor vehicle crash–injury severity. As presented in the paper, ordered logit and probit, multinomial logit, binary logit and binary probit, and nested logit are some of the frequently used statistical methodologies. In particular, logistic regression has been widely applied to model crash severity levels. Variables such as elements of geometric design, traffic operational measures, and environmental conditions are considered as independent variables to estimate the severity. In particular, it is important to note that modeling ordinal outcome-dependent variable using nominal variable will lead to loss of efficiency as a result of ignoring information. In the reverse, modeling nominal variable using ordinal variable will give biased or sometimes irrational estimates [4].

As discussed, crashes occurring at HRGCs have a significant effect on highway user safety and the importance of conducting research in such areas is evident. However, this subject receives less attention and little research efforts have been made in this particular area (except [5]) in which a multinomial logit model was developed to analyze the severity of vehicle crashes at HRGCs). As such, the objective of this research is to apply and compare the multinomial logit and ordered logit models to explore the impacts of various factors contributing to different levels of crash severity to vehicle users as a result of vehicle-rail crashes on HRGCs.

The remainder of this paper is organized as follows. First, Sect. 2 presents a literature review of existing studies regarding vehicle crash severity modeling. Then, Sect. 3 describes the multinomial logit (MNL) and ordered logit (ORL) modeling methodology. Section 4 discusses the data assembly and analysis of the research. Section 5 presents numerical results and discussion. Finally, conclusions and recommendations are made in Sect. 6.

2 Literature review

Several studies have been conducted to model crash severity and investigate the impacts of various factors involved in the crashes. Mercier et al. [6] conducted a study and tested the hypothesis that older drivers and passengers would suffer more severe injuries than younger adults in the presence of broadside and angle collisions of automobiles on rural highways. Logistic modeling, Hierarchical Regression Analysis, and Principal Components Regression were analysis tools applied. Injury severity levels, fatal, major, and minor were considered as dependent categorical variables. Some of the independent variables which were considered included occupant age, occupant position relative to point of impact and protection. According to the study, age is reported as a significant predictor of injury severity and is slightly higher for females than males. It was also identified that the use of lap and shoulder restrains, reduces injury severity, and is less certain for females. For females only, air bags deployed were reported as significant injury severity predictors.

By using sequential binary logistic regression, Dissanayake and Lu [7] modeled crash severity for single-vehicle fixed object crashes involving young drivers. The five crash severity categories which were considered including no injury, possible injury, noncapacitating injury, incapacitating injury, and fatal. As reported in the study, factors such as alcohol or drug influence, ejection in the crash, point of impact, rural crash locations, existence of curve or grade at the crash location, and speed of vehicle significantly increased the probability of more severe crashes. On the other hand, restraint device usage and drivers being of male gender were reported to reduce the chance of high severity crashes. It was also indicated that factors such as weather condition, residence location, and physical condition had no significant relation in the model.

Duncan et al. [8] conducted a study to investigate car occupant injury severity in two-vehicle passenger car-truck rear-end collisions by an ordered probit model. As reported in the study, factors such as darkness, large speed differentials, high speed limits, grades, being in a car struck to the rear, driving while drunk, and being female increased the passenger vehicle occupant injury severity. On the other hand, factors such as snowy or icy roads, being in a child restraint, and congested roads decreased the severity level. It was also indicated that interaction effects of cars being struck to the rear with large speed differentials and car rollovers were significant.

Donnell and Mason [9] conducted a study and developed median-related crash severity models. Three crash severity classes, fatal, injury, and property damage only (PDO) were considered as independent variable outcome. Both ordinal and nominal response logistic regression models were developed in the study. As indicated in the report, the ordinal response model gave more attractive results for cross-median crashes. On the other hand, the nominal response model gave better result for median-barrier crashes. Furthermore, variables such as highway surface conditions, use of drugs or alcohol, presence of an interchange entrance ramp, horizontal alignment, crash type, and average daily traffic volume were reported to have effect on crash severity.

By using paired comparison analysis and ordered probit model, Renski et al. [10] conducted a study to test the hypothesis that a speed limit increase will result in an increase in driving speed and produce higher crash severity. The study was focused on single-vehicle crashes on interstate roadways in North Carolina. As reported in the study, increasing speed limits from 89 to 97 km/h and from 89 to 105 km/h increased the probability of sustaining minor and noncapacitating injuries. However, the study indicated that increasing speed limits from 105 to 113 km/h did not show significant effect on crash severity [10].

Huang et al. [11] investigated effects of road diets in which four-lane undivided roads were converted into three lanes. Twelve road diets and 25 comparison sites in California and Washington cities were considered in the study. A “yoked comparison” study was applied to a “before” and “after” analysis and it was reported that road-diet crashes were observed to be lower by 6 percent than that of the comparison sites before road-diet countermeasures were made. Khattak [12] conducted a study that investigated the effect of vehicle technologies on crash–injury severity. The North Carolina 1994–1995 HSIS crash data were used for the analysis. Three separate ordered probit models were developed for the three drivers, Driver 1 (leading), Driver 2 (striking), and Driver 3 (striking in a three-vehicle crash). As indicated in the study, in a two-vehicle rear-end collision the leading driver is more likely to be injured, whereas in a three-vehicle collision, the driver in the middle is more likely to be injured. It was also stated that being in a newer vehicle protects the driver in rear-end collisions. Moreover, the study showed the benefit of technological improvements on driver safety.

Mercier et al. [13] performed a study and tested the hypothesis that older drivers and passengers would suffer more severe injuries than younger adults in the presence of head-on collisions of automobiles on rural highways. Logistic modeling, Hierarchical Regression Analysis, and Principal Components Regression were applied. Injury severity levels fatal, major, and minor were considered as dependent categorical variable. The independent variables considered included, among others, occupant age, occupant position relative to point of impact, and level of protection. As stated in the study, age was an important factor in predicting injury severity for both men and women. The study concluded that older drivers and passengers experienced more severe injuries than any of other age groups. The use of lap and shoulder devices was reported to be more important for men than women while the reverse is true for deployed air bags.

Chira-Chavala et al. [14] investigated the characteristics and probable causes of light rail transit system accidents and developed a crash severity model for the Santa Clara County Transit Agency. A binary logit model was applied to predict the probability of injury accident as a function of explanatory variables such as speeds before collision of light rail vehicles and motor vehicles, movement of the motor vehicle before collision, etc. As reported in the study, left-turn vehicle movements, higher speeds of the motor vehicle, or the LRV and accident occurring during peak hours increased the probability of injury accidents.

Chen and Jovanis [15] developed and tested the variable-selection procedure that avoids problems occurring due to the presence of the large number of potential factors, the complex nature of crash causes and outcomes, and the large number of categories in crash severity modeling. Bus-involved crash data for Freeway 1 in Taiwan from 1985 through 1993 were used. The procedure consisted of the χ 2 automated interaction detection (CHAID) method to collapse categories. Pearson χ 2 test was used to assess the relationship between dependent and independent variables, and log-linear modeling techniques. As indicated in the study, the log-linear model showed that late-night or early-morning driving increased the risk of severe injury crashes for bus drivers. It was also stated that bus crashes involving a large truck or tractor-trailers increased the risk of severe injury crashes.

By using an ordered probit model, Khattak et al. [16], explored factors contributing to more severe older driver (age of 65 and above) crash–injury severity by analyzing 1990–1999 crash data from Iowa. According to the study, older male drivers are more prone to injury as compared to older female drivers. It was stated that older drivers under the influence of alcohol experienced more severe injuries. It was also indicated that older driver injuries involving farm vehicles are more severe as compared to other vehicle types. Xie et al. [17] conducted a study that demonstrated application of a Bayesian ordered probit model in drivers’ injury severity analysis. In the Bayesian probit model, prior distributions such as means and variances were included (reflecting the analysts’ prior knowledge about the data). Comparisons were made between Bayesian ordered probit and conventional ordered probit models. As reported in the study, for large data size, model fitting results obtained from the Bayesian and the conventional probit model have no significant differences. It was also reported that for small sample size, a Bayesian probit model produced parameter estimates with better prediction performance than the conventional ordered probit model. Most recently, Zhao and Khattak [18] modeled motor vehicle drivers’ injuries in train-motor vehicle.

The purpose of this study is to analyze severity of vehicle crashes on HRGCs and make comparison between ordered and unordered logistic regression models in predicting vehicle crash severities at HRGCs. MNL and ORL models are developed to model the impact of various factors which include vehicle driver characteristics, environmental factors, railroad crossing characteristics, highway characteristics, land use type, and more, using the same dataset. Since the coefficients cannot be compared directly, marginal effects/values are computed for both models. The three levels of responses considered are fatality, injury, and no injury. The SAS PROC LOGISTIC procedure is used to develop the models.

3 Modeling methodology

3.1 MNL model

The MNL model formulation is well discussed by Long [4]. If y is the response variable with J nominal (i.e., categorical) outcomes (which takes on one of a limited number of possible values), then the assumption of the multinomial logit model is that category 1 through J are not ordered (i.e., not arranged in an increasing or decreasing order). Also, let Pr(y = m|x) be the probability of observing outcome m given the independent variable x. The model for y is constructed as follows:

  • Assume that Pr(y = m|x) is a linear combination m . The vector β m  = (β 0m β km β km ) contains the intercept β 0m and coefficients of β km for the effects of x k on outcome m.

  • To ensure nonnegativity for the probabilities, the exponential of m is used.

  • For the probabilities to sum to 1, divide exp( m ) by \(\mathop \sum \limits_{j = 1}^{J} { \exp }({x}_{i}{\beta}_{j} ).\)

$$\Pr \left( {y_{i} = m |x_{i} } \right) = \frac{{{ \exp }({x}_{i}\varvec{\beta}_{\varvec{m}} )}}{{\mathop \sum \limits_{j = 1}^{J} { \exp }({x}_{i}{\beta}_{j} )}}.$$
(1)

Although the probability sum is 1, the set of parameters that generate the probabilities is not identified since more than one set of parameters can generate the same probabilities. In order to identify the set of parameters that generate the probabilities, a constant must be imposed. By imposing one of the parameter estimates to be equal to zero (assume β 1 = 0), the model can be written as follows:

$$\Pr \left( {y_{i} = 1 |x_{i } } \right) = \frac{1}{{1 + \mathop \sum \limits_{j = 2}^{J} { \exp }({x}_{i}{\beta}_{j} )}}.$$
(2)
$$\Pr \left( {y_{i} = m |x_{i} } \right) = \frac{{{ \exp }({x}_{i}\varvec{\beta}_{\varvec{m}} )}}{{1 + \mathop \sum \limits_{j = 2}^{J} { \exp }({x}_{i}{\beta}_{j} )}} {\text{for }}\,\,m > 1.$$
(3)

The parameter estimates are determined using maximum likelihood estimation. If the observations are independent, the likelihood Eq. (4) is given by

$$L\left( {\beta_{2} , \ldots .\beta_{J} |y,x} \right) = \mathop \prod \limits_{i = 1}^{N} P_{i},$$
(4)

where P i is the probability of observing whether values of y was actually observed for the ith observation. Combining the Eq. (1) with this Eq. (4) in place of P i the likelihood Eq. (5) can be written as

$$L\left( {\beta_{2} , \ldots \beta_{J} |y,x} \right) = \mathop \prod \limits_{m = 1}^{J} \mathop \prod \limits_{{y_{i = m} }} \frac{{{ \exp }({x}_{i} \varvec{\beta}_{\varvec{m}} )}}{{\mathop \sum \limits_{j = 1}^{J} { \exp }({x}_{i}{\beta}_{j} )}},$$
(5)

where \(\mathop \prod \limits_{{y_{i = m} }}\) is the product over all cases for which y i is equal to m. Taking logs, we may obtain the log-likelihood function which can be maximized with numerical methods to estimate the β’s.

The overall model fitness can be compared using the model’s log-likelihood at convergence with the log-likelihood of a naive model (model with all coefficients set to zero which is equivalent to assigning equal probability for all outcomes). It is also possible to compare a model with only alternative constants (assigning probability to outcomes equal to the observed share of the outcomes in the dataset).

$$\rho^{2} = 1 - \frac{{{\text{LL(}}\beta )}}{\text{LL(0)}},$$
(6)

where LL(β) represents the log-likelihood at model convergence, LL(0) represents the log-likelihood of a naïve model (without information). The ρ 2 goes from 0 (for no improvement in the log-likelihood) to 1 for a perfect fit. A value for ρ 2 larger than 0.1 indicates meaningful improvement [4].

The marginal effect or partial change can be determined by taking derivative of Eq. (1) with respect to x k as described in the following Eq. (7).

$$\frac{\partial {\text{Pr}}(y = m|x)}{{\partial x_{k} }} = \Pr \left( {y = m |x} \right)\left[ {\beta_{km} - \mathop \sum \limits_{j = 1}^{J} \beta_{kj} { \Pr }(y = j|x)} \right].$$
(7)

Marginal effect is the slope of the curve relating xk to Pr(y = m|x), holding other variables constant. Variables are held at their means, possibly with dummy variables at 0 or 1. Although the computation of the change in the probability is important to interpret the effects of the MNL model, there is limitation in that it measures the discrete change which does not indicate the changes among the dependent outcome due to infinitely small changes in independent variables [4].

Odds ratio can also be used in the interpretation of the developed model. The odds ratio is defined as the ratio of the odds of those with the risk factor to the odds for those without the risk factor. Generally, the odds ratio associated with a one-unit increase in the risk factor can be computed by the exponential function of the regression coefficient of that risk factor [19].

3.2 Ordered logit (ORL) model

When the absolute distance between categories of a variable is unknown, yet there is a clear ordering of the categories, the variable is considered ordinal. The ordered response logistic regression formulation is presented as discussed by Long [4]. An ordinal logistic regression model is derived from a measurement model in which a latent variable y* is mapped to an observed variable y. These variables are related according to the following equation:

$$y_{i} = m \;{\text{if}} \tau_{m - 1} \le y_{i}^{*} < \tau_{m }\; {\text{for}}\; m = 1 \,to\,\,J.$$
(8)

The τ’s are cutpoints on the measurement scale that are used to distinguish the ordinal categories. In the case of crash severity models, the ordinal response categories are fatality, injury, and no injury crashes. Category 1 (fatal) is defined by the open-ended interval on the lower end of the measurement scale; Category 3 (no injury), is defined as the portion of the scale above cut point \(\tau_{2}\) and Category 2 (injury) is the portion between the two cutpoints. Note that the crash severities are divided into 3 categories instead of 5 levels due to the data availability issues in this paper. In other words, if 5 levels are used, all the 3 injury categories (except fatal and no injury categories) will each involve only a very limited amount of data for modeling, which may cause some data underrepresentation issues. As such, to avoid such issues, the crash severities are classified into only 3 categories, which are used for modeling in this paper.

The observed y is related to y* according to the measurement model:

$$y_{i} = \left\{ {\begin{array}{lll} {1 = {\text{Fatal}} \quad {\text{if}} \;\tau_{o } = - \infty \le y_{i}^{*} < \tau_{1} } \\ {2 = {\text{Injury}} \quad {\text{if }}\;\tau_{1} \le y_{i}^{*} < \tau_{2} } \\ {3 = {\text{No injury}} \quad {\text{if}} \;\tau_{2} \le y_{i}^{*} < \tau_{3} = \infty } \\ \end{array} } \right..$$
(9)

The regression equation used for an ordinal response is \(y_{i}^{*}\)  = x i β + ϵ i , where x i is a row vector (with 1 in the first column for the intercept), β is a column vector of structural coefficients (with the first element being the intercept βo), and ϵ i is an error term.

Maximum likelihood (ML) estimation can be used to estimate the regression of y* on x. In order to use ML, assumption of a specific type of error (ϵ) distribution is required. For the ordered logit model, the error term has a logistic distribution with mean zero and a variance of π 2/3. The probability density function (p.d.f) of the logistic distribution is given as shown in Eq. (10).

$$\lambda \left( \varepsilon \right) = \frac{{{ \exp }(\varepsilon )}}{{[1 + \exp \left( \varepsilon \right)]^{2} }}.$$
(10)

The cumulative distribution function (c.d.f) of the logistic distribution is given as shown in Eq. (11):

$$\varLambda \left( \varepsilon \right) = \frac{{{ \exp }(\varepsilon )}}{{1 + { \exp }(\varepsilon )}}.$$
(11)

After specification of the error term, the probabilities of observing values of y given x can be computed. The probability of any observed outcome y = m given x is the difference between the c.d.f evaluated at these values:

$$\Pr \left( {y_{i} = m |X_{i} } \right) = F\left( {\tau_{m} - \varvec{x}_{\varvec{i}}\varvec{\beta}} \right) - F(\tau_{m - 1} - \varvec{x}_{\varvec{i}}\varvec{\beta}).$$
(12)

To estimate the model, let β be the vector with parameters from the structural model, with the intercept βo in the first row and let τ be the vector containing the threshold parameters. Either βo or τ 1 is constrained to 0 to identify the model. Program such as SAS’s LOGISTIC procedure assumes βo and estimates τ1. From Eq. (12), the following can be obtained:

$$\Pr \left( {y_{i} = m |X_{i} ,\varvec{\beta} ,\tau } \right) = F\left( {\tau_{m} - \varvec{x}_{\varvec{i}}\varvec{\beta}} \right) - F(\tau_{m - 1} - \varvec{x}_{\varvec{i}}\varvec{\beta}).$$
(13)

The probability of observing whatever value of y was actually observed for the ith observation is

$$P_{i} = \left\{ {\begin{array}{*{20}c} {\Pr \left( {y_{i} = 1 |x_{i} ,\varvec{\beta} ,\tau } \right)\; {\text{if}}\; y = 1} \\ . \\ . \\ . \\ {\Pr \left( {y_{i} = m |x_{i} ,\varvec{\beta} ,\tau } \right) \;{\text{if}}\; y = m} \\ . \\ . \\ . \\ {\Pr \left( {y_{i} = J |x_{i} ,\varvec{\beta} ,\tau } \right) \;{\text{if}}\; y = J} \\ \end{array} } \right..$$
(14)

If the observations are independent, the likelihood equation over the population of N observations is

$$L\left( {\varvec{\beta} ,\tau |y,X} \right) = \mathop \prod \limits_{i = 1}^{N} P_{i}.$$
(15)

Combining Eqs. (13), (14), and (15),

$$L\left( {\beta ,\tau |y,X} \right) = \mathop \prod \limits_{j = 1}^{J} \mathop \prod \limits_{{y_{i} = j}} { \Pr }(y_{i} = j |x_{i,} \varvec{\beta} , \tau )= \mathop \prod \limits_{j = 1}^{J} \mathop \prod \limits_{{y_{i} = j}} [{F}(\tau_{j} - \varvec{x}_{\varvec{i}}\varvec{\beta}) - F(\tau_{j - 1} - \varvec{x}_{\varvec{i}}\varvec{\beta})].$$
(16)

Here, \(\mathop \prod \limits_{{y_{i} = j}}\) indicates multiplying over all cases, where y is observed to equal j. Taking logs, the log-likelihood can be written as follows:

$${\text{ln}}L\left( {\beta ,\tau |y,X} \right) = \mathop \sum \limits_{j = 1}^{J} \mathop \sum \limits_{{y_{i} = j}} { \ln }[F(\tau_{j} - \varvec{x}_{\varvec{i}}\varvec{\beta}) - F(\tau_{j - 1} - \varvec{x}_{\varvec{i}}\varvec{\beta})].$$
(17)

Model estimation involves maximizing Eq. (17) using numerical methods to estimate the τ’s and the β’s.

A measure of the model goodness of fit (ρ 2) can be calculated as

$$\rho^{2} = 1 - \left[ {\frac{{{ \ln }L_{b} }}{{{ \ln }L_{o} }}} \right],$$
(18)

where \(\ln L_{b}\) is the log-likelihood at convergence and \(\ln L_{o}\) is the restricted log-likelihood. The ρ 2 measure is bound by zero and one. Values of ρ 2 closer to one indicate better fit of the model.

Interpretation of ordinal response variables can be performed according to odds ratios. In this paper, the proportional odds model is used to interpret odds ratios for cumulative probabilities.

The cumulative probability that the outcome is less than or equal to m is

$$\Pr \left( {y \le m |x} \right) = \mathop \sum \limits_{j = 1}^{m} \Pr \left( {y = j |x} \right) {\text{for}}\,m = 1, \ldots ,J - 1.$$
(19)

The odds that an outcome is m or less versus greater than m given a set of explanatory variables x are

$$\varOmega_{m} \left( x \right) = \frac{{{ \Pr }(y \le m|x)}}{1 - Pr(y \le m|x)} = \frac{{{ \Pr }(y \le m|x)}}{{{ \Pr }(y > m|x)}} = { \exp }(\tau_{m} - \varvec{x\beta }).$$
(20)

Taking the log results in the logit equation,

$${\rm{ln}}\varOmega_{m} \left( x \right) = \tau_{m} - x\beta.$$
(21)

The marginal effects of variables x on the underlying crash severity propensity can be evaluated by taking the partial derivative of Eq. (11) with respect to x k , resulting in

$$\frac{{\partial { \Pr }(y = m|x)}}{{\partial x_{k} }} = \frac{{\partial F(\tau_{m} - \varvec{x\beta })}}{{\partial x_{k} }} - \frac{{\partial {F}(\tau_{m - 1} - \varvec{x\beta })}}{{\partial x_{k} }},$$
(22)

or

$$\frac{{\partial { \Pr }(y = m|x)}}{{\partial x_{k} }} = \beta_{k} \left[ {f\left( {\tau_{m - 1} - \varvec{x\beta }} \right) - f\left( {\tau_{m} - \varvec{x\beta }} \right)} \right].$$
(23)

The marginal effect is the slope of the curve relating x k to Pr(y = m|x), holding all other variables constant and is usually computed at the mean values of all variables. For a dummy independent variable, the derivative while treating it as a continuous variable provides an approximation.

4 Data assembly and model description

Vehicle-rail crash data on the USDOT public crossing sites from 2005 to 2012 are used in this study. In order to acquire more explanatory variables, the USDOT highway-rail crossing inventory is also included. The crash data and the crossing inventory data are merged based on the USDOT identification number. The SAS PROC SQL is used to merge and clean the data. After the data merging and cleaning process, a total of 7,414 records are obtained and used in the modeling stage from the Federal Railroad Administration (FRA) database. The data used to create the dataset are obtained from the FRA [2].

Table 1 presents the descriptive statistics of some of the variables from such HRGC crash and inventory data. As shown, the distribution of vehicle-rail crash severity is 6.80 %, 26.63 %, and 66.58 % for fatal, injury, and no injury, respectively. This distribution of crash severity indicates around 33.43 % of vehicle crashes at HRGC sites lead to fatality or injury, in which the figures are much higher as compared to those of multi-vehicle crashes in highway traffic. The majority (78.64 %) of vehicle-rail crashes at HRGC sites occurred when the rail equipment struck the vehicle while the remaining (21.36 %) were when vehicle struck the rail equipment. It is shown in the table that a majority (53.09 %) of vehicles involved in the vehicle-rail crashes are cars. It is also shown that the majority (71.01 %) of vehicle crashes had occurred in clear weather conditions.

Table 1 Descriptive statistics of variables from HRGC crash and inventory data

The HRGC sites where crashes occurred are located in different development areas. As one can see from Table 1, 32.37 % of the crossings are located in open space areas, 21.51 % in residential areas, and 28.10 % in commercial areas. The rest are found in industrial and institutional development areas. The majority (74.99 %) of the HRGCs cross two-lane highways. Descriptive statistics of other variables are also shown in the table. As many variables as possible are considered in this study. Some of the continuous variables are converted into categorical variable and the multinomial logit model and ordered logit models are developed and compared to estimate the model parameters.

5 Results and discussion

Many variables obtained from the crossing inventory and crash data were used in developing the MNL and ORL models. During the final preferred model development process, some of the variables were found to be statistically insignificant and hence removed in a stepwise manner. PROC LOGISTIC procedure was applied with significance level being 0.1 to retain some of the variables.

Tables 2 through 5 present the MNL and ORL model results obtained from this study. In both models, three vehicle-rail crash severity levels (Fatal crashes, Injury crashes, and No Injury crashes) were considered as the dependent variable. In particular, in the MNL model, no injury crashes were considered the base case among the three crash severity levels. Therefore, coefficients estimated for the explanatory variables are values representing the relative effect of contributing factors on fatal or injury crashes compared to no injury crashes. Positive estimates in the models indicate that the chance of injury or fatal crash increases as the value of the independent variables increases. On the other hand, the interpretation of the coefficients in the ORL model is different and can be presented as follows: A positive coefficient in the model indicates that an increase in the value of a variable will increase the probability of the highest severity level (fatal) and decrease the probability of the lowest severity level (no injury). On the other hand, a negative coefficient indicates that a decrease in the variable will increase the probability of the highest severity level and decrease the probability of the lowest severity level. For the intermediate severity level (injury), an increase in the value of a variable may decrease or increase the probability of it occurring.

Table 2 MNL model results

As shown in Tables 2 and 3, some of the variables are not statistically significant. However, for the convenience of interpretation, those variables were still retained in the model if at least one of variables/factors in the same parameter category was significant in at least one of the models (injury and/or fatality), though this may actually induce reduction in efficiency of the model. Furthermore, a 90 % confidence level was considered instead of 95 %.

Table 3 ORL model results

It is important to note that the assessment and comparison of the two models cannot be performed simply based on the estimated coefficients of the models. Marginal effects of the variables on the probability of severity levels are computed for the two models in Tables 4 and 5 and used for comparison purpose. The positive sign in estimated marginal effect indicates that the probability of a given crash severity level increases when the variable changes and the converse is true for a negative sign. And the value of the number indicates the magnitude of shift in the probability.

Table 4 Marginal effects results for the MNL model
Table 5 Marginal effects results for the ORL model

The shifting direction of the probability in the two models was used for comparison of the impacts of each variable on the probability of injury severity outcomes as shown in Table 6. As the results indicate, most variables are consistent which include the variable crash circumstance for the case of intermediate severity level (injury). Empty cell indicates that the variable is not significant even at the 90 % confidence level.

Table 6 Comparison of Marginal Effects on Variables for ORL and MNL Models

Some of the variables in the ORL model, including pick-up vehicles, crash circumstances when rail equipment struck vehicle, Foggy weather, unconsolidated, and concrete & rubber surface type and traffic volume (AADT of 10,000–20,000) are found to be statistically insignificant while they all are statistically significant in the case of MNL model. On the other hand, rainy weather condition was found to be statistically significant in the MNL model. In addition, some of the other variables are found inconsistent at both fatality and no injury severity levels, all of which are highlighted with red color and “+” or “−” signs. However, a majority of the variables have shown similar effects on the probability of the three different severity levels. The Akaike information criterion (AIC) of the two models is 9,667 and 10,479 for the MNL and ORL model, respectively. This indicates that the MNL model is better than the ORL model in predicting vehicle crash severity at HRGCs in this paper.

6 Conclusion

Comparison between the MNL and ORL model in predicting the vehicle crash severity on HRGCs was conducted using the USDOT public crossing sites data. The three vehicle crash severity levels, fatality, injury, and no injury, were considered as dependent variable. Train characteristics, environmental characteristics, types of development areas, highway-rail crossing characteristics, highway traffic characteristics, vehicle driver characteristics, and vehicle characteristics were the explanatory variables used in predicting the vehicle crash severity levels. The analysis was conducted using SAS PROC LOGISTIC procedure.

As discussed in the result part of this paper, in the ORL model, some variables were found to be statistically significant while they were not in the MNL model and vice versa. A majority of the variables have shown similar effects on the probability of the three different severity levels. In addition, based on the AIC, it was found that the MNL model is better than the ORL model in predicting the vehicle crash severity levels on HRGCs in this study. Therefore, the researcher recommends the MNL model to be applied rather than the ORL model in predicting severity levels of vehicle-rail crashes on highway-rail at-grade crossings.