Prediction of Fatalities at Northern Indian Railways’ Road–Rail Level Crossings Using Machine Learning Algorithms

: Highway railway level crossings, also widely recognized as HRLCs, present a signiﬁcant threat to the safety of everyone who uses a roadway, including pedestrians who are attempting to cross an HRLC. More studies with new, proposed solutions are needed due to the global rise in HRLC accidents. Research is required to comprehend driver behaviours, user perceptions, and potential conﬂicts at level crossings, as well as for the accomplishment of preventative measures. The purpose of this study is to conduct an in-depth investigation of the HRLCs involved in accidents that are located in the northern zone of the Indian railway system. The accident information maintained by the distinct divisional and zonal ofﬁces in the northern railways of India is used for this study. The accident data revealed that at least 225 crossings experienced at least one incident between 2006 and 2021. In this study, the logistic regression and multilayer perception (MLP) methods are used to develop an accident prediction model, with the assistance of various factors from the incidents at HRLCs. Both the models were compared with each other, and it was discovered that MLP supplied the best results for accident predictions compared to the logistic regression method. According to the sensitivity analysis, the relative importance of train speed is the most important, and weekday trafﬁc is the least important.


Introduction
India has the world's third largest rail network, trailing behind only the United States and China [1]. The magnitude of the Indian rail network is approximately a route of 68,442 km, of which 64,891 km are broad gauge. The rail network serves over 13,500 daily passenger trains (including 5125 suburban EMU trains) and over 9100 daily freight trains [2]. The Indian railway network has recorded 685 crossing accidents between 2006 and 2021 at crosswalks, of which 611 occurred at driverless crossings and 74 at manned crossings, causing 2639 deaths and 4991 non-fatal injuries between 2006 and 2020 [3]. Road traffic accidents account for 43% of all accidents in India [4]. Accidents at driverless crossings during the period 2006-2021 show a downward trend due to some new safety policies by the Indian government. The purpose of this study is to conduct an in-depth investigation of the HRLCs involved in accidents that are located in the northern zone of the Indian Railways system. The records retained by the divisional and zonal offices of the Northern Zone were used to collect the pertinent information, for a total of 225 rail road intersections where at least one accident occurred between 2006 and 2021. This research, unlike some others, investigates a wide variety of factors. Both the structural and functional aspects of crossings are evaluated. Other than the vehicle and train details, numerous different pieces of relevant data, including information such as time, place, driver behaviour, the geometry of crossings, and intersection type, are also included in this study. Results from this study will shed light on the primary factors that contribute to HRLC accidents in the roadway and traffic-related parameters. The developed ANN was able to predict the 85th percentile speed with an average degree of accuracy of approximately 96%. In order to analyse and predict traffic accidents in Sudan, Ali and Bakheit [24] used an ANN (artificial neural network) model. ANN models include principal component regression models. The results show that ANN models fit the data more closely (as measured by the coefficient of determination), but the predictions are otherwise very similar. Delen et al. [25] used police reports of 30,358 car accidents between 1995 and 2000 to create eight binary multilayer perceptron (MLP) neural network models, with different levels of injury (from no injury to death) as the dependent variable. Their model helps one to find the most important factors that explain each dependent variable. By using an ANN, Jadaan et al. [26] developed a model for predicting incidences by looking at the relationship between injuries and the factors that affect them. The model produced results that were good for Jordanian traffic. Alkheder et al. [27] trained an ANN model to predict the severity of injury (mild, moderate, severe, and fatal) of avenue visitor accidents with the data from 5973 incidents that occurred in Abu Dhabi between 2008 and 2013. Overall, their model predicted an average success rate of 74.6 percent when using the testing dataset. Sameen and Pradhan [28] made a recurrent neural network (RNN) to be ready for different kinds of injuries. The RNN version was compared to the MLP and Bayesian logistic regression models (BLR). The RNN version was found to be more correct than the ANN and BLR versions. Borja et al. [29] offered a method for founding an accident threat prediction model. They developed models with artificial neural networks (ANNs) and decided on the ultimate structure of the ANN version, which enabled the use of information for incidence counts on the Swiss national roads (2009)(2010)(2011)(2012). It becomes clear that ANNs may be used as a workable approach to predict the frequency of street accidents. In addition to the above, the emergence of various datasets has led to the use of different prediction methods for various engineering problems [30][31][32][33][34]. Pattern recognition equipment and the correct evaluation of its usage of optimized prediction obligations are contemporary subject matters in current years [35][36][37][38][39]. From the above literature study, it was concluded that HRLC accidents occur due to various factors. To establish a relationship between these factors, we need a mathematical tool such as logistic regression and ANN, which are widely used in accident prediction; however, very little research has been carried out on HRLC accident prediction. From the literature review, it was also found that most of the study was conducted in developed economies where advanced and intelligent infrastructure is available. Apart from this, the education level of the people was also high, due to which people can understand the importance of safety and its effects more wisely. Hence, this study concentrated on those developing economies where a lack of advanced infrastructure and low education levels can differ from the results of the previous study.

Selection of Study Area and Data Collection
This study collected data from the Northern Railways, North Central Railways, North Western Railways, North Eastern Railways, and Northeast Frontier Railways in India. The total length of railways in the abovementioned railway zone is 23,319 km [40], and this encompasses different parts of different Indian states. Data were collected from the database of the zonal head office and divisional head office via the right to information (RTI) act and from a direct visit to the office. The accident data were collected from 2006 to 2021. These data contain the following: place of accident, time, date, type of train involved, type of vehicle involved, number of fatalities and injuries, type of injuries, and manned or unmanned level crossing. The sample demonstration of the dataset is shown in Table 1.

Primary Investigation of the Accident Data of Northern Railways
This study primarily concentrates on looking at the characteristics of RRLCs that experienced accidents in the northern zone of Indian Railways between 2006 and 2021. A total of 225 crossings were found in the northern zone out of 355 unmanned level crossings, where at least one accident occurred between 2006 and 2021. The data illustrate that the number of RRLC accidents increased from 2006 to 2014, then decreased over the next seven years. From the data, it is observed that fatalities were highest in 2011, whereas 2019 and 2020 had no accidents at level crossings. The drastic decrease in accidents is due to some of the major safety enhancement policies and planning in road safety that has been conducted by the Government of India. By 2025, the Indian government intends to remove 2500 unmanned level crossings from national highways [41]. The majority of level crossings are regularly maintained. The primary objective of the Indian government is to improve the existing infrastructure of railways through the routine monitoring of level crossings, road signs and signals, and surface types. Another cause for the reduction in accidents in 2020-21 was the lockdown that occurred due to the spread of COVID-19. Indian Railways is planning to remove all unmanned levels from major national highways by the year 2022, which is another reason for accident reduction [40]. Railroad level crossing (RRLC) casualties in the northern zone depend on the type of crossing, the presence of light, the surface of the intersection area, the type of warning system deployed, traffic characteristics, driver characteristics, and environmental factors. There is a total of 87 RRLCs at the mostthreatened crossings, which is where the lighting is inadequate. Between 2006 and 2021, non-gated RRLCs accounted for 86.7% of accidents. Most of the crossings have crossbucks or stop signs that are not properly maintained, some road signs that are broken, and some that are found to be faded in colour. According to the data, 20% of RRLCs have inadequate protection, which is one of the causes of accidents. According to Figure 1, the majority of accidents (88%) occur during the day because trains and road traffic interact more during the day than at night. Countries such as India that have a daytime work culture are also a prominent cause for daytime accidents. Most of the accidents that occur at unmanned level crossings compared to manned level crossings are shown in Figure 2.   These events occurred because unmanned level crossings are not protected by gates. In the study area, several unmanned level crossings were found without proper signs, broken stop signs, and several had road signs that had faded in colour.
Passenger trains are more involved in accidents compared to goods trains, as shown in Figure 3. This is because goods trains move at a slower speed compared to passenger trains. In the northern zone of railways, the number of passenger trains is higher compared to goods trains, so there is less interaction between vehicle and train (which is a major reason for passenger train accidents). Accidents at level crossings are also influenced by the geometry of the crossings. More accidents occur at skewed-geometry level crossings than at linear ones. This is because the motorist has less available sight distance.   These events occurred because unmanned level crossings are not protected by gates. In the study area, several unmanned level crossings were found without proper signs, broken stop signs, and several had road signs that had faded in colour.
Passenger trains are more involved in accidents compared to goods trains, as shown in Figure 3. This is because goods trains move at a slower speed compared to passenger trains. In the northern zone of railways, the number of passenger trains is higher compared to goods trains, so there is less interaction between vehicle and train (which is a major reason for passenger train accidents). Accidents at level crossings are also influenced by the geometry of the crossings. More accidents occur at skewed-geometry level crossings than at linear ones. This is because the motorist has less available sight distance.

Manned and unmanned level crossing accident
Manned level crossing Unmanned level crossing These events occurred because unmanned level crossings are not protected by gates. In the study area, several unmanned level crossings were found without proper signs, broken stop signs, and several had road signs that had faded in colour.
Passenger trains are more involved in accidents compared to goods trains, as shown in Figure 3. This is because goods trains move at a slower speed compared to passenger trains. In the northern zone of railways, the number of passenger trains is higher compared to goods trains, so there is less interaction between vehicle and train (which is a major reason for passenger train accidents). Accidents at level crossings are also influenced by the geometry of the crossings. More accidents occur at skewed-geometry level crossings than at linear ones. This is because the motorist has less available sight distance.
In the study area, the trains run from major cities where traffic volume is very high; as such, more train-traffic interaction takes place. Due to this, more accidents occur in urban areas compared to rural areas, as shown in Figure 4. The dry season has fewer accidents than the wet season. The dry season is summer, while the wet season is autumn and winter.
In the wet season, visibility is disturbed by heavy rain and fog. For the study area, a similar result shows up in Figure 5. Peak hours experience more accidents compared to non-peak hours due to the increased interaction of vehicles and trains. This is also valid for the study area, as shown in Figure 6. In the study area, the trains run from major cities where traffic volume is very high; as such, more train-traffic interaction takes place. Due to this, more accidents occur in urban areas compared to rural areas, as shown in Figure 4. The dry season has fewer accidents than the wet season. The dry season is summer, while the wet season is autumn and winter. In the wet season, visibility is disturbed by heavy rain and fog. For the study area, a similar result shows up in Figure 5. Peak hours experience more accidents compared to non-peak hours due to the increased interaction of vehicles and trains. This is also valid for the study area, as shown in Figure 6.    In the study area, the trains run from major cities where traffic volume is very high; as such, more train-traffic interaction takes place. Due to this, more accidents occur in urban areas compared to rural areas, as shown in Figure 4. The dry season has fewer accidents than the wet season. The dry season is summer, while the wet season is autumn and winter. In the wet season, visibility is disturbed by heavy rain and fog. For the study area, a similar result shows up in Figure 5. Peak hours experience more accidents compared to non-peak hours due to the increased interaction of vehicles and trains. This is also valid for the study area, as shown in Figure 6.    In the study area, the trains run from major cities where traffic volume is very high; as such, more train-traffic interaction takes place. Due to this, more accidents occur in urban areas compared to rural areas, as shown in Figure 4. The dry season has fewer accidents than the wet season. The dry season is summer, while the wet season is autumn and winter. In the wet season, visibility is disturbed by heavy rain and fog. For the study area, a similar result shows up in Figure 5. Peak hours experience more accidents compared to non-peak hours due to the increased interaction of vehicles and trains. This is also valid for the study area, as shown in Figure 6.  In descriptive statistics, the maximum, minimum, mean, standard deviation, and variance of all variables are calculated and tabulated in Table 2. The speed of the train varies from 22 km/h to 120 km/h. However, the average train speed is 64.9 km/h. The variation in speed is also shown in Figure 7.

63% 37%
Peak and non peak hour accident

Peak hour
Non peak hour

Descriptive Statistics of the Variable
In descriptive statistics, the maximum, minimum, mean, standard deviation, and variance of all variables are calculated and tabulated in Table 2. The speed of the train varies from 22 km/h to 120 km/h. However, the average train speed is 64.9 km/h. The variation in speed is also shown in Figure 7.

Descriptive Statistics of the Variable
In descriptive statistics, the maximum, minimum, mean, standard deviation, and variance of all variables are calculated and tabulated in Table 2. The speed of the train varies from 22 km/h to 120 km/h. However, the average train speed is 64.9 km/h. The variation in speed is also shown in Figure 7.  Speed of the train

Models
There are two methods that were used for the analysis of data in this study. Analysis was completed by using both methods, and the results were compared.

I.
Logistic regression; II. Artificial neural network.

Logistic Regression
The regression method, also known as logistic regression, was used to fit the accident data. In order to predict future events, probabilistic systems were modelled using logistic regression techniques. The distributions of the explanatory variables or predictors were not necessary in these direct probability models [41]. If p is the probability that a binary response variable Y = 1 when input variable X = x, then the logistic response function is modelled as P = P(Y = 1|X = x) = e β 0 +β 1 X 1 +β 2 X 2 +β 3 X 3 .................+β n X n 1 + e β 0 +β 1 X 1 +β 2 X 2 +β 3 X 3 .................+β n X n (1) This function represents an s-shaped curve and is non-linear. Here, β is the coefficient of the predictor or input of the variable x that is used in a regression equation.

Artificial Neural Networks
A neural network machine-learning model has been extensively used in predictive applications. Warren McCulloch, a neurophysiologist, and Walter Pits, a logician, based the first artificial neuron on a biological neuron in 1943 [42]. In artificial neural networks, feedforward networks and feedback networks are the two main architecture types. Feedforward, or multi-layer, networks have been used quite often when constructing neural models. In such models, several layers as a hidden layer and one output layer may be included. The general mathematical expression of an ANN model is represented in Equation (2).
where A N = normalized output of the model; ϕ = activation function, b o = bias at the output layer neuron; w k = weight between the output layer neuron and kth neuron of the hidden layer; b k = bias associated to kth neuron of the hidden layer; w ik = weight between ith neuron of the input layer and kth neuron of the hidden layer; x i = normalized ith variable (neuron) of the input layer; n = number of input variables; and m 1 4 number is the neurons in the hidden layer.

Preparation of Model Data
In this paper, in order to establish a predictive model for railroad level crossing accidents, fatal and non-fatal accidents were selected as the dependent variables. A fatal accident is coded as y = 1 and a non-fatal accident as y = 0. Another variable is shown in Table 1 with a coded value. All dependent variables were coded as shown in Table 3.

Result of Logistic Regression Model
The binary logistic regression model that was built within an IBM SPSS Statistics 22 environment was used for the analysis. The findings of the statistical analysis are summarized in Table 4, including the following information for each predictor: (1) estimate (2) standard error; (3) Wald; (4) degree of freedom; (5) p-value; and (6) Exp (B).
According to the findings of the statistical analysis, the majority of the considered predictor variables were statistically significant with p-values of less than 0.05. (See Table 4). Furthermore, for the entire model, the p-value was less than 0.001, which shows that the model was statistically significant. Moreover, some of the predictor variables considered had relatively high p-values. This is because certain aspects were shared by all the RRLCs that had at least one accident during the 15-year study period. The road surface at crossings was not significant because the crossing is a very short distance; thus, it does not have much impact. The majority of the manned RRLCs in the northern zone of the Indian railways have active warning devices installed. The p-value for gauge of track was 0.210, which is not statistically significant because the majority of RRLCs in the northern zone have similar gauges of trains. The area under the curve (AUC) for the MLP model is 0.94, which is more than 0.90; hence, the model can distinguish between fatal and non-fatal accidents very well. The accuracy of the model is 0.97, which is close to 1.0; this implies that the model can accurately predict fatalities 97 times out of 100 with the given condition of level crossings. This is shown in Table 5.

Logistic Regression Model Validation
Four pseudo-R-square statistical tests were used to assess the fitness of the proposed model, giving satisfactory results for all tests. First, the -2 log-likelihood (or -2LL) test was performed (this is also referred to as the model deviance). The lowest value of the -2LL was zero, which signifies a perfect predictive performance (increasing values relative to zero indicate a worse model fit [43]). This indicator is typically not very insightful regarding the characteristics of a poor fitted model. For the proposed model, the value of -2LL was 0.089, which is near zero and indicates that the model is fit for prediction. The second test used was Cox and Snell's R square. Cox and Snell's R-square is based on the log likelihood of the model compared to the log likelihood of a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model. The most significant value of the Cox and Snell R square is 1, which indicates a perfect fit, and decreasing values relative to 1 signify a worse model fit [43]. For the proposed model, this value was 0.943, which is near one and signifies a good fit. The third statistical test performed was the Nagelkerke R-square. The largest value of the Nagelkerke R square is 1, which indicates a perfect fit, and decreasing values relative to 1 signify a worse model fit [44]. For the proposed model, this value was 0.973, which is near one and signifies a good fit. The fourth test used was McFadden's pseudo-R-square, which lies between 0.2 and 0.4 for a good-fit model. For the proposed model, this value was 0.31, which shows that the model is a perfect fit. The Hosmer and Lemeshow [45] test provide an additional global fit test, comparing the estimated model to one with a perfect fit. If this assessment is not significant, it reveals that the model is a well-specified fit model. If it is significant, then we have evidence that the model is misspecified or does not fit the model. Here, the Hosmer and Lemeshow tests were not statistically significant [χ 2 (8) = 0.286, p = 1.000], suggesting that the model fits adequately. From the above details, it was found that the proposed model satisfied all statistical fitness tests.

Results of the ANN Model
The ANN model used in this study was developed using 15 independent variables, as per Table 2, by taking fatal and non-fatal accidents as the dependent variables. Optimization of the model was done with the gradient descent algorithm. The activation function for the input layer and output layers were hyperbolic tangent and sigmoid, and these give a maximum accuracy of the model, as shown in Table 4. The accuracy of training and testing was 100% for fatal accident prediction. Non-fatal accident accuracy for training and testing were 96.9 and 94.7, respectively. There are seven hidden layers used in this model. The confusion matrix for the ANN model is shown in Table 6. The area under the curve (AUC) for the MLP model is 0.986, which is more than 0.90; hence, the model can distinguish between fatal and non-fatal accidents very well [46]. This is calculated using the ROC curve, which is drawn using specificity on the X axis and sensitivity on the Y axis.

Sensitivity Analysis for the ANN Model
A sensitivity analysis was carried out in order to determine which of the many possible factors had the greatest impact. The connection weights in a neural network model were deduced using the formulas proposed by Garson [47]. In addition to that, Shahin et al. [48] applied this theory to the field of civil engineering. We were able to determine the relative importance (RI) of each independent variable with the assistance of this analysis. According to Table 7, the variables were ranked based on the decreasing order of their corresponding relative importance values. As the ranks were assigned to each of the variables in Table 6, it was observed that the variable 'speed of the train' had the highest RI of 32.1%. The 'level crossing type' variable was observed to be the second most important variable. Similarly, "weekend and weekend day traffic" was discovered to have the lowest impact on model output.

Discussion
In urban India, especially in large cities, concerns have been raised about the security of vehicle users at level crossings and other such intersections. While statistical analysis and modelling have been widely used to assess pedestrian safety at traffic junctions [49], there is a need for a more holistic approach. This research has presented a new approach and is based on the use of logistic regression and artificial neural network techniques for determining the connection between level crossing characteristics and fatal and nonfatal accidents. In this investigation, efforts were made to construct ANN-based models for determining the frequency of fatal pedestrian collisions. Nevertheless, several researchers have elucidated the importance of statistical models for predicting accident frequencies [50,51]. Chakraborty and Mitra [52] created a negative binomial model to predict the catastrophic pedestrian accident frequency in Kolkata; they demonstrated that the statistical model's prediction performance was nearly 50%. However, the accuracy of an ANN model is very high when compared to statistical models [53][54][55][56]. Alkheder et al. [27] used an ANN model to predict the degree of injury (minor, moderate, severe, and death) of avenue visitor accidents. Their version had an average overall prediction performance of 74.60%. In this investigation, the accuracy level for logistic regression was 96%, but the accuracy level for the ANN model that uses hyperbolic tangent as its activation function for the input layer and SoftMax for the output layer was 98%. Both models have better accuracy than the statistical models used by various researchers. In addition to analysing data on the frequency of road-railway events, it is also important to analyse the data on the triggers that contribute to fatalities and the extent to which these factors impact. Most of the previous studies where ML was used as an effective factor in accidents were smaller than the present study. Some of the important factors, such as average daily traffic, age of the driver, sight distance, and frequency of trains, may be included in future studies. These data are not available, but they are very important for accident prediction.

Conclusions
In this study, the accident data from the past 15 years on rail road level crossings in the northern zone of Indian railways were analysed. From the data, it was found that level crossing accidents have decreased due to some initiatives by Indian railways. Unmanned level crossings encounter more accidents compared to manned level crossings. Most of the unmanned level crossings have road markings and signs that are either faded or not properly installed. The speed of the trains, day and night driving, weather, rural and urban areas, the number of railway tracks, and the surface type of the pavement at highwayrail-grade crossings were found to be the factors significantly affecting the severity of driver injuries at both manned and unmanned level crossings. Some of the factors, such as the availability of signboards, road markings, and average annual daily traffic, were not significant for the prediction of accidents, as per the proposed model. Multilayer perception ANN has an accuracy of 98%, while logistic regression has an accuracy of 97%. As per the sensitivity analysis, the speed of train had the greatest relative impact (32.1) on accidents; moreover, the gauge of track had the least relative importance (1.2). This paper includes only 15 independent variables, but some of the variables may be included in future work, such as driver and pedestrian behaviours, sight distance, delays at the level crossing, automatic and manual gate operations, and the width of the road near crossings. In Indian conditions, fatal and non-fatal accidents at RRLCs can be reduced by increasing driver and pedestrian awareness and by improving safety standards. The government must impose severe penalties on drivers who violate traffic laws at intersections such as RRLCs. Intelligent signal management and monitoring systems must be adopted for the effective reduction in accidents at level crossings. Recommendations are made for addressing traffic engineering, road, and construction concerns to enhance the security of road-railway infrastructure. In addition, creating and enforcing more stringent laws, particularly regarding identified causes of fatal accidents, and increasing penalties is recommended. State and federal governments should set aside money for the development of national and local databases that compile data on road-railway collisions, including the frequency of rail links, the average age of drivers, the levels of education and income they hold, the types of property damage that they sustain in accidents, and more.