Influence of meteorological conditions on road accidents. A model for observations with excess zeros

a Military University of Technology, Faculty of Security, Logistics and Management, ul. gen. Sylwestra Kaliskiego 2, 00 – 908 Warsaw, Poland b Warsaw University of Technology, Faculty of Transport, pl. Politechniki 1, 00-661 Warsaw, Poland Borucka A, Pyza D. Influence of meteorological conditions on road accidents. A model for observations with excess zeros. Eksploatacja i Niezawodnosc – Maintenance and Reliability 2021; 23 (3): 586–592, http://doi.org/10.17531/ein.2021.3.20. Article citation info:

Road accidents are one of the basic road safety determinants. Most research covers large territorial areas. The results of such research do not take into account the differences between individual regions, which often leads to incorrect results and their interpretation. What makes it difficult to conduct analyses in a narrow territorial area is the small number of observations. The narrowing of the research area means that the number of accidents in time units is often very low. There are many zero observations in the data sets, which may affect the reliability of the research results. Such data are usually aggregated, which leads to information loss. The authors have therefore applied a model that addresses such problems. They proposed a method that does not require data aggregation and allows for the analysis of sets with an excess of zero observations. The presented model can be implemented in different territorial areas.

Introduction
Road accidents are one of the basic sources of data for road safety analysis [1,5,10]. However, it is very difficult to find their causes as there are numerous factors that affect them [25,26]. The basic road safety analyses carried out in most countries concern the general trends in the number of accidents and casualties in relation to data characterizing a given area. However, determination of the exact causes of accidents requires much more advanced methods. The studies presented in the literature are of varied nature [10,12]. Some of them are limited to evaluation of the impact of single variables such as driver drowsiness [22], driving speed before the incident [27], traffic jams [19,29], driver's gender [2], driving under the influence of alcohol or drugs [16], etc. In other publications, many variables are analyzed simultaneously. Singh [23], for example, evaluates the impact of inexperience and lack of skills characteristic of young drivers, while in the group of older drivers he emphasizes impairment of sight, cognitive functions and motor skills. Ashraf et al. [4] also take into account many different elements, considering, among other things, driver's gender, experience, time of incident, observance of traffic rules [20].
There are a lot of publications on road accidents. All of them analyze a limited number of factors, as it is not possible to take into account all variables that affect the number of such incidents. Moreover, not all of them are identifiable or measurable, and some data are difficult to obtain. These include, for example, detailed weather data, which in publicly available form concern only average measurement values for larger administrative areas and sometimes the whole country. Such aggregated values are useless, as meteorological conditions may vary dramatically among distant regions.
Another problem in the analysis of road accidents is the availability of information in this area. In many countries, no accurate records are compiled for areas smaller than the whole country [6,7], or the available information is not complete [9,28], so that only country-wide analyses are possible. Examples include the research conducted in Saudi Arabia [3], South Korea [4], India [21] or Poland [8,11,24].
The results of such research, however, do not take into account the differences between individual regions, which may occur even within a single country/region. They may result (e.g. when comparing small towns and large agglomerations) from different lifestyles, traffic volumes at different times of the day, different numbers of traffic users, the condition of road infrastructure, and even driver experience or driving culture. Analyses carried out within different areas allow to compare them, find similarities concerning factors conducive to accidents, as well as elements improving road safety, which, when bringing the expected results in one region, can be implemented elsewhere. Systematic research, conducted in parallel in different locations, is therefore desirable. However, in addition to the data availability, the nature thereof poses a significant obstacle in this respect. A significant narrowing of the research area results in a very small number of accidents per time unit and a large number of zero observations in the data sets, which may affect the reliability of research results. Such data are therefore often aggregated prior to analyses [6], which in turn may lead to a significant loss of information.
This paper is part of the analysis of issues related to the trend of continuous improvement of road safety, carried out through monitoring of hazard levels and permanent evaluation of factors that shape it. The authors adopted a research hypothesis stating that meteorological factors significantly influence the number of road accidents. Due to the high variability of weather conditions in relation to geographical location, only the city of Warsaw was analyzed. As a result, in addition to the main research objective, i.e. to indicate meteorological factors that significantly influence the number of accidents, there was an additional objective to present the possibility of mathematical analysis of a set of data with excess zeros thus eliminating the necessity of measurements aggregation and the related loss of information. Moreover, factors related to the time of the incident were also taken into account, i.e.: time, day of the week and month.
The research was conducted using data (including meteorological data) on road accidents by hour that occurred in 2018 and 2019 in Warsaw. Data on accidents were obtained from the Polish Road Safety Observatory (operating at the Motor Transport Institute in Warsaw), while meteorological data were made available by the Warsaw-Okęcie Airport.
The article consists of an introduction, methodological and practical parts and a summary. The introduction presents the research objective and justifies the necessity to conduct it. The methodological part presents the applied analysis methods dedicated to the empirical data gathered. In the practical part, the research sample and the mathematical model of road accidents are characterized in detail. The whole article ends with the summary of the research carried out and the final conclusions.

Methodology
The numerator variable represents a category whose possible values are non-negative integers. Linear regression is the most common way of studying the influence of independent factors on the explained variable [17], but using the classic model with the endogenous variable being the numerator variable can lead to serious cognitive errors, especially when the expected value of the variable is not large.
Poisson regression is a popular approach to modeling count data [18,29]. It is assumed that the distribution of observations is consistent with Poisson distribution with the mean depending on the predictors. The problem arises if the empirical data show deviations from the assumptions of this model. In many applications, for example, an excessive dispersion occurs and the assumption of equality of the expected value and variance of distribution is not fulfilled. Therefore, other models are adopted in place of Poisson regression that take into account two types of zeros, i.e., "true zeros" and "excess zeros", estimating two equations, one for the counting model and one for the excess zeros. The most commonly used are the zero-inflated model and the hurdle model [1,13,15,30,31].
The article includes an estimation of parameters of four models: the Zero inflated Poisson model (ZIP), the Zero inflated negative binomial model (ZINB), the Poisson hurdle model (PLH), and the Negative binomial hurdle model (NBLH). Using Akaike's criterion, the selection of the best one was made. A method allowing to simplify the expanded model was then presented and a negligible loss of information that was associated with this was shown.

Research sample
The presented research is based on the data on road accidents that occurred in the years 2018-2019 in the Polish capital city -Warsaw, archived on an hourly basis. The research sample consisted of 17,250 observations. The narrowed area of research strongly influenced the number of events recorded in each hour. The maximum number of accidents in the analyzed period was as low as 4, and the average value was 0.11. Other descriptive statistics are presented in Table 1.
The reason behind such results of descriptive statistics is that the vast majority of observations are zeros. There are as many as 15,723 of them in the whole set, which represents more than 89% of the measurements. The remaining numbers are presented in Table 2.
The distribution of the data gathered, sorted in ascending order, is shown in Fig. 1. The form of the dependent variable dictated the use of mathematical models that are dedicated to data with excess zeros. Since, according to the assumed research hypothesis, the research objective was to analyze the influence of meteorological factors on the number of accidents, additional information was collected for each hour describing the weather conditions prevailing then. Detailed data concerning Warsaw were obtained from the Warsaw-Okęcie Airport, from Meteorological Aerodrome Reports. It is a coded weather report format used in aeronautical meteorology and weather forecasting. It contains information about ambient temperature, dew point temperature, pressure, wind speed and direction, precipitation, cloud cover, cloud base height, visibility. It may also contain other important annotations, concerning for example the condition of runways.
The set of factors used in the study contained information on visibility, wind speed, pressure, temperature, precipitation, type of clouds, mist. The original data set adopted for the study contained 7 variables that could occur in the fixed effect category and that were used for preliminary model construction, while their descriptive statistics are presented in Table 3. Additionally, the variables resulting from the date -calendar, i.e. month, day of the week and time of the incident, were included.

Mathematical model of road accidents
The parameters of four models were estimated: Zero inflated Poisson (ZIP) model, Zero inflated negative binomial (ZIMB) model, Poisson logit hurdle (PLH) model, Negative binomial logit hurdle (NBLH) model. Since some of the variables had no significant impact on the number of accidents, the following variables were used for the final estimation of model parameters: clouds, precipitation, mist, temperature, month, week, hour. For the models constructed in this way the value of the AIC information criterion was calculated and on its basis the best of them was selected, which turned out to be the negative binomial hurdle model, for which the AIC value was the lowest (Table 4).
Thus, the number of accidents can be presented as a two-part model (see Appendix 1 for estimated parameter values). First of all, it is a logit model, which is designed to model the probability of values 0 i y = . The second part concerns positive values and is modeled as a variable with negative binomial distribution, taking into account selected predictors. The resulting model can help us to determine which conditions are conducive to road accidents. The model is interpreted as two separate processes. First of all, it is a process that generates zero numbers for road accidents. The constructed model indicates that the probability of no incident is significantly influenced by cloud and fog variables, which increase this probability. Among the individual categories, overcast turned out to be significant, which is probably due to the increased caution of drivers during such unfavorable weather conditions, as well as cloudless sky and no mist, which in turn increase visibility and facilitate safe driving. The days of the week (Sunday and Tuesday) also proved to be significant, as they increase the probability of accidents. The second part of the model is a process that generates the number of road accidents, taking into account the occurrence of at least one accident. The stimulants in this case are overcast (OVC) and temperature, as well as the following hours: 4:00 a.m. and from 6:00 a.m. to 10:00 p.m. The destimulants are: no precipitation, the months of July, August and November, and the following days of the week: Tuesday and Wednesday.
Not all the factors for individual predictors in groups are statistically significant. Moreover, the model is extensive, due to a large number of independent variables. It was therefore analyzed whether it would be possible to combine variables in individual groups in order to simplify the model.

Analysis of qualitative variables
To simplify the model, an analysis was made of the possibility of combining the variables in each group. For this purpose, the Kruskal-Wallis test was used to see if there were differences between the variables in the group and then the Wilcoxon rank sum test was used to determine which variables in the group were significantly different [14]. Tests were conducted for each group of variables.
Analysis of individual categories of the cloud group using the Kruskal-Wallis test showed that there are significant differences between at least two categories. The Kruskal-Wallis test statistics are 22.433 T = and = . This is confirmed by the interaction plot presented in Figure 2. If the influence of each category in the group was the same, the lines in the plot would be parallel.
In order to find the categories that are significantly different from each other, Wilcoxon rank sum test was used, the results of which are presented in Table 5.
Based on the Wilcoxon rank sum test results, three groups were distinguished. The first one includes cloudless sky and NSC, FEW, SCT clouds. Consistency within the group was again confirmed by the Kruskal-Wallis test (T = 7.059, p-value = 0.07). In the second group there were only clouds of BKN type, while in the third -of OCV   Table 6) three groups of similar months were distinguished.
The following were distinguished: Group 1, which included April, May, June, September and Oc-tober. Consistency within the group was confirmed again by the Kruskal-Wallis test (T = 1.027, p-value = 0.906). Group 2, which included the months of January, February, -March, July, August, November and December (T = 11.564, pvalue = 0.0724).
The analysis of individual days of the week also revealed the existence of significantly different groups (T = 61.524, p-value = 2.2 11 10 − ), which were created on the basis of the Wilcoxon test results (Table 7).
Two groups of days of the week were created: Group 1, which included Monday, Thursday, Friday, Saturday, Sunday (consistency within the group was confirmed by the Kruskal-Wallis test, T = 5.096, p-value = 0.278) and group 2, which included Tuesday and Wednesday (T = 0.723, p-value = 0.395). The last variable studied was the time of the incident, for which the zero hypothesis of equal distribution in groups was also rejected (T = 723.01, p-value < 2.

Fig. 2. Interaction plot of individual categories in the cloud group
When it comes to the time of the incident variable, the consistency within groups was confirmed by the Kruskal-Wallis test at the significance level of 0.01 α = , and therefore, a chi-squared test was also performed, which is also used to compare the distributions in groups. Consistency was confirmed at the significance level of 0.05 α = (Table 8).

Estimation of simplified model parameters
Grouping of variables allowed to construct a simplified Negative binomial hurdle model. Estimates of parameters of the first and second part of the model are presented in Table 9.
The constructed model is simpler and thus more transparent. The influence of individual variables is obviously the same as in the extended model. The AIC criterion is 11,897, compared to the AIC = 11,766 obtained for the model before grouping, which means a slight loss of quality in the context of significant model simplification. The adjustment of the proposed model to the empirical data is presented in Fig 3.

Summary
In order to address the problems presented in the introduction, the article proposes a mathematical model allowing to estimate the number of road accidents, including a correction for random effect (i.e. resistant to excess zeros in the data set) and eliminating the problem of excessive dispersion by applying the binomial negative distribution. The application of such a model to traffic accidents is virtually non-existent/unnoticeable in the literature. This is because traffic accident data are usually aggregated to lower frequency data or such events are considered for large areas. While this provides a sufficient number of observations for analysis, it is associated with significant data loss or even obtaining a model that is inadequate for individual component areas. Therefore, this article proposes a model that solves these problems while providing a reliable assessment of the factors affecting accidents for a narrow area and a high frequency of observations. In this study, meteorological factors were the main focus, however other variables that were not used in this case, e.g., terrain characteristics, traffic conditions, vehicle type, etc., can also be studied in this way.
The authors focused on meteorological factors because they are often considered the cause of accidents, and there are few studies that support this. Some of the numerous variables being analyzed turned out not to significantly influence road hazard occurrence. Temperature, precipitation, type of cloud coverage and mist turned out to be significant. Moreover, the impact of variables related to the date of the event, i.e. calendar month, day of the week and time of the accident, was also significant.
The presented study shows that selected weather factors influence the number of accidents. This may be due to their impact on the condition of road traffic users and is an important part of further work in this area. Furthermore, the results obtained prompt us to consider other factors not taken into account here, such as traffic volume, which can be correlated with weather conditions (cloudy, rainy days may be conducive to vehicle use) as well as the date of the incident (peak traffic hours, varying traffic volumes depending on the day of the week or month). The above assumptions will be the subject of further research / investigations by the authors.