Analysis of historical road accident data supporting autonomous vehicle control strategies

It is expected that most accidents occurring due to human mistakes will be eliminated by autonomous vehicles. Their control is based on real-time data obtained from the various sensors, processed by sophisticated algorithms and the operation of actuators. However, it is worth noting that this process flow cannot handle unexpected accident situations like a child running out in front of the vehicle or an unexpectedly slippery road surface. A comprehensive analysis of historical accident data can help to forecast these situations. For example, it is possible to localize areas of the public road network, where the number of accidents related to careless pedestrians or bad road surface conditions is significantly higher than expected. This information can help the control of the autonomous vehicle to prepare for dangerous situations long before the real-time sensors provide any related information. This manuscript presents a data-mining method working on the already existing road accident database records to find the black spots of the road network. As a next step, a further statistical approach is used to find the significant risk factors of these zones, which result can be built into the controlling strategy of self-driven cars to prepare them for these situations to decrease the probability of the potential further incidents. The evaluation part of this paper shows that the robustness of the proposed method is similar to the already existing black spot searching algorithms. However, it provides additional information about the main accident patterns.


27
Human drivers have many disadvantages compared to autonomous vehicles (slower reaction time, inatten-28 tiveness, variable physical condition) (Kertesz and Felde, 2020). Nevertheless, they can often perform 29 better (Chatterjee et al., 2002) in some unexpected situations like a child running out in front of the vehicle. 30 Because beyond the information gained in real-time, they may have specific knowledge about a given 31 location (linked to the previous example, the human driver may know that there is a playground without 32 a fence near the road; therefore, the appearance of a child is not unexpected). Drivers also have some 33 incomplete but useful historical knowledge about accidents and they can build this information into their 34 driving behavior. If they know that there were several pedestrian collisions somewhere, they will decrease 35 the speed and try to be more attentive without triggering real-time signals. Thanks to this behavior, they 36 can prepare for and avoid some types of accidents, which were not possible without this historical data. 37 Another example might be a road section, which is usually extremely slippery on rainy days. Real-time 38 sensors can detect the element of slipping when it is too late to avoid the consequences. Some historical 39 accident data can help to prepare the car for these unexpected situations. 40 We propose the following consecutive steps to integrate historical data into the control algorithm for 41 autonomous devices: and equipment of the self-driven car. For example, in the case of dangerous areas is it possible to increase 48 the power of lights to make the car more visible? Or in the case of large changes in pedestrian accidents, 49 is it possible to increase the volume of the artificial engine sound to avoid careless road crossing? Can the 50 car change the suspension settings to prepare for potentially dangerous road sections? The scope of this 51 paper is the development of the theoretical background to support these preliminary protection activities. 52 The appropriate preliminary actions may significantly decrease the number and severity of road sections is the first step to prevent further accidents or to decrease the seriousness of these. It is a heavily 64 researched area, and there are several theoretical methods for this purpose. 65 However it has a long tradition in traffic engineering; interestingly, there is not any generally accepted 66 definition of road accident black spots (also known as hot spots), the official definition varies by country. 67 It follows that the method used to find these hazardous locations also varies by country. For example, 68 by the definition of the Hungarian government, outside built-up area black spots are defined as road 69 sections no longer than 100 meters where the number of accidents during the last three years is at least 3. 70 According to this, road safety engineers use simple threshold-based methods (for example, the traditional 71 sliding window technique) to find these areas. Switzerland uses a significantly different definition as black 72 spots are sections of the road network (or intersections) where the number of accidents is "well above" 73 the number of accidents at comparable sites. The key difference is the term "comparable sites" because 74 these advanced comparative methods do not try to classify all road segments by itself but try to compare 75 to similar areas. 76 There are some general attributes of accident black spots to overcome the conceptual confusion. These . Much data about the road network is also available (road layout, speed limits, 82 tables, etc.). As a result, road safety engineers can use several procedures from various fields (statistics, 83 data mining, pattern recognition) to localize accident black spots in these databases. 84 It is a common assumption that the number of accidents is significantly higher at these locations 85 compared to other sections of the road network. However, this alone is neither a necessary nor sufficient 86 condition. The variation of the average yearly accident count of road sections is relatively high compared 87 to the number of accidents. Because of this, the regression to the mean effect can distort the historical data.

88
A given section with more accidents than average is not necessarily an accident black spot. The converse 89 is also true, as there may be true black spots with relatively few accidents for a given year. However, 90 this deficiency is already theoretically proven as most black spot identification methods are based on the 91 accident numbers of the last few years, simply because this is the best place to start a detailed analysis.

92
Nevertheless, it is always worth keeping in mind that these locations are just black spot candidates, 93 but it needs further examination to make the right decision concerning them. The best way to do this 94 is via a detailed scene investigation, but it is very expensive and time-consuming. Another theoretical 95 approach can be the analysis of accident data to find some irregular patterns and identify one or more 96 risk factors causing these accidents. Without these, it is possible that the higher frequency of accidents is 97 purely coincidental at a given location and time.

Computer Science
To localize potential accident black spots, the most traditional procedure is the sliding window method 99 (Lee and Lee, 2013;Elvik, 2008;Geurts et al., 2006). The input parameters of the process are the section 100 length and a threshold value. The method is based on the following: 101 1. divide the selected road into small uniform sized sections; 102 2. count the number of accidents that have occurred in the last few years for each section; 103 3. flag the segments where this number is higher than a given threshold as potential black spots.

104
There are many variants of the proposed traditional sliding window method (Anderson, 2009;Szénási 105 and Jankó, 2007). A potential alternative is to use variable window length. One of its advantages is that it 106 is unnecessary to set the appropriate parameter, but sufficient to give a minimal and maximal value. The 107 method can try several window lengths to find the largest black spots possible. Due to this modification, 108 it can find small local black spots and larger ones too. The traditional sliding window method uses 109 non-overlapping segments, but it is also possible to slide the window with smaller steps than the window 110 size. This leads to a more sensitive method, which can find more black spot candidates. However, it is 111 also necessary to manage the overlapping black spots (considering these as one big cluster, or multiple 112 distinct ones). It is worth mentioning that the method has some additional advantages: it has very low 113 computational demand (compared to the alternatives) and is based only on the road accident database.

114
The sliding window method is one of the first widely used procedures; therefore, it is based on the 115 traditional road number + section number positioning system (for example, the accident location is Road 116 66, 12+450 kilometer+meter). This traditional positioning system was the only real alternative in the 117 past. However, in the last decades, the spreading of GPS technology makes it possible to collect spatial 118 coordinates of accidents. This step has several benefits (faster and more accurate localization) but also 119 requires the rethinking of the already existing methods. It is possible to extend the sliding window method 120 to a two-dimensional procedure, but it is not widely used. It is better to seek out better and more applicable 121 methods fitting to the spatial systems given by the GPS coordinates.  The procedure has several parameters, like the search radius distance from the reference point (bandwidth 130 or kernel size) and the kernel function.

131
Several researchers recommend the use of empirical Bayesian methods combining the benefits of 132 the predicted and historical accident frequencies. These models usually analyze the distributions of the 133 already existing historical data from several aspects, and give predictions about the expected accident 134 state. In the Empirical Bayesian method, the existing historical accident count and the expected accident 135 count predicted by the model are added using different weights (Ghadi and Török, 2019). Because of this, 136 this process requires an accurate accident prediction model.

137
Another group of already available methods is based on clustering techniques. These procedures are 138 from the field of data mining, where clustering is one of the widely used unsupervised learning methods.

139
In this context, a cluster is a group of items, which are similar to each other and differ from items outside 140 the cluster. Accidents with similar attributes (where properties can be the location and/or another risk 141 factor(s)) can be considered as one cluster, using this concept in the field of black spot searching. Most 142 studies use the basic K-means clustering method (Mauro et al., 2013), but there are also some fuzzy-based 143 C-means solutions.

144
As already mentioned, the results of the proposed methods are just a set of black spot candidates. It 145 needs further analysis to make a final, valid decision as to whether it is a real accident black spot or not.

Manuscript to be reviewed
Computer Science existence of accident black spots and the potential safety mechanisms, which may help to avoid further 154 crashes. As a second difference, from the road safety engineers' point of view, it is not necessary that 155 the accidents of a given black spot have common characteristics. The hot spot definition of this paper 156 assumes that accidents of a given cluster have similar attributes because this pattern will be the basis of 157 the preventive actions.

158
The localization of accident black spot candidates is a heavily researched area and there are several 159 fully-automated methods to find these. Nevertheless, the further automatic pattern analysis of these is 160 not as well developed. This phase usually needs a great deal of manual work by human road safety 161 experts (they must travel to the scene and investigate the environment to support their decisions about 162 recommended actions). However, this process is supportable by some general rules but is mostly done 163 manually using the pattern matching capability of the human mind. To fully automate it, it is necessary to 164 make this method applicable to self-driven cars.

165
According to this objective, this paper focuses on the help for autonomous vehicles to take the 166 appropriate preventive actions to avoid accidents:

167
• localize black spot candidates using historical accident database;

168
• make assumptions about the common risk factors and patterns of these accidents;

169
• according to these preliminary results, the autonomous device will know where the dangerous areas 170 are and what preventive actions to take.

172
Autonomous vehicles will have several ways to avoid accidents and, therefore, is a hot, widely researched 173 topic. Nevertheless, most papers deal with options existing only in the far future when autonomous 174 devices will be a part of a densely connected network without any human interferences. Real-world 175 implementations are far from this point, but some technologies already exist, although they are not 176 closely related to autonomous vehicles. Currently, implemented accident prevention systems are built into 177 traditional cars as braking assistants, etc. However, it is worth considering these because such methods 178 will be the predecessor of the future techniques applicable to self-driven vehicles.

179
The two main classes of accident prevention systems are passive and active methods. Passive systems 180 send notifications to the driver about their warnings but do not perform any active operations. On the 181 contrary, active methods have the right to perform interventions (braking, steering, etc.) to avoid accidents.

182
It seems obvious that these prevention systems have a large positive impact on accident prevention, and it 183 has already been proven by (Jermakian, 2011) that passive methods have significant benefits. There are 184 more than one million vehicle crashes prevented in the USA each year. As Harpen proved (Harper et al.,185 2016), the cost-benefit ratio of these systems is also positive.   The objective of (Nitsche et al., 2017) is similar, which proposes a novel data analysis method to 211 detect pre-crash situations at various (T-and four-legged) intersections. The purpose of this work is also to 212 support the safety tests of autonomous devices. They clustered accident data into several distinct partitions 213 with the well-known k-medoids procedure. Based on these clusters, an association rules algorithm was 214 applied to each cluster to specify the driving scenarios. The input was a crash database from the UK 215 (containing one thousand junction crashes). The result of the paper contains thirteen crash clusters, 216 describing the main pre-accident situations. however, it is one of the most efficient density-based clustering methods from the field of data mining.

223
The main objective of density-based clustering tasks is the following: the density of elements within a 224 cluster must be significantly higher than between separate clusters. This principle distinguishes the two 225 distinct classes of elements: items inside a cluster and the outliers (elements outside of any cluster).

226
According to the road safety task, elements are the accidents in the public road network. These are 227 identified by spatial GPS coordinates and have several additional attributes (time, accident nature, etc.).

228
The general DBSCAN method needs a definition for distance calculation between two elements. In the 229 case of road accidents, the Euclidean distance between the two GPS coordinates was used (black spots are 230 usually spread over a small area. Therefore, it is a good estimation of the real road network distances).

231
The DBSCAN method requires two additional parameters:

232
• ε : a radius type variable (meters); 233 • MinPts : the lower limit for the number of accidents in a cluster (accidents).

234
The main definitions of the DBSCAN algorithm are as follows:

235
• ε environment of a given x element is the space within the ε radius of the x element;     The result of the presented procedure is a set of black spot candidates.

257
The prerequisites of Step 4 can be one or more of the following:

258
• The number of accidents should be more than a given threshold.

259
• The accident density of the given area should be more than a given threshold.

282
If the number of accidents is less than three, the proposed area concept is not applicable. However, clusters with one or two accidents are usually not considered as black spot candidates. Therefore, this is not a real limitation. In the case of clusters with more than two accidents, the accident rate is calculated as (2).
Where 283 • ρ(C): the accident density of the C cluster;

284
• |C|: the number of accidents in the C cluster.

285
The formula requires the sequence of corner coordinates of the polygon in a given order (in this case, a • In the case of the first (P 1 ) and second (P 2 ) items, the concept of "polygon" cannot be interpreted.

291
Hence, these are automatically marked as further corner points of the polygon.

292
• With the third point (P 3 ), the items already form a polygon. The p 3 point must be on the right side 293 of the vector − − → P 1 P 2 , which can be checked using a scalar multiplication to ensure the clockwise 294 direction requested by the Gauss formula. If this is not the case, it is necessary to change the order 295 of P 1 and P 2 . After that step, P 1 , P 2 and P 3 will be the corner points of the polygon in a clockwise 296 direction.

297
• For every additional point (P 5 , P 6 , . . . , P n ), it must be checked that the additional point is inside the

Manuscript to be reviewed
Computer Science must be a sequence of one or more consecutive vectors breaking the rule. Let k and l be the first 303 and last vectors of this sequence. It is possible to substitute the P k−1 , P k , P k+1 , . . . , P l−1 , P l , P l+1 part 304 of the boundary vector list with P k−1 , P new , P l+1 . Because of the convexity of the original polygon, 305 the P k−1 , P new , P l+1 triangle contains all the P k , P k+1 , . . . , P l−1 , P l points, and the transformation also given by the previous methods needs further examination to find the real hazardous sites.

320
At this point, the methodology of this paper significantly differs from the work of road safety engineers.

321
Their objective is to find hazardous sites and take the appropriate actions to decrease the probability of to make the road network better. Nevertheless, as a passive participant, it should be able to localize the 328 problematic areas, analyze these, and take the necessary preliminary steps to avoid further accidents.

329
Another difference between the methods of these fields is that from the perspective of road safety 330 engineers, it is not necessary that the accidents of a given black spot have any special patterns or common 331 characteristics. For the self-driven car, the localization of high-risk areas where the number of accidents is 332 significantly higher than expected is not enough because this fact does not help to take the appropriate 333 preliminary steps. This is the reason why this paper focuses on the identification of accident reasons.

334
The result of this further investigation can be one of the following:

Manuscript to be reviewed
Computer Science as a type of "catching-up accident", but this does not give any information about why the accident occurs.

357
It is also typical that most of the accidents in the Hungarian road network are caused by "incorrect choice 358 of speed". However, it is obvious that not just the speeding itself was the triggering reason for these 359 accidents. There should be other factors (besides, it is unarguable that speeding increases the effects of 360 other factors and makes certain accidents unavoidable).

361
Based on these experiences, this paper does not try to assign all accidents to mutually exclusive 362 accident reason classes. Contrarily, the proposed method defines several potential accident reasons, which 363 are not mutually exclusive. These factors can be complementary and having different weights and roles 364 in the occurrence of the accident. Only the reasons with potential preventive operations are discussed 365 because these have valuable information for the self-driven car.

366
The proposed method is based on the following consecutive steps:   for the given factor.

375
The independent accident reason factors, like "slippery road", "bad visibility", "careless pedestrians", where W i attr=value shows the score for the R i accident reason when the attr attribute equals to value. 380 Accordingly, the cumulative score of the R i reason for the x accident is (3): where S i (x) is the score value of the R i reason for accident x. The x.y corresponds to the value of the 381 specific y attribute of the x accident, and A (x) contains all the available known attributes of x.

382
It is also possible to calculate the same value, not just for an accident but also for all accidents of a black spot candidate. The H i (C) set contains all the S i (x) score values for all x accidents in the C cluster as visible in (4): (4)

383
As a further step, it is necessary to determine that there is any significant reason which proves that the C set is a real hot spot or not. For a well-established decision, it is necessary to analyze all the accidents in the database to determine the main characteristics of the distributions of all R reasons. Based on these results, it is possible to compare the distributions of H i (C) values for the examined C hot spot candidate and the referenceĤ i values for the whole accident database (D) for a given R i reason (5). Hypothesis tests can show if the mean value of a given accident reason score (R i ) in a given cluster is 387 higher than the same mean for all accidents in the database. The used alternative hypothesis states that the 388 mean score of the cluster minus the mean score of the whole population is greater than zero (7). The null 389 hypothesis covers all other possible outcomes (6).

390
H : According to Welch's method, the statistic t value is given by (8). where:

407
• x 1 is the mean of the first sample;

408
• x 2 is the mean of the second sample;

409
• v 1 is the variance of the first sample;

410
• v 2 is the variance of the second sample;

411
• n 1 is the size of the first sample;

412
• n 2 is the size of the second sample.

413
The degree of freedom (v) is calculated by (9) Based on the previously calculated t and v values, the t-distribution can be used to determine the probability 415 (P). The one-tailed test is applied because it will answer the question that the mean of the cluster is 416 significantly higher than the mean of the entire population. Based on P and a previously defined level of 417 significance (α) it is possible to reject or not the null hypothesis.

418
In the case of rejection, it can be assumed that the examined accident reason is related to the accidents 419 as one of the possible causal factors. If the null hypothesis cannot be rejected, there is no evidence for 420 this.

422
The practical evaluation presented by this paper focuses on one specific accident reason (N = 1) the 423 slippery road condition factor (R 1 ).

424
The used accident database contains more than two hundred fields, in four categories:

Manuscript to be reviewed
Computer Science presented in Section . Considering the R 1 slippery road condition factor, the S 1 (x) value is calculated for 488 all x accidents. Most of these are not related to a slippery road surface reasons; so, S 1 value for these is 0.  Table 1 shows the black spot candidates of the t interval where the null hypothesis was rejected 499 because the mean of the R 1 score for the given black spot candidate was significantly higher than the 500 expected average. It can be assumed that these black spots are affected by the examined R 1 factor.  1 shows the environment and the accidents of the first black spot from this list. As is visible in the 502 satellite image, it is a part of a long straight road; consequently, there is no reason for the autonomous car to 503 decrease its speed. From the historical database, Table 2 contains detailed information about the accidents.

504
As is visible, there is a high number of accidents affected by one or more slippery road-related attributes. Manuscript to be reviewed Computer Science

533
Site consistency test 534 This test assumes that any site identified as a black spot in the t time period should also reveal high risk in 535 the subsequentt time period.

536
Let Π(C) the convex boundary polygon of the C cluster given by the algorithm presented in Section , and Π is the union of these regions identified in the t time period (10).
As the next step, we collect all accidents for the consecutivet time period, which are inside the clusters 537 identified by the prior t time period. The T 1 attribute shows the number of these accidents divided by the 538 summarized area of these clusters. Thus, this is the accident density of these clusters in the consecutive 539 time period (11).
Accident reason factor consistency test 541 As this paper goes further, revealing the accident reason factors, it is also worth checking if the accidents 542 in thet time period inside the region identified by the t time period have the same attributes or not. This 543 leads to the introduction of the T ′ 1 value, which shows the average score value for these accidents (12).

Manuscript to be reviewed
Computer Science Method consistency test 545 It is also assumable that a black spot area identified in the t time period will also be identified as black spot in the consecutivet time period. A given black spot searching method can be considered consistent if the number of a black spots identified in both periods is large. Meanwhile, that of black spots identified only in one of the examined periods is small. It is possible to use (13) to calculate this method consistency: Where T 2 is the ratio of the number of clusters existing in both search results and the number of clusters 546 given by only the search in t or only int time period (△ stands for the symmetric difference of sets). A 547 pair of clusters from the t andt period considered identical if the distance between these is less than 300m.

548
Rank difference test 549 The rank difference test is based on black spots identified in both the t andt periods. The black spots 550 of both periods are sorted by accident density, and the rank difference test shows the difference in the 551 positions of the same cluster in the two lists. The smaller the value, the more consistent the examined 552 method is, because the sequence of clusters is similar. Large numbers shows that the examined method 553 was able to identify the same black spot in both intervals but with a different severity related to each other.

554
Let O andÔ the sequences of black spots identified in both periods (both sequences contain the items of the {C 1 ,C 2 , . . .C n } ∩ {Ĉ 1 ,Ĉ 2 , . . .Ĉn} set) ordered by accident density in the t time period (O) and in thet time period (Ô). The T 3 will show the rank difference of the examined method (14). Obviously |O| = |Ô|.
Where Rank(x,Y ) is the rank of the x black spot in the Y sequence.

556
First, the proposed method was compared to the traditional Sliding Window method (SW) using dynamic 557 window length. The minimal window length parameter was 250m, the minimal accident number was 5, 558 and the minimal accident density was 0.01 accidents/m. As a further step, the novel method was also 559 compared to the raw DBSCAN based clustering (without the accident factor scoring). The parameters of 560 this were the same as presented above. The proposed method is presented in the comparison under the 561 DARF (DBSCAN with Accident Reason Factor determination) name.
562 Table 3 shows the overall results for "Győr-Moson-Sopron" county. As visible, the number of black 563 spots recognized by the DARF method is significantly less than by its alternatives. It was expected 564 because the SW and DBSCAN methods list all clusters where the accident density is higher than a given 565 threshold. In contrast, the DARF method results in only black spots affected by the R 1 accident factor.

566
The difference between the SW and DBSCAN is also significant and is caused by the fact that the SW 567 uses road name + road section positioning which is not available in built-up areas. In comparison, the 568 DBSCAN method is based on GPS coordinates and can find the black spots of municipal roads (which is 569 one advantage of this approach).

570
The T 1 result is similar in the case of DBSCAN and DARF methods and it is significantly less in the 571 case of SW. The T 2 results are almost the same for all algorithms. The third general metric shows that 572 the proposed method performs very well on the rank difference test. However, it is worth noting that the 573 number of black spots is significantly less in this case, which can be an advantage.

574
The T ′ 1 metric shows the real strength of the proposed method.  Table 4 shows the same values for another county ("Heves") as a control dataset to check the robustness 585 of the method. As visible, the main characteristics of the results are very similar. In this case, the T 1 586 and T 3 results are better compared to the alternatives. However, the T ′ 1 value is slightly lower, but still 587 significantly higher than the population average. 588 Table 4. Results of the comparison of the SW, DBSCAN, and DARF methods based on the road slippery condition. Precision is the ratio of the number of confirmed black spots (identified in both intervals) and the number of all black spots (identified at least in one of the intervals). Results are based on the personal injury accidents are occured in "Heves" county. This work presents a novel, fully automated method updating autonomous vehicles concerning potential

Manuscript to be reviewed
Computer Science considers the distribution of these score values for the full population (all accidents of the given county) 597 and each black spot candidate. Using hypotheses tests (one-tailed Welch-test), it is possible to select 598 clusters in which the mean of the score values is significantly higher than the expected value (calculated 599 by statistical methods based on the entire accident database). These can be considered as black spots 600 affected by the given factor.

601
The output of this process is a sequence of risky locations on the public road network and a prediction Another direction of further development is to make the method more sensitive to real-time environ-622 mental conditions. For example, if the autonomous car has to plan a route at night in wet weather, then it 623 should pay more attention to historical accidents that have occurred under similar conditions. This also 624 confirms the fact that it is necessary to make simple and fully automatic algorithms for this purpose to 625 make the fast recalculations available.

626
As another further development, an Artificial Intelligence based approach should be used to extend 627 the database to solve the problems raised by the limitations of the dataset.