STUDY OF USING PROBE VEHICLE DATA FOR SPEEDING ANALYSIS

Speed is a critical transportation concept – it is one of the most important factors that road users consider in relation to route convenience and efficiency; at the same time speed has been recognized as the most influential risk factor. To improve speeding analyses, an emerging data source – probe vehicle data (also known as floating car data), may be used. This data enables obtaining information on vehicle speeds, without being limited in time and space. To prove the feasibility of using this data, a study was conducted on a sample of Prague expressway and collector roads. Firstly, probe data sample validity was checked through comparison to a traditional speed measurement technique – average speed control. Secondly, descriptive analysis of speeding was performed, focusing on speeding differences across homogeneous road segments in individual hour intervals. Thirdly, statistical models were also developed to explain which road parameters contribute to speeding. Analysis utilized cross-section and geometry parameters, which may potentially be related to speed choice and driving speed and speeding. In general, the applied concept proved as feasible: particularly night time was found more prone to speeding, and the rates were significantly different between segments. Statistical models indicated the statistically significant influence on speeding: lower speed limit, lower number of lanes, absence of roadside activities, or presence of horizontal curves. Information on these factors may be generalized and used for planning adequate speeding countermeasures. Final discussion also identified and described several challenges for future research, including free-flow speed estimation uncertainty, quality of speed-safety models, and potential multicollinearity of explanatory variables.


Introduction
Speed is a critical transportation concept. It was described as one of the most important factors that road users consider in relation to convenience and efficiency of a certain route (TRB, 2011), as well as a key consideration in the geometric design and the road life cycle (Porter et al., 2012). At the same time speed has been recognized as the most influential risk factor (OECD/ITF, 2018): e.g., in the US, speeding was a contributing factor in 27 percent of all fatal crashes (NHTSA, 2018). The proportion may be even higher: for example, on Czech roads speeding has been attributed to approximately 40 percent of fatal crashes in recent years, making it the most frequent cause of road deaths (Police of the Czech Republic, 2018). Among various speed management measures, speed enforcement has a high potential (Gaca and Pogodzińska, 2017): it was estimated that reaching full compliance with speed limits would reduce number of fatalities up to 50 percent (Hydén, 2018). In order to make speed enforcement operations the most effective, they should target the critical locations and conditions -to identify these, GPS data collected during driving by so called probe vehicles (also known as floating car data) may be used. This data enables obtaining information on vehicle speeds, without being limited in time and space (Bessler and Paulin, 2013). Probe data has been used mainly for purposes of navigation and traffic monitoring; nevertheless, their coverage is progressively increasing with advent of connected vehicles (Saponara, 2018). Probe data was also used in various safety-related studies, including identification of critical manoeuvers and hazardous road locations (Kamla et al., 2019), investigation of driving activity patterns (Jun et al., 2007) or safety performance of self-explaining roads . Less often probe vehicle data was used in studies related to speed enforcement. For example, Bar-Gera et al. (2017) used probe vehicle speeds to evaluate the effect of Israeli enforcement cameras on speed distributions; Remias and Brennan Jr. (2018) used probe data to create congestion diagrams and to identify high speed areas on two Interstate Highways in Michigan and New Jersey. While both studies were positive about feasibility of applying probe vehicle data, they also indicated some potential challenges, such as sample size and representativeness, or varying approaches to data aggregation. In addition, neither of the two studies attempted validating the obtained speeds, i.e. comparing them to some of traditional measurement techniques (ground truth). Some studies, built on GPS data, also attempted statistical modelling. The explanatory variables, which they used, usually comprised behavioral characteristics -age, gender, trip purpose, attitudes, motivations, etc. (Familar et al., 2011;Richard et al., 2013). For the present study focusing on speed and environment, it would be more practical to model speeding, based on observable characteristics, i.e. road characteristics. In fact, many such models were developed (TRB, 2011;Boodlal et al., 2015), but their response variable was usually operating speed, not speeding. Some studies also used logistic regression (Gargoum et al., 2016), with response defined as probability of speed limit compliance or non-compliance, which unfortunately is not able to quantify the amount of speeding. Therefore, there is a lack of studies, which would model speeding in a measurable way, based on tangible characteristics. The current paper aims to contribute to the previous research by studying the feasibility of using probe vehicle data from the perspective of speed and environment. Firstly, validity of a sample of probe vehicle speed data was checked through comparison with average speed control data. Next, descriptive analysis was performed, focusing on speeding on road segments in individual hour intervals. Statistical models were also developed to explain which road parameters contribute to speeding. In a sum, the feasibility study aimed to find out whether the probe vehicle data help answering where and when drivers speed, which is useful for planning the speeding countermeasures. Section 2 describes the study, including data, sample validation, descriptive analysis and explanatory modeling of speeding. Section 3 provides discussion and conclusions.

Study description 2.1. Data
The feasibility study focused on five road corridors in Prague (see Figure 1), which were identified by Traffic Police Directorate as prone to speeding. Their length varied between approximately 1 and 7 km. The roads mostly had two lanes in each driving directions, divided by median; some parts were 1+1 lane (1 lane in each driving direction), without median. Speed limits were 50, 70 or 80 km/h. Traffic volumes (annual average daily traffic, AADT) were between 10,000 and 50,000 veh/day. All corridors were in relatively flat terrain. Three illustrative photographs are in Figure 2. Probe vehicle data, covering the selected corridors through January to December 2017, was obtained from a private company Princip a.s. The data was sourced from a fleet of approximately 10,000 company vehicles. Due to privacy policies, no information on specific vehicles and drivers was available, but sample was estimated to have roughly 80/20 split between personal and heavy goods vehicles. The recording consisted of GPS positions, at interval between approximately 10 and 60 seconds, together with speed. According to the data provider, accuracy was 2.5 m and 2 km/h for GPS and speed, respectively. Table 1 shows the data structure, consisting of vehicle ID, time, geographical position (longitude and latitude in degrees) and speed (in km/h). The presented study consists of three analyses (sample validation; descriptive analysis of speeding; explanatory models of speeding), which are presented in following paragraphs.

Sample validation
The idea of validation of a sample of probe vehicle speed data was to check its representativeness by comparison to some of traditional speed measurement technique; average speed control was chosen for this purpose. Average speed control (ASC; also known as section control or point-to-point control) measures the average speed over a road section, based on camera identification of vehicles when entering and leaving the enforcement section. ASC has been applied internationally and found to be effective in reducing both speeds and crashes ( The study utilized a partial overlap between analyzed road corridors and ASC sections. Following consultations with company TSK Praha, which manages ASC in Prague, four sections were selected for validation, using 2 months of ASC data (April and November 2017). Given the spatiotemporal focus of the study, comparison was conducted in 1-hour intervals, aggregated from two datasets: 1) ASC data -average hourly speeds, provided by TSK Praha.
2) Probe vehicle speeds -since the sample size was significantly smaller compared to ASC data (approximately 6%), and thus often influenced by outliers, median was used to characterize speeds in hourly intervals. Example comparison of both datasets on a specific section in one month is presented in Figure 3. Differences were tested by two non-parametric statistical tests: Kolmogorov-Smirnov test (equality of two probability distributions) and Wilcoxon signedrank test (comparison of two samples by a paired difference test), at 95% confidence level. The tests indicated no significant differences between the distribution of both samples. Even though, there were some differences in data: − In two shorter sections (up to 1 km), differences were on average up to 4 km/h. − In two longer sections (over 1 km), differences were on average 15 km/h. Nevertheless, several previous studies found comparable differences in speeds between different measurement methods, and saw them as acceptable. For example, Smith et al. (2003), when comparing probe data to point video data, estimated differences 10-15 km/h. Next, I-95 VPP study (INRIX, 2019), known as "The World's Largest Independent Traffic Data Validation", reported differences within 10 mph (i.e. 16 km/h) of actual traffic speeds. Given the focus on accurate speeding estimations, the differences in longer sections were not seen as satisfactory, and it was decided to keep the length of analyzed segments below 1 km. Fig. 3. Example comparison of both speed datasets (mean speeds from average speed control in red, median speeds from probe vehicles in blue)

Descriptive analysis of speeding
For descriptive analysis, segments of road corridors were created. The idea was to define homogeneous segments with constant values of parameters, which may potentially be related to speed choice and driving speed and speeding. The cross-section and geometry parameters were selected based on previous reviews and experience (TRB, 2011;Boodlal et al., 2015;Ambros et al., 2017). Using these parameters, the studied corridors were divided into homogeneous sections with constant values of explanatory variables. After splitting between driving directions and exclusion of some non-typical cases, 71 segments were obtained, with lengths between 100 and 500 m. In addition to total speeding (i.e., number of all records, which exceeded the speed limit, divided by total number of records), following speeding categories, based on Czech legal definitions, were used: − small speeding (up to 5 and 10 km/h over the speed limit on urban and rural roads, respectively) − medium speeding (up to 20 and 30 km/h over the speed limit on urban and rural roads, respectively) − high speeding (up to 40 and 50 km/h over the speed limit on urban and rural roads, respectively) Speeding rates were calculated and visualized in polar graphs, which enable looking up the values in specific hourly intervals (in 24-hour clock format, i.e. 1 = between midnight and 1 am, … , 24 = between 11 pm and midnight). Since rates of high speeding were relatively low (below 10%), they were not used in graphs. In Figure 4 an example is presented, which compares speeding rates in six expressway segments. Each colored line in the graph corresponds to one of six segments (D, E, F in one driving direction; G, H, I in another driving direction), and values change within hourly intervals (in 24-hour clock format). The graphs illustrate differences between segments (higher rates in segments F and G), as well as differences between driving directions, or daytime and nighttime values.

Explanatory models of speeding
To provide more insight into speeding performance, explanatory models were developed. Speeding was used as a response variable; road parameters, collected during previously mentioned segmentation were used as potentially explanatory variables. Approximate AADT was also added, based on 2017 census by TSK Praha (TSK, 2017). Overview of variables is in Table 2.
The models were developed using IBM Statistical Package for the Social Sciences (SPSS), specifically backward-elimination, in the following form: where ( ) is speeding on segment i; are explanatory variables; 0 and are regression constant (intercept) and coefficients to be estimated. During modeling, some categories (with less than 10% relative frequencies) were combined in order to strengthen the modeled relationships. Nevertheless, modeling was not successful for small speeding as a response variable; results are thus presented for medium and total speeding; see Table 3. In some cases, achieved level of statistical significance slightly dropped below 5% (Sig. values in bold), but by no more than 2%, so the results were deemed satisfactory. According to goodness-of-fit (R 2 ), the models of medium and total speeding explained 60 and 51% of systematic variance of speeding, respectively, which is comparable to similar previous studies   . Nevertheless, no reference was found to support the mentioned higher speeding in curves. In terms of speed, rather opposite may sound logical. However, relationships related to speed and speeding may not be identical; in fact, they may even contradict each other, as evidenced by example of contradictory relationship between speed limit and speeding.

Discussion and conclusions
The goal of the presented study was to answer where and when drivers speed. To this end, probe vehicle data was analyzed on a sample of Prague expressway and collector road segments. After checking data validity through comparison to average speed control data, a descriptive analysis of speeding was performed, focusing on homogeneous road segments in individual hour intervals. In general, the applied concept proved as feasible: particularly night time was found more prone to speeding; the rates were significantly different between segments, which shows the importance of location-specific approach. Statistical models were also developed to explain which road parameters contribute to speeding: lower speed limit, lower number of lanes, absence of roadside activities, or presence of horizontal curves. Information on these factors may be generalized and used in planning speeding countermeasures. During the study, several issues emerged, which are described in the following paragraphs: − Traditionally, most speed and speeding related studies, used free-flow speed in their analyses, defined as speed of vehicles exceeding specific headway values (TRB, 2011). However, there is no consensus on these values (Ambros and Kyselý, 2016); what is more, this concept is not applicable for probe vehicle studies, where data are collected from individual vehicles only, without being able to check whether these were influenced by other vehicles or not. A compromise solution is restricting data collection to off-peak hours (Bekhor et al., 2013), however, this would not be practical, when the study objective is to study and compare behavior without time restrictions, i.e. including also peak hours. − In order to prove the sample representativeness, average speed control (ASC) was chosen as a ground truth. It would be ideal to use speeding for comparison, however it was not available in given data; speed was thus used as indicator. The comparison results were found divergent in case of longer segments; which is logical, given that ASC averages the speed over distance, and thus the bias may increase with distance. Several previous studies used speeds from inductive loops ( son, 2004), connecting changes in speeds with changes in road crashes at various levels of injury severity, must hold. In this regards, the model needs to be updated: it should for example consider, that impact of speed does not depend only on the relative change of speed, but also on initial speed (Elvik, 2013); additionally it should be expanded to reflect more specific conditions (for example, Gitelman et al. (2018) noted, that it does not include separate estimates for night crashes, although these may be especially severe). Emerging use of speeds from probe vehicles in safety analyses also indicates, that these may be used to develop a new generation of speed-safety models (Jurewicz et al., 2018). In the meantime, care needs to be taken if one should estimate crash changes, based on changes in speeds, obtained from probe vehicle data. − It is known that various road design parameters are correlated between each other, as well as with speed limit and AADT (Hauer, 2004). This means that using such explanatory variables leads to multicollinearity, which is considered a bias. In fact, it is possible that most of identified influential road characteristics have in common that they imply lower speed limits, which are in turn associated with higher speeding. When using Cramér's V measure of association between categorical variables (Field, 2018; University of Toronto, 2019) in the studied sample, 75% pairs were labelled as strongly associated; should we exclude all correlated variables, the analysis would not have been possible. On the other hand, some authors claimed that multicollinearity does not necessarily mean that specific variables need to be discarded (Fridstrøm, 2015). For example, Mannering (2018) states that "multicollinearity (...) should never be used as a basis for not considering a variable in model estimation (a variable should only be excluded after it has been found to produce a statistically insignificant parameter)" (p. 273). The feasibility study may be thus considered successful: it proved, that speeds from probe vehicles provide practical source for identifying where and when drivers speed. This finding is relatively consistent with previous studies (Bar-Gera et al., 2017; Remias and Brennan Jr., 2018); however, some limitations were encountered, which were not often considered by other authors: − In theory, probe data spatial coverage is unlimited, but in practice, it may be limited on lower-volume roads. The amount of data may be compared for example by number of data points, divided by segment length and traffic volume. Specifically the analyzed collectors (see Figure 2) had about 40% less data compared to the analyzed expressways. This means that data collection on lower-volume roads requires extended time, or possibly using data from additional sources. − In order to gain knowledge on network-wide speed(ing) performance, including the mentioned roads with lower volumes, generalization is also possible. In this paper, influential road characteristics were identified through exploratory modelling. However, the quality models require detailed descriptive information on the analyzed road network (i.e., digital maps). For example in the presented exercise, some of variables could be defined more quantitatively (using widths instead of presence of median or shoulder lanes; or quantifying roadside activities through pedestrian exposure, density of pedestrian crossings, etc.). Both points could be analyzed in terms of sensitivity analysis, which would indicate necessary data collection periods, as well as sufficient level of details of network description and its sample size. This information will be valuable for planning adequate speeding countermeasures. Future research should focus on the mentioned challenges, such as free-flow speed estimation, validation (possibly against different "ground truth"), and relationship to crashes. Further studies could focus on temporal variations (day of the week, condition, traffic volume variations, etc.) and their effect on speeding. The concept of feasibility could also be tested on roads outside urban areas, possibly even using more detailed segmentation, especially in curves.
Richard Andrášik and Robert Zůvala with data processing and analysis, as well as valuable consultations with Sabina Burdová, Pavel Fiala and Michal Hodboď (Traffic Police Directorate). The study was supported by the Ministry of Education, Youth and Sports' National Sustainability Programme I project of Transport R&D Centre (LO1610), using the research infrastructure of Operation Programme Research and Development for Innovations (CZ.1.05/2.1.00/03.0064).