Assessing Driving Risk Using Internet of Vehicles Data: An Analysis Based on Generalized Linear Models

With the major advances made in internet of vehicles (IoV) technology in recent years, usage-based insurance (UBI) products have emerged to meet market needs. Such products, however, critically depend on driving risk identification and driver classification. Here, ordinary least square and binary logistic regressions are used to calculate a driving risk score on short-term IoV data without accidents and claims. Specifically, the regression results reveal a positive relationship between driving speed, braking times, revolutions per minute and the position of the accelerator pedal. Different classes of risk drivers can thus be identified. This study stresses both the importance and feasibility of using sensor data for driving risk analysis and discusses the implications for traffic safety and motor insurance.


Introduction
With the advances made in information technology, the internet of things (IoT) is gradually revolutionizing the way people live. An important part of this advanced network is the internet of vehicles (IoV) which, based on telecommunications and informatics technology, collects data about vehicles, roads, and the environment. These data, collected from multiple sources, are then processed and shared. For example, data can be analysed by insurance providers to improve products and services. In short, the IoV facilitates information sharing through vehicle-to-vehicle (V2V), vehicle-to-person (V2P) and vehicle-to-road (V2R) interconnectivity [1].
Our objective here is to show that IoV data can be an important source of driving risk indicators, which can be used by drivers as inputs for improving their driving habits and by insurers for calculating insurance premiums. We also propose that risk indicators can be incorporated into the internal architecture of a vehicle's sensors. We do not consider autonomous vehicles. Ryan et al. address the quantification of risk for autonomous vehicles compared to humans [2] IoV data are typically collected by means of on-board devices: pre-and post-installed devices and smart phones being the three of the most common data acquisition terminals. A pre-installed device is the electronic component that an automobile is equipped with when it leaves the factory; a post-installed device is added by users according to their specific business demands once the vehicle has left the factory and a smart phone is carried by a driver and uses a satellite-positioning module, acceleration sensor, magnetic sensor, proximity sensor, gyroscope and other sensors to collect data Sensors 2020, 20, 2712 3 of 14 in Section 5. These results are discussed in Section 6 and the conclusions we draw are presented in Section 7.

Literature Review
Generalized linear models are typically used to study relationships between variables and a response. For example, Jin et al. built two binary logistic models to examine the effects of telematics-related variables on predicting the probability of traffic accidents [7]; Verbelen et al. used generalized additive models and compositional predictors to study the effect of different telematics variables on the expected claim frequency [3]; and Ma et al. used Poisson regression to analyze the relationship between several risk factors and traffic accident probability [8]. Conventional generalized linear regression has always played an important role in the study of driving risk because of the interpretability of its variables. Note, however, that in many countries, premium calculation is regulated and, as a result, the authorities prefer non-black-box methods [9].
With the rise of machine learning, various studies have employed different ML methods to identify driving risks. For example, Carfora et al. assessed driver aggressiveness on urban and highway roads using cluster analysis [10] and Jafarnejad et al. showed that AdaBoost classifiers can provide good predictions of driver identification based on telematics data [11]. Paefgen et al. compared the performance of various ML methods, including logistic regression, neural network, and decision tree classifiers, for driving risk prediction and insurance pricing, and found that logistic regression models had advantages in terms of both accuracy and interpretability [12]. Finally, Bian et al. employed the ensemble learning method to build a driver risk classification model [13]. In most ML studies, logistic regression is selected as the method for driver scoring or driving risk classification and it has been reported to perform well.
The number of indicators of driving risk is increasing with the development of sensor technology. Traditional automobile insurance often relies on quite basic information about the customer and the vehicle. In some advanced markets, prices usually include experience rating systems, which reward good drivers and vehicle type or regional classes; however, with the development of UBI, PAYD products require the collection of mileage and fuel consumption data, as recorded on the odometer, with some studies indicating that the greater the mileage, the higher the probability of suffering an accident [14]. PHYD products have also begun to take into account driving behavior data (including speeding, harsh acceleration, sudden braking and sharp turns) and driving condition data (including road types and night-time driving), which have been shown to be associated with driving risk; however, as yet, there are not many mature PHYD products on the market [4,15,16].
In practice, the use of IoV data is not without its problems. While on-board devices can collect data indicative of many key features, they are not universally accepted due to privacy or installation issues [17,18]. Moreover, different on-board devices have different characteristics [16], and while many UBI studies have been unable to obtain all these key features, driving risks and insurance pricing can still be assessed with these limited characteristics [3,19]. Smart phones, on the other hand, while offering advantages of portability, high efficiency and flexibility, have seen their development checked by a lack of data integrity compared to that offered by other on-board devices. Additionally, smartphone data are often missing braking and fuel consumption measurements; even so, UBI can still be attempted [20]. Although the data used in this study do not include accidents and claims, they do contain what can be considered key driving behavior data. Many studies report a positive correlation between accidents and excessive acceleration and braking and so, here, we use these data to undertake our analysis of driving risk [21].
Zuo et al. developed a vehicle terminal that can accurately obtain the vehicle's behavior information, including acceleration, deceleration, sharp turns, and sudden lane change [22]. They suggested that statistical analyses of these data produce a broad application prospect in the field of UBI. Recently, Li et al. proposed a method to prove that physiological and vehicle information (average speed) have a significant influence on the driving risk [23]. However, better pricing based only on driving behavior is only accessible if internal data of the vehicle are combined with map/street and weather data, and it might be necessary to have also additional data from other sources (other vehicles, traffic jam cameras, speeding limit in areas of road work, etc.).

Data Description
The IoV data used in our study were obtained from an IoV information service provider in China. The IoV information service provider is called 'China Satellite Navigation and Communications Co., Ltd.' which is a high-tech enterprise providing intelligent network technology services, headquartered in Beijing. The IoV information service gathers data from their own intelligent network vehicles. Data acquisition, transmission, storage, and application procedures are shown in Figure 1. Each vehicle was pre-installed with a telematics box (T-box), which included a GPS sensor, vehicle condition sensor, and wireless transmission unit. When the vehicle was started up, information was updated second-by-second and aggregated at the device level to reduce data transmission and storage costs. Every 30 s, the T-box transmitted the last data item to the database. When the vehicle was shut down, the on-board device automatically restarted every 30 min and transmitted a data item to the database. The data from the database include unique vehicle identification, time-varying GPS trajectory data, and vehicle condition information (for example, mileage, fuel consumption, speed, accelerator pedal position, brake times and so on); they do not include accidents and claims information, which are unavailable for privacy and legal reasons. However, based on the assumption that some indicators are associated with the frequency of accidents, the information can be used to score drivers and to assess accident risks.
Sensors 2020, 20, x FOR PEER REVIEW 4 of 14 map/street and weather data, and it might be necessary to have also additional data from other sources (other vehicles, traffic jam cameras, speeding limit in areas of road work, etc.).

Data Description
The IoV data used in our study were obtained from an IoV information service provider in China. The IoV information service provider is called 'China Satellite Navigation and Communications Co., Ltd.' which is a high-tech enterprise providing intelligent network technology services, headquartered in Beijing. The IoV information service gathers data from their own intelligent network vehicles. Data acquisition, transmission, storage, and application procedures are shown in Figure 1. Each vehicle was pre-installed with a telematics box (T-box), which included a GPS sensor, vehicle condition sensor, and wireless transmission unit. When the vehicle was started up, information was updated second-by-second and aggregated at the device level to reduce data transmission and storage costs. Every 30 s, the T-box transmitted the last data item to the database. When the vehicle was shut down, the on-board device automatically restarted every 30 minutes and transmitted a data item to the database. The data from the database include unique vehicle identification, time-varying GPS trajectory data, and vehicle condition information (for example, mileage, fuel consumption, speed, accelerator pedal position, brake times and so on); they do not include accidents and claims information, which are unavailable for privacy and legal reasons. However, based on the assumption that some indicators are associated with the frequency of accidents, the information can be used to score drivers and to assess accident risks.    Figure 2, we display the speed. We see that the car is idle during the night. During this period, driving speeds varied from 0 to 110 km per hour. Acceleration and braking actions occur at different speeds. Figure 3 presents an example of the trajectory associated with a vehicle in the course of one day.
Our dataset comprises 309 data files for 309 unique vehicles over a five-day period. These files were obtained in two groups according to the time span: that of the first group runs from 0:00 h on 27 June 2018 to 10:00 h on 2 July 2018, while that of the second group runs from 0:00 h on 3 July 2018 to 22:00 h on 8 July 2018. Since the timespan of both groups is about five days and these measurement periods correspond to two adjacent weeks, it is reasonable to combine the two groups into one. We assume that weather and other relevant conditions are almost identical in these two periods for the   Figure 2, we display the speed. We see that the car is idle during the night. During this period, driving speeds varied from 0 to 110 km per hour. Acceleration and braking actions occur at different speeds. Figure 3 presents an example of the trajectory associated with a vehicle in the course of one day.
Each data file contains 62 variables, but not all the latter can be used directly. First, some variables have many null or meaningless invariant values, due to either sensor failure or extraction error. Second, some variables are redundant. Indeed, some of the variables are redefined and calculated by the information provider, as transformations of the original variable. In addition, some variables have outliers. After further data processing, the following key variables were selected for this study (see Table 1).   variables have many null or meaningless invariant values, due to either sensor failure or extraction error. Second, some variables are redundant. Indeed, some of the variables are redefined and calculated by the information provider, as transformations of the original variable. In addition, some variables have outliers. After further data processing, the following key variables were selected for this study (see Table 1).   Our dataset comprises 309 data files for 309 unique vehicles over a five-day period. These files were obtained in two groups according to the time span: that of the first group runs from 0:00 h on 27 June 2018 to 10:00 h on 2 July 2018, while that of the second group runs from 0:00 h on 3 July 2018 to 22:00 h on 8 July 2018. Since the timespan of both groups is about five days and these measurement periods correspond to two adjacent weeks, it is reasonable to combine the two groups into one. We assume that weather and other relevant conditions are almost identical in these two periods for the two groups. Some data files are empty due to transmission errors or for reasons unknown. Therefore, the total number of effective vehicle data files after excluding the empty data files is 253.
Each data file contains 62 variables, but not all the latter can be used directly. First, some variables have many null or meaningless invariant values, due to either sensor failure or extraction error. Second, some variables are redundant. Indeed, some of the variables are redefined and calculated by the information provider, as transformations of the original variable. In addition, some variables have outliers. After further data processing, the following key variables were selected for this study (see Table 1).

Determination of Dependent Variables
As discussed, our data do not contain any accident or claims information that can be used as dependent variables or labels. However, with the exception of a few unsupervised ML learning algorithms (such as clustering algorithms) that do not require label data, both GLMs and supervised machine learning algorithms require a dependent variable or label data, otherwise they cannot be implemented. Therefore, it is imperative at the outset to define a dependent variable to represent driving risk.
Excessive braking while driving, especially at high speeds, indicates that the vehicle is likely to be exceeding the safe speed limit. When driving at an exorbitant speed and foreseeing danger, a driver is obliged to brake, as opposed to releasing the accelerator pedal, to avoid a collision. Thus, to some extent, braking can be considered a high-risk driving behavior and, so, for the purposes of this study, a driver's brake count was selected as the dependent variable.
Acceleration is an essential part of driving, but different drivers have different habits as regards acceleration. Drivers with good driving habits tend to minimize unnecessary harsh acceleration, thus controlling the vehicle's speed more effectively and saving fuel in the process. Poor drivers, on the other hand, do not have the same control over the use of the accelerator, resulting in more braking to curb their speed. Excessive harsh acceleration increases driving risk. Therefore, the average position of the accelerator pedal (the higher the value, the greater the frequency of harsh accelerations) was chosen as a second dependent variable.
In the logistic regression model, the dependent variables, both "Brakes" and "Accelerator", were dichotomized as follows. Value levels above the median were identified as events equal to one, while those below the median were identified as non-events equal to zero.

Determination of Independent Variables
As discussed, we were left with a number of effective variables after data preprocessing. Among these, driving distance, speed and RPM were selected as the independent variables for our driving risk evaluation. Driving distance and fuel consumption are two of the earliest factors used to study PAYD. Longer distances and higher fuel consumption mean a higher risk of accidents. Because of the strong correlation between driving distance and fuel consumption, we selected driving distance as an independent variable. In addition, with the exception of harsh acceleration and sudden braking, speed is often selected as a driving risk factor: the higher the speed, the harsher the acceleration a driver needs, and the more braking he needs to bring the vehicle to a stop [24]. In this study, the average speed of each vehicle is used as one of the independent variables. Finally, RPM, an infrequently employed variable in these studies, reflects, to some extent, driving intensity, and so the average RPM of each car is selected as our third independent variable.
Additionally, we defined a new measure (Range) of driving patterns in this instance, to capture the fact that a driver always remains in the same region (see Figure 4). It was calculated as follows, based on the available GPS trajectory Range = (long max − long min ) 2 + (lat max lat min ) 2 (1) where long max and long min represent the maximum and minimum observed Longitude values, respectively, and lat max and lat min represent the maximum and minimum observed Latitude values, respectively.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 14 Additionally, we defined a new measure (Range) of driving patterns in this instance, to capture the fact that a driver always remains in the same region (see Figure 4). It was calculated as follows, based on the available GPS trajectory (1) where longmax and longmin represent the maximum and minimum observed Longitude values, respectively, and latmax and latmin represent the maximum and minimum observed Latitude values, respectively. Theoretically speaking, if drivers brake or accelerate a lot because of traffic congestion or because of having to move in a limited area, this is not indicative of high-risk driving situations. However, if the driver drives over a wide area per unit of time (that is, at high speeds), a high number of brakes or accelerations may represent a safety hazard.
The summary statistics of our original summary data files are presented in Table 2. The important thing to bear in mind is that "Accelerator" ("Brakes") inputs the regression model as an independent variable if "Brakes" ("Accelerator") is the dependent variable. Note that, for example, RPM average is low because we also consider the time when the car is idle.  Theoretically speaking, if drivers brake or accelerate a lot because of traffic congestion or because of having to move in a limited area, this is not indicative of high-risk driving situations. However, if the driver drives over a wide area per unit of time (that is, at high speeds), a high number of brakes or accelerations may represent a safety hazard.
The summary statistics of our original summary data files are presented in Table 2. The important thing to bear in mind is that "Accelerator" ("Brakes") inputs the regression model as an independent variable if "Brakes" ("Accelerator") is the dependent variable. Note that, for example, RPM average is low because we also consider the time when the car is idle.

Feature Creation and Extraction
Polynomial variation can be used to increase the variable dimension of data and transform part of the data from nonlinear to linear, thus allowing linear regression to process nonlinear data. However, excessively high dimensions hinder subsequent analyses and increase collinearity among variables. For this reason, the dimension is limited to two in this study. Two-by-two plots are displayed in Figures 5 and 6.

Feature Creation and Extraction
Polynomial variation can be used to increase the variable dimension of data and transform part of the data from nonlinear to linear, thus allowing linear regression to process nonlinear data. However, excessively high dimensions hinder subsequent analyses and increase collinearity among variables. For this reason, the dimension is limited to two in this study. Two-by-two plots are displayed in Figures 5 and 6. In order to facilitate interpretation of the data analysis results, ordinary least square (OLS) regression and logistic regression were conducted. However, even when controlling for the dimensions of the polynomial variation, the increased number of features still hampered regression analysis. Hence, the stepwise regression was conducted by bidirectional elimination. In this way, we eventually obtained the combination of independent variables that best explains the dependent variables.   In order to facilitate interpretation of the data analysis results, ordinary least square (OLS) regression and logistic regression were conducted. However, even when controlling for the dimensions of the polynomial variation, the increased number of features still hampered regression analysis. Hence, the stepwise regression was conducted by bidirectional elimination. In this way, we eventually obtained the combination of independent variables that best explains the dependent variables.  In order to facilitate interpretation of the data analysis results, ordinary least square (OLS) regression and logistic regression were conducted. However, even when controlling for the dimensions of the polynomial variation, the increased number of features still hampered regression analysis. Hence, the stepwise regression was conducted by bidirectional elimination. In this way, we eventually obtained the combination of independent variables that best explains the dependent variables.

Brakes as Dependent Variable
First, OLS and logistic regressions were carried out by taking "Brakes" as the dependent variable and "Accelerator" as one of the independent variables. The stepwise regression results after bidirectional elimination are shown in Table 3. The adjusted-R 2 statistic of the OLS regression (0.3420) and the pseudo-R 2 statistic of the logistic regression (0.2472) are both small.

Accelerator as Dependent Variable
Second, "Accelerator" was selected as the dependent variable, and "Brakes" was selected as one of the independent variables. Then, OLS and logistic regressions were again carried out. The regression results are shown in Table 4. The adjusted-R 2 statistic of the OLS regression (0.7070) and the pseudo-R 2 statistic of the logistic regression (0.5321) are greatly improved, indicating that the model offers a more accurate reflection of the relationship between variables. Thus, the best choice of dependent variable for the regressions in this study is "Accelerator".

Discussion
The regression model allows us to score and classify drivers based on our input data. The independent variables were substituted into the regression model to obtain the estimates. Taking these results as the ordinate value and the observed dependent variable as the abscissa value, the median of the predicted scores and the median of the observed values are used as classifiers (see the scatter diagram in Figure 7). The x-axis presents the intuitive judgment of the original dependent variable, where zero is "Good", the median of the observed data is "Median", and the maximum is "Bad". Likewise, the y-axis presents the predicted results, where zero is "Good", the median of the predictions is "Median", and the maximum is "Bad". In general, the smaller the value of "Accelerator", the more cautious the driving style and the better the driver. In most cases, the predicted value corresponds closely to the expected value, or lies close to the observed value. However, a few predictions clearly do not meet expectations.  Table 5 shows the classification results of the two models used in this study. The results of the two classifications are generally consistent, but there are a few inconsistencies that can affect our assessment of risky drivers. Thus, to improve fault tolerance in their predictions, it seems both models should be used, so as to take into consideration their respective high-risk judgments. A section of the data should be test out-of-sample to show how the model predicts new data. We have split the sample in a train dataset (75% instances) and a test dataset (25% instances). The training and test samples were chosen randomly. The comparison of predictive results and classification is shown in Figure 8 and Table 6. The conclusions do not change.
Insurance companies seek to charge a premium according to the normal standard calculation of the driving risk of the driver. In the case of low-risk drivers, insurance companies need to offer certain premium discounts to encourage drivers to maintain their current driving habits. However, in the case of high-risk drivers, insurers would be better off charging them punitive premiums and giving them feedback to help them drive more safely.
Our work has some limitations. As suggested by one reviewer, some additional discussion around the speed of the vehicle at the time is important. For example, a harsh braking event can occur when the vehicle is travelling at 5km/h with no safety implications, while the same harsh braking at 50 km/h is obviously a different proposition. We need to analyze the detailed data instead of The points in the bottom left and top right of the scatter diagrams are as expected. Drivers with low "Accelerator" values, whose expected driving risk is also small, are unquestionably good drivers. In contrast, drivers with high "Accelerator" values, whose expected driving risk is correspondingly high, are undeniably risky drivers. Similarly, drivers in the top left are high-risk drivers and, while they did not drive aggressively for a short period, they present a tendency to drive aggressively based on their driving behavior. In addition, the driver in the bottom right also needs special attention. For example, a driver who has an average accelerator pedal position of 30%, but whose risk model predicts 20% or lower, should be warned, because his level of acceleration activity is above the risk level predicted by the model. Therefore, those in the bottom right are also identified by our models as risky drivers. Table 5 shows the classification results of the two models used in this study. The results of the two classifications are generally consistent, but there are a few inconsistencies that can affect our assessment of risky drivers. Thus, to improve fault tolerance in their predictions, it seems both models should be used, so as to take into consideration their respective high-risk judgments. A section of the data should be test out-of-sample to show how the model predicts new data. We have split the sample in a train dataset (75% instances) and a test dataset (25% instances). The training and test samples were chosen randomly. The comparison of predictive results and classification is shown in Figure 8 and Table 6. The conclusions do not change.

Conclusions
This study has demonstrated that IoV data without accidents and claims can be used for undertaking driving risk analyses and driver ratings. As mentioned in the literature review, accidents and the indicators of driving risk are strongly related [14][15][16]. Here, we use variables that are theoretically related to driving risk to replace these dependent variables in a regression analysis and find that when the mean of the accelerator pedal position is used as the dependent variable, the regression model captures the relationship between variables well. This means that insurance companies can use limited data resources for regression analysis when claims data for vehicles are  Insurance companies seek to charge a premium according to the normal standard calculation of the driving risk of the driver. In the case of low-risk drivers, insurance companies need to offer certain premium discounts to encourage drivers to maintain their current driving habits. However, in the case of high-risk drivers, insurers would be better off charging them punitive premiums and giving them feedback to help them drive more safely.
Our work has some limitations. As suggested by one reviewer, some additional discussion around the speed of the vehicle at the time is important. For example, a harsh braking event can occur when the vehicle is travelling at 5 km/h with no safety implications, while the same harsh braking at 50 km/h is obviously a different proposition. We need to analyze the detailed data instead of averages. Various outliers that clearly do not obey the model can be identified. Explanations for these outliers might include problems with vehicle sensors or with the data being collected, transmitted, or stored. However, giving feedback on these outliers to the information providers should enable them to identify the source of the problem and improve the quality of their sensors and information services.

Conclusions
This study has demonstrated that IoV data without accidents and claims can be used for undertaking driving risk analyses and driver ratings. As mentioned in the literature review, accidents and the indicators of driving risk are strongly related [14][15][16]. Here, we use variables that are theoretically related to driving risk to replace these dependent variables in a regression analysis and find that when the mean of the accelerator pedal position is used as the dependent variable, the regression model captures the relationship between variables well. This means that insurance companies can use limited data resources for regression analysis when claims data for vehicles are unavailable, but premiums need to be priced.
Using regression analysis, we identified the relationship between our independent and dependent variables that indirectly reflects the influence of various factors on driving risk. Speed is the main factor influencing driving risk: the higher the speed, the more the driver needs to accelerate, the higher the number of RPM, and the more the driver needs to brake, the greater the driving risk. At high speeds, braking action is significantly reduced to avoid sideslip and rollaway accidents caused by braking. However, the range variable specifically created in this study does not appear to play a relevant role. Moreover, the mileage variable performs differently in the two regression analyses, but it does at least show that this variable is related to braking and acceleration, which, in turn, are related to driving risk.
To analyze telematics data and accident data, a comprehensive view of risky driving is needed, Insurers tend to have a complete vision of accidents and, in particular, accident severity combined with the driver's past experience. Telematics service providers collect information on conditions and driving style. Optimally, these two sources should be put together [25,26].
It is also worth mentioning that driving risk indicators such as speed, breaking, acceleration or RPM, should be accurately measured, otherwise they could be misleading in a prediction system, especially if such a system is designed as a real-time risk warning tool. In our study, we analyzed daily averages and provided summary characterization of the drivers, we expect that measurement error, i.e., accuracy, could have produced wider confidence intervals of our model parameters. A number of weaknesses need to be addressed in this study. For example, we only use OLS and binary logistic regressions, but other linear regression and multiple logistic regression methods should be tried to conduct the regression analyses. In addition, GPS data are not especially well exploited here. In subsequent studies, GPS track data can be usefully employed to vide the driving road grade (urban road or highway) and driving time (day vs. night), so as to analyze driving risks under different driving conditions. Thanks to the detailed processing and analysis of the data, a driving risk event can be defined quickly so as to explore the risks faced.