How much information is lost when sampling driving behavior data? Indicators to quantify the extent of information loss


Purpose
Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at rates ranging from one Hertz (or even lower) to hundreds of Hertz. Failing to capture substantial changes in vehicle movements over time by “undersampling” can cause loss of information and misinterpretations of the data, but “oversampling” can waste storage and processing resources. The purpose of this study is to empirically explore how micro-driving decisions to maintain speed, accelerate or decelerate, can be best captured, without substantial loss of information.


Design/methodology/approach
This study creates a set of indicators to quantify the magnitude of information loss (MIL). Each indicator is calculated as a percentage to index the extent of information loss (EIL) in different situations. An overall information loss index named EIL is created to combine the MIL indicators. Data from a driving simulator study collected at 20 Hertz are analyzed (N = 718,481 data points from 35,924 s of driving tests). The study quantifies the relationship between information loss indicators and sampling rates.


Findings
The results show that marginally more information is lost as data are sampled down from 20 to 0.5 Hz, but the relationship is not linear. With four indicators of MILs, the overall EIL is 3.85 per cent for 1-Hz sampling rate driving behavior data. If sampling rates are higher than 2 Hz, all MILs are under 5 per cent for importation loss.


Originality/value
This study contributes by developing a framework for quantifying the relationship between sampling rates, and information loss and depending on the objective of their study, researchers can choose the appropriate sampling rate necessary to get the right amount of accuracy.



Introduction
In 2017, the National Highway Traffic Safety Administration of the United States announced its decision to move forward with the vehicle-to-vehicle (V2V) communication technology for all new light-duty vehicles (NHTSA, 2017). Newly manufactured vehicles will likely be equipped with dedicated short-range communication (DSRC) devices by regulation. As the roll-out of the V2V environment, diagnostic sensors will be installed on vehicles to collect data, and the data will be transmitted wirelessly between vehicles and nearby infrastructures. It would no longer have to rely on conventional data collection equipment, such as loop detector or video detections, and it collects much more information than the conventional ways (Liu, 2015;Liu and Khattak, 2016;Liu and Khattak, 2019;Studer et al., 2019). Measurements that are previously unknown are now available, which include but not be limited to vehicle speeds, positions, arrival rates, rates of acceleration and deceleration, queue lengths, stopped time and so on. With the increasing amount of data collected from DSRCequipped vehicles, it is now made possible to explore micro-level driver behaviors. Instantaneous driving decisions are of particular interest, because they are the foundation of monitoring energy consumption, emissions and safety on a real-time basis. Driving decisions consists of a collection of maneuvers: accelerating, decelerating, maintaining speed, altering acceleration/deceleration, etc. Driving reflects a chain of instantaneous driving decisions made by drivers according to changes in surrounding circumstances, e.g. adjacent vehicles, roadway conditions and geometric changes in the roadway and weather conditions (Wang et al., 2015).
Intuitively, higher rate sampled data can capture more information about the instantaneous driving decisions. Current data collection in industry can go as high as 800 MHz (Linear Technologies, 2014). However, driving data is not always necessarily sampled by such high rates in the transportation context. One problem of high sampling rates is cost, particularly under the context of the big data-driven intelligent transportation systems, in terms of requiring extra storage and processing time, which is called oversampling (Chawla, 2010). Another problem for data sampled by high sampling rates is the data accuracy. The Next Generation Simulation Program (NGSIM) collected detailed vehicle trajectory data in 10 Hz to develop behavioral algorithms in support of traffic simulation on microscopic modeling (Punzo et al., 2011), as well as Safety Pilot Model Deployment (SPMD) sampling the safety messages (e.g. motion and location data) transmitted between connected vehicles and infrastructures at 10 Hz (Henclewood, 2014). The accuracy of NGSIM data is estimated at 2-4 ft (Kovvali et al., 2007). For NGSIM data, in 0.1 s, the distance traveled by a 60 mph vehicle is about 8.8 ft but with a 2-4 ft error. Therefore, the accuracy of NGSIM data might be jeopardized with high sampling rates.
However, it does not mean low sampling rates are always desirable; undersampling/inadequate sampling may cause loss of critical information (Meade et al., 1991). Jackson et al. (2005) discussed the validity of using in-vehicle GPS second-by-second (1 Hz) velocity data to track the 1-s driving operation modes, including acceleration and deceleration. Their results imply that the 1-s operation modes can be successfully measured by using GPS data sampled by 1 Hz (Jackson et al., 2005), whereas the driving operation modes within 1 s are unknown. For example, if a driving command -"acceleration ! deceleration ! acceleration" occurs within 1 s, the 1 Hz sampled data may lose the information about the deceleration.
Current driving data are usually continuously sampled by rates from 0.2 to 10 Hz (Int Panis et al., 2006;Ahn and Rakha, 2008;Campbell, 2012;Wang et al., 2008;Hung et al., 2007;Lyons et al., 1986;Boriboonsomsin et al., 2010;Simpson and Markel, 2012;TSDC Secure Transportation Data Project, 2014). Note that the continuous driving data are different from the traffic data collected by loop detectors (Bikowitz and Ross, 1985;Oh et al., 2002). The focus of this study is the continuous driving data used to explore micro-driving behavior. The key question to be answered is what sampling rates are appropriate to capture micro-driving behavior without losing much information (i.e. by undersampling).
In the field of signal processing, Nyquist-Shannon sampling theorem gives the appropriate sampling rates for continuous signal. The Nyquist criterion for sampling rates is twice the bandwidth of a bandlimited signal or a bandlimited channel.
The key question is to find out the bandwidth of a signal (Landau, 1967). However, the driving behavior does not fulfill the features of bandlimited signal. Driving behavior varies according to the decisions a driver makes to respond the instantaneous driving circumstances. This study aims to find out the appropriate sampling rates for driving behavior data through exploring the nature of driver's micro-driving behavior.

Data description
Data used in this study comes from the University of Tennessee Driving Simulator Lab (DSL). This driving simulator, Drive Safety DS-600c, is fully integrated and immersive to driving test subjects with its visual and audio effects in the front half cab of a Ford Focus sedan and it provides 300°horizontal field-ofview via five projectors and back sight via three rear mirror liquid crystal displays (Yang et al., 2013). The cab base is able to mimic pitch and 30 longitudinal motions. Since 2009, over 10 simulator studies have been conducted in DSL. The equipment has been recognized as a high-fidelity driving simulator and is qualified to be used to conduct driving behaviors-associated research. The data of driver responses (e.g. speed) gathered from simulator driving tests can be used as surrogate measures of driving behavior (Bédard et al., 2010;Wang et al., 2010). The driving data used in this study was collected from 24 subjects (13 males, 11 females, average licensed year -17.6, standard deviation -7.87). Note that, the scope of this study is to introduce the indicators to quantify the extent of information loss (EIL) when sampling driving behavior data. The influences of driving conditions on driving behavior are not examined in this study. Subjects were tested in a simulated driving scenario designed with various driving conditions, covering most possible driving conditions as a whole, including urban and rural environments, as well as freeways and local streets. Each subject completed the driving test in 22-29 min, depending on their travel speed and responses to traffic controls. The driving speed was sampled at 20 Hz. The final dataset used in this study includes 718,481 data points from 35,924 s (598 min) of driving tests.

Methodology
A fundamental question is "how much information is lost in going to lower sampling rates?" Driving can be volatile as drivers made driving decisions (e.g. accelerating and braking) according to the instantaneous changes of surrounding circumstances, e.g. adjacent vehicles, roadway conditions, geometric changes in the roadway and weather conditions (Wang et al., 2015). Using the 20-Hz simulator driving data, this study creates a set of indicators to quantify the magnitude of information loss (MIL): MIL 1 : instantaneous driving decision loss (based on combined direct and indirect "detectability" explained below)equations (1)-(3); MIL 2 : percentage of out-of-range observations during drivingequation (4); MIL 3 : ratio of sampled to actual range in driving dataequation (5); and MIL 4 : relative speed deviation from linear interpolation of undersampled data (based on observed speed deviation over the undersampled data)equations (6) and (7).
An index, called Extent of Information Loss (EIL), is created for a sampling rate, as shown in equation (8). The overall methodological framework for this study is shown in Figure 1 and explained in more detail below. There are two groups of indicators: micro-driving decision indicators and magnituderelated indicators. The micro-driving decision indicators are used to capture the missing of micro-driving decisions when sampling data, and the magnitude-related indicators are to quantify the magnitude errors between the sampled values and ground truth values.
Each indicator is calculated as a percentage to index the EIL in different situations. The EIL is an overall indicator of information loss that combines the above indicators. The study quantifies the relationship between information loss indicators and sampling rates. A user can then select thresholds, e.g. 5 or 1 per cent of information loss may be acceptable, and find the appropriate sampling rate.

Direct detectability of driving decisions
Driving decisions can be altered at any time and frequently when a vehicle is being operated. If the frequency of the driving decision alteration is considerably high and the data sampling rate is very low, then some driving decisions may be lost. As shown in Figure 2(a), the decision alteration -"acceleration to deceleration" between n and n 1 1 s is missed by the 1-Hz sampled data (red points), as the speeds at n and n 1 1 s are identical. In this case, undersampling causes information loss of micro-driving decisions. The information about going from "acceleration to deceleration" between n and n 1 1 s is lost, whereas the information on "deceleration" or "no decision alternation" between n 1 1 and n 1 2 s is detected directly by the sampled data.
This study uses the 20-Hz simulator driving data to count the number of decisions made given a specific time interval, and then computes the possibility of no decision made cases,

Figure 1 Study steps and indicators
termed direct detectability of driving decisions. The formula is as follows: where: N = T Â f, the number of time slices during total data duration T in second; f = target sampling frequency/rates, e.g. 1 Hz; N = T Â f, the number of time slices during total data duration T in second; f = target sampling frequency/rates, e.g. 1 Hz; indicator for micro-driving decision alternation during i th time interval; t ¼ 1 f ; i ¼ 1; 2; 3; . . . ; N; v ij = speed at j th location in i th time interval, j = 1, 2, 3, . . ., n; n ¼ T N ¼ F f , number of available data points in a given time interval; and F = sampling rate of original dataset, 20 Hz in this study.
In this study, time intervals without decisions made belong to Case 0 (this includes constant acceleration or deceleration), as shown in Figure 2(b), with one microdecision made are referred to as Case 1 and with two decision alternations are referred to as Case 2. Case 1 will be further discussed below.

Indirect detectability of driving decisions
Direct detectability tells the chance of detecting micro-driving decisions directly with the sampled data. Next, this study discusses the chance of detecting driving decisions in Case 1. It is believed that driving speed can only continuously change without sharp changes. A sine wave illustrates the example of continuous changes, whereas square wave and sawtooth wave are examples of sharp changes (Elmore and Heald, 2012).
This study takes 1-s interval (corresponding to 1-Hz sampling rate) as the example for illustrating detection of driving decision alternation. Figure 3(a) presents six possible types of micro-driving behavior of Case 1 within 1 s. Types (a) and (c) show that there is a micro-decision made from accelerating to decelerating between n and n 1 1 s. Types (b) and (d) show that there is a micro-decision made from decelerating to accelerating between n and n 1 1 s.
For Type (a), there is a micro-decision made from accelerating to decelerating between n and n 1 1 s, whereas the speed measurement at n and n 1 1 s implies a deceleration during that second. Therefore, the missing micro-decision made within this second could be observed by using given sampling data points at n and n 1 1 s, though the amount/ intensity of the driving decision change is not necessarily accurate. In the same fashion, Type (b) illustrates information detection for the micro-decision made from decelerating to accelerating. Therefore, for Types (a) and (b), the microdecision change can be detected but with an error.
Types (c) and (d) do not meet the situations in Types (a) and (b), because the sampled data do not show the correct Figure 2 Example of information loss in instantaneous driving decisions micro-decision made between two sampled observations. Types (c) and (d) also include the cases that speed at n second is equal to n 1 1 s, as shown in Figure 2(a), because in these cases, the sampled observations cannot tell the micro-decision correctly.
Therefore, we move our sight to the next second, as shown in Figure 3(b). In Type (c 1 ), the sampled speeds at n 1 1 and n 1 2 s give a deceleration which uncovers the lost micro-decision made between n and n 1 1 s, but with a temporal error. The time stamped for the micro-decision using sampled data is at n 1 1 s, but actually, it occurred between n and n 1 1 s. Type (d 1 ) is similar to Type (c 1 ), but for detecting a micro-decision from decelerating to accelerating.
Types (c 2 ) and (d 2 ) illustrate two types of micro-decisions which cannot be easily detected, because there are two distinct micro-decisions (acceleration and acceleration) made in two sequential sampling intervals. Besides, for cases with two or more micro-decisions made within one particular time interval, there is no way to detect them by the above methods. This study mainly discusses Case 1 with one micro-decision made and tries to find the possibilities of having Types (a), (b), (c 1 ) and (d 1 ) given a time interval. The indicator, indirect The formula is as follows: where N = T Â f is the number of time slices during the total data duration T in second; f = target sampling frequency/rates, e.g. 1 Hz; for whether two consecutive micro-decisions are the same (either acceleration or deceleration); v ij = speed at j th location in i th time interval, j = 1, 2, 3, [. . .], n; n ¼ T N ¼ F f , the number of available data points in a given time interval; F = sampling rate of the original dataset, 20 Hz in this study.
indicator Type for ðaÞ error; 8 > < > : ; indicator for Type ðbÞ error; 8 > < > : ; indicator for Type ðc 1 Þ error; 8 > < > : Instantaneous driving decision loss With the direct and indirect detectability of driving decisions, we can detect micro-driving decision made given a particular sampling rate. The formula for instantaneous driving decision loss (MIL 1 ) is as follows: Empirical results are shown later. Theoretically, higher sampling rates lower the possibility of missing critical decisions, but they increase the possibility of "noise" in the data and the data storage and processing requirements. The challenge is to not lose decision information while reducing the noise in the data.

Indicators concerning magnitudes
It is important to know whether sampled values represent the population and the magnitude of errors, if any. In other words, whether the one point (e.g. 1 Hz data) can represent the 20 data points (20 Hz data) during the same second? If the 20 data points provide only marginally more information (such as constant speed during 1 s), one data point might be sufficient for sampling this second. Figure 4(i) shows an example using 20 Hz simulator data, along with two 1-Hz sampled points at the n and n 1 1 s. The speed is 10 mph at n second and 12 mph at n 1 1 s. The question would be whether all speed values between n and n 1 1 s are within the micro-speed range 10-12 mph. The example shows that given a 1-s time interval, there are six data points, or 30 per cent (6 out of 20) data points with speed values out of range 10-12 mph. In this case, two data points with records of 10 and 12 mph cannot fairly represent the driving behavior from n to n 11 s. The percentage of out-of-range observation (MIL 2 ) is an indicator that captures how many data points are out of the sampled micro-speed range. Theoretically, the value of MIL 2 can be from zero to extremely close to 100 per cent.
The formula for percentage of out-of-range observation (MIL 2 ) is as follows: where for out-of-range observation. The ratio of sampled micro-speed range over actual microspeed range during the same second is another indicator of information loss and it is termed ratio of sampled to actual range (MIL 3 ). In the example, the sampled micro-speed range is 12 À 10 = 2 mph, whereas the actual micro-speed range is 12.3 À 9.6 = 2.7 mph. The ratio is 2/2.7 = 0.74, or 74 per cent. The formula is as follows: An indicator of information loss is through speed deviations. The deviations are measured based on the linear distance between observed speeds and sampled speeds. Sampled data can be used to linearly interpolate the data points in between two time stamps. This can be compared with observed data at a higher frequency (20 Hz in this case). Figure 4(b) uses 20-Hz driving simulator data and measures observed speed deviation, which is the mean of absolute deviations within time intervals. Another indicator is relative speed deviation (MIL 4 ), which is the average deviations over interpolated speed values, providing the extent of deviations. The formulas are as follows:

Index for magnitude of information loss
The instantaneous driving decision loss, percentage of out-of-range observation, ratio of sampled to actual range and relative speed deviation quantify the MIL from different angles. All these indicators are finally calculated in terms of percentage of information loss. Then, these indicators can be combined (weighted equally) to create an index capturing the EIL index, given a sampling rate. The formula is as follows:

Extent of Information Loss Index
where MIL 1 = instantaneous driving decision loss; MIL 2 = percentage of out-of-range observations; MIL 3 = ratio of sampled to actual range; and MIL 4 = relative speed deviation.
Users of data in the transportation context can either choose a threshold for information loss and find the appropriate sampling rate or vice versa.

Direct detectability of driving decisions
To capture alternations between acceleration and deceleration within the given time interval (e.g. 1 s) corresponding to a sampling rate (e.g. 1 Hz), the number of alternations was counted by using 20 Hz data. All possible alternations within the data, given different time intervals and starting locations, were counted. If all decisions made occur exactly at the sampled points, no information will be lost. For example, in Figure 2, if the data was just sampled at n 1 0.5 s and n 1 1.5 s Figure 4 Quantifying magnitude errors in sampled data instead of n and n 1 1 s, then the driving decisions from accelerating to decelerating can be detected accurately, even if the data are still sampled at 1 Hz. The example in Figure 2 shows that there are 20 possible locations to start sampling the 1 Hz data. Figure 5(a) presents the direct detectability and possibility of no decision made (Case 0), given a specific time interval, and Figure 5(b) presents the distribution of the possibilities of the three cases (discussed above) in different time intervals. For short time intervals, the location does not have a significant influence on the data sampling. Specifically, for time interval of 1 s (1 Hz sampling rate), the direct detectability is around 89.9 per cent, i.e. Case 0 or no micro-decision made during 1-s intervals. The reason is probably related to the driver reaction time, which is usually more than 1 s (AASHTO, 2011).
In Figure 5(b), the percentages of possibilities of the three cases (i.e. no decision, one decision and two and more decisions made within the sample interval) are provided. Shorter time intervals (higher sampling rates) are related to the lower information loss in terms of instantaneous driving decisions, as expected. For time interval of 1 s (1 Hz sampling rate), Case 1 accounts for 9.2 per cent and Case 2 accounts for 0.9 per cent of sampling intervals (1 s). Figure 6(a) shows percentages of Types (a), (b), (c 1 ) and (d 1 ) in Case 1 (one decision change). Specifically, given a 1-s time interval (or 1-Hz sampling rate), Types (a), (b), (c 1 ) and (d 1 ) constitute 31, 25.37, 21.42 and 16.14 per cent of Case 1, where only one micro-decision is made between two sampled data points. These four types of patterns contain detectable driving information. The indirect detectability is the sum of these possibilities, shown in Figure 6(b). For 1-s time interval (or 1-Hz sampling rate), the indirect detectability is around 31 per cent 1 25.37 per cent 1 21.42 per cent 1 16.14 per cent = 93.92 per cent. With the time interval getting longer, this indirect detectability decreases.

Instantaneous driving decision information loss
The combined results of instantaneous driving decision loss are shown in Table I. There is an 89.90 per cent chance that there is no micro-decision (Case 0) within 1 s (1-Hz sampling data, highlighted in Table I) and 9.20 per cent chance that there is one micro-decision (Case 1). For Case 1 with only one microdecision, there is a 30.99 per cent chance that the Type (a) decision pattern would occur, and 25.37, 24.42 and 16.14 per cent for Types (b), (c) and (d), respectively. These four types include micro-decisions that can be detected. Therefore, in summary, the feasibility of detecting micro-driving decisions for 1 Hz sampling data are 89.90 per cent 1 9.20 per cent Â (30.99 per cent 1 25.37 per cent 1 24.42 per cent 1 16.14 per cent) = 98.54 per cent, and 1.46 per cent of information about micro-decisions would be lost. Data sampled by rates higher than 0.5 Hz can reflect more than 95 per cent of microdecisions and the instantaneous driving decision loss is less than 5 per cent.  Table II show that lower sampling rates (or longer time intervals) are associated with larger percentages of outof-range points, smaller ratio of sampled-to-actual range, larger speed deviations and relative speed deviations, as expected. Percentage of out-of-range points concerns the sampled micro-speed range within a time interval. The sampled micro-speed range is determined by two sequential recorded data points, as shown in Figure 4. The results show that, on average, 1.75 points (or 8.75 per cent) are out of the sampled micro-speed range for 1-s time interval (or 1-Hz data), because there is a large possibility that there is no micro-decision changes during 1 s. It is consistent with the above finding that for the time interval of 1 s, the average possibility of no micro-decision change is 88.90 per cent (see Figure 5). For 1-Hz data, the ratio of sampled to actual micro range is 0.957, which means the extent of representativeness of the 1-Hz data to 20-Hz data is about 95.7 per cent in terms of magnitude. Though some data points are possibly out of the recorded micro ranges, these points do not deviate broadly. Further, 1-Hz data have an observed speed deviation of about 0.076 mph. Note that 1 per cent percentile of 718,481 20-Hz speed records is 0.493 mph, and thus the deviation of 0.076 mph is not substantial in the distribution of speed records. This is consistent with EPA drive cycle data, which is based on 10-Hz (EPA, 2013). Further, the relative speed deviation, ratio of deviation over interpolated speeds, shows that 1-Hz data has a relative speed deviation to 20-Hz speed records at 0.87 per cent, substantially lower than the 5 per cent threshold.

Extent of information loss
The overall EIL is an equally weighted indicator, calculated using equation (8). The results are shown in Table II. We know if the sampling rate is 1 Hz, the percentage of out-of-range points is 8.77 per cent, ratio of sampled to actual range is 95.71 per cent, relative speed deviation is about 0.87 per cent and the instantaneous driving decision loss is about 1.46 per cent. So, the overall EIL is (8.77 per cent 1 (100 per cent À 95.71 per cent) 1 0.87 per cent 1 1.46 per cent)/4 = 3.85 per cent. Thus, overall, about 3.85 per cent of the driving information, including the micro-driving decisions and speed magnitude, might be lost if the sampling rate is 1 Hz instead of 20 Hz. If 5 per cent of information loss is the threshold, a sampling rate higher than 0.8 Hz can be acceptable, if EIL is considered. If all MILs need to be under 5 per cent for importation loss, then the 2 Hz sampling rate might be the lowest sampling rate to meet the information loss threshold. Figure 7 presents the final results quantifying various information loss indicators and different sampling rates. The results show that different indicators have different levels of information loss at a given sampling rate and the relationship is nonlinear. At sampling rates higher than 2 Hz, all MILs are under 5 per cent for importation loss. The indicator of MIL 2 , percentage of out-of-range observations, seems to be with higher values than other MILs across sampling rates. This  indicator may be critical for some purposes, e.g. crash reconstruction and reporting. Therefore, for studies dealing with crashes, especially crash reconstruction studies that are highly sensitive to speed magnitude, higher sampling rates can be beneficial. The curves, including the overall information loss indicator, show that information loss becomes rather high between at 1-and 2-Hz level.

Limitations
The data used in this study comes from a simulator driving test, i.e. they are from a hypothetical but controlled test environment. Having few test subjects is recognized as a limitation, though it is not very germane to this study. The data was sampled by 20 Hz. It is possible that micro-driving decisions between the 20-Hz time-stamp data points were lost. This study assumes the chance of having micro-decision changes within 0.05 s is very small, given a perception reaction time of about 1 s. In the future, driving data sampled at even higher sampling rates can be used to verify the results of this study. The proposed indicators can be used for analysis of information loss with any range of sampling frequency. The scope of this study is to develop the concept of MILs or EILs that can be used to quantify the EIL when sampling driving behavior data. This study introduced a limited number of indicators, and more indicators can be developed to quantify the information loss. In addition, the results of quantified information loss may vary significantly across different traffic conditions, e.g. urban and rural environments. The road configurations would also have a significant impact on driving behavior. Therefore, the recommended sampling rates for collecting driving behavior data may need to be specified for particular driving conditions of interest.

Conclusions
The key question investigated in this study is: what sampling rates are appropriate to capture micro-or short-term driving decisions? Oversampling can result in noisy data, and waste storage and processing resources. Undersampling can result in loss of information about important instantaneous driving decisions. This study developed indicators of information loss and quantified their relationship with sampling rates. It discussed driving behavior information from two angles: instantaneous driving decisions and speed magnitudes. Four main indicators were created to quantify the magnitudes of driving behavior information loss: MIL 1instantaneous driving decision loss (combined direct and indirect "detectability"); MIL 2percentage of out-of-range observations; MIL 3ratio of sampled-to-actual range; and MIL 4relative speed deviation from linear interpolation of sampled data (based on observed speed deviation over interpolated speed).
These indicators quantify the EIL. With these four indicators, the overall MIL index was generated by equally weighting them. The index, termed by EIL, simply tells us how much information might be lost, given a sampling rate.
The results show that shorter time intervals (i.e. higher sampling rates) are associated with larger direct detectability of instantaneous driving decisions. In other words, there is a smaller chance of having cases with micro-driving decisions between two sampled data points. Drivers typically keep constant acceleration/deceleration rates during a short time. Specifically, for the time interval 1 s (i.e. 1-Hz sampling rate), the direct detectability is 88.90 per cent. The large possibility of no micro-decision in 1 s may be because of the driver reaction time. The reaction time includes the time for driver perception, identification, judgment and reaction (TRB, 1998). The whole process usually takes more than 1 s (AASHTO, 2011). This study further observed cases of one micro-driving decision made within a particular time interval and discussed the possibility of detecting such micro-driving decisions. Through defining the six possible micro-driving decision patterns, the study found the four of six patterns include the micro-driving decisions that can be detected indirectly by using the sampled data points. These four patterns dominate the cases in short time intervals (less than 3 s). Specifically, the indirect detectability for 1-s time interval (or 1-Hz sampling rate) is around 93.92 per cent. The feasibility of detecting micro-driving decisions combines direct detectability and indirect detectability. Thus, the feasibility of detecting micro-driving decisions by 1-Hz data are 89.90 per cent 1 9.20 per cent Â 93.92 per cent = 98.54 per cent, and 100 per cent À 98.54 per cent = 1.46 per cent of information about micro-decisions (MIL 1 ) will be lost by 1-Hz data.
The indicators of information loss magnitude reveal that smaller sampling rates or longer time intervals are related to more missing data points because of their too large or too small values. Though there are some data points out of the microspeed ranges (about 8.77 per cent of points out of the microranges for 1-Hz data, MIL 2 ), these points do not deviate broadly when sampling rates are equal to or higher than 1 Hz.
Specifically, the ratio of sampled to actual ranges (MIL 3 ) is 95.7 per cent for 1-Hz data. And 1-Hz data has an average speed deviation of about 0.076 mph. The small deviation supports the assumption that driving behavior within 1 s shows nearly constant acceleration (EPA, 2013). Further, the relative speed deviation (MIL 4 ) of 1-Hz data to 20 Hz is around 0.87 per cent. With four indicators of MILs, the overall EIL can be calculated. For 1-Hz sampling rate, the EIL is about 3.85 per cent.
This study proposed indicators to quantify the MIL regarding the longitudinal driving behavior. The indicators can be used individually or combined to create an index. The calculation results are not intended to be directly used by all other driving behavior studies, as the results may vary significantly across different traffic conditions and driver behaviors. The trends of MILs and EIL may be useful to researchers to understand how information might be lost because of the low sampling rates. The calculation process can be easily replicated by other researchers who aim to determine an appropriate sampling rates for their study data collection, or to evaluate the extent of information loss for driving behavior data that have been collected at a known sampling rate. The results show that lower sampling rates are associated with greater information loss, but the relationship is nonlinear. This study contributes by developing a methodology to quantify the relationship between sampling rates and information loss. Depending on the objective of their study, researchers can choose the appropriate sampling rate necessary to get the right amount of accuracy. For some studies, e.g. quantifying energy consumption or emissions, 2-Hz sampling rate may be sufficient, whereas for safety studies, higher sampling rates may be required. In addition, different indicators may capture different aspects of the information loss while sampling data to study driving behavior. The indicators introduced in this study are for longitudinal driving behavior. Indicators for lateral behavior such as steering angle need to be developed in future research.