On the Effects of Various Measures of Performance Selections on Simulation Model Calibration Performance

Objective. This paper examines the effects of various measures of performance (MOP) selections on simulation model calibration performance, in terms of reflecting actual traffic conditions and vehicle interactions.Method. Two intersections in Shanghai were selected for simulation model calibration, one for testing and another for validation.Three effective MOPs were utilized, including average travel time (i.e., time passing the intersection), average queue length, and vehicle headway distribution.The counts of three types of traffic conflicts (i.e., crossing, rear-end, and lane change) were used as safety MOPs.ThoseMOPs, as calibration objectives, were examined and compared. Results.The results of the testing site showed that different effectiveMOPs had their own advantages: average travel time appeared to be the best in reflecting lane change and rear-end conflictswhile headway distribution performed the best consistency between simulated and actual crossing conflicts. Compared to the safety MOPs, average travel time and headway distribution still performed better, in terms of resulting in more similar simulated conflict metrics (e.g., TTC, PET) to actual ones. A multicriteria calibration strategy based on average travel time and headway distribution generally had better performances in reflecting actual traffic conditions and vehicle interactions than using any single effective or safety MOP. Similar results were found for the validation site. Conclusion. To simulate actual traffic conditions and vehicle interactions, multiple effective MOPs could be simultaneously considered for model calibration, instead of using safety MOPs.


Introduction
Microscopic traffic simulation has been widely utilized for traffic design, control, and management, since it is able to simulate actual traffic conditions so that unknown impacts of various traffic design and control scenarios can be properly evaluated.In order to develop a reliable microscopic simulation model, model calibration is considered as the most critical step that matches simulated traffic conditions with actual environment.To calibrate a simulation model, a measure of performance (MOP) usually needs to be selected as the calibration objective, such as average queue length [1], volume [2], speed [3], travel time (or time passing the intersection) [4], control delay [5], and headway distribution [6].However, there seems to be no benchmark MOP for microscopic simulation model calibration.According to Park and Qi [4], a MOP was usually selected due to the ease of data collection in the field.However, such method could also raise a question whether all MOP selections have comparable performances in reflecting actual traffic conditions.To our knowledge, there is limited research on the effects of MOP selections on simulating actual traffic conditions.
To note, most conventional MOPs are effective measures since simulation was initially conducted for traffic operational studies.Recently, microscopic simulation has been recognized for traffic safety analysis.To do this, it is important to reflect actual vehicle interactions.Previous research found that by incorporating a safety MOP (e.g., conflict counts) for calibration, simulation models appeared to better reflect actual vehicle interactions than calibrating effective MOPs [7][8][9][10][11].However, such calibration strategy requires additional calibration efforts of safety-related data collection.For instance, in order to calibrate a simulation model based on traffic conflicts, conflicts need to be identified and detailed conflict data need to be collected in the field.To note, collecting conflict data requires either a large amount of human observations or solid knowledge/skills of computer vision techniques.In practice, however, such requirement is very difficult to be fully fulfilled.Not to mention that video techniques often suffer from adverse weather, occlusions, and algorithm limitations.In general, safety-based calibration is difficult and requires a lot of efforts.If a calibrating strategy based on effective MOPs can have reasonable performances in simulating actual vehicle interactions, safety-related calibration efforts may not be required.Duong et al. [3] discussed this topic by proposing a calibration strategy based on two effective MOPs for freeway segments.The result showed the potential of using effective MOPs for simulating actual vehicle interactions.However, for urban intersections, there is no literature addressing such issue.
In light of these, it is necessary to find out how various MOPs (as calibration objectives) would affect the consistency between simulated and actual traffic conditions, as well as vehicle interactions.In this paper, we will examine the effects of various effective/safety MOP selections on simulation model calibration, in terms of reflecting actual traffic conditions and vehicle interactions at signalized intersections.The expectation of this research is to identity a promising calibration strategy that can reasonably reflect both actual traffic conditions and vehicle interactions.

Data Collection
Two signalized intersections in Fengxian District in Shanghai were selected in this study, one for testing and another for validation.At the initial stage of this study, video clips were collected by portable cameras, set up on roadside high-rise buildings [11].This set-up provided enough view coverage to collect most data needed, including traffic conflicts, volume, and headway.However, to examine the effects of various MOPs on simulating actual traffic conditions, average queue length is an important effective MOP widely used in previous literature.Thus, to get enough viewing height to cover whole intersection area (shown in Figure 1), a drone was used to collect video clips during afternoon peak hours (4:30-6:30) on two different weekdays.
Conflicts were identified based on traditional TCT methods, by four trained observers.Those observers were early examined their intrareliability and interreliability, by watching other video clips [11].A clear sign of evasive actions (e.g., braking, swerving) was used as the sign of conflicts.A grid system was placed in the video to match global coordinates with local coordinates, by which vehicle positions and distance can be manually estimated.Video clips were viewed at 25 frames per second to estimate moving velocity of vehicles, using VideoStudio.Furthermore, conflict metrics were calculated by observers, including Time-to-Collision (TTC) and Postencroachment Time (PET).
It should be noted that conflicts could be identified by automated video techniques, including unsupervised or semisupervised learning algorithms [12].Unsupervised learning algorithms could quickly differentiate conflicts without manually labeling data.However, it often suffers from a low detection accuracy issue [12].On the other hand, supervised or semisupervised learning algorithms could better identify conflicts but require many efforts to collect training data.In general, there is an obvious trade-off between efficiency and accuracy, regarding video conflict identification and data collection.Moreover, those algorithms still inherently depend on human judgment on conflict identification for either data labeling or validation.Thus, from any perspective, conflict identification could be a difficult task for middleto large-scale microscopic traffic simulation calibration.The derivation of vehicle speed and conflict metrics may be improved by extracting high-resolution vehicle trajectories based on automated video techniques [13,14].However, the reliability of such method still suffers from adverse weather, varying light conditions, and occlusion.
Thus, in our study, around over ten hours of manual calculation efforts were conducted to obtain those detailed conflict data, for each intersection.This data collection process was believed to adequately capture actual traffic conditions and vehicle interactions in the real world.The laborassist conflict identification approach was believed to be a more straightforward and reliable way to identify conflicts in this study.Admittedly, when considering larger number of intersections, it will become labor-intensive to identify conflicts by human efforts.However, it also highlights the necessity of finding other effective MOPs comparable to safety MOPs (e.g., conflicts), for microscopic traffic simulation calibration.
Figure 1 presents the aerial photos of the two intersections, as well as their sketches with numbered lanes.All effective MOPs were calculated by each lane (i.e.lane-based) and aggregated into one hour.Thus, for the testing site, there were 14 (lanes) * 4 (hours) = 56 samples.For the validation site, there were 12 (lanes) * 4 (hours) = 48 samples.Conflicts are not considered as lane-specific, because many are difficult to be assigned into a single lane (e.g., crossing and lane change conflicts).Instead, they were analyzed by total counts, PET, and TTC.Table 2 presents the summary statistics of traffic and conflict data collected in this study.Detailed lane-specific traffic data were included in the Appendices A and B.

Simulation Model Development
In this study, a commercial microscopic simulation package VISSIM was used to develop simulation models for the two intersections.VISSIM has been extensively used in simulation studies [9][10][11][15][16][17], due to its high flexibility in simulating actual traffic conditions, especially microscopic driver behaviors [18,19].Geometric data were collected based on aerial photographs captured by drone and field observations.Technical drawings were also provided by the Fengxian Highway Administration, with detailed roadway geometrics including the number and width of lanes, length of storage bays and tapers, and details of lane utilization.
Traffic control details were derived from the Fengxian Traffic Police Department and field observation, including posted speed limits, signal timing plans, and movement permissions.Traffic flow data were mostly collected based on videos, including traffic counts and vehicle composition.Those data were used as the input for developing a base simulation model for each intersection.
As for simulation calibration, many MOPs have been used in previous studies [7], including average delay, average travel time, number of stops, headway distribution, and average queue length.In this study, average travel time (time passing an intersection), headway distribution, and average queue length were utilized as effective MOP.Each was examined for their effects on reflecting actual traffic conditions.In previous literature, conflict counts were often used as a safety MOP to reflect actual vehicle interactions.Thus, in this study, conflict counts were used for model calibration, while TTC and PET were used to verify calibrated simulation models.

Simulation Model Calibration for Testing Site
The testing site was used to examine the effects of various MOP selections on reflecting actual traffic condition and vehicle interactions.Effective MOPs include average travel time, average queue length, and headway distribution.Conflicts were classified into three types: crossing, rear-end, and lane change.The counts of each type were used as a safety MOP, resulting in three safety MOPs.Surrogate Safety Assessment Model (SSAM) was used to extract three types of simulated conflicts, with the suggested angle thresholds (30 ∘ and 85 ∘ ) [20].As for PET/TTC thresholds, 3 seconds was used because most actual observations fell within this range (as shown in Table 1).First of all, a set of ANOVA tests were applied to identify sensitivity parameters to each effective and safety MOP.Since the two intersections are in urban area, parameters in Wiedemann 74 model were selected for model calibration.The list and the acceptable ranges of driver behavior parameters in VISSIM were determined initially based on previous literatures and a Latin Hypercube Sampling-(LHS-) based calibration approach was early adopted to ensure that the full range of parameter combinations was sampled [21].Based on ANOVA tests, a set of parameters were found as sensitive.Table 2 presents the results.
From Table 2, some parameters were found to be sensitive to both effective and safety MOPs.For example, minimum gap time was found to be sensitive to both headway distribution and crossing conflict counts.This result was consistent with previous literature [7].Average standstill distance, additive, and multiplicative part of desired safety distance affected average travel time, average queue length, and rear-end conflict counts.Those parameters could reflect interactions between leading and following vehicles.Safety distance reduction factor and min headway were found as sensitive to average travel time and lane change conflict counts.This could be reasonable since those parameters mainly control lane change behaviors.Another finding is that different effective MOPs have different sensitive parameters.Then, Genetic Algorithm (GA) was applied to calibrate the simulation model based on each MOP selection, respectively.GA is an adaptive heuristic search algorithm, which was commonly used in simulation calibration to find optimized driving behavior parameters that could reflect actual traffic conditions to the largest extent [7].GA includes four main steps: (1) Population generation; (2) Selection; (3) Crossover; and (4) Mutation.Normally, initial population needs to be randomly generated.In this case, a random combination of sensitive parameters was considered as an individual of GA.Ten individuals were included in the initial population, with random population generation.A selection operator is used for giving preference to better individuals, allowing them to pass on their genes to the next generation.Normally, a fitness function is created to determine the goodness of individuals.Regarding simulation calibration, various indicators can be used, such as mean average percentage error (MAPE), root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), and the Pearson correlation coefficient.To note, in our study, MAPE was used for developing fitness functions, while all other listed indicators were calculated for verifying calibration performances.The equation of MAPE, RMSE, MSE, MAE, and correlation coefficients can be written as where N is total sample points; C k is the calibrated MOP for the k th sample point; A k is the actual MOP for the k th sample point; C, A are the average value of the calibrated/actual MOP of all sample points, respectively.
After that, a crossover operator needs to be applied to create one new "offspring" individual based on two "parent" individuals.That is, two sets of parameter values were used to create a new parameter set, which had values from either of them.The last step is mutation.A portion of new individuals (i.e., parameter combinations) will be randomly selected and modified, to maintain diversity within the population and inhibit premature convergence.This process was repeated until the convergence of the calibration.In this study, population size, crossover rate, and mutation rate were set as 10, 0.6, and 0.3, respectively.All calibration processes converged within 80 generations.
Table 3 shows the optimal parameter values after GA convergence, for each MOP selection.Note that a multicriteria calibration result based on headway and average travel time was also reported in Table 2. Based on the results of each single effective MOP, a multicriteria calibration strategy based on effective MOPs was further applied.Each MOP was set as a weight and the calibration objective was set as where w i is the weight for the i th MOP; z i is the performance indicator of the i th MOP (e.g., MAPE of queue length);  is the vector of parameters to be calibrated.
Multicriteria optimization problem has been studied for years [22,23].In this study, equal weights were used for GA calibration.Among multicriteria scenarios (i.e., travel time and queue length, travel time & headway, headway and queue length), the scenario based on headway and travel time showed the best overall calibration performance.
Figure 2 illustrates the whole calibration procedure applied in the study.
To examine the effects of different MOPs on calibration performances, both effective and safety perspectives were considered.Effective measures for performance verification are the same as those for calibration.For each conflict type, three safety-based measures were selected to verify calibrating performance, including conflict counts, average TTC, and average PET.In most literature, only conflict counts were often used for model verification.However, to accurately reflect actual vehicle interactions, detailed conflict metrics were also believed as important.Table 4 presents the verification of calibration performances, in terms of MAPE.Note that, for safety-based calibration, only the best performances of each conflict type were presented.
For crossing conflicts, the safety-based calibration strategy resulted in the lowest MAPE (29%) for total conflict counts.This is reasonable since conflict counts are the direct calibrating objective for safety-based calibration.Among effective MOPs, headway distribution appeared to be a better calibrating objective than average travel time and average queue length, in terms of presenting 34% MAPE for crossing conflict counts.Moreover, it has 18% MAPE for average TTC and 23% MAPE for average PET, which is better than safety MOPs.For rear-end conflicts, calibration based on the safety MOP (i.e., conflict counts) resulted in the lowest MAPE (21.4%) for conflict counts, followed with average travel time (24.2%).However, as for average TTC and PET, average travel time appeared to be the best MOP among all, with the lowest MAPEs.For lane change conflicts, average travel time was shown as the best MOP for calibration, with the lowest MAPE for conflict counts, average TTC, and average PET.On the other hand, calibrating safety MOPs appeared to be unable to reflect actual vehicle interactions, in terms of presenting relatively high MAPEs for detailed safety-based verification measures (i.e., average TTC and PET).
It is reasonable that safety-based calibration could generally resulted in lower MAPEs for conflict counts.Previous literature also reported similar results and claimed the advantage of such strategies over conventional effectivebased calibration strategies.However, in this study, safetybased strategies were found with lower MAPEs for average TTC and PET.
Headway distribution appeared to be the best calibrating objective for crossing conflicts, in terms of presenting lowest MAPEs for average TTC and PET.However, it did not perform very well in reflecting the other two types of conflicts.Average travel time better reflected actual vehicle interactions in terms of rear-end conflicts and lane change conflicts.However, it provided relatively high MAPEs for crossing conflicts.Average queue length generally resulted in high MAPEs for all three conflict types.Thus, headway distribution and average travel time appeared to be complementary in matching simulated vehicle interactions with actual vehicle interactions.More importantly, they are comparable or even better than safety MOPs used for calibration in this study.Thus, it is reasonable to expect a multicriteria calibration based on two effective MOPs can be better than a single effective MOP and safety MOP.This was proven by the multicriteria calibration results (i.e., headway distribution and average travel time), which provided reasonable MAPEs for all verification measures.Even some MAPEs are slightly higher than calibration based on single effective/safety MOPs, the results can still be considered as acceptable due to the randomness of simulation nature as well as the potential trade-off between two effective MOPs.

Calibration Results for Validation Site
To examine the transferability of the effects of MOP selections on simulation calibration performances, another intersection was examined based on the same calibration procedure.First, sensitivity parameters were found as comparable to the testing site.Then, calibration efforts based on single effective Due to those effects, a multicriteria calibration effort was conducted based on average travel time and headway distribution.The results were consistent with the testing site.Again, the multicriteria calibration appeared to find a balance between two single effective MOPs by providing acceptable performances in reflecting all types of actual vehicle interactions (i.e., crossing, rear-end, and lane change).It also outperformed the safety-based calibration strategies.Calibration performances were reported in Table 6.
To further examine calibration performances, a lanebased test was also applied.Figure 3 shows the simulated observations (based on the multicriteria calibration) versus actual observations, for each lane with a one-hour aggregation.All performance indicators were reported, including MAPE, correlation, RMSE, MSE, and MAE.Generally, the multicriteria calibration reasonably reflected lane-based MOPs, for both sites.
Additionally, the TTC/PET distributions of actual conflicts, safety-based calibration results, and the multicriteria calibration were compared, for the two sites.Details are shown in Appendices C and D. It can be found that the multicriteria calibration resulted in simulated TTC and PET distribution, which are more similar to actual TTC/PET distribution than safety-based calibrations.Chi-square tests were also conducted.It was found that there was no significant difference between simulated and actual TTC/PET in most cases, for the multicriteria calibration strategy.

Conclusion
This paper examines the effects of various MOP selections on simulation model calibration performances, in terms of reflecting actual traffic flow conditions and vehicle interactions at signalized intersections.Two intersections in Fengxian, Shanghai, were selected in the study: one site was used for testing those effects and another site was used for validation.Average travel time (i.e., time passing the intersection), average queue length, and headway distribution were used as effective MOPs for model calibration, which were also commonly used in previous literature.Actual vehicle interactions were represented by traffic conflict data collected in the field.Conflicts were carefully determined by four reliable trained observers, while detailed conflict metrics (e.g., TTC, PET) were manually calculated.The counts of three conflict types (i.e., crossing, rear-end, and lane change) were selected as safety MOPs for simulation calibration.A commercial simulation package, VISSIM, was utilized to develop base simulation models for the two intersections, based on fundamental geometric, traffic, and control data.
For the testing site, driving behavior parameters in VISSIM were examined for their sensitivity to the effective MOPs, as well as safety MOPs.An LHS-based calibration approach was early adopted to ensure that the full range of parameter combinations was sampled.In doing so, ANOVA tests can be properly applied to examine sensitivity of each parameter.Genetic algorithm (GA) was utilized to calibrate the microscopic simulation model and comparisons among different MOPs were conducted, in terms of examining the consistency between simulated outputs and actual data.The results showed that average travel time outperformed other effective MOPs, in terms of presenting the lowest MAPEs of average TTC and PET.Headway distribution was found to better reflect crossing vehicle interactions, in terms of producing lower MAPEs of average TTC and PET for crossing conflicts.On the other hand, only using conflict counts for model calibration did not result in higher consistency between simulated and actual vehicle interactions, especially for detailed conflict metrics (i.e., average TTC and PET).A multicriteria calibration strategy based on headway and average travel time showed a considerable consistency between simulated outputs and actual data.Although it did not outperform single effective MOPs or safety MOPs in every aspect, it was considered as a promising selection to simulate actual traffic conditions and vehicle interactions.To further examine the transferability of those effects, another intersection was applied with the same calibration process.Similar results were also found.
Based on the results, some conclusions can be drawn as follows: (1)       effective MOPs have their own advantages of reflecting actual vehicle interactions; (3) conflict counts may not be the optimal safety MOP for simulation calibration, in terms of reflecting actual vehicle interactions; (4) compared to conflict counts, using multiple MOPs resulted in better calibration performances in reflecting vehicle interactions, especially for detailed conflict metrics (i.e., TTC and PET).Thus, a multicriteria calibration strategy based on effective MOPs could be considered to simulate both actual traffic conditions and vehicle interactions.
Admittedly, this research also has some issues that need to be further addressed.First, only two intersections were considered in this study.It may not be enough to determine the transferability of the results.However, this was partly due to the difficulty of collecting safety MOPs, which also highlighted the necessity of calibrating proper effective MOPs as surrogates.Future validation studies could incorporate more intersections and more data collection efforts.Second, more complicated calibration strategies could be examined in the future.Currently, equal weights were used for calibration.However, it is interesting to identify the effect of unequal weights on calibration.Also, other safety-based calibration methods could also be compared, such as the two-stage calibration strategy [7].Third, more effective MOPs could be examined, such as average number of stops.In addition, traffic facilities other than signalized intersections could be examined to validate the transferability of the effects of various MOPs as calibration objectives.We recommended that future research could be focused on these topics.

Figure 1 :
Figure 1: Aerial photos and sketches of the testing site (left) and the validation site (right).

Figure 2 :
Figure 2: A general simulation calibration procedure applied in the study.

Figure 3 :
Figure 3: Simulated lane-based observations versus actual observations at each time interval at testing site (left) and validation site (right).

Figure 4 :
Figure 4: TTC distribution of multicriteria calibration, safety-based calibration, and the field.

Figure 5 :
Figure 5: PET distribution of multicriteria calibration, safety-based calibration, and the field.

Table 1 :
Summary statistics of collected traffic and conflict data.

Table 2 :
ANOVA tests on sensitive driving behavior parameters to MOPs.

Table 3 :
Calibrated driving behavior parameters for testing site.

Table 4 :
The verification of model calibration performance for testing site (in MAPE).

Table 5 :
Calibrated driving behavior parameters for validation site.
* indicates calibrated values.MOPs and safety MOPs were conducted.Calibration details are shown in Table5.Similarly, safety-based calibration was unable to capture detailed vehicle interactions, in terms of average TTC and PET.As for effective MOPs, average travel time appeared to be the best calibrating objective for lane change conflicts and headway distribution outperformed other MOPs in terms of having lowest MAPE for crossing conflicts.For rear-end conflicts, average queue length and average travel time had comparable MAPEs.Average queue length had slightly better MAPEs for average PET and TTC.
different effective MOPs may have different sensitivity parameters for model calibration; (2) different

Table 6 :
The verification of model calibration performance for validation site (in MAPE).

Table 7 :
Lane-based measures of performance at the testing site.

Table 8 :
Lane-based measures of performance at the validation site.