Adopting Machine Learning and Spatial Analysis Techniques for Driver Risk Assessment: Insights from a Case Study

Traffic violations usually caused by aggressive driving behavior are often seen as a primary contributor to traffic crashes. Violations are either caused by an unintentional or deliberate act of drivers that jeopardize the lives of fellow drivers, pedestrians, and property. This study is aimed to investigate different traffic violations (overspeeding, wrong-way driving, illegal parking, non-compliance traffic control devices, etc.) using spatial analysis and different machine learning methods. Georeferenced violation data along two expressways (S308 and S219) for the year 2016 was obtained from the traffic police department, in the city of Luzhou, China. Detailed descriptive analysis of the data showed that wrong-way driving was the most common violation type observed. Inverse Distance Weighted (IDW) interpolation in the ArcMap Geographic Information System (GIS) was used to develop violation hotspots zones to guide on efficient use of limited resources during the treatment of high-risk sites. Lastly, a systematic Machine Learning (ML) framework, such as K Nearest Neighbors (KNN) models (using k = 3, 5, 7, 10, and 12), support vector machine (SVM), and CN2 Rule Inducer, was utilized for classification and prediction of each violation type as a function of several explanatory variables. The predictive performance of proposed ML models was examined using different evaluation metrics, such as Area Under the Curve (AUC), F-score, precision, recall, specificity, and run time. The results also showed that the KNN model with k = 7 using manhattan evaluation had an accuracy of 99% and outperformed the SVM and CN2 Rule Inducer. The outcome of this study could provide the practitioners and decision-makers with essential insights for appropriate engineering and traffic control measures to improve the safety of road-users.


Introduction
Road transport is considered the backbone of the nation's economy. In China, rapid economic growth during the past three decades has brought a revolution in the transportation industry. The motorization rate has witnessed exponential growth, particularly in urban areas. Though this rapid expansion of urban transport infrastructure has inarguable benefits for various businesses, it has caused serious agony in the form of extreme traffic congestion, limited parking facilities, increases air pollution, and noise pollution, as well as safety concerns. For example, a study reported that a total Tselentis et al. utilized Data Envelopment Analysis (DEA) based framework for evaluation and benchmarking of drivers' safety efficiency under a naturalistic driving environment [33]. The driver population (N = 56) was divided into three categories, i.e., less efficient, weakly efficient, and most efficient. Eboli et al. in their study, proposed a novel framework considering kinematic parameters, such as speed, lateral, and longitudinal acceleration profiles to establish whether driver behavior was safe or unsafe [34]. Jovanovic et al. investigated the speeding violation factors and assessed the predictive validity of the adjusted Theory of Planned Behavior (TPB) model in association with speed violation [35]. Warner and Aberg attempted to explore about drivers' perspective of speeding violation and predicted the speeding intent. A total of 162 car owners were selected for the analysis. The findings revealed that the indicators, including mood, attitude, social norms, and perceived behaviors, seem to be more effective in predicting the speeding intentions of the drivers [36]. Likewise, Kim et al. introduced a method for regulating parking violations using camera captured images through computer vision techniques. The result showed that the illegal parking was determined by the conformity of the lanes and the vehicle's shadow [37]. De Winter and Dodou conducted a detailed meta-analysis that revealed that age and annual mileage have a great deal of association with drivers' errors and violations. The authors found that young driving age appeared to be associated positively with violations and errors. Toledo et al. assessed the potential for an in-vehicle recorder system to monitor road-driver behavior [38]. Implementation of the proposed system using the Drive Diagnostics system showed that short-term rates and risk indices might be reduced. In another study, researchers designed a combined method considering both objective and subjective parameters to identify crash risk levels [39]. Based on the study results, the authors classified three ranges for being involved in a crash, i.e., low, medium, and high.
In recent years, major Chinese metropolitans witnessed an increasing trend in traffic violations. These violations account for 75% of total crashing occurring in the country [40]. Traditionally, statistical modeling or simulation-based approaches have been widely used for examining aggressive driving behavior and analysis of traffic violations. However, these methods have several underlying assumptions and are unable to estimate associations between predictor variables in a realistic fashion. Further, a vast majority of such studies have focused on analyzing the patterns of traffic violations in the urban metropolitan, whereas factors contributing to traffic violations along expressways have been scarcely explored. To fill this gap, we utilized state-of-the-art Machine Learning (ML) models to predict violations taking into account various spatio-temporal attributes. The main contributions of current work are: (i) we advocate the application of Inverse Distance Weighted (IDW) method of interpolation in ArcMap (Geographic Information System (GIS)) to identify violation hotspots along expressways; (ii) proposed a systematic ML framework, including SVM, CN2 Rule Inducer, and K Nearest Neighbors (KNN) to classify and predict traffic violations considering a number of available explanatory variables; (iii) performed comprehensive comparative analysis for proposed ML algorithms based on several classification evaluation metrics; and (iv) our results showed that KNN (with k = 7) outperformed other models.
The remainder of the paper is structured as follows. Section 2 presents description of study area, data collection, and detailed methods for hotspot analysis and violation prediction using ML. Section 3 provides study results and discussion, highlighting key descriptive anlaysis, mapping of violation hotspots, ML models' prediction comparions, and Spearman correlation analysis. Finally, Section 4, provides conclusions, study limitations, and outlook for future studies.

Selection of Study Area
The city of Luzhou (shown in Figure 1) was selected as the study area. It is a prefecture-level municipality with an area of 12,246 km 2 and a population over 1 million and is located in the southeast of Sichuan Province, China. Located at the combination of the Tuo River and Yangtze River, the Luzhou port on the Yangtze River is the major port of Sichuan since the Chongqing Province in 1997 [41].
As per the National Bureau of the Statistics People's Republic of China (PRC), by 2017, the country had 4.77 million of paved roads and over 300 million registered vehicles [1]. Luzhou port on the Yangtze River is the major port of Sichuan since the Chongqing Province in 1997 [41]. As per the National Bureau of the Statistics People's Republic of China (PRC), by 2017, the country had 4.77 million of paved roads and over 300 million registered vehicles [1].

Figure 1.
Locations of S219 and S308 in the study area (from Google Maps).

Data Collection
Traffic violation data for the year 2016 was obtained from the department of Sichuan traffic police in Luzhou city and collected from off-site traffic enforcement cameras system. Therefore, traffic violations along expressways road segments may be missing. It is worth stating that the license plate number of vehicles was not available in the dataset, which bounds its application to record a particular driver's history of the violation. It is a significant shortcoming that could be addressed in future research for better analysis of violations by various vehicle types' drivers.

GIS-Based Analysis for Violation Hotspots
Spatial Statistic toolbox in ArcMap was used for distribution and identification hotspots along the two expressways. Four steps were carried out to find the hotspots in the study area. In the first step, traffic violation point data was loaded in GIS. Each point represented a traffic violation caused by any of the four types, i.e., Violation of prohibited markings, Wrong-way driving, Illegal parking, or Overspeeding. In the second step, Data clusters were created in order to convert the data to a weighted point. A spatial statistics tool (Collect Event) was used in ArcMap in order to convert the data to a weighted point. Weights were assigned based on the frequency of the traffic violation at a particular point. Collect Event tools combines all coincident points that have the same X and Y centroid coordinates. During the third step, the Getis-Ord statistic was used to identify traffic violation hotspots. A high value of Getis-Ord statistic shows that a cluster is having high index values (hot spots), and a low value of Getis-Ord statistic represents a cluster of low index values (cold spots). Mathematically, the Getis-Ord statistic and its z-score are expressed by the following equations.

Data Collection
Traffic violation data for the year 2016 was obtained from the department of Sichuan traffic police in Luzhou city and collected from off-site traffic enforcement cameras system. Therefore, traffic violations along expressways road segments may be missing. It is worth stating that the license plate number of vehicles was not available in the dataset, which bounds its application to record a particular driver's history of the violation. It is a significant shortcoming that could be addressed in future research for better analysis of violations by various vehicle types' drivers.

GIS-Based Analysis for Violation Hotspots
Spatial Statistic toolbox in ArcMap was used for distribution and identification hotspots along the two expressways. Four steps were carried out to find the hotspots in the study area. In the first step, traffic violation point data was loaded in GIS. Each point represented a traffic violation caused by any of the four types, i.e., Violation of prohibited markings, Wrong-way driving, Illegal parking, or Overspeeding. In the second step, Data clusters were created in order to convert the data to a weighted point. A spatial statistics tool (Collect Event) was used in ArcMap in order to convert the data to a weighted point. Weights were assigned based on the frequency of the traffic violation at a particular point. Collect Event tools combines all coincident points that have the same X and Y centroid coordinates. During the third step, the Getis-Ord statistic was used to identify traffic violation hotspots. A high value of Getis-Ord statistic shows that a cluster is having high index values (hot spots), and a low value of Getis-Ord statistic represents a cluster of low index values (cold spots). Mathematically, the Getis-Ord statistic and its z-score are expressed by the following equations.
where G * i represent spatial dependency of the incident i, x j is feature value for j, w ij is the spatial weights for i, and j stands for distance d. n is the total number of features.
Inverse Distance Weighted (IDW) method of interpolation was used on these hotspots to estimates traffic violations along the expressways. IDW helps in estimating the neighboring values by averaging the values of sample data points. The principle of IDW is, the closer a point is to the center of the cell being estimated, the more influence or weight it has in the averaging process.

Traffic Violation Prediction Using ML
Three different ML algorithms, including K Nearest Neighbors (KNN), Support Vector Machine (SVM), and CN2 Rule Inducer, were implemented for predicting traffic violations using the available violation data collected from the study area. The three algorithms were implemented using the orange data mining toolbox in python. The preliminaries and detailed methodology of proposed ML methods are discussed in the following passages.

K Nearest Neighbors (KNN)
The K-nearest neighbor (KNN) classifier is a conventional non-parametric classifier initially proposed by Cover and Hart in 1967 [42]. It is a low computation complexity method for object recognition and classification tasks, such as character, face, and other objects. The KNN principle is based on an intuitive idea that the data points of the same class should be nearer to the feature space. The very first step of implementation is the collection of traffic violations observations and is classified C = {C 1 , C 2 , . . . , C N }. Firstly, KNN determines the distances for each training sample and the target point, then chooses the closest k-samples to the target. Such k-samples assess together the target point class. The distance measurement of the attributes is a simple way of expressing the point's resemblance. Afterward, the shortest k distance D = {d 1 , d 2 , d 3 , . . . . . . , d k } is chosen where each neighbor belongs among N classes to a specific class Y = y 1 , y 2 , . . . . . . , y n . Given a data set labeled with observations (x i , y i ) and i = 1, 2, . . . , n to capture the relationship between x data and y label. More explicitly, to know a function g : X → Y such that g (x) can accurately predict the corresponding output class Y given an unknown observation X. Moreover, the distance between the components to be recognized, and each class is then computed by euclidean, manhattan, mahalanobis and weighted by uniform and distance techniques. The distance can be defined as the nearer query neighbors point have a more significant impact than neighbors further away, while uniform described all points in each neighborhood are weighted equally. The formula for euclidean, manhattan, and mahalanobis are given in below equations, where D denotes distance, and S −1 is the covariance of the matrix of → x i and → y i . The weight distribution is achieved through distance and uniform techniques, and the k values assigned for these different metrics are given in Table 1. The Support Vector Machine (SVM) method is a supervised learning method proposed by Boser et al. in 1992 [43]. It seeks to find the ideal hyperplane, which divides two or more classes by seeking the maximal margin distances (e.g., positive vs. negative). In the classification scenario, the SVM seeks for the curve that can separate and classify the training data, ensuring that the difference between the curve, as well as other training class observations (support vectors), is as high as possible. This separation can be achieved in the same space or an ample space by mapping the input space into a feature space through a kernel function (radial basis function (RBF), polynomial, sigmoid, etc.). The training dataset of n points with observations x i i = 1, 2, . . . , n is defined as the vectors relative to the class observations y i i = 1, 2, . . . , n. In particular, SVM adjusts the balance between the margin as well as the error by adjusting a C parameter. With a higher optimum performance due to less computational difficulties and reasonable precision that reduced overfitting, the RBF kernels have been selected for the ultimate model of our research. The SVM model was implemented using the Orange python scripting tool. The formula for the RBF kernel is given below.
where K represents kernel, and γ can be termed kernel 'spread', as well as the decision region. The values of γ and C are 10 and 1.

CN2 Rule Inducer
The rule learning model for traffic violations and classification were discussed in this study. The CN2 Rule Inducer is a classification method designed to generate simple output, if condition then forecasts class, even in conditions where noise can occur. Moreover, CN2 Rule Inducer generates a class distribution based on the number of instances covered and distributed over the classes. In other words, it indicates the total number of representatives of the class. In our study, we employed a statistical significance check to determine whether the new rule has a valid correlation between features and classes. In addition, rules are pre-determined using two methods: (i) statistical likelihood ratio (SLR or LRS) tests and (ii) minimum threshold for rules coverage. The LRS test further shows two tests: firstly, the minimum level of relevance of a rule α 1 , and the second LRS test is equivalent to its parent rule, since it examines whether the last classification of rule is of adequate significance α 2 . In our implementation, we introduced exclusive coverage at the upper stage, such as an unordered rule, while Laplace estimation was used at the lower level for function evaluation. Laplace estimation has described as an alternative measure of the quality of the rule to correct undesirable entropy (downward bias) as follows: where R is the rule, p refers to the number of positive examples defined in the training set covered by the rule R , n refers to the number of negative examples by R , and k is the number of classes included in the training set. The values used for LRS tests are shown in Table 2, whereas the CN2 Rule Inducer viewer listed in Table A1 in Appendix A was obtained using stratified 10-fold cross-validation.

Performance Evaluation Metrics for ML Models
Different classification evaluation metrics were used to assess and compare the predictive performance of proposed ML methods. These include; precision, recall, F-score, accuracy, and specificity. Precision quantifies the number of positive predictions that are made correctly, while the recall quantifies the number of correct positive predictions that could have been made from all the positive predictions. The formula for calculating precision and recall could be found in Equations (8) and (9). The F-score comprises both the recall and the precision and is calculated from Equation (10). Accuracy is the proportion of the correct sample to the total number of samples and can be calculated from Equation (11). Similarly, specificity can be calculated from Equation (12).

Analysis of Descriptive Statistics
Traffic violations experienced by different vehicles, like private cars, taxis, vans, buses, and small trucks, were included in this analysis since they hold a large proportion of total occurrences of violations. A total number of 2003 violations by different vehicles were used for the analysis once the data was pre-processed. The occurren'ce of violations, comprised of time, date, month, day of the week, and time of the day, was assessed. The location of the violation was just to approximate the position of the occurrence of a traffic violation. In the present work, this detail was used to consider the spatial distribution of various type of violations occurred on urban expressways. Key descriptive statistics for data used in this research are summarized in Table 3. Violations committed by taxi drivers and private cars were more prevalent compared to other vehicles. Wrong-way driving (74.94%) comprised the highest proportion of observed violation followed by violation of prohibited road marking (17.87%), overspeeding (4.54%), and illegal parking (2.65%). The main reason for a significantly high percentage of wrong-way driving violations may be attributed to the fact that vehicles (taxis and private cars drivers in particular) tend to use the wrong way to avoid long travel to the next entrance or exit ramp to save time. In comparison, a relatively low percentage of overspeeding violations may be attributed to the presence of speed surveillance cameras at multiple locations along both expressways. Since wrong-way driving usually results in more severe crashes due to head-on collisions, it is essential to identify high-risk areas and factors that are likely to encourage wrong-way driving. Considering the distribution of violations caused by different vehicle types, it may be noted from Table 3 that private cars were involved in approximately three-fifths (58%) of the total violations, while violations caused by buses accounted for only 2.55% of total violations. Considering the temporal distribution of violations, the Spring season had the highest percentage of reported violations (54.72%), followed by winter (34.40%), whereas Autumn had the lowest proportion (3.10%) of total violations. The large proportion of Spring violation may be associated with frequent travel during this season. Similarly, weekdays and peak periods accounted for almost 70% and 48% of reported violations, respectively. Figure 2 shows the mapping of violations based on frequency-based clustering and the IDW method in ArcMap GIS. A total of 2003 traffic violations were observed along two expressways S219 and S308. S219 connects two residential districts, Lantian Residential District and Mianhuapozhen, in Luzhou, Sichuan, China. This expressway passes through various residential zones, one commercial zone, and one public facilities zone. S308 connects Lantian Residential District, Linyu Residential District, and Naxi District in the study area. As shown in Figure 2, both expressways had one major hotspot. The hotspot is located along the commercial and public facility zones. Wrong-way driving was the most observed traffic violation along both expressways. This observation may be attributed to the fact that drivers usually tend to drive the Wrong-way to avoid long travel along the expressway to reach their destinations or nearby residential areas to maximize profit. Expressway S219 had more number of hotspots compared to S308 due to densely populated residential zones on both sides in the vicinity of observed hotpots. Instead, S308 is occupied by coldspots, as shown in Figure 2. This observation is intuitive because this expressway runs along the rivers, and it does not divide any residential areas. Secondly, there are no commercial or public facilities along this expressway. Hence, very few traffic violations are observed along S308. The hotspot along this road is near to the airport along the curve (shown in Figure 2). More number of violations in these areas could be attributed to the presence of airport and sharp curve along the expressway. In general, hotspots frequency analysis along both expressways was dominated by wrong-way driving, followed by overspeeding, illegal parking, and violation of prohibited road markings. both sides in the vicinity of observed hotpots. Instead, S308 is occupied by coldspots, as shown in Figure 2. This observation is intuitive because this expressway runs along the rivers, and it does not divide any residential areas. Secondly, there are no commercial or public facilities along this expressway. Hence, very few traffic violations are observed along S308. The hotspot along this road is near to the airport along the curve (shown in Figure 2). More number of violations in these areas could be attributed to the presence of airport and sharp curve along the expressway. In general, hotspots frequency analysis along both expressways was dominated by wrong-way driving, followed by overspeeding, illegal parking, and violation of prohibited road markings.

ML Model's Comparison for Violation Prediction
A detailed comparative analysis was conducted to examine the applicability and efficacy of applied models. The model's performance is checked in terms of area Under the curve (AUC), accuracy, precision, recall, log loss, specificity, and F-1 score. Amongst these models, the KNN model outperformed the CN2 Rule Inducer and SVM model. As shown in Figure 3, we considered KNN with different k neighbors 3, 5, 7, 10, and 12 of different metrics and weights. The accuracy achieved for these different k neighbors is 99, 98, 98, 87, and 98 percent, respectively. Besides, all these k neighbors achieved higher accuracy and performed better except k = 10 nearest neighbor of uniform euclidean with obtained accuracy 87 percent. Hence, the average KNN model accuracy is 97.3 percent, which indicates that the predictive performance of KNN is more robust compared to SVM and CN2 Rule Inducer. Figure 3 also shows the obtained AUC, accuracy, precision, recall, F-1 score,

ML Model's Comparison for Violation Prediction
A detailed comparative analysis was conducted to examine the applicability and efficacy of applied models. The model's performance is checked in terms of area Under the curve (AUC), accuracy, precision, recall, log loss, specificity, and F-1 score. Amongst these models, the KNN model outperformed the CN2 Rule Inducer and SVM model. As shown in Figure 3, we considered KNN with different k neighbors 3, 5, 7, 10, and 12 of different metrics and weights. The accuracy achieved for these different k neighbors is 99, 98, 98, 87, and 98 percent, respectively. Besides, all these k neighbors achieved higher accuracy and performed better except k = 10 nearest neighbor of uniform euclidean with obtained accuracy 87 percent. Hence, the average KNN model accuracy is 97.3 percent, which indicates that the predictive performance of KNN is more robust compared to SVM and CN2 Rule Inducer. Figure 3 also shows the obtained AUC, accuracy, precision, recall, F-1 score, and specificity for SVM are 0.95, 0.964, 0.963, 0.978, 0.961, and 0 KNN (k = 1, 3, 5, 7, 10, and 12). Additionally, the KNN model takes less training time than the SVM and CN2 rule inducer, which therefore validates the efficiency of the KNN model. The models' run times are shown in Figure 4.  1, 3, 5, 7, 10, and 12). Additionally, the KNN model takes less training time than the SVM and CN2 rule inducer, which therefore validates the efficiency of the KNN model. The models' run times are shown in Figure 4.

Spearman Correlation Analysis
After data pre-processing, we estimate the rank correlation coefficient of the Spearman between the two features and obtain the matrix of the correlation coefficient. It aims to assess how well a monotonic function could be used to represent the relation between two variables. The correlation matrix describes the correlation and no correlation variance of a range of features through the values created in matrix ranging from +1 to −1. +1 means high and positively correlated, while −1 means less and negatively correlated.

Spearman Correlation Analysis
After data pre-processing, we estimate the rank correlation coefficient of the Spearman between the two features and obtain the matrix of the correlation coefficient. It aims to assess how well a monotonic function could be used to represent the relation between two variables. The correlation matrix describes the correlation and no correlation variance of a range of features through the values created in matrix ranging from +1 to −1. +1 means high and positively correlated, while −1 means less and negatively correlated. The features which are correlated are month and date, month and type of violation, latitude, and longitude. On the other hand, features highly less correlated are season and

Spearman Correlation Analysis
After data pre-processing, we estimate the rank correlation coefficient of the Spearman between the two features and obtain the matrix of the correlation coefficient. It aims to assess how well a monotonic function could be used to represent the relation between two variables. The correlation matrix describes the correlation and no correlation variance of a range of features through the values created in matrix ranging from +1 to −1. +1 means high and positively correlated, while −1 means less and negatively correlated. The features which are correlated are month and date, month and type of violation, latitude, and longitude. On the other hand, features highly less correlated are season and month. Additionally, the less correlated features also include season and date, season and type of violation, as shown in Figure 5. This indicates that month, latitudes, and longitude have a notable impact on the type of violation. The violations vary throughout the year as the month changes. Longitudes and latitudes show that the location of the violation also has a strong positive association with observed violations.

Conclusions
Traffic violations often caused by aggressive driving behavior are considered as significant indicators for crashes. Existing studies on aggressive driving behavior and violation analysis have mostly employed statistical regression and simulation-based techniques to explore factors responsible for such uncivilized driving behavior [40,[44][45][46]. However, it is well-known that methods based on statistical analysis have a number of underlying assumptions and are unable to capture hidden correlation among explanatory variables [41,47]. Further, the low prediction accuracies obtained by these methods are also not highly reliable. Hence, in this study, we designed a systematic framework by first identifying violation hotspots using GIS, followed by classification and prediction of the different violations, using state-of-the-art ML methods. During the first phase of the study, a detailed descriptive analysis was conducted that showed that wrong-way driving had the highest proportion (75%) of total violations, whereas illegal parking had the lowest (2.10%). It was also noted that private cars and taxis were frequent violators. Similarly, temporal distribution analysis revealed that violations were more prevailing during the spring season, weekdays, and off-peak periods. The relationship between temporal attributes and occurrence of violation is consistent with a previous study [41]. Another recent study conducted by Liu et al. also indicated the relationship of time (peak hours/Off peak hours), month, and locations with a different type of violation occurrence [48]. Previous studies have also focused on the relationship between land-use and observed violations. During the second phase of the study, the Inverse Distance Weighted (IDW) method of interpolation in ArcMap GIS was used for the identification of hotspots for traffic violations along both expressways. Violation hotspots were mostly concentrated along commercial and public facility zone on S219, whereas, along S308 expressways, they were located mostly near the horizontal curve. Accurate identification of hotspots is vital for carrying subsequent treatment activities more efficiently within limited time and budget constraints. Studies suggest that the frequency of violations at a particular location may be associated with a range of factors, such as land-use, area

Conclusions
Traffic violations often caused by aggressive driving behavior are considered as significant indicators for crashes. Existing studies on aggressive driving behavior and violation analysis have mostly employed statistical regression and simulation-based techniques to explore factors responsible for such uncivilized driving behavior [40,[44][45][46]. However, it is well-known that methods based on statistical analysis have a number of underlying assumptions and are unable to capture hidden correlation among explanatory variables [41,47]. Further, the low prediction accuracies obtained by these methods are also not highly reliable. Hence, in this study, we designed a systematic framework by first identifying violation hotspots using GIS, followed by classification and prediction of the different violations, using state-of-the-art ML methods. During the first phase of the study, a detailed descriptive analysis was conducted that showed that wrong-way driving had the highest proportion (75%) of total violations, whereas illegal parking had the lowest (2.10%). It was also noted that private cars and taxis were frequent violators. Similarly, temporal distribution analysis revealed that violations were more prevailing during the spring season, weekdays, and off-peak periods. The relationship between temporal attributes and occurrence of violation is consistent with a previous study [41]. Another recent study conducted by Liu et al. also indicated the relationship of time (peak hours/Off peak hours), month, and locations with a different type of violation occurrence [48]. Previous studies have also focused on the relationship between land-use and observed violations. During the second phase of the study, the Inverse Distance Weighted (IDW) method of interpolation in ArcMap GIS was used for the identification of hotspots for traffic violations along both expressways. Violation hotspots were mostly concentrated along commercial and public facility zone on S219, whereas, along S308 expressways, they were located mostly near the horizontal curve. Accurate identification of hotspots is vital for carrying subsequent treatment activities more efficiently within limited time and budget constraints. Studies suggest that the frequency of violations at a particular location may be associated with a range of factors, such as land-use, area type, time of the day, roadway design, weather conditions, and drivers' socio-demographic attributes [40,[49][50][51]. For example, another study, conducted by Zahid et al., suggested that risks of committing traffic violations are relatively more inside the central business districts (CBDs), dense residential landscape, the area with public facility services, and near urban intersections [41]. Lastly, during the third phase of the study, three different ML algorithms, i.e., KNN, SVM, and CN2 Rule Inducer, were applied for prediction and classification of traffic violations, considering spatio-temporal attributes of available data. The efficacy and predictive performance of proposed ML models were investigated using several classification evaluation metrics such as AUC, accuracy, precision, recall, F score, and specificity. Study results showed that KNN (k = 7) model using manhattan evualation had an accuracy of 99% outperformed SVM, and CN2 Rule Inducer. KNN model also showed increased predictive efficiency with reference to AUC, accuracy, precision, recall, F-score, and specificity. The outcome of this study could provide useful guidance to safety managers and practioners to initiate sound policy recommedations to enhance road safety.
The current study has a few limitations that must be' acknowledged and may be adressed in future studies. First, detailed demographic characteristics of drivers (such as gender, age, education, etc.) could be considered in future studies. Unfortunately, they were not available for this study. Second, the license plate record of the vehicles were not available in the current dataset, which limits its application to record the detailed history of violations comitted by individual vehicle/driver. This is a another major drawback that could provide valuable insights to in-depth violation analysis. Finally, it would be intresting to explore the impacts of operating styles, working hours, features of built environment, attributes of roadway geometric and daily driving distances on agressive driving behavior, and traffic violations in forthcoming studies.

Conflicts of Interest:
The authors declare no conflict of interest.