Severity analysis of powered two wheeler traffic accidents in Uttarakhand, India

Powered Two Wheeler (PTW) vehicles are one of the preferred modes of transport used in India. Also, PTWs accidents are comparatively more frequent than other type of accidents on road. The influencing factors of PTW accidents are also differ from factors that affect other accident types. The objective of this study is to analyze newly available PTWs road accident data from Uttarakhand state in India and revealing the factors that affect the severity of these accidents in various districts of Uttarakhand.. To analyze the factors that affect the severity of road accidents in Uttarakhand, initially we have compared three popular classification algorithms i.e. decision tree (CART), Naïve Bayes and Support vector machine on PTW accident data set. The decision tree algorithm’s (CART) classification accuracy was found better than other two techniques. Hence we have preferred CART algorithm to extract the factors that affect the severity of PTWVs accidents in whole Uttarakhand state and its 13 districts separately. The analysis of PTWVs accident data using CART for 13 districts of Uttarakhand and the whole state reveals that every districts have different factors associated with PTW accidents severity. There are some districts in Uttarakhand state which have similar PTW accident patterns, whereas few districts are found to have different PTW accident patterns. These results are very useful to understand the pattern of PTW accidents in Uttarakhand state. These results can certainly be helpful to overcome the PTWs accident rate in Uttarakhand state.


Introduction
Traffic accident can be considered as an incident in which one or more vehicles collide with another vehicle, person, animal or any other fixed object. Traffic accidents do not only involve human life loss but also property damage. World health organization (WHO) mentioned that there are 1.2 million deaths and around 4 million injuries every year around the world due to traffic accidents [1]. An increasing number in vehicle purchase is increasing the number of vehicles on road day by day. Hence, the chances for traffic accident are also increasing.
The traffic accident not only affects the life of victims involved in accidents but also affects the life of their associated peoples i.e. family members, business associates etc. Every road accident is left with a record in police database or hospital database. This record consists of various important information about road accidents i.e. time, date and location of accident, weather information, road characteristics and traffic information at the time of accident. The proper analysis of this information can certainly produce some good results. These results can be utilized to know the factors behind road accidents and certain accident preventive efforts can be taken.
Traffic accident analysis is a well known research area. There is a rich literature available that reveals the different techniques and their outcome in road accident analysis. Abdalla et al. [2] analyzed road accident data from Scotland and establish the relationship between traffic accident location and its distance from residential areas. Their finding reveals that traffic accidents are more frequent near residential areas in comparison to areas that are not in close proximity of residential areas. Mussone et al. [3] analyzed road accidents that occurred at intersections in Milan, Italy region. They used neural network model to analyze the accident data. Their results showed that the pedestrian hit accident at night time and at non-signalized intersection has the highest frequency of accidents in that region. Several other studies focused on traffic accident severity analysis using traditional statistical techniques and provide good results [4][5][6][7][8][9][10][11][12]. However, [13,14] shown that traditional statistical techniques has certain limitations in analyzing road accident data. Further, several studies using data mining techniques in road accident analysis has shown that data mining provides productive results than traditional statistical techniques. Data mining techniques [15] are further used to categorize the road accident locations and indentifying factors that affects accidents in those locations [16]. Some authors raised the issue that road accident data is of heterogeneous nature and suggested that clustering prior to analysis of data can certainly remove the heterogeneity [17][18][19]. Some studies also used data mining techniques to analyze crash counts using time series analysis [20,21].
Powered two wheelers (PTW) are one of the most involved vehicles in road accidents. Although it is directly related to the more number of PTW purchased in comparison to other vehicles. The reason behind the rapid purchase of PTW is that these vehicles are more easily affordable, small in size, light-weighted, flexible, and speedy than other vehicles in heavy traffic conditions. In other words, a PTW is the vehicle that has been driven by people with all economic conditions (rich, middleclass and poor) in both urban and rural roads. Various studies used traditional approaches [22][23][24][25][26] to analyze the crash severity of PTW accidents in developed countries. A study [27] used classification trees to generate rules that predict the crash severity of powered two wheeler accidents.
One of the important things about PTW riders is that, they are more prone to road and traffic accident in comparison to other vehicles such as cars, SUVs, vans and buses. The motivation behind this study is to identify the different factors that affect severity of road accidents among PTW accidents in Uttarakhand state. We have used decision tree classifier, support vector machine and naïve bayes classifier to predict the factors that affect the severity of PTW road accident in 13 districts of Uttarakhand state. The severity of accidents is categorized into KSI (Killed or severely injured) and SI (Slightly injured). In this study, we have identified several factors that affect the severity of PTW accidents in Uttarakhand, India that will certainly help in overcome the accident rate.

Data set used
The data set used in this study has been obtained from the GVK-EMRI [28] Dehradun for Uttarakhand state which covers all PTW accidents from January 2010 to December 2014. We are using this 5 year of PTW road accidents data for the severity analysis. This PTW road accident data consists of all 14,709 accident records with 11 attributes from 13 districts of Uttarakhand. The description of data set and its attributes is given in Table 1 and the distribution of PTW accidents in all 13 districts of Uttarakhand is illustrated in Fig. 1.

Classification techniques
In the domain of data mining [29], classification is a supervised learning technique that can be defined as follows: given a set of observations, we are interested in extracting certain rules that can be used to predict the class of the each new observation. The set of observations used to extract the rules are known as training set. Another set of observations, known as test set is used to verify the quality and accuracy of the rules. Initially training data and test data both are part of the data set available at the moment. Classification is widely used technique that shows its importance in various fields such as bioinformatics, pattern recognition, image classification etc. In order to achieve the best prediction, more suitable classification techniques must be selected. The selection of any classification technique depends on the type and nature of data. As our data is more like a categorical data, we are trying to evaluate the prediction accuracy of three best suitable classification techniques on our data i.e. decision tree algorithm [30], naïve bayes algorithm [31] and support vector machine algorithm [32]. Further, the technique with higher prediction accuracy will be used for analysis.

K-fold cross-validation
The common problem with classification technique is the partition of the data into training and test data [33]. Sometimes, it is a value decided by the user itself, where training data is usually kept larger than test data. Some choose 70%-30%, 60%-40%, 80%-20% and so on for training and testing set and they check for the better accuracy. But it is rather time consuming and complex process to divide the data based on user's choice. Also, this technique fails in the case of imbalanced data where class values to be predicted are not similar or they differ by some large ratio. K-fold cross validation [34] is a statistical technique that divides the entire data set into k groups. K is any number greater than 1. Out of k sub groups, a single group is retained as the test data and remaining k-1 sub groups are taken as training data. The k-fold cross validation process is then repeated k times, with each k subgroups used as a training set exactly once. Further, the k outcomes from the k-fold cross validation can be averaged to produce a single estimation. Usually k remains unfixed in kfold cross validation, but k = 10 is a standard value that is widely acceptable for k-fold cross validation. This study used k-fold cross validation method to partition data into training and test sets where k = 10 is used.

Classifier accuracy measures
One of the most important aspects in the classification process is that how well your classifier predicts for unobserved instances. This is known as accuracy of a classifier. Sometimes accuracy itself is not a good measure of classifier goodness. Here, we are providing some classifier accuracy measures that can help in identifying the goodness of a classifier.

Confusion matrix
A confusion matrix (or error matrix) [35] is a contingency table that allows visualization of the performance of a classifier. A column in confusion matrix denotes the predicted class instances and a row represents the actual class instances. In order to understand the confusion matrix, consider an example of a data sample of 10 animals with 4 lions and 6 tigers. A classification algorithm is trained to distinguish between lions and tigers, a confusion matrix will summarize the results of the algorithm for the given sample of data. The confusion matrix for PTW accident data is given in Table 3.
In the above confusion matrix, out of 4 actual lions, classifier predicted 3 lions correctly and predicted 1 lion as a tiger. Out of 6 tigers, 2 were predicted as a lion. All correct predictions are located in the diagonal of the Table 2. Using this contingency table, other measures can be effectively evaluated.

True positive rate (TPR) and false positive rate (FPR)
TPR measures the fraction of positive that are correctly identified. It is also known as sensitivity of a classifier. It can be calculated using parameters in contingency table using Eq. 1. Whereas, FPR also known as false alarm ratio refers to the probability of falsely rejecting the null hypothesis. It can be calculated as the number of negative events that are mistakenly categorized as positive and the total number of actual negative events. The formula is given in Eq. 2.

Specificity
The specificity of a classifier is the accuracy of classifier to correctly predict the negative cases in the data set. It can be calculated as

Precision and recall
The precision and recall measures are mostly used metric to measure the performance of a classification algorithm. Precision can be defined as a measure of exactness i.e. if all the predicted labels for a given class X is given, how many instances were correctly classified. Recall which is similar to sensitivity or TPR is the measure of completeness i.e. for all data instances with class value X, how many of these instances are correctly captured.  The formula for calculating precision is given in equation4 and formula to calculate recall is same as for TPR in Eq. 4.
2.4.5 F-measure and MCC F-measure [35] also known as F-scores is a measure of the classifier test's accuracy. In order to calculate the F-score of a test, both precision and recall are considered. In other words, F-score can be defined as the harmonic mean of precision and recall. The best value for F-score is close to 1 and worst value is close to 0. F-score can be calculated using Eq. 5.
MCC or Matthews correlation coefficient [36] is a measure of the quality of a binary classification, in which variable to be predicted has two values only. In our case, we have two class values for the target attribute i.e. KSI (Killed or severely injured) and SI (Slightly injured). It is also considered as a balanced metric to measure the quality of a binary classification even if the classes are not balanced. Its value ranges between +1 to −1. A value of +1 is considered as a perfect prediction, 0 for average prediction and −1 for no prediction. MCC can be calculated using the values in the confusion matrix using Eq. 6.
2.4.6 Receiver operating characteristic (ROC) curve ROC [37] is an important measure to check the accuracy of a classifier. It has been previously used in signal detection theory to depict the tradeoff between hit rates and false alarm rates over noisy channel. Now, it is widely used in machine learning field as a useful technique to visualize the performance of the classifier. ROC curve is a plot between TPR and FPR. To evaluate the performance of the classifier, AUC (area under ROC curve) is calculated. An AUC value close to 1 represent very good performance and a AUC value <0.5 is considered as not good performance.

Results and discussion
This section presents the results and experimental analysis of the PTW road accident data mentioned as follows.

Performance of classification techniques on PTW data
Initially, we applied Classification and Regression Trees (CART) algorithm for decision tree classification, naïve bayes and support vector machine techniques to evaluate the prediction accuracy on accidents data. The prediction accuracy obtained for CART is higher than naïve bayes classifier and support vector machine (Table 3). Hence, we have selected CART decision tree algorithm to analyze our road accident data. The Table 3 illustrates the prediction accuracy of all three classifiers on PTW accident data set.

CART performance analysis
The PTW road accident data of 13 districts of Uttarakhand state is considered for analysis. We build decision tree using CART for all 13 districts and for entire data set (EDS). The confusion matrix obtained after building decision trees for all districts and EDS is shown in Table 4. The different values of classifier accuracy measures to illustrate the performance of decision tree classifier on 13 districts of Uttarakhand and EDS have been calculated from confusion matrix and shown in Table 5.
The values of different parameters shown in Table 5 indicate the performance of CART to predict the severity of PTW accidents. The Dehradun, Haridwar, Nainital and Udham Singh Nagar districts which have the high PTW accident rate in Uttarakhand state. The decision tree classifier's accuracy is found better than other remaining districts. In other districts, the performance of the classifier is not so accurate. The one reason can be the small size of the accident records. This certainly reveals the conclusion that if data set is not sufficiently large enough, then the decision tree algorithm may not be accurate as desired. The other reason for low accuracy is that the similar values for different attributes are there that predicts the KSI and SI both. The ROC plot is illustrated to show the performance of decision tree classifier for all 13 districts and EDS in Fig. 1.1 to Fig. 1.14.
The AUC (Area under ROC curve) is shown in each figure. The AUC indicates that the decision tree classifier performs worst for Bageshwar district and best for Dehradun, Nainital, Hardiwar and Udham singh nagar district.

Decision rules extraction and description
Further, decision rules are extracted from decision tree build for all districts and EDS. The relevant and interesting rules have been chosen to describe the patterns of each district and EDS. The description of decision rules are given as follows: The decision rules for Almora, Bageshwar and Chamoli districts indicate that NOI, TOD, SUA and LIG are the main contributing accidents attributes that is involved in several PTW accidents. The decision rules revealed that PTW accidents that occurred during night time with no light conditions were KSI accidents. The locations where road light facilities were present during night time have SI accidents only. In other conditions, it is difficult to conclude between KSI and SI accidents, because similar attribute values were present for both KSI and SI accidents. The other attributes that were not available with the data such as speed and weather information may be the responsible factors for PTW accidents in these districts of Uttarakhand.
The severity of PTW accidents in Champawat and Hardiwar districts were mainly affected by NOI, TOD, ROF and LIG attributes. The decision rules for Champawat district reveals that intersection were mainly involved in KSI accidents in during TOD values T1and T6 whereas for other TOD values the accidents were SI. The decision rules for Haridwar district indicate that Intersections in no light condition was more prone to KSI accidents. Other road features such as curve and slope was found to have similar effect on PTW accidents in all lightning conditions for SI accidents with 2 or more victims involved in accidents. Some PTW accidents were KSI accidents that involved 1 victim injured in day light conditions in slope road feature.
The Dehradun district that has the highest PTW road accidents in Uttarakhand state was mainly affected by NOI, TOD, ROF, SUA and LIG road accident attributes. The decision rules certainly reveal some interesting information. According to decision rules, most of the KSI accidents have occurred in no light conditions in intersections near markets, residential area and agriculture land. Curve on road near forest area was also KSI prone area for PTW accidents with 1 victim involved. Other values of different attributes were usually involved in SI accidents.
The factors that affect the severity for PTW accidents in Nainital districts, in addition to other previously mentioned districts, has few more accident attribute responsible for accidents i.e. Age of victim and ROT. The rules reveals that curve on road are the main factor that contributes to KSI accidents at night and early morning duration. Also, in evening duration the KSI accidents on highway roads were involved with minor victims or victims less than 18 years of age.
For Udham singh nagar district, the factors that affect severity of road accidents were quite similar to those factors in Dehradun districts. The colonies and markets areas were the major location where lots of the accidents have occurred but most of these accidents were SI accidents only. The PTW KSI accidents were mainly occurred at a highway that goes through the agriculture land or the forest area. The YNG and ADU age group victim were mainly involved in KSI accidents. Very few KSI accidents were involved SNR and CHD group victims.
Rudraprayag, Tehri and Uttarkashi districts were not mainly affected by ROT, ROF and other important factors which were found for the previous districts. One common factor revealed by decision rules is the LIG condition. Most of the KSI accidents in these districts have occurred in DUSK lightning condition. Other lightning conditions were usually involved SI accidents. As the accident records for PTW accidents for these districts were comparatively low, some other factors remain hidden. The decision rules for Pauri and Pithoragarh districts revealed that these two districts have similar patterns for PTW accidents. In both districts, the KSI accidents mainly involved the AGE group CHD and SNR and the LIG condition as DUS. Also, these accidents were mainly happened in Q1 and Q4 months of the years. The SI accidents were mainly involved the AGE group ADU, whereas YNG age group was equally involved in both SI and KSI accidents. Further, the rules for the EDS have been analyzed. It was found that for EDS almost all attributes except the MON (month) attribute were involved in KSI and SI accidents for PTW. Most of the KSI accidents were involved NOI values of 1 but very few KSI accidents involved NOI = +2 for EDS. For AGE attribute, the values YNG and ADU were mainly involved in KSI accidents, whereas the number of CHD victims was comparatively low. SNR victims were found to be involved in both KSI and SI accidents but these accidents are comparatively lower than accidents with other victims. The major road location where most of the KSI accidents have occurred was intersections on highways. Most of the intersections where KSI accidents have occurred were a part of highways. Also the curve on highways was found to be dangerous as it involves most of the KSI accidents than SI accidents. The SUA attribute values MAR and HIL are the locations where most of the accidents have occurred but the number of SI accidents was more in comparison to KSI accidents in these locations. The SUA values FOR and AGL was found to be dangerous for PTW accidents on local roads. For attribute LIG, around 10% of accidents have occurred in DUS condition in which 46% accidents were KSI, hence the DUS condition could be dangerous for PTW accidents. Although, lots of accidents have occurred in DLT condition but most of the accidents were SI accidents. In RLT condition, it is found that most of the PTW accidents were KSI accidents. Some of the PTW accidents have also occurred in NLT conditions but most of the accidents were SI accidents.
Therefore, it is found that a separate analysis of every district data and a complete analysis of entire data certainly reveal different but important information that can be utilized to understand the factors that involved in PTW road accidents. The different accident attributes have different impact on PTWaccidents in every district. It can be concluded that the analysis of entire data can give you a broad overview of the information about the factors involved in road accidents of PTW accidents, whereas a

Conclusion
The study used decision tree classification technique to analyze the 5 years PTW road accident data from 13 districts of Uttarakhand state in India. The reason behind selection of decision tree algorithm is that its prediction accuracy is found better than naïve bayes and support vector machine on our data. A total of 14,709 PTW accident record with 11 different attributes were selected to analyze the accident data. A decision tree is a popular data mining technique that is widely used for analysis of road accident data. In this study, we have used decision tree classification technique for the severity analysis of PTW accident in each district of Uttarakhand. The accident severity in our data is classified into KSI (killed or severely injured) and SI (Slightly injured) class values. The distribution of PTW road accidents in our data was different among all districts. Dehradun, Nainital, Haridwar and Udham singh nagar districts was the district with high number of PTWaccidents, whereas remaining other districts involved comparatively less number of PTW accidents. The severity analysis of PTW data revealed different factors contributing to the severity of accidents in different districts. The decision tree classifier's performance in those districts with good number of accident records was very good (illustrated in Fig. 2) whereas for the districts with very less number of accident records was not so good. Some districts such as Uttarakashi, Tehri, Rudraprayag, Almora, Bageshwar and Chamoli contains very few accident records, therefore neither the classifier's accuracy was good nor the decision rules generated revealed very fruitful information  PTW accident to occur. Further, this information can be utilized to develop some policies to prevent and overcome the PTW accidents in Uttarakhand state and its districts. The study presented a classification based approach on PTW accident data from Indian state. The quality of experiments and results in this study are subject to the quality and attributes of the data in India. However, European countries (Mostly western) have well maintained road accident data sources with quality information. The methodology adopted in this study could be utilized to analyzed quality PTW data from European countries to provide more quality results that would certainly useful to reveal different important factors for PTW accidents.