Data based violated behavior analysis of taxi driver in metropolis in China

: Violation probability of taxi drivers in metropolis is far more than that of normal drivers because they are labor-intensive, overconfident of self-driving skill, and always searching potential customers, sometimes even picking up or dropping off passengers randomly. In this paper, four types of violated behavior of taxi drivers in metropolis were first summarized, based on which corresponding scale table was initial designed with social statistical method. Furthermore, with certain samples, relative item analysis, exploratory factor analysis, validity analysis and reliability analysis were conducted to verify validity of the initial scale table, based on which some improvements were made, and we can see that the modified scale table in the paper has high fitness degree, good reliability and validity to detect violated behavior of taxi driver accurately. Finally, large area survey data of taxi driver questionnaire from Shanghai was collected with the modified scale table above, the analysis results showed that among four types of violated behavior of taxi drivers in metropolis, the probability over-speed is top to 89.57%, in which probabilities of behaviors of “driving over-speed at mid-night” and “accelerating to across the intersection during the yellow signal” are top to 64.2% and 58.2% respectively, which is meaningful for the improvement of taxi drivers’ behaviors specification and traffic safety regulation.

can find violated behavior of taxi driver simply and accurately, especially with assistant of classical statistical analysis software. In this study, we first summarized violated behaviors of taxi driver via interview and questionnaire survey. Furthermore, initial scale table was designed, and relative item analysis, exploratory factor analysis, validity analysis and reliability analysis of scale were also conducted verify validity of the initial scale table, based on which some improvements were made. Finally, with the survey collection data from Yangpu District, Shanghai, China, some meaningful conclusions were reached with the modified scale table. The detailed research thought diagram was shown as Fig. 1

Initial scale table design
To understand characteristics of violated behavior of taxi driver, with relative reference scale [Yang (2007)], we designed the initial scale table with the guidance of psychological analysts. Furthermore, 10 taxi drivers were invited randomly to judge whether they can understand questions of the scale table. After these steps, the initial scale was established which included 4 dimensions (4 classes of usual violated behaviors of taxi driver): ignorance of traffic sign and marking, dangerous car following over-speed driving, and illegal lane change and over-taking.
Every dimension also include several sub items, so there were 15 items in the initial scale table, and every item has 4 options: "Never", "Occasionally", "Frequently" and "Always", which represent 0, 1, 2, 3 points respectively. The initial scale table structure was shown as Tab. 1. 14 Too many cars on the road, I have to change lanes to avoid congestion.

15
No car in the bus lane, so changing lanes to it to overtake cars in front.

Scale table analysis 3.2.1 Item analysis
Item analysis was first made to improve validity and reliability of scale, because good internal consistency between item and total scale means good discriminability of the item, and close correlation between item and scale means this item can identify samples in different levels.
If item score is high when total score high, or if is low when total score is low, we can see that this item keep consistent with the total scale, which means discriminability of this item is very good. Product-moment formula is selected for analysis as follow: where γ is the product-moment correlation coefficient, X is the item score of sample, Y is the total score of sample, n is the total number of sample. If the total scores of the scale table and items are relatively high, it shows that the internal consistency of items and the whole scale is better, the whole scale table is more reliable; on the other hand, high score of the item and scale table indicate that the item can distinguish items at different levels, in other words, the effectiveness of the tests is very good [Jin (2007)]. Therefore, the item identification analysis has important significance for the reliability and validity of the scale table.
In this scale table, total score of dimension is based on the total score of its related items, however, the purpose of the item analysis is to delete items that are inconsistent with total score of dimension, which means some items may score low when the total score of dimension is high, indicating that this item identification is very poor, and should be deleted. So the method of Principal Component Analysis (PCA) was used in exploratory factor analysis, in this case, there is no need to consider the correlation between total score and related items' score.

Exploratory factor analysis
The exploratory factor analysis allows the scale table to condense the original items into several main factors in the least loss of information, and makes the factor having named explanatory [Xue (2013)].
The main purposed of factor analysis is to reflect the majority information of original variables with less conflicting factors. The matrix form of mathematical model can be expressed as follow: where F is the common factor, ε is the special factor, which means unexplained part of original variables, whose mean value is 0. Factor analysis can be solved by the method of Principal Component Analysis, in which by means of coordinate transformation, original relative variables can be converted to another set of uncorrelated variables with linear combination. Before exploratory factor analysis, Bartlett Test of Sphericity and Kaiser-Meyer-Olkin test have to be finished to check whether there is intensive correlation among original variables.
Original hypothesis of Sphericity in Bartlett Test is that the correlation coefficient matrix of the original variable is a unit matrix, and the test statistics can be calculated by the determinant of correlation coefficient matrix, and it approximately obey chi squared distribution. If the observation value of this statistics is relatively large, and the probability of corresponding P value is less than the given significance level α, we will reject the original hypothesis, and believe that correlation coefficient matrix is unlikely to be the unit matrix, and the original variables are suitable for factor analysis; if we accept the original hypothesis, the original variables are not suitable for factor analysis.
The KMO test statistic is the index to compare simple correlation coefficient and partial correlation coefficient among variables, which can be expressed as follow: When the quadratic sum of simple correlation coefficient among all the variables is far more than the quadratic sum of the partial correlation coefficient, KMO value is close to 1. Approaching 1 of KMO value means intensive correlation among variables r, and original variables are suitable for factor analysis. When KMO value approaching 0, correlation among variables becomes weaker, and original variables are not suitable for factor analysis.

Validity analysis
The validity is the degree that measurement results from scale table reflect the research content. If measurement results match with research contents, we can say it has high validity, or the validity is low. The validity can be divided into three types: content validity, criterion validity and construct validity.

Reliability analysis
Reliability is the estimation of consistency degree of measurement. Higher reliability coefficient means that the scale table test results are more consistent, stable and reliable, and it's usually use the internal consistency to show the test reliability level [Xue (2013)]. The internal consistency reliability is divided into half reliability and the homogeneity reliability. Before solving the split half reliability, the scale is split to odd and even two parts according to the item number, and then to calculate the correlation coefficient with related formula with Pearson product moment. If variances of two pars are equal, we can adjust it with Spielman Brown formula as follow: where r is the reliability value of the total test, ' r is the correlation coefficient between the two half scores. When homogeneity test of variances of two parts is not homogeneous, Fran Flanagan formula can be used as follow: where 1 S and 2 S represent variation values of two parts in test score, X S represents variation number of all participants in total score of the whole test.
In calculation of the homogeneity reliability, the Cronbach's α coefficient is usually used, and the calculation formula is as follows: 1 ( 1) where k is the assessment number of items, r is the mean value of correlation coefficient of k items.

Validity verification of scale table 4.1 Survey respondent information
According to the relevant study, ratio between sample sizes and the item of scale table should be greater than 5:1, and the total sample size is no less than 100. Therefore, 200 questionnaires was distributed in No. 26 Road East Guoshun and No. 609, Road Guohe, Yangpu District, Shanghai, China, where are rest sites of test taxi drivers, and finally 187 valid questionnaires were taken back, so the retrieve rate is 93.5%. In these drivers, numbers of male and female are 161 and 26 respectively. About ages, they are from 27 years old to 59 years old, and about driving experience, the minimal is 3 years, and the maximal is 26 years.

Statistical analysis
The sample data was analyzed in classical statistical software SPSS22.0 in four aspects: item analysis, exploratory factor analysis, validity analysis and reliability analysis.

Item analysis
When correlation is highly significant (P-value>0.5) and correlation coefficient is more than 0.4, it means that homogeneity between item and scale is high and the tested psychological characters are closer to the fact. The correlation analysis between the items and the total score is shown in Tab. 2, from which we can see that all correlations are highly significant (P-value<0.01) except that of Item 6 and Item 15 whose correlation coefficients are 0.331 and 0.273 respectively, therefore the other thirteen items should be retained. And it could also be summarized that items in the scale table of taxi driver behavior of traffic violation are highly effective.

Exploratory factor analysis
After Principal Component Analysis (PCA) of the 13 items above by SPSS, we can find that KMO of sample data was 0.87 (more than 0.80)，which means that correlation among variances is high and the original variances are suitable for Factor Analysis. The level of significance of Bartlett's Test of Sphericity is high (p<0.001), which demonstrates that matrix of correlation coefficients is impossible to be a unit matrix and the variances are suitable for Factor Analysis. Through Varimax orthogonal rotation in PCA, four factors whose eigenvalues are more than 1 were chosen to analyze its characteristic roots and Seree Plot, and the results show that no items need deleting. The accumulated variance contribution rate is 64.437%, which is more than acceptable accumulated variance contribution rate 40% and applicable accumulated variance contribution rate 50% [Xue (2013)].The four factors are named by ignorance of traffic sign and marking,dangerous car following, over-speed and lane changing illegally, and their variance rate are 18.289%, 15.491%, 15.338%, 15.319% respectively. Load capacity of all dimension and items are shown in Tab. 3. The value of load was measured by coefficients of every variance' factors, and if the value is over 0.4, the item would belong to this factor.  (2013)]. For evaluation of model's GFI (Goodness of Fit Index), the higher GFI, the better its availability. All indexes of GFI are as follow: (1) if 2 / df χ is less than 3, model's GFI is high; (2) The closer of GFI to 1, the better of the model's fitting degree. Normally, GFI is more than 0.9; (3) The closer of RMSEA(Root Mean Square Error of Approximation)to 0, the better of model's fitting degree. Normally, RMSEA is more than 0.1; (4) The other indexes like NFI (Normed Fit Index), CFI (Comparative Fit Index) and IFI (Incremental Fit Index) are the same as the index 2 / df χ , which means the closer to 1, the better of the model's fitting degree. Analysis result show that all GFI index reached ideal standard as follow: GIF=0.91,NFI=0.95,CFI=0.97,IFI=0.97,RMSEA=0.054 χ So, we would say that structure validation of the scale table is good enough.

2) Content validity analysis
To detect whether the sample content is suitable, correlation test between every dimension and total score listed on the scale of taxi drivers 'behaviors of traffic violation was finished as Tab. 4, from which we can see that correlation coefficients in matrix are all more than 0.75 and their p-value is less than 0.01, which indicated that the content tested by the four dimensions is consistent with scale and the modified scale has good content validity. Total points 0.776 ** 0.789 ** 0.862 ** 0.796 ** 1 ** The correlation is significant at the level of 0.01 (two-tailed).

Reliability analysis
Reliability analysis was also finished by SPSS. The split-half reliability and Cronbach's α are shown in Tab. 5. We can see that the split-half reliability value is 0.827 and Cronbach's α is 0.836, all which are more than 0.8. Meanwhile, Cronbach's α can satisfy the requirements of reliability test even after all sub-scales and some items were deleted. Therefore, we can believe that the scale has high reliability.

Case study and discussion
The modified and verified scale table was employed to conduct questionnaire analysis of taxi drivers' behavior in larger area in Shanghai. A total of 588 questionnaires were distributed and 556 were taken back, and 544 were validated, effective rate is 92.35%. In this questionnaire survey, numbers of male and female are 498 and 46 respectively. About ages, they are from 25 years old to 58 years old, and about driving experience, the minimal is 3 years, and the maximal is 29 years.
With the statistics analysis of survey data above in SPSS, we got results shown in Tab. 6 from which we can see that among 4 types of taxi driver violation behavior, probability of over-speed occasionally or frequently is high to 89.27%, and its mean value, median and modal number are also very high. Both item 10 of "night over-speed" and item 11 of "speeding up to cross intersection during yellow light period" have low skewness (less than 0) and higher peak value. Apart from that, "the frequent appearance" of these two violation behaviors accounted for 64.20% and 58.20% respectively, more than 50%, which means that over-speed is also one of the most serious traffic violation behaviorof taxi driver, in which the night over-speed and speeding up to cross intersection during yellow light period are especially outstanding. As for item 7 "following behind the car moving slowly" listed in "dangerous car following" and item 12 "changing lane over confidence in driving skills" listed in "illegal lane change and over-taking", "the frequent appearance" of these two violation behavior accounted for 29.70% and 26.40% respectively, which demonstrated that more than 1/4 taxi drivers have the above violation behaviors. Additionally, "never appearance" of item 4 "red light violation at intersection without surveillance in safety situation" listed in "ignorance of traffic sign and marking" has high proportion of 93.40%，which means that most drivers would not run a red light.

Conclusion
In this paper, we first summarized 4 types of taxi driver violated behavior in metropolis, and designed the initial survey scale table, and furthermore, item analysis, exploratory factor analysis, validity analysis and reliability analysis were conducted to verify validity of the scale table, based on which some improvements were made. Based on the work above, large area survey was finished with the modified scale table, from which we can get conclusions as follow: 1) Among four types of violated behavior of taxi driver in metropolis, the probability of over-speed is 89.57% which is the most serious problem, in which "night over-speed" and "speeding up to cross intersection during yellow light period" are the most serious phenomena.
2) Among violated behaviors of "illegal lane changes and over-taking", "changing lane over-confidence in driving skills" accounts for 26.4%, more than 1/4, which also should be focused on.
3) Among violated behaviors of "ignorance of traffic sign and marking", "running red light at intersection without monitor" has lower proportion with 6.6%, which indicates that most of taxi drivers are not run a red light. It should be pointed out that case study in this paper is limited to taxi drivers in Shanghai. Validation and general applicability of the designed scale table need to be verified further in other cities. Additionally, the designed scale table in this paper is only used to test violated behaviors of taxi driver, about the interior impact factors and mechanism of violated behaviors of taxi driver, it may be an interesting research work in future.