Gender Forecast Based on the Information about People Who Violated Traffic Principle

: User portrait has been a booming concept in big data industry in recent years which is a direct way to restore users’ information. When it talks about user portrait, it will be connected with precise marketing and operating. However, there are more ways which can reflect the good use of user portrait. Commercial use is the most acceptable use but it also can be used in different industries widely. The goal of this paper is forecasting gender by user portrait and making it useful in transportation safety. It can extract the information from people who violated traffic principle to know the features of them then forecast the gender of these people. Finally, it will analyze the prediction based on characteristics correlation and forecasting results from models which can verify if gender can have an obvious influence on the traffic violation. Also we hope give some advice to drivers and traffic department by doing this research.


Introduction
User portrait has been used widely in different areas as an efficient way of describing goal users and knowing users' requirements. In this period of big data, internet develops quickly so users' information is dispersed and everywhere [1,2]. User portraits put each specific piece of information of users into a label then visualize users through these labels. It builds relationships between data that seems irregular and makes them become clear, vivid and easy to understand so it can be accepted and used by all walks of life. It has become a research hotspot in recent years [3].
The main research point of this paper belongs to the prediction of gender with user portraits. Gender is one of essential attributes of users [4]. Also, it is one of the most different characteristics among human beings. Gender will influence users' behaviors, hobbies and requirements. As the same way, we can predict users' gender based on extracting the characteristics from all kinds of users' information to provider specific service to users, and it can also help us know our users better [5].
Prediction of gender with user portrait has aroused researchers' interest all over the world because of the potential applications like accurate marketing, injecting advertisements and personalized recommendations. The describing ways based on the users' behaviors, moods and hobbies, topics, personalities are the important points and hotspots for researching in academic circles and economic circles [6]. There are various predicting ways of gender like Random Forest (RF), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machine (SVM) that are common classification, predicting methods and models in computer study. But what should be pointed out is that there is just few research results of user portrait and its relevant content that mostly depends on documents. However, most of the textual information is private, making it more difficult to obtain, also it is hard to know the textual characteristics [7]. In addition, most researches are based on user behaviors such as e-commerce shopping and social networks, while traffic safety and violations are rarely involved. Although major influencing factors of traffic accidents involving drivers of different genders have been studied, gender prediction using the information of drivers violating the rules is rarely mentioned.
This study is based on the information of traffic violators in Maryland, USA. On the one hand, the public has different impressions and ideas about male and female drivers, but it is generally believed that men have an innate advantage in driving [8]. Especially in recent years in all kinds of media reports, female drivers have become synonymous with the "road killer". One of the key issues in the debate is whether there is a correlation between gender and traffic accidents [9]. This paper intends to verify and summarize the results through prediction. On the other hand, different from the more common traffic violation rate projections, this paper starts from another Angle to study the correlation between various factors of traffic violation and gender, which can provide relevant theoretical basis and data support for traffic accident prevention, and is also of great significance for traffic safety management [10].
According to the chosen data characteristics, we decided to use Decision Tree Model. In the process of data pretreatment, we made data redundancy and noise-removing handle then extracted the new characteristics. We made participle and word frequency statistics for the text message and transferred all the transformable class variable to numerical categories. Finally, we calculated the correlation value between each characteristic and gender to get final graph of characteristics.
When doing visualized analysis to the result of this research, bar chart is the best way to make statistics of the people who violated traffic principles. Make word cloud map to show the direct result of text segmentation and statistics. Then thermodynamic chart can reflect the correlation between the features. Finally, confusion matrix and classification report can evaluate the result of models which can tells if it is a good research.

Overview of Decision Tree (DT) Model
As it mentioned, there are lots of models that can be used to classify and predict users'' gender, all of them are different from each other and have own key points. We select the Decision Tree Model according to the features of the data to provide convenience and learning for the following research.
Decision Tree is a graphical and decision analyzing method which uses probability analyzing directly and based on knowing the probabilities of all kinds of situations. It decides the feasibility through estimating the risk of an item and knowing the probability that is equal or greater than 0 by making decision tree. It is called Decision Tree because the branch of this decision is painted like a branch of a tree. Entropy means how messy in this system and this meaning of "Entropy" is based on informatics theory, so we can get entropy by using algorithm ID3, C4.5 and C5.0 to generate decision tree [11]. Decision Tree is divided into two category-classification tree and regression tree. Regression tree aims at continuous variable and classification tree aims at discrete variable. Decision tree has a tree shape, every internal node represents a test of quality, every branch represents an output of test, and each leaf node represents a category.
There are three processes of generating a decision tree. Firstly, choose the characteristics: It means choosing characteristic from the different characteristics in training data as a splitting standard of present node. There are many different quantitative estimation standards of how to choose characteristics which derives many different decision tree algorithms. Secondly, generation of decision tree: Child nodes will generate from above down recursively according to the chosen characteristics' estimation standards. Recursive structure is the easiest one to understand in decision tree. And thirdly, pruning: It is easy for decision tree to become over-fitting so it needed to be pruned to downsize and alleviate. Pruning includes prepruning and backward pruning [12].
And there also are some requirements of using decision tree: Goals that decision makers want to get. There will be at least two feasible alternatives that decision makers can choose from. There will be at least two uncertain factors that decision maker cannot control. The loss and gain can be calculated in different schemes with different factors. The decision makers can estimate the probability of uncertain factors [13].

The advantages and Disadvantages of This Algorithm
It is easy to understand and accomplish. It can deal with both data-type and regular-type attributes, and can make feasible and effective results for large data sources in a relatively short time. Then it is easy to measure the reliability of the model by static test. Given an observed model, it can derive logical expressions based on the resulting decision tree easily.
But continuous fields are harder to predict. A lot of preprocessing is required for chronological data. When there are too many categories, errors may increase more quickly. The general algorithms classify only depend on one field.

The scope of Application
Decision tree can help analyze the risk and direction of operating for decision maker of an enterprise easily. The accuracy of a decision depends on scientific decision ways.
The enterprises and analyzers can easily use decision tree when they have accumulated enough data and resources of their customers if they want to classify user.
Decision tree always link with analyzing goal and background, for example: decision tree can estimate the risk of debt in the financial industry. It can also be used for promoting some kinds of insurance in the insurance industry and it can generate auxiliary diagnosis model in the medical industry [14].

Global Mutual Interference Coefficient Based on Matrix
The

Data Pretreatment
We can extract some useful characteristics to make a new form after data reading, because the data in that form is irregular which has text forms and Boolean forms. Beside of it, during data cleaning, text data and noise data that cannot be converted into numerical type are deleted, then it will reduce the steps and difficulty of subsequent machine processing [15]. The next step is to check null value and then delete all rows which include null value and reset index. Some of characteristic from origin data need to be transformed to gain more useful information. Firstly, choose the right date from the above years to create a "date" attribute. Then deduct the date when the car had already been made well to know how long this car had been kept. In the end, add this result to the form.
To know the specific situations well about the violation, we extracted the information "the specific description violation" solely to make text segmentation in English and statistic the frequency of words. The participle can be separated by spacing directly because all of the English words are capital letters. But when deleting the stop words, all of the letters need to be transferred to minuscule in case of analyzing. The following image is part of description after transformation and deletion [16].

Figure 2: Part description of violation
When choosing the characteristics, it can be more convenient to know the relativity between characteristics and gender attribute through correlation coefficient method if using numerical value to replace the categorical variables of each characteristic [17]. Then we can choose characteristics in series. Pearson coefficient correlation can reflect the level of correlation between two variables, but the result reflects that the coefficient correlation is not very high between each characteristic and gender attribute also there are positive and negative numbers at the same time. So we use Kendall which is used to reflect the index of categorical variables' correlation coefficient and Spearman which points at nonlinear data to calculate and then we find the calculation result is similar with Pearson's. Therefore, for the sake of the future prediction model, only the characteristics of positive correlation are left.
After the sifting we get final form(feature.csv) and its shape is (1034594, 8), the details of its characteristics are as the following Tab. 1 shown. After getting the final characteristics form we begin to create models and predict. We use Scikitlearn, a third-party library from Python. Scikit-learn were based on Numpy library and Matplotlib library which can used for classifying, regression, dimension reduction and cluster analysis. The efficiency of machine learning can be improved quickly if using there 4 modules' advantages correctly [18]. Because the work of this paper is to predict the gender of violators, which is a binary classification problem. In addition, according to the characteristics of the data, the decision tree algorithm is selected to model the training set, conduct attribute comparison from top to bottom, and predict the gender of the test set.
Firstly, divide data into the training set accounted for 80% and the test set for 20% by using train_test_split. Assume "gender" is the dependent parameter y and other characteristics are the independent parameter "x". Secondly, train the training set by using decision tree model and then factor test set into model to predict which can get the prediction the gender (X_text). Finally, make comparison between x and the real value (Y_text) then get a score and have the result from models. At last the data should be operated and disposed visually.

Analyze the Visual Experiments Results
According to the research about the relativity between kinds of factors, characteristics and attributes of gender, the gender can be predicted finally, so gender is indispensable in this study. Now we make statistics about the number of females and males in the data set and make the following bar chart (Fig. 3). From the above image we can know the probabilities cannot get a balance in the people who violated traffic principles from 2012 to 2018. Female is about 700000 which means there are about twice as many females as males and that indicates it is more possible for women to violate the traffic principles.
After making text segmentation of "Description" we made statistics about the frequency of words and made the following word cloud map.

Figure 4: Word cloud map of violation description
As the Fig. 4 shown, apart from the words "Driving", "Vehicle", "Motor", "Highway" and other words that describe the Vehicle and road condition, we find that the words "Posted", "Mph", "Maximum Speed", "License", "obey" and other words are bigger. It can be speculated that the reasons of the violation may also be related to speeding, driving License and failure to obey the correct traffic rules.
The most important thing about checking the correlation coefficient among different characteristics is checking the correlation coefficient between the gender in first column and other characteristics. Then make a thermodynamic diagram (see Fig. 5).

Figure 5: Thermodynamic diagram
We can know the correlation values of all features were very low, close to 0 or even negative. Therefore, it could be found that there was basically no linear correlation between all features through Pearson correlation coefficient method, while both Kendall and Spearman principle could verify that there was no obvious correlation between each feature and gender.
We get final result after making model and training by using decision tree. Comparing the real value with predictive value from gender and get the accuracy of the model test (see Fig. 6).

Figure 6: Diagram of experimental results
We can know 20% of experiment is centralized from Fig. 6, there are 139997 right results and 66922 wrong results which means the accuracy rate by using model prediction is 67.66%. It is not ideal.
Confusion matrix is a condition analysis graph which includes the predictions' conclusion and classification in machine study. Confusion matrix records the data together in matrix and concludes the two standards which are real category and classification model prediction. Fig. 7 is the confusion matrix which includes the real value and predicted value, the sum of (0,0) and (1,0) is the data of positive samples, and sum of (0,1) and (1,1) is the data of negative samples. Recall means the proportion of right prediction positive samples in the whole positive samples. F1 is the evaluating index of precision index and recall index which is used to reflect the whole index synthetically [19]. The three indexes are calculated by confusion matrix (see Fig. 8).

Figure 8: The classification of decision tree
Combine the two pictures above we can know the prediction of female has higher right accuracy rate (96%) than male (10%). So this model does not fit male prediction well. From the recall rate, the proportion of positive samples correctly predicted by men and women is not high. Therefore, from theperspective of F1 score, the overall index is low and the model effect is not ideal.

Conclusions and Future Work
We find that accuracy rate of model prediction won't increase when the test set decrease and it will not decrease when the test set increase if not in accordance with the 20% of the test set divided. It is irregular in accuracy rate and has no correlation with the amount of training set and test set，but it will not be more than 69%. Then we can deduce that there is no direct correlation between gender and other factors in violation. So we cannot decide the gender directly only by gender which means gender difference has no specific form in traffic violation.
Generally, the non-ideal results can be caused by many reasons. Firstly, when pretreating data, there's no mature process of choosing and extracting so we haven't got more valuable characteristic attribute. Secondly, the chosen data set did not fit to make classification prediction. Because we just mentioned above that there is no obvious correlation between gender and each characteristic. There is no good standard for internal node to compare attributes for decision tree which will have difficulty in classifying. It also proves our research conclusion from another side.
From the experiment on the other hand also can prove that, Although the number and probability of traffic violation are higher for women than for men, but that is likely to be affected by the local population, the traffic environment or other factors, cannot be attributed to gender differences, also it can't think that gender factor is the cause of illegal violations occur, it may have influence, but is not directly factor. Therefore, in daily traffic travel, drivers, both men and women, should be more serious and strictly abide by traffic laws and regulations, which is responsible for themselves and others. Nor should accidents be blamed on gender weakness, since the law does not discriminate on the basis of sex.
This research is still relatively simple, there are many shortcomings. Since only the decision tree model is applied in this paper, we will make more efforts to understand the prediction models of various categories and optimize them based on this study to compare whether different models have different prediction results. In addition, in view of the shortcomings in feature selection, we will continue to analyze and understand the features in the original data, and further study on feature engineering, so as to make the research more serious and rigorous.
We gain lot from the gender prediction about user portrait [20]. The gender prediction asks not only gender attribute but also the correlation between it and other characteristics to check if it can be used to predict. Also the gender prediction is just a small part of user portrait so we should research more to learn it better when we can use it well [21]. We believe it can bring much convenience and information resources to people in the well-developed future.