Vehicle Emission Detection in Data-Driven Methods

Environmental protection is a fundamental policy in many countries, where the vehicle emission pollution turns to be outstanding as a main component of pollutions in environmental monitoring. Remote sensing technology has been widely used on vehicle emission detection recently and this is mainly due to the fast speed, reality, and large scale of the detection data retrieved from remote sensing methods. In the remote sensing process, the information about the fuel type and registration time of new cars and nonlocal registered vehicles usually cannot be accessed, leading to the failure in assessing vehicle pollution situations directly by analyzing emission pollutants. To handle this problem, this paper adopts data mining methods to analyze the remote sensing data to predict fuel type and registration time. +is paper takes full use of decision tree, random forest, AdaBoost, XgBoost, and their fusion models to successfully make precise prediction for these two essential information and further employ them to an essential application: vehicle emission evaluation.


Introduction
e popularization of vehicles in our daily life has been continuously enhanced with the expansion of urbanization around the world. Gasoline-engine vehicles are the most popular and widely used type compared with new energy ones, and the pollution gases, such as carbon dioxide, carbon oxide, hydrocarbon, and oxynitride, from vehicles have become the main contaminants in urban atmospheric pollution [1]. Efficient vehicle pollution detection therefore turns to be an emergency task which attracks more and more attention. Exhaust emission detection methods have evolved from periodic detection in the environmental monitoring station to daily road detection with remote sensing technology. is paper studies the vehicle emission detection in cities of China which is one of the largest developing countries.
In the USA, EPA (Environmental Protection Administration) proposed MOVES algorithm [2] to calculate the vehicle emission ratio in some fixed locations and periods of time. e Japanese government enforces the vehicle exhaust emission monitoring system in their country, and the emission behaviour of each vehicle in Japan can be checked on the official website of Japanese national transportation [3]. In order to rapidly capture the emission detection results, a French transport agency collects the emission pollution-related information from different places and puts them together to realize the sharing network for vehicle emission detection [4]. Related researches and works on this area started a bit later in China. In 2011, Cheng et al. [5] made systematic analysis for the harm caused by vehicle emission, verifying the necessities of exhaust emission controlling. Next year, Wu [6] collected the values of CO 2 , HC, CO, and NO exhausted by 1092 vehicles in the Xian Yang city using simplified loaded mode. ey established regression equations between the emission value and vehicle information and found that the average emission value was highly related with the vehicle acceptability and the age of the vehicle. Referring to the local standards, they further gave a systematic explanation for the rationalization of the local standard mean emission value based on their research. With the development of remote sensing technology, a large amount of practical exhaust emission data can be obtained by environmental protection agencies in China. is paper introduces data mining technology to these valuable data to explore efficient information in vehicle exhaust emission detection. is research has a huge potential contribution in promoting the environmental protection department's accurate assessment of unqualified vehicles and providing a theoretical basis for policymakers to learn from. e first successful vehicle emissions demonstration system was probably an across-road vehicle emissions remote sensing system (VERSS) proposed by Gary Bishop and colleagues in the University of Denver in the late 1980s [7,8]. A liquid nitrogen cooled nondispersive infrared was the first instrument that can only measure CO and CO 2 . In the next two decades, their team continuously refined the system: added hydrocarbon, H 2 O, and NO channels to their NDIR system [9,10], integrated an ultraviolet spectrophotometer and improved it to enhance NO measurement [11,12], and removed the dependence on the liquid nitrogen cooling [13].
e Denver group designed another commonly used remote sensing device, known as fuel efficiency automobile test, providing some of the inchoate comments on across-road particulate measurement [14].
ere are also many other sensing systems typically based on multiple spectrometric approaches proposed for detection of passing vehicle emissions [15][16][17]. More recently, Hager Environmental and Atmospheric Technologies introduced an infrared laserbased VERSS named Emission Detection and Reporting (EDAR) system, which incorporated several new functions, making it a particularly interesting system for vehicle emission detection.
Important information is buried in the vehicle emission remote sensing data.
is paper exploits data mining methods to deal with the data and obtain valuable knowledge from them. ere are three main directions in data mining: the improvements of classical data mining algorithms, ensemble learning algorithms, and data mining with deep learning. e improvements on classical algorithms are usually performed and employed in multiple application scenarios taking additional information into consideration. Ensemble learning is actually the integration of multiple learners with a certain structure which completes learning tasks by constructing and combining different learners. Its general structure can be concluded as follows: firstly, generate a set of individual learners and then combine them with some strategies. e combining strategies mainly include average method, voting method, and learning method. Bagging and boosting [18] are the most commonly used ensemble learning algorithms which improve the accuracy and robustness of prediction models. As the rapid development and popularization of deep learning, it plays more and more important roles in data learning with the support of big data and high-performance computing. Many traffic engineering-related researches mainly focus on analyzing relevant data such as traffic diversion [19], traffic safety monitoring [20], engine diagnosis [21], road safety [22] and traffic accident [23], and remote sensing image processing [24][25][26][27][28][29][30][31][32][33][34][35], extracting useful information and digging out valuable knowledge. A few works are proposed in vehicle emission evaluation in data mining ways which is the key study subject in this paper. Xu et al. [36] used XgBoost to develop prediction models for CO2eq and PM2.5 emissions at a trip level. In [37], Ferreira et al. applied online analytical processing (OLAP) and knowledge discovery (KD) techniques to deal with the high volume of this dataset and to determine the major factors that influence the average fuel consumption and then classify the drivers involved according to their driving efficiency. Chen et al. [38] proposed a driving-events-based ecodriving behaviour evaluation model and the model was proved to be highly accurate (96.72%).
Relevant environmental policies have been introduced to define difficult limitation standards based on the vehicle fuel type and registration time in China. e vehicle license plate number, plate color, speed, acceleration, and VSP (vehicle specific power), etc., will be captured by the surveillance system when vehicles pass by the remote survey stations. e analysis for the smoke plume generated by gas emission is simultaneously conducted by laser gears at the stations, where the exhaust emission value can be calculated. With the fuel type and registration time information learned from vehicle plate numbers, it is able to obtain the gas emission standard value to judge whether the vehicle emission is eligible. However, register information of nonlocal vehicles and partial local vehicles is not recorded in the official database due to the limitation of environmental policies, which leads to the failure to provide the fuel type and registration time information for vehicle emission detection. According to the National Telemetry Standard in China, relevant departments will treat the information-missing vehicles as the diesel consumption ones, and this situation keeps the limitation criteria of the emission value of partial vehicles unknown, resulting in the evaluation for these vehicles being unable to carry on.
erefore, the precise information upon fuel types and registration time of vehicles is an essential prerequisite for finding out the pollutionexceeding vehicles. is paper adopts multiple data mining methods to learn the fuel type and registration information of vehicles from remote sensing data and further utilize cascaded classified framework to make accurate prediction on vehicle emission-related information, providing valuable reference standards on evaluation of different vehicles.

Data Mining Models for Analysis
In this section, detailed descriptions on the models and dataset used in this study are given [39].

Decision Tree
Model. Decision tree model [40] is a commonly used data mining method based on information theory and a greedy algorithm-like framework, which is proposed for classification or prediction. e model divides the whole dataset into branch-like parts to construct an inverted tree with a root node, internal nodes, and leaf nodes. e nonparametric design enhances the efficiency and generalization ability of the decision tree in processing a large and complex dataset. Five core components made up the tree decision, including the following (1) nodes: root node, internal nodes, and leaf nodes which are the three types that represent different choice operations for data distribution. (2) Branches: they represent the splitting process of nodes in the decision tree, and each branch from the root node to a leaf node represents a corresponding decision rule on classification. (3) Splitting: it is a procedure to generate child nodes from parent nodes terminating when the predetermined homogeneity or stopping criteria is met. (4) Stopping: stopping rules are applied to prevent the overfitting and inaccuracy happening. (5) Pruning: it is an alternative way to establish a large tree first and then prune it to an optimal structure by removing useless nodes. is paper uses decision trees to make classification for vehicle fuel type and specifically calculates the information gain of multiple corresponding attributes to generate the rule model for fuel type prediction. e attributes that greatly affect the final results can be shown in a quite intuitive way. Figure 1 illustrates the decision tree model for fuel type prediction.

Random Forest Model.
e random forest model is another classic and efficient data mining method that belongs to bagging learning. In 2001, Leo Breiman combined ensemble learning theory [41] with random subspace method [42], proposing the well-known machine learning methodology: random forest. It is a data-driven nonparametric model without a priori knowledge and has good tolerance to noise and abnormal values, as well as excellent extendibility and parallelism abilities for high-dimensional data classification. e ensemble learning structure enables random forest in some extent overcoming the performance bottleneck and overfitting of single classifiers such as SVM.
Given dataset D � X i , Y i , X i ∈ R K , Y i ∈ 1, 2, · · · , C { }, random forest is essentially a series of combined classifiers made up by M decision trees g(D, θ m ), m � 1, 2, · · · , M . e classification result is decided by voting of every decision tree and is highly related with two vital randomizations: sample bagging and feature random subspace. e sample bagging process randomly picks M training dataset with return which shares the same size with the oral dataset, constructing a corresponding decision tree. When a node in the decision tree is split, the model will randomly select a feature space from the whole K features (usually use log 2 K) from which an optimal splitting feature is selected to construct trees. ese features consist of the feature random subspace and contain more discriminative feature combination for classification. Since in the construction of each decision tree, the process of randomly picking training data and feature subspace is independent and the constructions are procedurally identical; θ m , m � 1, 2, · · · , M is a sequence of random variables with independently distribution.
is character makes it applicable and efficient to be realized in a parallel computing way and simultaneously ensures the high extendibility of random forest. e structure and construction of random forest are shown in Figure 2.

AdaBoost Model.
Similar to the bagging method, the boosting also belongs to the ensemble learning method which enhances the classification/prediction accuracy of base learner continuously with an iterative update process. AdaBoost [43] based on boosting learning was proposed by Professor Freund and Schapire in 1995. e algorithm was widely used in various classification/prediction fields due to its outstanding performance. e central idea of AdaBoost is to continuously update the wrong judgment weights. Weights of mistaken classified samples of the previous basic classifier are set to increase while the correctly classified samples' weight is programmed to decrease, and the correct one will be used to train the next basic classifier again. At the same time, a new weak classifier is added to cascaded classifiers in each iteration, and the final strong classifier stays undetermined until a predetermined sufficiently small error rate or the prespecified maximum number of iterations is reached.
e concrete procedures are presented in the following paragraph. Given the training set (x 1 , y 1 ), · · · , (x n , y n ) , y i ∈ −1, 1 { }, i � 1, · · · , n, x i is the i-th training data with its label y i , and y i � 1 and y i � −1 denote the positive and negative labels, respectively. w i is the i-th classifier of AdaBoost. Initialize the weights at first with the uniform distribution: en, perform m � 1, 2, · · · , M iteration: the base classifier G m (x) is trained in the cost of D m , and the cost accumulation of the wrong classified samples labeled by G m (x i ) is represented by the classification error rate E m which is defined as e coefficient of the base classifier (G m (x)) α m is calculated as the following formula: In the next iteration, the cost distribution of samples D m+1 is updated as where Z m is defined as e final strong classifier G(x) is the combination of the trained base classifiers, denoted as

XgBoost Model.
XgBoost [44] is a modified algorithm based on GBDT (gradient boosting decision tree) [45]. Both of them share the same idea with boosting methodology; in each iteration, the current decision trees are learned by the previous iteration results and move forward to the residual diminishing direction. When dealing with multiclassification problems, the logarithmic likelihood loss function is defined as where y is the label of the input data x, k denotes the attribute value, and y k is an indicator function that if the predicted value is k, then y k � 1. e prediction probability p k (x) is denoted as At the t iteration, the label of sample data i is l; the current negative gradient can be calculated as  As a classic algorithm, GBDT has advantages such as high accuracy, robustness, and conveniences. Yet, it just uses the first-order partial derivative to calculate the negative gradient which may cause relatively big error. Aiming at overcoming this shortage, XgBoost deduces the first-order derivative by second-order Tailor expansion and the results turn to be more approximate to ground truth when the deduction is introduced to calculate the leaf node weight. In addition, XgBoost firstly ranks the sample data which stored the records in the form of block and the speed of XgBoost is much faster than GBDT with the same training data.

Model Fusion.
Although the single models mentioned in the above subsection have satisfactory performances on data classification and prediction, the model fusion methods based on the difference between the characters of different single models are able to further improve the accuracy and robustness for the final results. e voting manner is a common method for model fusion. is paper adopts two fusion methods to construct the combined model: hard voting and soft voting [46]. e hard voting classifier follows the simple idea that the minority obeys the majority. Given the classification/prediction results, treat each label result for the same variance as a vote, the most voted value is set to the variance in hard voting way. is process can be defined as where x denotes the variance, is an indicator function that shows whether the j-th classifier deeming the label of x is c, and lab c (x, j, c) � 1 when the probability that x belongs to label c calculated by j-th classifier: p(x, j, c) exceeds some threshold values; otherwise, lab c (x, j, c) � 0. e soft voting classifier is another fusion strategy, which treats the average of the probabilities of all classification/ prediction samples for a certain label as the standard. e corresponding label with the highest probability is the final result and the voting can be demonstrated as where n cf is the total number of classifiers and k is the number of labels. is paper combines different models mentioned in Section 2 in both methods denoted as hard voting model and soft voting model, respectively, to compare the final results.

Data Description
e data source is collected by means of remote sensing. e vehicle remote sensing system is a complicated synthetic system made up of some subsystems, including tail assay, environmental information monitoring, traffic condition monitoring, and vehicle identification. When vehicles pass by the remote sensing devices, the equipment performs detection of smoke plume produced by the diffusion of vehicle exhaust. e specific process can be summarized as follows: the probe light from the remote sensing device passes through the air mass and then returns to the detection unit through the right-angle displacement unit, finishing the detection for carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, etc. e intensity of opaque smoke exhausted by diesel engine vehicles can also be monitored by the gas-diesel integration design. As the parameters of vehicles' running status, the instantaneous velocity and acceleration of vehicles are obtained synchronously by remote sensing system. e exhaustion and running data are used for the final remote sensing results.
Vehicle exhaust includes water vapor, oxygen, hydrogen, nitrogen, carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, sulfur dioxide, and particulate matter. ere are two main methods for remote sensing detection on road for vehicle exhaust analysis, namely, nondispersive infrared red analyzer (NDIR) to measure carbon monoxide, carbon dioxide, and hydrocarbons and the dispersion of the ultraviolet (DUV) method to measure nitric oxide, smoke factors, and particulate matter (including opacity). e exhaust diffuses and dilutes in the air immediately after being discharged, and the variation of the dilution concentration is affected by factors such as air disturbance, wind direction, and wind speed. Direct measurement of the concentration of each pollutant in the exhaust plume may not reflect the vehicle emissions accurately and efficiently. erefore, carbon dioxide is adopted as the reference gas to measure various exhaust pollutants in vehicle remote sensing technology. e same exhaust remote sensing optical path (including horizontal or vertical erection) is incapable of remotely measuring the exhaust of multiple vehicles at the same time. It must pass one by one and the time slot between passing vehicles should be greater than one second so that enough time could be set aside for the remote sensing device to measure the exhaust of the preceding vehicle.
is also allows the exhaust of the front vehicle to spread out in time without affecting the remote sensing of vehicles behind. e remote sensing data used in experiments come from the real detected data in the database of the Environmental Protection Bureau of a certain city, which is derived from the remote sensing database and consists of three parts of  , v, a). e third part is the remote sensing result C → . It includes the detection value of carbon dioxide C co 2 , carbon monoxide C co hydrocarbon C hc , nitric oxide C no , smoke intensity C op , and the environmental detected values: wind speed C ws , wind direction C wd , temperature C ot , humidity C h , and atmospheric pressure C ps , the remote sensing data thus is recorded as C → � (C co 2 , C co , C hc , C no , C op , C ws , C wd , C ot , C h , C ps ). Each record used in the research is composed of the above three parts of information and the range of vehicle fuel type and the registration time are the predicted targets.

Fuel Type Prediction.
e data of fuel type prediction come from two tables: vehicle information table and remote sensing record table. e ID and fuel type fields are extracted from the former table, and the latter contains (a) vehicle running conditions (speed, acceleration, passing time, etc.). (b) Environmental meteorological conditions (lane, wind direction, wind speed, temperature, humidity, and atmospheric pressure). (c) Remote sensing results (detection value of carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, and the intensity of smoke). is paper relates these two tables by vehicle ID of which the specific process is described as follows: (1). Data analysis. e preliminary statistical analysis on the number of vehicles with different fuel types is made and the result (shown in Figure 3) demonstrates that vehicles with multiple fuel types, such as mixed oil and natural gas type, other than gasoline and diesel account for a very low proportion. e ratio between gasoline and diesel cars in the data is about 4.6 : 1, and the unbalanced ratio means this is an unbalanced dataset.
(2). Feature layering. In the case of unbalanced distribution dataset, all the data at the end of the horizontal axis in the distribution graph are concentrated to one level so that the data becomes more concentrated, and the number of levels of data can be reduced. A temporary feature will be generated for hierarchical sampling, dividing the dataset into different sections. is paper divides the vehicle passing time into three time periods: morning, afternoon, and evening.
(3). Data cleaning. Data processing for character data of vehicles and missing values are the main components of data cleaning procedure. Vehicle character features include license plate number, license plate color, and the test results, and they are usually converted to one-hot codes in machine learning.
ere are three common methods to deal with missing values: ignoring the record of missing value, ignoring the missing features, and median/average padding. If the license plate number, license plate color, or detection result is missing, the record will be directly ignored, and continuous value parameters are filled with the mean value in other situations.
(4). Feature selection. e main work is to find correlations and feature combinations and generate a correlation matrix for the original data. Check the features in the original data that are most relevant to the fuel type, and use related techniques to find the features that are positively correlated, such as license plate color and nitric oxide. Features that show obvious negative correlation include validity, transit time, and carbon dioxide. en, check the features that are most relevant to the date of registration of diesel vehicles in the original data, and through technical means, find that the features that show more obvious positive correlation include license plate color, etc., and the attributes that show more negative correlation include validity, detection line, and nitric oxide. In terms of feature combination, because the concentration of plume is affected by the wind speed and the vehicle's own speed during the remote sensing process, the CO 2 gas concentration is used as a reference when recording the exhaust pollutant concentration. e new feature combined is the ratio of other pollution items to CO 2 . e amount of data in this study is trillions; after filtering, the number of features is about 30. e decision tree algorithm is used to calculate feature importance of fuel type prediction. e calculation results are shown in Figure 4. Using random forest to calculate the feature importance of gasoline vehicle registration time period is shown in Figure 5, and calculating the feature importance of diesel vehicle registration time period is shown in Figure 6.

(5). Feature scaling.
is experiment discovered that the difference in the value range between different features is very large which brings huge obstacles to the decision part (6). Dataset division. is paper randomly divides the source dataset into two parts: 85% data as training dataset and 15% data as validation dataset. (7). Model training. e processed data is put into the three models for training and verification. e comparison among different results from these models is performed in the experiment part.
When the period division is done, the prediction of registration time can be treated as a multiclassification problem.
is remote sensing record table contains three parts information: (a) remote sensing data of vehicle running conditions, including speed, acceleration, and passing time; (b) environmental meteorological conditions, including lane condition, wind direction, wind speed, temperature, humidity, and atmospheric pressure; (c) exhaust remote sensing results that contain detection value of carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, and the intensity of smoke. In the classification of the national landmarks and local landmarks mentioned above, there exists a large error that it is impossible for the neighboring vehicles around certain segment points to change a lot. erefore, this paper chooses to discard the record data of three months before and after the segmentation point.

Decision Tree Tuning.
ere are mainly three parameters to adjust in decision tree algorithm: D_max_depth represents the max depth of a decision tree. According to the principle of decision tree, it can be seen that with deeper layers a decision tree has more power to thoroughly divide attributes and mine the deep relationship between data.
is paper experimentally sets the max_depth range from 1 to 32. e F1 curve is shown in Figure 7. D_min_samples_split denotes the minimum number of samples required to split the internal nodes. At least one sample is required for each node to perform splitting. When the number of internal node samples increases, more samples will participate in the split of the tree. e decision tree will suffer more constraints with more reference data utilized in node splitting, which could also affect the speed of model execution. is paper sets the minimum number of samples for internal nodes ranging from 10 to 500. e F1 curve is shown in Figure 8. D_min_samples_leaf is the number of samples required for leaf node splitting, which is called the minimum number of samples. e leaf node will be pruned if the number of leaf node samples is less than the minimum one.
In experiments. the D_min_samples_leaf is set to range from 1 to 100 and the F1 curve is plotted in Figure 9.
According to Figures 7-9, the optimal parameter selection of the decision tree algorithm obtained by the experiment in this paper is as follows: D_max_depth is 12, D_min_samples_split is 150, and D_min_samples_leaf is 10.

Random Forest Tuning.
ere are four main parameters to adjust in the random forest model: n_estimators is the number of decision trees in random forest that plays a significant role in the performance of the model. Small value of n_estimators means fewer base classifiers participate in the decision process, leading to the decrease in prediction accuracy, while large number of decision trees will bring out computational burden to the system and take more running time. is paper sets n_estimators from 10 to 100 and plots the F1 curve, as shown in Figure 10. splitting process when max features � None/Auto and selects no more than log N and �� N √ when max features � log and max features � sqrt, respectively. N is the total number of features. ere are few sample attributes in the experiment, and this paper sets the value of max _ feature to Auto R_max_depth is the maximum depth of decision trees in random forest. is paper sets R_max_depth ranging from 1 to 100 and F1 curve is shown in Figure 11. R_min_samples_leaf is the minimum number of samples in leaf node for splitting. is paper sets it ranging from 1 to 100 and F1 curve is shown in Figure 12.
According to the result of Figures 10-12, this paper defines that n_estimators is 100, max_features is 5, R_max_depth is 24, and R_min_samples_leaf is 2.

AdaBoost Parameters Setting.
e default base classifier of AdaBoost is decision tree, which is a classic ensemble learning algorithm with a boosting structure. e base classifier parameters refer to the optimal parameters of decision tree model above, where max_depth � 12, min_-samples_split � 150, and min_samples_leaf � 10. Two significant parameters of AdaBoost are the number of tuning for base classifiers N t une and learning rate L r ate. e model is easy to overfit if the N t une is too large, and underfitting in reverse. is paper sets N t une � 50 and L r ate to be 1 by default.

Fuel Type Prediction Model.
is section uses decision tree, random forest, and AdaBoost algorithm to make fuel type prediction. Vehicle fuel type prediction is a typical twoclass classification. Five classification models are used in the experiment including decision tree, random forest, Ada-Boost algorithm, hard voting fusion model, and soft voting fusion model. After parameters optimization of all kinds of models, the single classifier models are compared to fusion ones to evaluate these models. e whole technological process is illustrated in Figure 13. More details are given in the following subsections.
In Section 4.2, this paper optimizes the parameters of a single model. Although the performance of the single model is already very good, it is based on the differences between the single models, and the model fusion of the single models becomes very meaningful.
It can be seen from Table 1 that random forest performs best in a single model scene. e fusion model obtained by the voting method also performed very well. Compared with most single models, fusion models have higher prediction accuracy. In the process of predicting the vehicle's fuel type, this paper has obtained a good prediction effect. e random forest algorithm and the fusion model have the best prediction results. e F1 value of the random forest is 90.41%, and the F1 value of the soft fusion model is 90.3%. Because the random forest prediction speed is faster and the prediction model is better to explain, the random forest prediction model is selected as the final model for fuel type prediction.

Registration Time Prediction Model: Mixed Fuel Type.
Vehicle registration time predictions are divided into diesel vehicles registration time prediction and gasoline vehicles registration time prediction. From the statistical analysis of the data, it can be known that diesel vehicles are mainly divided into vehicles registered between 2009 and 2013 and after 2013. e proportion of registrations before 2009 was very low. Gasoline vehicles are mainly vehicles after 2008, accounting for about 90%. e purpose of predicting the registration date and fuel type is to find the limit value of unknown vehicles to judge whether the emission is qualified. e earlier the registration, the higher the limit value standard. Predicting a car before 2008 to a car after 2008 is equivalent to selecting a low limit value, and it is easy to judge a vehicle with a qualified emission as unqualified, which causes a waste of resources for car owners and environmental inspection workstations. Based on the above analysis, random forest and XgBoost are used in this section to create a prediction model of registration time. e classification periods are as follows: gasoline + after 2001-10-1, gasoline + before 2001-10-1, diesel + during 2008-7-1 and 2013-7-1, diesel + during before 2008-7-1, and diesel + after 2013-7-1. Since it is a multiclassification problem, this paper uses random forest and XgBoost to perform prediction and the results are shown in Table 2. In order to reduce the randomness of the learning algorithm, the results show the mean and variance of the results of 10 independent runs. is paper finds that when the data is divided into five categories, the verification results are unbalanced. e number of gasoline-powered vehicles after 2001 is higher and the verification accuracy is much lower than the training   accuracy. e random forest model is superior to the XgBoost model in terms of training accuracy and verification accuracy. Its training accuracy reaches 99.0%, and its verification accuracy is about 91.7%, which indicates that overfitting occasionally happens, leading to the decrease in verification accuracy.

Registration Time Prediction Model: Gasoline Vehicle.
e gasoline cars are classified as follows: gasoline + after 2001-10-1 and gasoline + before 2001-10-1. e results are shown in Table 3. e prediction accuracy and verification are more than 99% using the random forest model. is is mainly caused by the severely unbalanced data distribution and the insufficient data amount.

Conclusions
Environmental protection has been a hot topic in academic and industrial communities.
is paper focuses on predicting the missing basic information of vehicles from telemetry data to monitor the vehicle emission. A variety of data mining methods are adopted to perform predictions based on the vehicle telemetry data provided by an environmental protection agency in a certain city and successfully made precise inferences on fuel type and gasolinepowered vehicle registration time. In the prediction for the registration time of diesel vehicles, the prediction accuracy rate just reaches about 70% due to the fact that the division of registration time is artificially controlled and the status of different vehicles varies a lot for different users. Further work will be carried out on the basis of more related data and improved algorithms to make more precise prediction on the vehicle emission-related information.

Data Availability
e SQL data used to support the findings of this study have not been made available because the data provider is the Municipal Bureau of Ecology and Environment of a certain city and the data is not public.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.