Environment Pollution and Climate Change

The conventional models for idling emission estimation are mainly based on ambient temperature and the status of vehicle itself, such as vehicle type/size, age and accumulated mileage and fuel type. Instant vehicle activity information is seldom taken into account. In this research, a machine learning approach is proposed to dynamically estimate vehicle emission rates while idling, based on real-world driving tests on more than 1,600 km highways in the State of Texas in the USA. One driver drove a dedicated light-duty gasoline vehicle on various types of roads, including interstate freeways, farm roads, state highways, and arterial road. During each episode of idling, rates of vehicle exhaust emissions, including carbon dioxide (CO 2 ), carbon monoxide (CO), hydrocarbon (HC) and nitrogen oxides (NO x ) were measured by a Portable Emission Measurement System (PEMS). Meanwhile, the real-time vehicle engine information of the test vehicle, such as revolutions per min, intake air temperature, and environmental information (e.g. ambient temperature), were collected through the On-board Diagnosis II port. Five machine learning algorithms were applied to build up idling emission models to illustrate the nature of emission patterns. Results show that Boosted and Bagged Decision Trees (BBDT) based idling emission model was identified as the best-fit ones for dynamic idling emissions with better prediction performance.


Introduction
Idling refers to the vehicle operation, when a vehicle's engine is running, but not in motion [1,2]. Though usually individual idling episode is very small in a driving trip, the cumulative impacts of idling are enormous. In the United States, more than 6 billion gallons of fuel yearly were spent on the avoidable idling operations [3], leading to a total of 5, 988.66 billion ton of carbon dioxide (CO 2 ) emissions. What's more, the gigantic byproducts of nitrogen oxides (NO x ), carbon monoxide (CO), and hydrocarbon (HC) are toxic to environment and humans [4]. An insight into the idling emission pattern is essential to prevent the unnecessary toxic exhaust from emitting.
The idling can be discretionary or non-discretionary. Discretionary idling occurs only when the driver chooses to stop and idle, while non-discretionary idling takes place during normal driving due to the restrictions of traffic signals, signs and congestions [5]. The nondiscretionary idling lasts normally shorter under a hot start [6], which is often operated during congestion. Many strategies have been implemented to release congestion related non-discretionary idling emissions, such as optimized signal control and eco-driving [7] and improved configuration of drive-through facilities [8]. Though many previous studies have been conducted to explore the idling emission patterns [9,10], most of them usually focus on the emission impacts of a vehicle itself, such as vehicle type/size, age and accumulated mileage, fuel type, and vehicle maintenance conditions, and ambient temperature [11]. The exhaust emissions are mostly statically monitored. In fact, most idling modes are followed with a series of vehicle operations, which transition could result in different engine activities, such as revolutions per min (rpm), intake air temperature (IAT), manifold absolute pressure (MAP), from those monitored during the lab idling tests.
Besides, idling emission estimation is usually embedded into the exhaust emissions attributed by other vehicle operations, such as acceleration and braking, during modeling. The most common used model is the Motor Vehicle Emission simulator (MOVES), which is developed by Environmental Protection Agency (EPA). Other emission models include the Emission Factors (EMFAC) model developed by California Air Resources Board (CARB), the international vehicle Emission (IVE) model in most developing countries, and the Computer Program to calculate Emissions from Road Transport (COPERT) model developed by European Commission Environmental Protection Agency [2]. These macroscopic models statistically use some types of microscopic emission information based on standard emission measurement. For example, MOVES estimates idling emissions by emission rates that are obtained by drive cycles [2,11]. No doubt that these emission rates can simplify the exhaust emission estimation at a regional scale. However, such estimates fail to demonstrate the exhaust emission patterns during an idling mode. Besides, the statistical models are based on a number of assumptions or often end up over fitting.
Comparably, the recently developed machine learning techniques, such as K-Nearest Neighbor (KNN) model, Neural Network, Boosted and Bagged Decision Trees (BBDT), can possibly provide more reliable, repeatable decisions and results. The machine learning techniques learn from measured computations without rules-based programming and conducts prediction on the fly. The most advantage is that there is no any continuity of boundary in the machine learning algorithm and the distribution of dependent or independent variables do not need to be specified [12]. This research attempts to identify a machine learning algorithm for training a best-fit idling emission model, based on field driving tests and real-time environmental information. The best-fit model can illustrate the nature of exhaust emission patterns during idling, and provides reliable and highly accurate estimation results.

Machine learning algorithms
Five machine learning algorithms were applied to build idling emission models, including KNN, Neural Network, BBDT, CHAID and Support Vector Machine (SVM) and were screened by comparing their relative errors. The relative error is the ratio of the sum of squared errors for the dependent variable to the sum of squared errors for the null model. A smaller relative error indicates a higher accuracy of prediction. The first two built-up models with lower relative errors were further analyzed, in terms of Root Mean Square Error (RMSE) and the correlation coefficients (R) of the fitted regression lines for each emission index. The better-fit model was identified by a lower relative error, a lower RMSE, and a higher absolute R.

KNN model:
As an instance-based learning (i.e., lazy learning), the KNN algorithm is one of the simplest machine learning algorithms. KNN is a model that predicts the value of an output variable based on the values of its nearest neighbors [13]. More specifically, it is a method to recognize the pattern of data without requiring an exact match to any stored patterns or cases. By this mode, similar cases are closely gathered to each other. The distance between cases is the measure of similarity. There will be many neighbors for each case. The best number of neighbors is called k and specified by a crossed check for error log (el) presented in Equation (1).
where, j y =the measured output  j y =the estimated output, which is an average of weighted k nearest neighbors j=the j th nearest neighbor BBDT model: BBDT model is a result of regression trees or classification trees and bagging, which is an ensemble learning method. Multiple decision trees are generated and bagged into an ensemble. For the idling emission estimations, individual tree grows deeply, based on regression trees. The bagging is a training process of resampled data. For each resampling, the unique observations are divided into two groups: bootstrap samples for training and out of bag samples for validation. The predictive power of the trained ensemble is indicated by the average errors from the out-of-bag samples. The prediction algorithm is expressed as Equation (2) [14].
where: ŷ =the prediction from tree t in the ensemble S=the set of indices of selected trees that comprise the prediction a t =the weight of tree t ( ) Neural network: The artificial Neural Networks are based on simple mathematical models of the brain. The Levenberg-Marquardt algorithm is one of the typical methods to train the networks (structure, weights and bias) using the multilayer perception procedure [15,16]. The training process stops automatically when generalization no longer improve, indicating by an increase in the mean square error of the validation samples. In this research, a structure of 1 hidden layer and 10 neurons was determined.
SVM model: SVM models are supervised learning models, the algorithms of which analyze data for classification and regression [17]. In the idling emission case, SVM maps emission data to a highdimensional feature space for classification, regardless of whether the data are linearly separable. Once the boundary between categories is found, the data are transformed by the mathematical function of kernel. After the transformation, the boundary can be defined by a hyperplane. The response of new data can be predicted by classifying them into categories based on their features [18].

CHAID model: CHAID stands for CHi-squared Automatic
Interaction Detection, and is a type of decision tree technique, which can be used for prediction as well as classification [19]. The optimal splits are identified by significance testing of chi-square independence. The CHAID algorithm consists of three steps: merging the pairs of categories showing the least significant difference, splitting for deep growing, and stopping when all categories differ at the specified testing level. A tree keeps growing by repeating these three steps at each node starting from the root node [20].

Accuracy estimation of predicted responses
The accuracies of the five machine learning based idling emission models are compared by Root-Mean-Square Error (RMSE) and Pearson product-moment correlation coefficient R. The RMSE is commonly used as a measure of the difference between observed values and predicted values by a model, which is expressed in Equation (3).
Where, X obs, i =the i th observed value X model, i =the modeled value at the i th data prediction On the other hand, the fitting level of the predicted values to the observed values is measured by the R value, which is obtained by The idling emission model that is able to provide predicted responses with the lowest RMSE and the absolutely higher R value is identified as the best-fit model.

Test plan and data collection
The non-discretionary idling emission pattern is addressed in this study, which is produced by temporal idles for traffic signals and congestion blockages. The vehicle engine in this case can be regarded as already being hot for sufficiently longer time. The structure and parameters of the model would be calibrated from input-output data pairs, which were obtained from on-road driving tests. Figure 1a illustrates the dedicated light-duty test vehicle, which is a 2004 Subaru Forester with four cylinders and 2.5 liters displacement, auto transmission. Its vehicle weight was 3,100 lb; the test weight was 3,500 lb, with 165 horse power at 5,699 rpm and a torque of 225 Nm at 4,000 rpm. The fuel type is gas and the mileage at start of test was 16,496 km (10,250 miles). Figure 1b is the PEMS placed on the back seat of the test vehicle. A plastic tube from the PEMS is connected to the tailpipe of the vehicle to suck in continuous exhaust emissions for measurement. A global positioning system (GPS) was placed on top of the vehicle to record the instant geolocation information. The sampling rate of the PEMS as well as GPS is 1 Hz (once per sec).
The test vehicle was employed to drive through approximately 1,600 km highways with different types of roads in the State of Texas, USA, including interstate freeways, farm roads, state highways, and arterial roads. The specific idling measurements on each highway are listed in Table 1. Table 1 shows that a total of P34H14M01S (i.e., a period of 34 h 14 min and 1 s) driving duration on these highways, while the total 221 episodes of idles lasted for P02H23M56S with an average of 39.07 s for each idle. The test sites cover a geologically wider range in the State of Texas.
The vehicle activity and engine information were recorded by connecting the PEMS with an on-board diagnostic (OBD) II port of the test vehicle, during each idling period for congestion or traffic controls, such as traffic signals or stop signs. The collected information combined with each idling duration serves as input variables, including revolutions per min (rpm), Intake Air Temperature (IAT), Manifold Absolute Pressure (MAP), Ambient Air Temperature (AAT), and Idling Duration (ID). Meanwhile, the PEMS was used to measure real-time exhaust emission rates, including CO, CO 2 , NO x and HC, which are the output variables of modeling. Figure 2 shows a screenshot of the PEMS records at a sampling frequency of 1 Hz. Column A is the recording time, columns B-E are part of the OBD II information that were flew into the PEMS, column F indicates the source gas analyzer used to measure emissions (from gas analyzer 1 or 2 or both), columns G to K are measured emissions and fuel consumption, column L is the Coordinated Universal Time (UTC), columns O-Q are GPS information, and column R is the realtime driving speed from OBD II.
The total input-output data pairs were divided into three parts for training, validation, and testing. Seventy percent of data pairs were trained by the five algorithms. During the training process, the classification, network and regression, are adjusted according to its errors. Another 15% of the data pairs were used to measure generalization as validation samples and to halt training when the generalization stops improving. The last 15% of the data pairs serves as testing samples, which do not have effect on training and provide an independent measure of modeling performance during and after training.   the real driving tests, which were all recorded at hot engine status. Five machine learning based idling emission models were developed. The relative errors for each emission index are listed in Table 2.

Comparison of relative errors among models
In Table 2, the relative errors of the four emission indexes by KNN are relatively closer to the ones by BBDT, ranged from 1% (for CO 2 by BBDT) to 15% (for CO by KNN). These two algorithms perform apparently smaller relative errors than Neural Network, CHAID and SVM models. Thus, the KNN and BBDT based idling emission models were selected for further analyses in the next section.
Besides, it is worth noting that except the CHAID algorithm, other four algorithms are able to predict the CO 2 emissions with smaller errors, whereas the five algorithms also provide CO and NO x estimations with comparably higher relative errors. This implies that the emission patterns of CO and NO x could be different from CO 2 and HC. Figure 3 illustrates the fitted regression lines between the observed and estimated emissions by KNN models from the validation tests. The greater absolute value of R indicates higher correlation between the estimated emissions and the observed emissions. Figures 3a and 3c shows that the estimated CO 2 and HC emissions are highly correlated to their corresponding observed values with the R value of 0.94 and 0.85, respectively. The R of 0.58 for NO x could be constrainedly considered as correlated relationship between the estimated and observed values, whereas the R of 0.33 for the CO tells that their relationship is relatively week. Similar fitting results are shown in Figure 4 for the testing phase, in which the correlation coefficients for the CO, HC and NO x , decline slightly to 0.17, 0.78 and 0.53, respectively. In the testing phase shown in Figure 6, though there is a subtle decrease in R for the NO x emissions with 0.44, the R value for the CO contrarily increase to 0.59. As a whole, the exhaust emission values estimated by the BBDT algorithm are more correlative to the observed emission values than the estimated emission values by the KNN algorithm. Table 3 depicts the RMSEs of the validation and testing results for the four emission indexes. General speaking, there are subtle differences in the RMSEs between the validation and testing results by the KNN and BBDT algorithms, respectively, which means the two built-up idling emission models are able to provide reliable estimated results.         Compared with the RMSEs by the two algorithms for each emission index at the two phases, the RMSEs by the KNN algorithm for the CO 2 , CO, and HC emissions are slightly greater than those by the BBDT algorithm. The NO x emission estimations by the two algorithms are similar to each other. This implies that the BBDT based idling emission model performs better prediction performance. Therefore, the BBDT emission model was identified as the best-fit models among the five developed machine learning emission models for its lower relative errors, higher absolute R values, and lower RMSEs. Besides, the average idling emission rates estimated by the best-fit model, the BBDT based emission model, were compared with the observed values measured by PEMS and the estimated values by MOVES for a light-duty gasoline vehicle. Table 4 shows the comparison results.

Comparison of RMSE between KNN and BBDT based models
Note: N/A=the emission rate is not available from source [11] In Table 4, the MOVES estimated values were the average emissions of all light-duty vehicles. It is obvious that, the BBDT estimated emission rates are very close to the observed ones, whereas both the observed and the estimated emission rates are quite different from the MOVES estimations for average light-duty vehicles. In other words, the built-up BBDT based idling emission model presents better predictive power for this specific test vehicle.

Conclusion
Field vehicle idling emission tests were conducted in several different cities in the State of Texas. Vehicle activity information, engine information, and real-time exhaust emissions during each idling period were recorded and analyzed to characterize the pattern for modeling. A total of five machine learning based idling emission models were developed. Among the five models, the BBDT and KKN based idling emission models presented better predictions with lower relative errors, ranged from 1% for CO 2 to 15% for CO. The prediction performance of the two models was compared by their RMSEs for each emission index. The RMSEs by the BBDT based idling emission model for the CO 2 , CO and HC exhaust emissions at the validation phase as well testing phase, were overall smaller than those by the KKN based emission models. Therefore, the BBDT based idling emission model was identified as the best-fit model. Besides, the estimated emission rates by the best-fit model were very close to the observed emission rates by PEMS.
The BBDT built-up model can accurately and dynamically estimate vehicle idling emissions. Such a model can be easily embedded into a smartphone or tablet via a suitably developed application, so as to promptly display vehicle idle emissions while being halted at red lights or in a queue of congestions.