A Heuristic Approach for Quantifying Household Travel GHG Emissions Using GPS Survey and Spatial Correlations-The Cincinnati Case Study

The United States Environmental Protection Agency (USEPA) reported that the historical increase of CO2 emissions from the transportation end user sector is largely attributable to the increased and imbalanced demand for land use and travel activities [1]. The current state of the practice for estimating GHG emission relies on the integration of two isolated modeling processes: travel demand forecasting and emission estimation. The procedure employs an ad-hoc approach using average link-based speed and traffic volume from travel demand model as transportation activities related inputs for the MOVES (Motor Vehicle Emission Simulator) model [2-4] Climate change, land use and socioeconomic development are principal variables that define the need and scope of adaptive engineering and management to sustain infrastructure development. It is in the Federal (e.g. U.S. EPA) and state governments’ (e.g. California Air Resources Board) best interests to investigate research questions, such as, are the changes tangible? What are the actionable sciences for decision-making? What adaptation changes can be made in the planning horizon? Are there any tools, models available to test those adaptive changes?


Introduction
The United States Environmental Protection Agency (USEPA) reported that the historical increase of CO 2 emissions from the transportation end user sector is largely attributable to the increased and imbalanced demand for land use and travel activities [1]. The current state of the practice for estimating GHG emission relies on the integration of two isolated modeling processes: travel demand forecasting and emission estimation. The procedure employs an ad-hoc approach using average link-based speed and traffic volume from travel demand model as transportation activities related inputs for the MOVES (Motor Vehicle Emission Simulator) model [2][3][4] Climate change, land use and socioeconomic development are principal variables that define the need and scope of adaptive engineering and management to sustain infrastructure development. It is in the Federal (e.g. U.S. EPA) and state governments' (e.g. California Air Resources Board) best interests to investigate research questions, such as, are the changes tangible? What are the actionable sciences for decision-making? What adaptation changes can be made in the planning horizon? Are there any tools, models available to test those adaptive changes?
From the emission modeling' perspective, accurate and detailed traffic operational activity inputs to MOVES model are crucial to maximizing its capability to accurately reflect the greenhouse gas emission associated with travel. Previous research [5][6][7][8][9] has proven that on-road traffic related emission varies with traffic operating conditions (i.e., speed, acceleration or deceleration). Recent studies [6,7,10,11] indicate potential deficiencies in converting travel demand outputs into the emission model inputs. Emission models often rely on traditional travel demand models for vehicle activity input, but traditional travel demand models are mostly calibrated and validated using aggregated total traffic data [12]. Therefore, the hourly emission estimates may not be accurate because hourly VMT and speed variations are underrepresented as well as aggregated inputs being used in the emission models [12,13]. In addition, real-world traffic data, especially location-based trip generations are spatial in its nature. Therefore, it contains unknown effects due to its spatial correlation [14,15]. Figure  1 illustrates the traditional link-based "bottom-up" (left) approach in comparison to the proposed "top-down" (right) approach in estimating the GHG emissions in Hamilton County, Ohio. The link-based "bottomup" approach clearly mapped out the interstate freeway network since the interstates are heavily loaded with traffic. It actually accounts for all the emissions that are emitted on the roadway network of the county but does not provide a measurement of the source of emissions. Adaptation planning to climate change impacts requires data-driven, location-based analysis capability to estimate spatial distribution of travel GHG emission contributing sources due to transportation activities. Therefore, household GHG emission generation modeling is viewed as a pressing need to provide data and location-driven decision support to addressing the aforementioned research questions and analysis capabilities. However, the challenge remains in the theoretical representation of sensitive interactions between spatial-dependent land use and traffic activities as well as providing location-based GHG emission information for decision makers. data support, it is difficult for planners to connect land use and household travel associated GHG emissions. Especially, there is almost impossible to make it traceable to the origin. Besides, since the household travel survey data analyses are cross-sectional studies and are spatially dependent, the effectiveness of incorporating spatial information into the research is not clear. A method of modeling household travel associated GHG emissions for accounting for spatial effects is needed.
The goal of this research is to develop a spatial regression-based GHG emission modeling approach at the TAZ-level using GPS household travel survey data. The method is expected to enable analyzing the sensitive interactions among land use changes, household travel characteristics and GHG emissions by introducing spatial information for decision support. To fulfill the above goal, the following objectives are designated: • To identify the contribution variables for household travel GHG emissions through statistical analysis using the high resolution (second-by-second) GPS household travel survey • To quantitatively reveal household travel GHG emissions at the TAZ level. Illustrating household GHG emission's socioeconomic and demographic characteristics with "ground-truth" traffic activity data inputs; • To utilize spatial information in GHG emission generation model bypassing the issues in Ordinary Least Square (OLS) regressionbased modeling assumptions • To compare model goodness of fit using an information-based measure of fit approach. The spatial cross-sectional regression method is based upon previously extracted travel and GHG emission characteristics of households as well as the spatial contiguity among TAZs.

Summary of Existing Studies
Spatial typically refers to data containing time series observations over a type of spatial unit such as TAZs, zip codes, regions, countries, and states. It is generally recognized that panel data are more informative since they contain more variation and less collinearity among the variables. The use of panel data results in a greater availability of degrees of freedom, and hence increases efficiency in the estimation [16]. A large body of literature [17][18][19] has proven that incorporating spatial factors into integrated land use and transportation applications are applicable and yields reliable results [20][21][22]. The spatial and temporal correlation characteristics, which were originally introduced to the transportation field from econometrics, consider traffic activities, similar to its source generation, to be spatially correlated. Several recent studies at the University of Cincinnati [23 -26] indicate that the spatial modeling approach is capable of achieving improved accuracy in both truck volume and Particulate Matter (PM 2.5) emission predictions. Hall et al. [27] identified that current land use land cover (LULC) models fail to incorporate and integrate spatial and temporal correlations in urban systems. To fill in the gap, they introduced the spatial linear and logistic regression model for panel data. They used the downtown population data for Austin, TX over multiple years to predict the population in 2020. A conclusion was drawn that spatial and temporal effects were shown to be highly statistically significant, suggesting that their recognition and formal inclusion in the models is likely to be of great value. Parent and LeSage [22] applied a spatial panel model with random effects to predict commuting times. They collected travel time to work, travel expenditures, traffic volume, lane miles and gas taxes to forecast the mean travel time to work for each state. The findings showed evidence of substantial of spatial spillovers and relatively weaker time dependence leading to much smaller time impacts accruing over future periods. A very recent article by Chakir and Le Gallo [28] investigates how the introduction of spatial effects and individual heterogeneity in an aggregated land-use share model affects the predictive accuracy of land use models. They considered agricultural, forest, urban and other land uses in their investigation. And one of the conclusions drawn is that controlling for both unobserved individual heterogeneity and spatial autocorrelation outperforms any other specification in which spatial autocorrelation and/or individual heterogeneity are ignored. Perugu et al. [29] applied spatial panel model for modeling truck factors and for improved PM 2.5 estimation in a regional roadway network. The proposed methodology enables plotting the spatiotemporal distribution of PM 2.5 emissions in a subarea. They also reported that the methodology presented is scalable and transferable and holds technical promise in its application across different regions and pollutants.
In summary, a gap exists between the current practices of aggregated level of household travel GHG emission estimation and the data and spatial informed needs for adaptive planning. This proposed research is expected to fill in the gap by connecting zonal level socioeconomics with household travel GHG emissions using spatial regression and high-resolution GPS household travel survey data. This paper extends previous work on modeling household travel GHG emissions in three ways: 1) building the capability of estimating a TAZ level GHG emission generation model which is highly-desirable for adaptive planning, and 2) developing a spatial regression based modeling approach which added to currently practiced approach, and 3) testing the spatial information's role in modeling regional-level household travel GHG emissions from large GPS-based household travel survey datasets.

Methodology
To fulfill the research gap identified, an integrated approach is proposed based on the Greater Cincinnati Household Travel Survey Data. The purpose of the methodology is to build up a linkage between household travels related GHG emissions and land use, socioeconomic, demographic, and spatial and temporal factors. Rapidly quantifying the GHG emissions through simulation of scenario-based land use and socioeconomic changes is an additional methodological goal. Figure 2 illustrates the heuristic framework of this research. The household travel data processing procedure extracts household travel characteristics base on the survey database. The purposes are threefold. First, to calculate the GHG emissions from the location specific household using the traditionally unavailable vehicle specific power (VSP) approach and the EPA approved MOVES model. Second, the extracted trip features based on household socioeconomic data will be used to update the trip rates table for the customized travel   demand model. Module two, the contributing variables, is to produce contributing variables for spatial cross-sectional modeling including TAZ level, trip level attributes and spatial weights. The spatial crosssectional model will then be estimated. Third, the spatial model calibration module will provide justified land use patterns and associated household spatial distribution. The last part of this research is measuring the goodness of fit from OLS and the proposed spatial regression models.

Spatial autocorrelation of the variables
The first law of geography according to Waldo Tobler is "Everything is related to everything else, but near things are more related than distant things. " [30]. This observation is embedded in the gravity model of trip distribution. It is also related to the law of demand, in that interactions between places are inversely proportional to the cost of travel, which is much like the probability of purchasing a good is inversely proportional to the cost. Spatial autocorrelation refers to the correlation of a variable with itself through space. If there is any systematic pattern in the spatial distribution of a variable, it is said to be spatially auto-correlated. OLS regressions assume that observations have been selected randomly. However, if the observations are spatially clustered to a certain degree, the estimates obtained from the correlation coefficient or OLS estimator will be biased and overly precise. The bias comes from areas with higher concentrations of events having a greater impact on the model estimation and will overestimate precision since events tends to be concentrated, and therefore, there are actually fewer independent observations than assumed.
The most common measurement of spatial autocorrelation is the Moran's autocorrelation coefficient (often denoted as I). It is an extension of Pearson-moment correlation coefficient to a univariate series [31,32]. Recall that Pearson's correlation (denoted as ρ) between two variables x and y both of length n is: where x and y are the sample means of both variables. ρ measures whether, on average, i x and i y are associated. In the study of spatial patterns and processes, it is logically expected that close observations are more likely to be similar than those far apart. It is common to associate a weight with each pair ( , i j x x ) that quantifies this expectation [33]. In its simplest form, these weights will be 1 for close neighbors, and 0 otherwise. The weights are sometimes referred to as a neighboring function with ii w set to be 0. Moran's I can be interpreted as the correlation between variable, x, and the "spatial lag" of x formed by averaging all the values of x for the neighboring areal units (i.e., polygons).
Moran's autocorrelation coefficient I's measured by: where ij w is the weight between observation i and j , and 0 S is the Moran's I varies on a scale between [-1,1]. When the value is close to -1, it means high negative spatial autocorrelation; when the value is close to 0, it means no or minimal autocorrelation; when the value is close to 1, it suggest high positive spatial autocorrelation.
The null hypothesis is that the Spatial Autocorrelation (Moran's I) is that the data is completely spatial random. If the p-value is not statistically significant, the null hypothesis cannot be rejected. If the p-value is statistically significant, and the z-score is positive, the null hypothesis is rejected. Table 1 shows Moran's I and its statistical testing results. Almost all the zonal attributes are determined to be spatially dependent.

Candidate spatial cross-sectional models
The general form of spatial cross-sectional model is below: where: • WY denotes the endogenous interaction effects among the dependent variables, • WX the exogenous interaction effects among the independent variables, and

•
Wu the interaction effects among the disturbance terms of the different spatial units.
• ρ is called the spatial autoregressive coefficient, • λ the spatial autocorrelation coefficient, while • θ represents a K × 1 vector of fixed but unknown parameters. Figure 3 shows the variations of spatial cross-sectional models with respect to assumptions in the error distribution in the above parameters. Since no predeterminations on the error term distribution can be made, this study tested all the below spatial cross-section models and the best model fits the data will be selected. Table 2 shows the variables with their coefficient estimates. The R 2 (coefficient of determination) gives information about the goodness of fit of a model. In regression, the R 2 is a statistical measure of how well the regression line approximates the real data points. An R 2 of 1 indicates that the regression line perfectly fits the data. The linear model has a R 2 of 0.8002, which suggests that the model is a good fit. The scale-location plot is similar to the residuals versus fitted values, but it uses the square root of the standardized residuals. A good fit linear model should show randomness in this plot. The last plot, residuals versus leverage, uses Cook's distance to identify points which have more influence than other points. Generally these are points that are distant from other points in the data, either for the dependent variable or one or more independent variables. Each observation is represented as a line whose height is indicative of the value of Cook's distance for that observation. There are no hard and fast rules for interpreting Cook's distance, but large values (which will be labeled with their observation numbers) represent points, which may require further investigation.

K-fold cross-validation of the OLS model
K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over. A common    k number for model cross validation is 10. However, since there are 693 TAZs in our dataset, k=9 is used to ensure each "fold" is equal.
Since the data are randomly assigned to a number of 'folds' . Each fold is removed, in turn, while the remaining data is used to refit the regression model and the deleted observations are predicted. Table 3 shows the residual sum of squares and mean square. Figure 5 is the validation plot showing the removed (folded) vs. fitted data. The validation plot shows a good validation since each removed vs. fitted data flows similar 45 degree line. Overall, the OLS model is validated and it is a good fit.

Spatial regression analysis results
The spatial regression models are estimated using the maximum likelihood method. Table 4 shows the variable coefficients using the OLS, SAR, SEM, SDM, SDEM, KPM, and MAM. The coefficients that are not spatially dependent (i.e., Avg_CarbEM, Avg_TRIPSP) are quite similar. And the spatially dependent variables have more variations in the coefficient. This is expected because each of the models has different assumptions and is of different forms as shown in Figure 3.

Goodness of fit measures for candidate models
The goodness of fit measures in spatial regression models is slightly more complex due to the lack of standard measures such as R 2 . However, commonly used goodness of fit measures is the information-based measures. The information-based goodness of fit measures utilizes several model performance measures and rank based on the values. The model with the lowest rank is considered a better fit than others. Table  5 shows the information based measures and their ranks for OLS, SAR, SEM, SDM, SDEM, KPM and MAM models. This ranking utilized AIC, Log Likelihood and Moran's I on Residuals as measures. For all three criteria, smaller values are better. Therefore, the SDEM model has the lowest summation of ranks and it fits the data better.

Discussion
A spatial regression-based modeling framework was developed based on finding the minimal model residuals and multiple informationbased measures of fit. The goodness of fit measures in spatial regression models is slightly more complex due to the lack of standard measures such as the R 2 . However, a common goodness of fit measures is the OLS model has an R 2 (coefficient of determination) of 0.8, which is a good fit. However, when examining the residuals on diagnosis plots, it was found that the residuals are still spatially correlated. This suggests that spatial models can fit the data better and reduce the residual spatial correlation. After performing spatial regressions, the informationbased measure of fit based on AIC, log likelihood and Moran's I on residuals are compared and the best model fitting the given dataset is the Spatial Durbin Error Model. The SDEM has the lowest AIC and Moran's I on residuals compared to other candidate models.
This study has provided a proof of concept for the proposed methodology and solid foundation for the modeling land use changes, and GHG emission analysis. It has been proven that the proposed method has the capability to reveal the dynamic linkage between land use, transportation, and emissions. The findings from this research provide insights on how land-uses planning alternatives built on adopted policies and enforced development regulations correlate with travel patterns and their sequential GHG emissions. The level of specificity, such as the land use change and GHG emission analysis presented in this study enables more data and indicators to be developed. Such data and indicators can be incorporated into decision makers' plans, policies and ultimately regulations and its possible integration with project level review processes.

Conclusion
While the results from this study offer specific recommendations as to which types of land use planning policy practices are most highly   associated with a higher amount of VMT, GHG emissions, there are also some potential to reveal policy impacts that can be applied to integrated land use and transportation sustainability practices. The results of this research are expected to add to the existing body of knowledge to enable faster and easier methods of examining the impact of adaptive planning strategies on alleviating the effects of household travel GHG emissions. The spatial cross-sectional regression model is developed through the integration of actual and scenario based land use visioning and planning, demographical changes, transportation emission analysis, and computer forecasting and evaluation of future scenarios. This research makes it possible to assess the household travel GHG footprint and provides models, data for possible GHG emission mitigation through land use policies and changes. Although the results may be pertaining to the specific dataset but it helps transportation decision makers to better connect the land use development and its related household socioeconomics with their GHG emission characteristics. Particular, the household travel GHG emission quantification results made its contribution to the current body of knowledge on the following: (1) provides accurate GHG emission results by using the best available traffic activity data inputs (VSP distributions) for emission modeling; (2) provides connections between household socioeconomics and their travel GHG footprint. The research suggests important potential to provide solid grounds for analyzing, modeling of sustainable community strategies, adaptive planning policies, and many other policy-making applications.  Table 5: Information-based measure of fit for spatial models.