Improving resolution of a spatial air pollution inventory with a statistical inference approach

Horabik, Joanna; Nahorski, Zbigniew

doi:10.1007/s10584-013-1029-4

Improving resolution of a spatial air pollution inventory with a statistical inference approach

Open access
Published: 28 January 2014

Volume 124, pages 575–589, (2014)
Cite this article

Download PDF

You have full access to this open access article

Climatic Change Aims and scope Submit manuscript

Improving resolution of a spatial air pollution inventory with a statistical inference approach

Download PDF

Joanna Horabik¹ &
Zbigniew Nahorski¹

2969 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

This paper presents a novel approach to allocation of spatially correlated data, such as emission inventories, to finer spatial scales, conditional on covariate information observable in a fine grid. Spatial dependence is modelled with the conditional autoregressive structure introduced into a linear model as a random effect. The maximum likelihood approach to inference is employed, and the optimal predictors are developed to assess missing values in a fine grid. An example of ammonia emission inventory is used to illustrate the potential usefulness of the proposed technique. The results indicate that inclusion of a spatial dependence structure can compensate for less adequate covariate information. For the considered ammonia inventory, the fourfold allocation benefited greatly from incorporation of the spatial component, while for the ninefold allocation this advantage was limited, but still evident. In addition, the proposed method allows correction of the prediction bias encountered for the upper range emissions in the linear regression models.

Improving resolution of a spatial air pollution inventory with a statistical inference approach

Spatial model for daily air quality high resolution estimation

Article Open access 18 April 2024

European Population Exposure to Airborne Pollutants Based on a Multivariate Spatio-Temporal Model

Article 18 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The development of high-resolution emission inventories is essential for designing suitable abatement measures. Spatial distributions of emissions can serve as an input for atmospheric dispersion models, which in turn may produce concentration maps of pollutants contributing to the adverse health effects, like ammonia emissions. For other air pollutants, such as greenhouse gasses (GHG), spatial patterns become helpful in improving identification of distributed emission sources.

Numerous issues underlying preparation of spatially resolved GHG inventory were discussed e.g. in Boychuk and Bun (this issue), Bun et al. 2010 or Thiruchittampalam et al. 2010. In general, the task crucially depends on availability of spatially distributed activity data. For instance, at present in Poland the activity data relevant to GHG emissions can be obtained at a level of country regions (voivodships). Information of higher spatial resolution can be often obtained only for some proxy data related to GHG emissions, such as land use and linear emission sources. Recently, also nighttime lights observed by a satellite have been used for more accurate estimation of spatial distribution of CO₂ emissions (Ghosh et al. 2010; Oda and Maksyutov 2011).

Typically, the regression models have been applied for spatial allocation of emission data (Dragosits et al. 1998; Oda and Maksyutov 2011). However, emissions in general tend to be spatially correlated, which provides opportunity for potential improvements. This idea motivated us to develop a more advanced approach for accurate disaggregation of air pollution data.

Making inference on variables at points or grid cells different from those of the data is referred to as the change of support problem (Gelfand 2010). Several approaches have been proposed to address this issue. The geostatistical solution for realignment from point to a real data is provided by block kriging (Gotway and Young 2002). Areal weighting offers a straightforward approach if the data are observed at a real units, and the inference is sought at a new level of spatial aggregation. Some improved approaches with better covariate modeling were also proposed e.g. in Mugglin and Carlin 1998 and Mugglin et al. 2000.

In this study we propose to apply methods of spatial statistics to produce higher resolution emission inventory data, taking advantage of more detailed land use information. The approach resembles to some extent the method of Chow and Lin (1971), originally proposed for disaggregation of time series based on related, higher frequency series. Here, a similar methodology is employed to disaggregate spatially correlated data.

Regarding an assumption on residual covariance, we apply the structure suitable for areal data, i.e. the conditional autoregressive (CAR) model. Although the CAR specification is typically used in epidemiology (Banerjee et al. 2004), it was also successfully applied for modelling air pollution over space (Kaiser et al. 2002; McMillan et al. 2010). Compare also Horabik and Nahorski (2010) for another application of the CAR structure to model spatial inventory of GHG emissions. The maximum likelihood approach to inference is employed, and the optimal predictors are developed to assess missing concentrations in a fine grid.

The application part of the study concerns an ammonia (NH₃) emission inventory in a region of Poland. Ammonia is emitted mainly by agricultural sources such as livestock production and fertilized fields. Its high concentrations can lead to acidification of soils, forest decline, and eutrophication of waterways. Ammonia emissions are also recognized for their importance in contributing to fine particulate matter; hence its spatial distribution is of great importance. However, agricultural emission sources cannot be measured directly, and spatial emission patterns need to be assessed otherwise. This issue was addressed, among others, by Dragosits et al. 1998, where agricultural and land cover data were used to disaggregate the national NH₃ emission totals across Great Britain. We demonstrate that the straightforward approaches based on linear dependences might be improved by introducing a spatial random effect.

Nevertheless, the proposed approach is of wider applicability, and can be used in numerous situations where higher resolution of spatial data is needed. In the context of the greenhouse gasses, the method might be particularly adequate to improve resolution of these activity data which tend to be spatially correlated. The plausible sectors include agriculture, transportation and forestry. Improved resolution may in turn contribute to reduction in uncertainties underlying GHG inventories.

2 Disaggregation framework

This section presents the statistical approach to the issue of spatial disaggregation. We have available data on a spatially distributed variable (inventory of emissions) integrated in a coarse grid. The aim is to estimate the distribution of this variable in a fine grid, conditional on some explanatory variables observable in this grid. It is assumed that the variable of interest is spatially correlated. Its residual covariance structure is set and the conditional autoregressive model is applied. An additional important assumption of the method is that the covariance structure of the variable in a coarse grid is the same as that in a fine grid.

Below we specify the model and provide details on its estimation in the coarse grid as well as on prediction in the fine grid.

2.1 Model

Fine grid

We begin with the model specification in a fine grid. Let Y _i denote a random variable associated with a missing value of interest y _i defined at each cell i for i = 1,…,n of a fine grid (n denotes the overall number of cells in a fine grid). Assume that each random variable Y _i follows the Gaussian distribution with the mean μ _i and variance σ ²_Y

$$ {Y}_i\left|{\mu}_i\right.\sim Gau\left({\mu}_i,{\sigma}_Y^2\right). $$

(1)

Given the values μ _i and σ ²_Y , the random variables Y _i are assumed independent, thus the joint distribution of Y = (Y ₁ ,…, Y _n)^T conditional on the mean process μ = (μ ₁, …,μ _n)^T is the Gaussian

$$ \left.\boldsymbol{Y}\right|\boldsymbol{\mu} \sim Ga{u}_n\left(\boldsymbol{\mu}, {\sigma}_Y^2{\boldsymbol{I}}_n\right), $$

(2)

where I _n is the n × n identity matrix; the superscript ^T stands for the transpose.

The mean μ represents the true process underlying emissions, and the (missing) observations are related to this process through a measurement error with the variance σ ²_Y . The model for the mean process is formulated as a sum of the regression component with available covariates, and a spatially varying random effect. For this, the conditional autoregressive model is used. The CAR model is given through the specification of the full conditional distribution functions of μ _i for i = 1,…,n (Cressie 1993; Banerjee et al. 2004)

$$ \left.{\mu}_i\right|{\boldsymbol{\mu}}_{-i}\sim Gau\left({\boldsymbol{x}}_i^T\boldsymbol{\beta} +\rho {\displaystyle \sum_{\begin{array}{l}j=1\\ {}j\ne i\end{array}}^n\frac{w_{ij}}{w_{i+}}\left({\mu}_j-{\boldsymbol{x}}_j^T\boldsymbol{\beta} \right),\frac{\tau^2}{w_{i+}}}\right), $$

(3)

where μ _− i denotes all elements in μ but μ_i, w _ij are the adjacency weights (w _ij = 1 if j is a neighbour of i and 0 otherwise, also w _ii = 0); w _i + = ∑ _j w _ij is the number of neighbours of an area i; x _i is a vector containing 1 as its first element (for the intercept β ₀) and k explanatory covariates of an area i as the next elements; β = (β ₀,β ₁, …,β _k)^T is a vector of regression coefficients. For calculation of the adjacency weights we use the Queen Method, i.e. two cells are considered neighbours if they share a side or a vertex. The CAR structure follows an assumption of similar random effects in adjacent cells; this is reflected in the second summand of the conditional expected value of μ _i, which is proportional to the average values of remainders μ _j − x ^T_j β for neighbouring sites (i.e. when w _ij = 1). This proportion is calibrated with the parameter ρ. Thus ρ reflects the strength of spatial association. The variance of the full conditional distribution of μ _i is inversely proportional to the number of neighbours w _i +, and τ ² is a variance parameter.

Given (3), the joint probability distribution of the process μ is as follows, see e.g. Banerjee et al. (2004)

$$ \boldsymbol{\mu} \sim Ga{u}_n\left(\boldsymbol{X}\boldsymbol{\beta }, {\tau}^2{\left(\boldsymbol{D}-\rho \boldsymbol{W}\right)}^{-1}\right), $$

(4)

where X is the matrix whose rows are the vectors x ^T_i

$$ \boldsymbol{X}=\left[\begin{array}{llll} 1\hfill & {x}_{1 1}\hfill & \cdots \hfill & {x}_{1k}\hfill \\ {}\vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {} 1\hfill & {x}_{n 1}\hfill & \cdots \hfill & {x}_{nk}\hfill \end{array}\right]; $$

D is an n × n diagonal matrix with w _i + on the diagonal; and W is an n × n matrix with adjacency weights w _ij. Equivalently we can write (4) as

$$ \begin{array}{ll}\boldsymbol{\mu} =\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{\varepsilon}, \hfill & \boldsymbol{\varepsilon} \sim Ga{u}_n\left(\boldsymbol{0},\Omega \right),\hfill \end{array} $$

(5)

where Ω = τ ²(D − ρ W)^− 1.

Coarse grid

The model for a coarse grid (aggregated) observed data is obtained by multiplication of (5) with the N × n aggregation matrix C consisting of 0’s and 1’s, indicating which cells have to be aggregated together

$$ \begin{array}{ll}\boldsymbol{C}\boldsymbol{\mu } =\boldsymbol{CX}\boldsymbol{\beta } +\boldsymbol{C}\boldsymbol{\varepsilon } \hfill & \boldsymbol{C}\boldsymbol{\varepsilon } \sim Ga{u}_N\left(\boldsymbol{0},\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)\hfill \end{array} $$

(6)

where N is the number of observations in a coarse grid. Now, suppose that the random variable λ = Cμ is the mean process for random variables Z = (Z ₁, …,Z _N)^T associated with observations z = (z ₁, …,z _N)^T of the aggregated model

$$ \left.\boldsymbol{Z}\right|\boldsymbol{\lambda} \sim Ga{u}_N\left(\boldsymbol{\lambda}, {\sigma}_Z^2{\boldsymbol{I}}_N\right). $$

(7)

Thus, random variables Z _i, i = 1, …, N are conditionally independent

$$ \left.{Z}_i\right|{\lambda}_i\sim Gau\left({\lambda}_i,{\sigma}_Z^2\right) $$

(8)

where λ _i is the i-th element of the vector λ.

2.2 Estimation and prediction

Having available observations of Z _i in the coarse grid, we can estimate parameters β, σ ²_Z , τ ² and ρ with the maximum likelihood (ML) method. First, from (6) and (7) the joint unconditional distribution of Z is derived

$$ \boldsymbol{Z}\sim Ga{u}_N\left(\boldsymbol{CX}\beta, \boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right), $$

(9)

where M = σ ²_Z I _N, I _N is the N × N identity matrix; see e.g. Lindley and Smith (1972). Next, the log likelihood function associated with (9) is formulated

$$ \begin{array}{c}\hfill L\left(\boldsymbol{\beta}, {\sigma}_Z^2,{\tau}^2,\rho \right)=-\frac{1}{2} \log \left|\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right|-\frac{N}{2} \log \left(2\pi \right)\hfill \\ {}\hfill -\frac{1}{2}{\left(\boldsymbol{z}-\boldsymbol{CX}\boldsymbol{\beta } \right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\left(\boldsymbol{z}-\boldsymbol{CX}\boldsymbol{\beta } \right),\hfill \end{array} $$

where | ⋅ | denotes the determinant. With fixed σ ²_Z , τ ² and ρ, the above log likelihood is maximised for

$$ \boldsymbol{\beta} \left({\sigma}_Z^2,{\tau}^2,\rho \right)={\left[{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{CX}\right]}^{-1}{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{z}, $$

which substituted back into the function L(β,σ ²_Z ,τ ²,ρ) provides the profile log likelihood

$$ \begin{array}{c}\hfill L\left({\sigma}_Z^2,{\tau}^2,\rho \right)=-\frac{1}{2} \log \left|\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right|-\frac{N}{2} \log \left(2\pi \right)\hfill \\ {}\hfill -\frac{1}{2}{\left[\boldsymbol{z}-\boldsymbol{CX}{\left[{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{CX}\right]}^{-1}{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{z}\right]}^T\hfill \\ {}\hfill \times {\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\hfill \\ {}\hfill \times \left[\boldsymbol{z}-\boldsymbol{CX}{\left[{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{CX}\right]}^{-1}{\left(\boldsymbol{CX}\right)}^T{\left(\boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{z}\right].\hfill \end{array} $$

Further maximisation of L(σ ²_Z ,τ ²,ρ) is performed numerically, including checks on ρ to ensure that the matrix D − ρ W is non-singular, see Banerjee et al. (2004).

To obtain the standard errors of the estimated parameters, one needs to derive the Fisher information matrix. The asymptotic variance-covariance matrix of the ML estimators is obtained by inverting the expectation of the negative of the second derivatives (the Hessian) of the log likelihood function, and the expectation is evaluated at the ML estimates. In other words, the expected Fisher information matrix is used to obtain the standard errors of parameters. Calculation of the Hessian with respect to the regression coefficients is relatively straightforward, but it becomes more burdensome for the covariance parameters. A detailed derivation of the explicit formulas for the expected Fisher information matrix will be provided elsewhere; here we report the standard errors of the parameter estimators obtained in the case study.

To estimate the required values in a fine grid, the following prediction procedure is applied. Note that our primary interest is the underlying emission inventory process μ. The predictors optimal in the minimum mean squared error sense are given by E(μ|z). The joint distribution of (μ,Z) is given by

$$ \left[\begin{array}{l}\boldsymbol{\mu} \hfill \\ {}\boldsymbol{Z}\hfill \end{array}\right]\sim Ga{u}_{N+n}\left(\left[\begin{array}{l}\boldsymbol{X}\boldsymbol{\beta } \hfill \\ {}\boldsymbol{C}\boldsymbol{X}\boldsymbol{\beta} \hfill \end{array}\right],\left[\begin{array}{ll}\Omega \hfill & \Omega {\boldsymbol{C}}^T\hfill \\ {}\boldsymbol{C}\Omega \hfill & \boldsymbol{M}+\boldsymbol{C}\Omega {\boldsymbol{C}}^T\hfill \end{array}\right]\right). $$

(10)

The distribution (10) allows for full inference, yielding both the predictor $ \widehat{\boldsymbol{\mu}}=\widehat{E}\left(\left.\boldsymbol{\mu} \right|\boldsymbol{z}\right) $ and its error $ {\widehat{\sigma}}_{\mu}^2=V\widehat{a}r\left(\left.\boldsymbol{\mu} \right|\boldsymbol{z}\right) $

$$ \widehat{\boldsymbol{\mu}}=\boldsymbol{X}\widehat{\boldsymbol{\beta}}+\widehat{\Omega}{\boldsymbol{C}}^T{\left(\widehat{\boldsymbol{M}}+\boldsymbol{C}\widehat{\Omega}{\boldsymbol{C}}^T\right)}^{-1}\left[\boldsymbol{z}-\boldsymbol{CX}\widehat{\boldsymbol{\beta}}\right] $$

(11)

$$ {\widehat{\sigma}}_{\mu}^2=\widehat{\Omega}-\widehat{\Omega}{\boldsymbol{C}}^T{\left(\widehat{\boldsymbol{M}}+\boldsymbol{C}\widehat{\Omega}{\boldsymbol{C}}^T\right)}^{-1}\boldsymbol{C}\widehat{\Omega}, $$

(12)

where $ \widehat{\cdot} $ denotes the estimated values.

3 Case study

3.1 Data

The proposed procedure is illustrated using a real dataset of gridded inventory of NH₃ (ammonia) emissions from fertilization (in tonnes per year) reported in the northern region of Poland (the Pomorskie voivodship). The inventory grid cells are of a regular size 5 km × 5 km, and the whole of cadastral survey compiles n = 800 cells, denoted y = (y ₁, …,y _n)^T, see Fig. 1. For explanatory information we use the CORINE Land Cover Map for this region, available at the European Environment (2010). Specifically, for each grid cell we calculate the area of these land use classes, which are related to ammonia emissions. The following CORINE classes were considered (the CORINE class numbers are given in brackets):

Non-irrigated arable land (211), denoted x ₁ = (x _1,1, …,x _n,1)^T;
Fruit tree and berry plantations (222), denoted x ₂ = (x _1,2, …,x _n,2)^T;
Pastures (231), denoted x ₃ = (x _1,3, …,x _n,3)^T;
Complex cultivation patterns (242), denoted x ₄ = (x _1,4, …,x _n,4)^T;
Principally agriculture, with natural vegetation (243), denoted x ₅ = (x _1,5, …,x _n,5)^T.

Performance of the proposed disaggregation framework depends on a few factors. Perhaps the most crucial ones are the following two: (i) explanatory power of covariates available in the fine grid, and (ii) an extent of disaggregation, which is connected with preservation of the spatial correlation. The impact of both these features will be evaluated in our case study.

Regarding the first factor, we will examine models with all the above land use classes (set 1), and compare the results with models including only two of them: non-irrigated arable land and complex cultivation patterns (set 2). This subset of land use classes was chosen on the basis of its explanatory power. When limiting the number of explanatory variables, these two covariates provided the best results. Secondly, we compare linear regression with independent (iid) errors versus spatially correlated errors modelled by the CAR process. We consider the following models:

Model CAR1: - CAR errors, set 1 of covariates;
Model LM1: - iid errors, set 1 of covariates;
Model CAR2: - CAR errors, set 2 of covariates;
Model LM2: - iid errors, set 2 of covariates.

This setting of four models is intended to enable the analysis of extent to which a limited number of explanatory information can be compensated by spatial modelling.

Regarding the second factor, we test the disaggregation from 10 × 10 km to 15 × 15 km (coarse) grids into a 5 km × 5 km (fine) grid. To examine performance of the disaggregation procedure, first the original fine grid emissions are aggregated into respective coarse grid cells. Next, the proposed model is fitted and ammonia emissions are predicted for a 5 km × 5 km (fine) grid. Finally, the obtained results are checked with the original inventory emissions of a 5 km × 5 km (fine) grid. Thus, our simulation study tests the cases of a fourfold and ninefold disaggregation. The aggregated values of the two coarse grids as well as the actual inventory data in the fine grid are shown in Fig. 1.

3.2 Results of disaggregation from the 10 km grid

This subsection presents the model testing results for disaggregation from the 10 km grid. Table 1 (the upper part) displays the maximum likelihood estimates (denoted by Est.) and standard errors (denoted by Std.Err.) of the parameter estimators for each model. Note that in the models with set 1 of covariates (CAR1, LM1) the regression coefficient β₀ was dropped as it was statistically insignificant. In the table, we can observe that the ML estimates of the regression coefficients are similar for all the models. From the ratio of regression coefficients and its respective standard errors (i.e. the t-test statistic), we can roughly conclude that all the considered land use classes are statistically significant; in fact, in each case respective p-values proved to be less than 0.05 (not shown). Next, let us turn our attention to the error part of the models. Significantly lower values of σ ²_Z estimates under both the CAR models, as compared with their linear regression counterparts, indicate that greater variability is explained by the models with spatially correlated errors than by the corresponding models with independent errors. As expected, among the spatially correlated models, both variance parameters σ ²_Z and τ² are higher for CAR2 than for CAR1 model with five land use classes as explanatory variables. Furthermore, the parameter ρ reflects strength of the spatial correlation. Note that ρ = 0 corresponds to a model with independent errors, see also Banerjee et al. (2004) for more details. A value of parameter ρ is higher for CAR2 model, which illustrates that in the models of limited explanatory power, the importance of spatial correlation becomes more pronounced.

Table 1 Maximum likelihood estimates

Full size table

The results of the four models are also summarized using the Akaike criterion (AIC). The idea of AIC is to favour a model with a good fit and to penalize it for a number of parameters; models with smaller AIC are preferred to models with larger AIC. Table 2 (the upper part) displays AIC for each model, and additionally it reports the negative log likelihood (-L). Naturally, the models with set 1 of covariates provide much better results than the models with another set. Among these respective sets, the models with the spatial structure considerably improve results obtained with the models of independent errors. Note, that this improvement is higher for the models with set 2 of covariates (797.6–742.8 = 54.8) than for the models with set 1 of covariates (685.1–640.7 = 44.4).

Table 2 Model comparison and analysis of residuals (d _i = y _i − y ^*_i )

Full size table

The values of ammonia emissions predicted in a 5 km × 5 km grid (y ^*_i ) are featured in Fig. 2. Differences between the four models are negligible, although a visual comparison with the original emissions in Fig. 1 (the left-hand-side plot) suggests that the both models based on set 1 of covariates (CAR1, LM1) provide slightly better results. Since the mapped emission values are classified into just 9 bins, therefore some features might not be easily distinguishable on the maps in Fig. 2. To remedy this, Fig. 3 presents the model residuals (d _i = y _i − y ^*_i ). Now the difference in prediction results among the models is evident—the best results are obtained for CAR1 model and the worst for LM2 model.

At this point it must be stressed that the values predicted in a fine grid (y ^*_i ) are calculated with the formula (11) based on the aggregated values of 10 km grid; the calculations are made as if the true emissions were unknown. On the other hand, recall that these true emissions in the fine grid (y _i) are available; see the left-hand-side map in Fig. 1. From now on, our analysis is based on a comparison between the prediction results obtained with the proposed technique and the original fine grid ammonia emissions (observations).

Figure 4 presents, for each model, a scatterplot of predicted values y ^*_i versus observations y _i. The straight line has slope 1, thus if the predicted values are close to the original data, points are close to the straight line. This setting, once again, illustrates much better explanatory power of models based on all the land use classes (set 1 of covariates). It also illustrates the importance of the spatial structure component. In the case of models CAR2 and LM2, the introduction of spatial dependence has evidently improved the accuracy of prediction. Whereas in the case of models CAR1 and LM1, the applied spatial structure considerably limited a number of highly overestimated predictions (points below the straight line). Furthermore, we note that for a prevailing number of cases in the high emission range (emissions over 1.5 tonnes) the linear regression LM1 provides biased (underestimated) predictions, while CAR1 model allows overcoming this deficiency. This is due to the fact that the analysed emissions are spatially correlated, that is, cells located nearby tend to have similar values. In particular, the majority of high emission values are located in the eastern part of the voivodeship as well as in the north-west stripe along the coastline (compare Fig. 1). The covariates applied in the linear regression LM1 explain emission variability only to some extent, and the point is that the unexplained variability remains spatially correlated. This can be noticed on the map in Fig. 3 for LM1 model, where clusters of residual values (0.2–0.5) in the mentioned areas indicate underestimated predictions. The autocorrelation term in the model CAR1 allows for this feature. In Fig. 4 it can be seen as a slope of a dotted line, which is visibly higher than 1 for LM1 model, while for CAR1 it lines up with the one of slope 1.

The residuals d _i are further analysed in Table 2 (the upper part). Namely, the mean squared error (MSE) is calculated

$$ MSE=\frac{1}{n}{\displaystyle {\sum}_i{\left({y}_i-{y}_i^{*}\right)}^2}, $$

and it should be as low as possible. The mean squared error reflects how well a model predicts data. In Table 2 we report also the minimum and maximum values of d _i, and the sample correlation cofficient r between the predicted y ^*_i and observed y _i values. In terms of both the mean squared error and the coefficient r, the best model is CAR1 and the poorest one is LM2, following the previous assessments. Interestingly, the remaining two models changed their ranks compared with the AIC criterion. That is, CAR2 model has lower MSE = 0.158 and higher coefficient r = 0.901 than the linear model based on set 1 of covariates (LM1 model with MSE = 0.186 and r = 0.882). This proves that the model with a limited number of covariates but having a spatial component (CAR2) can provide better disaggregation results than the model based solely on linear regression, even though its covariate information is richer (LM1). Note that the analysis based on residuals is more robust than the AIC rating, which basically tests a model fit to the aggregated data.

Following the formula (12), we also calculate the prediction error. Since in the present case study the correct values of predicted emissions are known, we are in a position to compare the prediction error with the actual residuals (more precisely, with its absolute values). In Fig. 5 these values are presented for CAR2 model. It is noticeable that the prediction error is significantly underestimated, and moreover, it does not reflect the diversification of the actual residuals properly. Note that in the both maps the highest errors are reported on the border of the domain; this fact is known in spatial modelling as the edge effect.

3.3 Results of disaggregation from the 15 km grid

Next, we present the results of disaggregation from the 15 km grid. The conducted analysis is similar to the one of the 10 km grid and, where appropriate, both settings are compared.

The lower part of Table 1 contains the maximum likelihood estimates for the 15 km grid data. In the models with set 1 of covariates, the regression coefficient β₀ was again dropped. Moreover, in all the models at this level of aggregation the land use class “Fruit tree and berry plantations” (β₂) was statistically insignificant, and thus it was also dropped. The remaining land use classes were informative, with respective p-values lower than 0.05.

As regards the error part, all the comments reported for 10 km disaggregation remain valid also here, although their degree is significantly lower. Both CAR models provide lower values of σ ²_Z than their linear regression counterparts. However, the reduction of unexplained variability between the models, for instance, LM1 and CAR1 is only 1.5 (3.5/2.339), while it was over 3 (1.165/0.334) for respective models of 10 km disaggregation. This suggests that the spatial correlation strength of the 15 km grid model is smaller than the 10 km grid one. Thus, here the CAR models are less competitive than the LM models, as compared to the former grid.

The values of AIC criterion and of the negative log likelihood (-L) are reported in the lower part of Table 2. Similarly as for the disaggregation from a 10 km grid, also in this case the models based on set 1 of covariates provide better results. The CAR structure improves obtained linear regression results of both respective covariate sets. Note, however, that in the setting of 15 km disaggregation, the impact of the spatial component is not that substantial anymore as it was previously. Again, a bigger improvement is noted for the models with a limited number of covariates (504.4–492.8 = 11.6 in terms of the AIC criterion), and the gain from incorporation of the spatial component is only marginal for the models with set 1 of covariates (455.9–455.3 = 0.6).

For the four considered models, the maps of ammonia emissions disaggregated from the 15 km grid and predicted in the fine grid provided visually similar results (not shown). The residual maps proved to be more informative, see Fig. 6. While for the 10 km disaggregation the residual maps clearly indicated discrepancies among the models, here it is not easily visible. The models based on set 1 of covariates (CAR1, LM1) provide smaller residuals. However, the differences between the spatial models and their linear regression counterparts seem to be negligible.

Again, Table 2 (the lower part) provides further analysis of residuals. The mean squared error MSE and the correlation coefficient r yield a consistent ranking of the models. Obviously the best model is CAR1 with r = 0.915 and MSE = 0.136, while the poorest one is LM2 with r = 0.807 and MSE = 0.295. When it comes to the remaining two models, LM1 slightly outperforms CAR2 (in terms of the mean squared error). Note that this order is reversed when compared with the results of the 10 km grid disaggregation (the upper part of the table). Therefore, when disaggregating from the 10 km grid, the spatial structure is more informative than some of the covariates, but this is not true anymore when disaggregating from the 15 km grid. From this we conclude that in this particular case study, the proposed framework offers an efficient tool for a quadruple and nine-times disaggregation, but it may become less adequate for higher order allocations.

The actual interplay among the four models is illustrated on the scatterplots in Fig. 7. In general, the 15 km disaggregation preserves the features reported previously—the performance of respective models is analogous as for the 10 km disaggregation. It means that for the models based on set 2 of covariates, the spatial correlation significantly improves prediction quality. Also for the other two models, the introduction of spatial structure is still beneficial as it allows correction of the prediction bias and a slight reduction in the number of overestimates. We highlight the difference between the models CAR2 and LM1 that yield almost the same MSE and coefficient r, but provide completely distinct plots, see Fig. 7. The residuals of CAR2 model are more dispersed owing to a limited set of explanatory covariates. On the other hand, improved covariate modelling of LM1 leads to the residuals gathered close to the diagonal, but a lack of spatial averaging results in larger amount of overestimated values. Altogether, the assessment of residuals for both models becomes the same.

4 Discussion and conclusions

The major objective of this study was to demonstrate how a variable of interest (here, emissions) available in a coarse grid plus information on covariates available in a finer grid can be combined together to provide the variable of interest in a finer grid, and therefore to improve its spatial resolution. We proposed a relevant disaggregation model and illustrated the approach using a real dataset of ammonia emission inventory. The idea is conceptually similar to the method of Chow and Lin (1971), originally designed for time series data; see also Polasek et al. (2010). It was applied to the spatially correlated data, and spatial dependence was modelled with the conditional autoregressive structure introduced into a linear model as a random effect.

The model allows for this part of a spatial variation which has not been explained by available covariates. Thus, if the covariate information does not correctly reflect a spatial distribution of a variable of interest, there is potential for improving the approach with a relevant model of a spatial correlation. The underlying assumption of the method is that the covariance structures of the variable in a coarse grid and in a fine grid are the same. In the present study of ammonia emissions examined in 5 km, 10 km, and 15 km grids, this assumption proved to be reasonable.

Performance of the proposed framework was evaluated with respect to the following two factors: explanatory power of covariates available in a fine grid, and the extent of disaggregation. The results indicate that inclusion of a spatial dependence structure can compensate for less adequate covariate information. For the considered ammonia inventory, the fourfold allocation benefited greatly from the incorporation of the spatial component, while for the ninefold allocation this advantage was limited, but still evident. In addition, the proposed method allowed to correct the prediction bias encountered for upper range emissions in the linear regression models.

We note that in this case study we used the original data in a fine grid to assess the quality of resulting predictions. For the purpose of potential applications, we developed also a relevant measure of prediction error (the formula 12). Although not entirely faultless, it is the first attempt to quantify the prediction error in situations, where original emissions in a fine grid are not known.

Other approaches, such as a geostatistical model, might be potentially used in the case of spatial allocation. Application of the geostatistical approach brings us to the concept of block kriging (Gelfand 2010). However, it should be stressed that geostatistics is more appropriate for point referenced data, while our proposition is dedicated to the case of emission inventories which involve a real data. Thus, the choice between these two options should be considered on a case by case basis.

Another possibility to deal with the issue of spatial disaggregation could be to use some expert knowledge and logical inference; compare Verstraete (this issue) for a fuzzy inference system to the map overlay problem.

The described method opens the way to uncertainty reduction of spatially explicit emission inventories, hence the future work will also include testing the proposed disaggregation framework for inventories of greenhouse gasses.

References

Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall/CRC, Boca Raton
Boychuk K, Bun R (this issue) Regional spatial cadastres of GHG emissions in energy sector: accounting for uncertainty
Bun R, Hamal K, Gusti M, Bun A (2010) Spatial GHG inventory at the regional level: accounting for uncertainty. Clim Chang 103(1–2):227–244
Article Google Scholar
Chow GC, Lin A (1971) Best linear unbiased interpolation, distribution, and extrapolation of time series by related series. Rev Econ Stat 53(4):372–375
Article Google Scholar
Cressie NAC (1993) Statistics for spatial data. John Wiley & Sons, New York
Google Scholar
Dragosits U, Sutton MA, Place CJ, Bayley AA (1998) Modelling the spatial distribution of agricultural ammonia emissions in the UK. Environ Pollut 102(S1):195–203
Article Google Scholar
European Environment Agency (2010) Corine land cover 2000. http://www.eea.europa.eu/data-and-maps/data. Cited August 2010
Gelfand AE (2010) Misaligned spatial data: the change of support problem. In: Gelfand AE, Diggle PJ, Fuentes M, Guttorp P (eds) Handbook of spatial statistics. Chapman & Hall/CRC, Boca Raton
Ghosh T, Elvidge CD, Sutton PC et al (2010) Creating a global grid of distributed fossil fuel CO₂ emissions from nighttime satellite imagery. Energies 3:1895–1913
Article Google Scholar
Gotway CA, Young LJ (2002) Combining incompatible spatial data. J Am Stat Assoc 97:632–648
Article Google Scholar
Horabik J, Nahorski Z (2010) A statistical model for spatial inventory data: a case study of N₂O emissions in municipalities of southern Norway. Clim Chang 103(1–2):263–276
Article Google Scholar
Kaiser MS, Daniels MJ, Furakawa K, Dixon P (2002) Analysis of particulate matter air pollution using Markov random field models of spatial dependence. Environmetrics 13:615–628
Article Google Scholar
Lindley DV, Smith AFM (1972) Bayes estimates for the linear model. J Roy Stat Soc B 34:1–41
Google Scholar
McMillan AS, Holland DM, Morara M, Fend J (2010) Combining numerical model output and particulate data using Bayesian space-time modeling. Environmetrics 21:48–65
Google Scholar
Mugglin AS, Carlin BP (1998) Hierarchical modeling in geographical information systems: population interpolation over incompatible zones. J Agric Biol Environ Stat 3:111–130
Article Google Scholar
Mugglin AS, Carlin BP, Gelfand AE (2000) Fully model-based approaches for spatially misaligned data. J Am Stat Assoc 95:877–887
Article Google Scholar
Oda T, Maksyutov S (2011) A very high-resolution (1 km × 1km) global fossil fuel CO₂ emission inventory derived using a point source database and satellite observations of nighttime lights. Atmos Chem Phys 11:543–556
Article Google Scholar
Polasek W, Llano C, Sellner R (2010) Bayesian methods for completing data in spatial models. Rev Econ Anal 2:194–214
Google Scholar
Thiruchittampalam B, Theloke J, Uzbasich M et al. (2010) Analysis and comparison of uncertainty assessment methodologies for high resolution Greenhouse Gas emission models. In: Proceedings of the 3rd International Workshop on Uncertainty in Greenhouse Gas Inventories, Lviv Polytechnic National University, Ukraine, 22–24 Sept 2010
Verstraete J (this issue) Solving the map overlay problem with a fuzzy approach

Download references

Acknowledgments

The study was conducted within the 7FP Marie Curie Actions IRSES project No. 247645. J. Horabik acknowledges support from the Polish Ministry of Science and Higher Education within the funds for statutory works of young scientists. This contribution is also supported by the Foundation for Polish Science under International PhD Projects in Intelligent Computing; project financed from The European Union within the Innovative Economy Operational Programme 2007–2013 and European Regional Development Fund. Z. Nahorski was financially supported by the statutory funds of the Systems Research Institute of Polish Academy of Sciences.

The authors gratefully acknowledge the provision of data for the case study from Ekometria – Biuro Studiów i Pomiarów Proekologicznych in Gdańsk, Poland.

Author information

Authors and Affiliations

Polish Academy of Sciences, Systems Research Institute, ul. Newelska 6, 01-447, Warsaw, Poland
Joanna Horabik & Zbigniew Nahorski

Authors

Joanna Horabik
View author publications
You can also search for this author in PubMed Google Scholar
Zbigniew Nahorski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joanna Horabik.

Additional information

This article is part of a Special Issue on “Third International Workshop on Uncertainty in Greenhouse Gas Inventories” edited by Jean Ometto and Rostyslav Bun.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Reprints and permissions

About this article

Cite this article

Horabik, J., Nahorski, Z. Improving resolution of a spatial air pollution inventory with a statistical inference approach. Climatic Change 124, 575–589 (2014). https://doi.org/10.1007/s10584-013-1029-4

Download citation

Received: 02 January 2013
Accepted: 01 December 2013
Published: 28 January 2014
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10584-013-1029-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving resolution of a spatial air pollution inventory with a statistical inference approach

Abstract

Similar content being viewed by others

Improving resolution of a spatial air pollution inventory with a statistical inference approach

Spatial model for daily air quality high resolution estimation

European Population Exposure to Airborne Pollutants Based on a Multivariate Spatio-Temporal Model

1 Introduction