Estimating ambient air pollutant levels in Suzhou through the SPDE approach with R-INLA

Spatio – temporal models of ambient air pollution can be used to predict pollutant levels across a geographical region. These predictions may then be used as estimates of exposure for individuals in analyses of the health effects of air pollution. Integrated nested Laplace approximations is a method for Bayesian inference, and a fast alternative to Markov chain Monte Carlo methods. It also facilitates the SPDE approach to spatial modelling, which has been used for modelling of air pollutant levels, and is available in the R-INLA package for the R statistics software. Covariates such as meteorological variables may be useful predictors in such models, but covariate misalignment must be dealt with. This paper describes a flexible method used to estimate pollutant levels for six pollutants in Suzhou, a city in China with dispersed air pollutant monitors and weather stations. A two-stage approach is used to address misalignment of weather covariate data.


Introduction
Research into the health effects of ambient air pollution requires long-term measurements of pollutant exposure at the individual level. Studies on longer-term effects of air pollution exposure in China (and elsewhere) have used averaged concentrations from static ambient air pollution monitors (Cao et al., 2011;Chen et al., 2016;Dong et al., 2012;Li et al., 2014;Zhang et al., 2011;Zhou et al., 2014) or satellite data (Peng et al., 2017;Yin et al., 2017) for exposure information. For analyses that require individual exposure levels for study participants, spatio-temporal models of ambient air pollution may be used to predict pollutant levels across a geographical region. Predictions at individuals' residential or employment locations can then be used as estimates of ambient air pollution exposure.
Bayesian inference offers a practical method for applying such spatio-temporal models and producing predictions. Integrated nested Laplace approximations (INLA) (Rue et al., 2009) allow fast computation for Bayesian inference and enable the use of the SPDE approach for spatial modelling (Lindgren et al., 2011). These methods have been used in modelling of air pollutant levels in Italy Fioravanti et al., 2021) and England (Blangiardo et al., 2016).
Meteorological variables can be useful predictors in models of ambient air pollution, but weather station locations may not coincide with the pollutant monitor locations or locations where predictions are sought. This is a case of the problem of covariate misalignment, where covariate data are not available at the same locations as observed dependent data. Joint modelling (Barber et al., 2016) or error models can be used to incorporate such covariates while accounting for uncertainty.
We obtained estimated pollutant exposure levels for participants in Suzhou in the China Kadoorie Biobank study. Data were available from static monitors for six pollutants: fine (PM 2.5 ), and coarse (PM 10 ) particulate matter, sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), and ozone (O 3 ). We used Bayesian spatial-temporal models to predict monthly levels of each pollutant at all clinic locations in the area. These predictions can be used as proxies for individual pollution exposure in analyses of health outcomes. This method exploits the spatial information from having monitors in different locations, providing localised exposure estimates that are not available by averaging pollution levels across the study area. Weather data were also available from stations in the city. However, the locations of weather monitors did not coincide with the pollutant monitors. Previous work (Blangiardo et al., 2016;Cameletti et al., 2013;Fioravanti et al., 2021) has used covariates that are fixed over space, or used the geographically closest available measurements. Given the limited number and placement of weather monitors, we used a two-stage approach to address misalignment of weather covariate data, and compare four models for including weather covariates in the pollutant models.

China Kadoorie Biobank study
The China Kadoorie Biobank study (Chen et al., 2005) recruited 512, 726 participants between 2004 and 2008, from ten diverse areas of China. Participants are followed up for a wide range of health outcomes via linkages with health insurance systems, established disease surveillance systems and death registries. Details of the study design and methods have been reported previously (Chen et al, 2005. In Suzhou, 53,269 study participants were recruited each of whom is linked to one of 77 local clinics. One clinic located outside the urban area of Suzhou was excluded from this analysis.

INLA and SPDE spatial models
Integrated nested Laplace approximations (INLA) (Rue et al., 2009;Wang et al., 2018) is a fast alternative to Markov chain Monte Carlo (MCMC) methods for Bayesian inference from latent Gaussian models. The method uses numerical integration and Laplace approximations for approximate Bayesian inference and is implemented in the R package R-INLA (R-INLA, 2020). The package includes many latent models, including SPDE spatial models (Lindgren et al., 2011), error models (Muff et al., 2013), and auto-regressive models. Posterior predictive distributions produced by fitting Bayesian models can be used to generate point or ranges of predictions.
The SPDE approach to spatial modelling, implemented in the R-INLA package, involves representing a continuously indexed Gaussian field with Matérn covariance as a discretely indexed Gaussian Markov random field (GMRF). This is achieved by means of a basis function representation defined on a triangulation of the domain. The GMRF has a sparse precision matrix and so computationally efficient methods for matrix factorisation, and INLA methods for Bayesian inference, can be used (Bakka et al., 2018;Bivand et al., 2015;Blangiardo et al., 2013;Blangiardo and Cameletti, 2015;Gomez-Rubio, 2020;Krainski et al., 2019;Lindgren et al., 2011;Lindgren and Rue, 2015;Moraga, 2019). The SPDE approach to spatial modelling and the Matérn covariance function and its parameters are well described in Chapter 6 of Blangiardo and Cameletti (2015). Separable space-time models, defined by the Kronecker product between the two precision matrices, can be constructed using the group feature in R-INLA. This allows spatial and temporal correlations to be jointly modelled. This form of spatio-temporal model is well described in Chapter 7 of Blangiardo and Cameletti (2015) and Chapter 10 of Moraga (2019). These methods have previously been used for spatio-temporal modelling of PM 10 levels in Italy Fioravanti et al., 2021) and NO 2 levels in England (Blangiardo et al., 2016).

Pollution and weather data
The data included daily average measurements of six pollutants: particulate matter with diameter of 2.5 μm or less (PM 2.5 ), particulate matter with a diameter between 2.5 and 10 μm (PM 10 ), sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), and ozone (O 3 ). Measurements were available between January 2013 and December 2015 from up to 10 pollution monitors situated in Suzhou (as shown in Fig. 1). Daily weather data, including temperature, pressure, precipitation and wind speed, were available from five monitors in the region from January 2013 to June 2016. The locations of the weather monitors are also shown in Fig. 1 and do not coincide with the locations of the pollution monitors or clinics. Five geographic covariates were available for all locations: elevation; distance to nearest major road; distance to nearest motorway; total length of major roads and motorways in a 1 km radius; and land use (a binary variable representing "urban" or "nonurban"). Elevation values were interpolated from the values of the four nearest raster cells.

Methods
In order to address the misalignment problem and use weather data variables as covariates in the pollutant models, two stages of models were used. Firstly, models for each weather variable were used to obtain predictions of each weather variable at pollution monitor and clinic locations. These predictions were then included as covariates in the models for the pollutants. Approximate Bayesian inference was performed using INLA with the R-INLA software package (R-INLA, 2020) for R. This analysis used pollutant and weather variables aggregated to monthly means. This reduced the size of the data being used and number of values to be estimated, and made distributions approximately normal. For example, daily rainfall data are highly skewed with many zeroes, but the observed monthly average rainfall has a symmetric distribution. Observed daily pollutant levels of zero were set to missing, as these were believed to indicate errors in the data.

Meshes for SPDE spatial model
A mesh (triangulation) of the region was required to apply SPDE spatial models. The same mesh was used for all weather variable and pollutant models. Latitude and longitude coordinates of all locations were converted to Universal Transverse Mercator (UTM) coordinates. All clinic and monitor locations are in UTM zone 51. These coordinates were then re-scaled with centre equal to the midpoint of all monitor and clinic locations and so that 1 unit equals approximately 1 km. The R-INLA inla.mesh.2d function, which employs constrained refined Delaunay triangulation, was then used to construct a triangular mesh on the region. The domain was formed by the convex hull of all weather station, pollutant monitor and clinic locations, with an inner extension of 5 km and an outer extension of 15 km. The locations of all pollution monitors and weather monitors were used as initial triangulation nodes. The maximum edge ledge was set to 5 km (10 km in the outer extension), the minimum triangle angle was set to 28 • (18 • in the outer extension), and the minimum distance between points was set to 0.1 km. A denser mesh was also constructed using the locations of all clinics, as well as weather stations and pollutant monitors, as initial triangulation nodes. The weather and pollutant prediction models were additionally fit using this mesh, and point predictions (median of the posterior predictive distribution) of the pollutants were compared between using either mesh by Pearson correlation. The main mesh had 432 nodes and the denser mesh 711 nodes. The meshes are shown in Fig. 2.

Weather models
From the weather data, four variables were selected representing temperature (daily average temperature), humidity (daily average humidity), wind speed (daily 10 min maximum wind speed), and precipitation (total 24 h precipitation). The wind speed variable was log transformed. Each of the four weather variables was aggregated to monthly means and then re-scaled to have a mean of zero and a variance of one.
Each of the four weather variables was modelled as a Gaussian response. Model predictors with and without linear effects for space and time trends were compared using the Watanabe-Akaike (or "Widely Applicable") information criterion (WAIC) (Gelman et al., 2014;Watanabe, 2010). Calendar month was included as a factor. No level was dropped but the intercept term was dropped, so that prior distributions were exchangeable for levels of this factor. All models included a space-time model, using an SPDE spatial model (i.e. an approximation to a Gaussian field with Matérn covariance) for spatial correlations and a first order auto-regressive model for temporal correlations. Details and formulae for the models are provided in a supplementary file.

Pollutant models
After aggregation to monthly means, pollutant levels were log transformed and then modelled as Gaussian responses. The model predictors included spatial trends, a linear time trend, calendar month as a factor, five geographic covariates, and a space-time model with an SPDE spatial model (i.e. an approximation to a Gaussian field with Matérn covariance) and a first order auto-regressive model for temporal correlations. Continuous covariates were re-scaled to have mean zero and variance one for the pollution monitor and clinic locations. Spatial trends were included using terms for the x-and y-coordinates and the square of x-and y-coordinates. Allowing for a quadratic shape of trends (on the log scale for pollutants) prevented simple linear trends from being extrapolated in predictions for clinic locations far from the centre of the region. In particular, including only a simple linear trend led to extreme, implausible predictions for SO 2 levels at clinic locations in the far West of the region.
Four different approaches to include the standardised weather covariates in the pollutant models were compared: 1. Exclude weather covariates from the model predictor. 2. Include the mean of the values from each of the weather monitors, so that the same value is used for every location at the same time point.     There is collinearity between the weather variables and calendar month, which complicates the interpretation of individual coefficients. However, the aim of these models is prediction of pollutant levels, rather than inference for individual coefficients, so this is not a concern and the prediction ability of the models is not affected (Shmueli, 2010).
Models were also fit for each pollutant excluding the SPDE model, but including the temporal first order auto-regressive random effects for stations. This allowed comparison between models which account for spatial correlation and models which ignore spatial correlation.
Details and formulae for the models are provided in a supplementary file.

Prediction models
Given a model with response variable Y and the predictor η, the posterior distribution of the predictor η is the posterior distribution of the mean response, not the posterior predictive distribution of the response itself. To obtain posterior predictive distributions of the response (including uncertainty due to all sources of errormodelled by σ 2 e ) an adapted formulation of the model was used: The precision of the Gaussian response was fixed to be very large (e 20 ) so that the response Y is (effectively) equal to the value of the predictor θ. An independent and Gaussian distributed random effect was then added to the predictor, so θ = η + ε and ε ∼ Normal(0, σ 2 e ). This strategy means that the posterior distribution of the predictor θ is the posterior predictive distribution of the response.
Posterior predictive distributions for pollutants were summarised by medians and 95% equal tailed intervals. Examples of the R code used for prediction models, for humidity and PM 10 , are provided in a supplementary file.

Priors
The prior distributions for calendar month and other fixed effect parameters were Normal with mean 0 and precision 0.001. The priors for the precision of the responses were Gamma with a shape parameter of 1 and an inverse scale parameter of 5 × 10 − 5 . The priors for the coefficient, a, of the first order auto-regressive model were given by log ((1 + a) /(1 − a)) ∼ Normal(0, 1/ 0.15 ). Normal priors were used for the coefficients of Berkson error models with mean 1 and precision 0.001. The mean and precision parameters for the error models were fixed values and therefore do not have prior or posterior distributions.
For the SPDE spatial models, penalised complexity (PC) priors were used with P(r < 10) = 0.5 and P(σ > 1) = 0.5, where r is the range and σ the standard deviation of the field.

Posteriors and model fit statistics
Posterior distributions for parameters and hyperparameters were transformed to the original scale of the dependent variable as applicable, and then summarised by medians and 95% highest posterior density (HPD) intervals. As pollutant levels were log transformed in the models, the exponentiated covariate coefficients are interpretable as ratios. For the SPDE models, hyperparameters were transformed to the range and variance. Models were compared using the Watanabe-Akaike (or "Widely Applicable") Information Criterion (WAIC), a Bayesian approach for  estimating out-of-sample prediction error (Gelman et al., 2014;Watanabe, 2010).
To further asses the performance of the modelling approach, pollutant prediction models were also applied after excluding a sample of 50 pollutant observations. The sample was a simple random sample from all 348 combinations of monitor location and month (in which observed data were available). The RMSE and Pearson correlation between predicted (median of posterior predictive distribution) and observed values were then calculated.

Results
Observed weather and pollutant data are summarised in Table 1, using daily values and monthly means calculated for each monitor.
Precipitation data are missing for 458 daily observations. There are at least 10,209 observations for each pollutant across the three year period.

Weather models
Including linear trends for time or space did not consistently decrease the WAIC values, so trends were not included in the weather prediction models. Medians and 95% HPD intervals of posterior distributions for the calendar month intercepts and hyperparameters of the four weather models are given in Table 2. Seasonal patterns are present for each of the weather variables. In particular, temperature and precipitation have much higher intercepts during the summer months (June to September) as in the observed data. The ranges of the SPDE models are large (posterior medians from 108 km to 379 km). Temperature,

Pollutant models
WAIC values for five models for each pollutant are given in Table 3. The models including error models for weather variables have lower WAIC values -indicating a better fit to the data -than models which use other methods to incorporate weather covariates, for all pollutants. For all pollutants, except PM 10 , the WAIC for models excluding an SPDE spatial model is larger indicating that accounting for spatial correlations with an SPDE model improves the fit of the models. This also allows for individual predictions of pollutant levels at locations across the region. The following results and predictions use models which include error models for the weather covariates.
Medians and 95% HPD intervals of posterior distributions for the parameters and hyperparameters of the six pollutant models are given in Tables 4 and 5. Collinearity between calendar month intercepts and weather variables inhibits clear interpretation of these parameters. The particulate matter pollutants have the largest range for the spatial model (posterior medians 42 km and 45 km), followed by CO and SO 2 (30 km Fig. 7. Posterior medians for pollutant levels after using the main mesh and the denser mesh, and Pearson correlation coefficients. and 14 km), and then NO 2 and O 3 (8 km and 6 km).
Predicted levels (posterior medians) of PM 2.5 and SO 2 for January 2014 at all clinic locations are shown in Figs. 3 and 4. Posterior medians and 95% predictive intervals for PM 2.5 and SO 2 at four selected clinic locations (shown in Fig. 2) are given in Fig. 5 After fitting the pollutant prediction models while excluding a random sample of fifty observations, posterior medians and 95% predictive intervals for pollutant levels are shown in Fig. 6. RMSE and correlations between predicted values (posterior medians) and observed values are given in Table 6. Correlations range from 0.80 for CO to 0.98 for PM 2.5 .
Predicted pollutant levels are very similar (Pearson correlation coefficients greater than 0.99) when using the denser mesh for both weather variable and pollutant models. Posterior medians for pollutant levels after using either mesh are shown in Fig. 7.

Discussion
We have used Bayesian spatio-temporal models to predict levels of six pollutants at clinic locations in Suzhou, China. Inference was performed using the approximate INLA method and spatial models used the SPDE approach. The application of the SPDE approach for modelling pollutant levels has previously been reported by Cameletti et al. (2013) and Blangiardo et al. (2016). These analyses used covariates measured at or aligned to the same locations as the observed pollutant measurements. We extended this approach using a two-stage method to address misalignment of covariates. After using spatio-temporal models to produce predictions for four meteorological variables at all relevant locations, we used error models to add these as predictors in the models for pollutants. This ensured that the pollutant models incorporated the uncertainty in the predicted weather covariate values. To obtain predictions for pollutant levels at the set of clinic locations we extended the pollutant models so that posterior predictive distributions were obtained directly from R-INLA function calls.
The models and methods described in this paper provide a flexible approach to modelling ambient air pollutant levels in a region with dispersed monitors. The analysis incorporates fixed and time-varying covariate data from several sources, including misaligned covariates for which error models were used to ensure appropriate error propagation. This approach could be adapted for other scenarios, and models can be expanded with comparative ease.
These results are based on monthly pollutant levels, which were aggregated from daily data to monthly means before developing prediction models. However, the models could be adapted to use daily average values for meteorological variables and pollutant levels to enable more detailed time-series analyses. To capture dependencies over time, splines could be used with auto-regressive models for the values at knot locations.
We suggest pollutant levels at clinic locations could be used as proxies for individual exposure. It would be desirable to have individual participant residence, employment and other common locations to estimate exposure, however only clinic location (anticipated to be close to residence) is available in the given data. The methods described could be used to predict pollutant levels at any locations in the city, and if more extensive location data were available more specific estimates of exposure could be calculated.
The limited number of pollutant and weather monitors did not allow for detailed modelling of pollutants and weather variables across the city. It would be preferable to have data from more monitors throughout the city to allow better predictions of levels across the city. Given the available data, we have leveraged the geographic information available to predict pollutant levels at each clinic location. This is an alternative to ignoring the locations of pollutant monitors by using city-wide means in time-series analyses of health outcomes, or using pollutant levels are the nearest monitor as estimated levels at a clinic location.
The ranges of the SPDE models in the weather models are much larger than the extent of the area over which the models were applied. In such cases the model is usually indistinguishable from intrinsic random fields (Lindgren and Rue, 2015), but we do not expect that this affects the utility of predicted weather variables as covariates in the pollutant models.
The narrow locations (East to West) of the pollutant monitors caused a problem with including overall spatial trends. Extrapolating simple linear trends to out-of-sample x-coordinates caused predictions to be implausibly high (with small precision) in some models, but this was tempered by including quadratic terms for spatial trends. Ideally, observed pollutant data would be more geographically diverse. Alternatively, there may be better methods for ensuring reasonable out-ofsample predictions, and this potential problem should be considered when planning this type of analysis.
As an alternative to the error models used here misaligned covariates could be jointly modelled with the pollutant variables of interest. Further, health outcome data could be jointly modelled with pollutant levels. This would allow a single modelling framework for exposure, covariate, and outcome data, at the cost of more complex models and the time and resources for computation. However, the use of INLA as an efficient alternative to MCMC methods could make such an approach feasible.

Declaration of competing interest
None.