Europe-wide air pollution modeling from 2000 to 2019 using geographically weighted regression

Five-fold cross-validation per year showed that GWR and GTWR explained similar spatial variations in annual average concentrations (average R 2 = NO 2 : 0.66; O 3 : 0.58; PM 10 : 0.62; PM 2.5 : 0.77), which are better than SLR (average R 2 = NO 2 : 0.61; O 3 : 0.46; PM 10 : 0.51; PM 2.5 : 0.75) and RF (average R 2 = NO 2 : 0.64; O 3 : 0.53; PM 10 : 0.56; PM 2.5 : 0.67). The GTWR predictions and a previously-used method of back-extrapolating 2010 model predictions using CTM were overall highly correlated (R 2 > 0.8) for all pollutants. Including spatially-varying relationships using GWR modestly improved European air pollution annual LUR models, allowing time-varying exposure-health risk models.


Introduction
Ambient air pollution contributes to around 7 million deaths mainly from non-communicable diseases (World Health Organization, 2021). To better understand the health effects of air pollution exposure, cohort studies are increasingly conducted in large study areas (Brauer et al., 2019;Di et al., 2017;Stafoggia et al., 2022). These cohort studies had follow-up periods of 10-20 years, with recruitment dating back to mid 1990s. To allow time-varying exposure analyses, these studies require annual exposure estimates harmonized for the study area during the follow-up period.
For the ELAPSE project (Effects of Low-Level Air Pollution: A Study in Europe) (de Hoogh et al., 2018a), we developed Europe-wide land use regression (LUR) models for nitrogen dioxide (NO 2 ), ozone (O 3 ), particulate matter less than 2.5 µm (PM 2.5 ), and black carbon for a single year (2010) using supervised linear regression (SLR). Chemical transport model (CTM) estimates at a large spatial resolution (50 km × 50 km) for multiple years were used to back-and forward-extrapolate estimates from the LUR model built for the year 2010 to earlier and later years where LUR models were unavailable. An important assumption, in the Europe-wide model (de Hoogh et al., 2018a) and other large-scale models (Bechle et al., 2015;Chen et al., 2019;Knibbs et al., 2014;Larkin et al., 2017;Lu et al., 2020), is that the relationship between air pollution and predictors such as traffic, population density, and meteorology can be modeled with fixed coefficients. However, the relationships between air pollution concentrations and predictors could differ across countries. For example, the same road density might be associated with higher NO 2 concentrations in southern Europe than in northern Europe, because the road density only serves as a proxy for traffic intensity and/or traffic-related emission sources, which can be varying across countries/regions. Geographically Weighted Regression (GWR) is a technique that allows for spatiallyvarying relationships between the predictors and air pollution concentrations. In previous work in North America (Van Donkelaar et al., 2015) and the globe (Hammer et al., 2020), GWR was used to adjust for spatial heterogeneity in the bias between surface concentrations calculated from satellite-derived observations and a large-scale chemical transport model (geophysical PM 2.5 estimate) and ground-based PM 2.5 observations. These studies have shown that GWR greatly improved the five-fold cross-validation (CV) accuracy compared to unadjusted predictions. Building on these studies, we evaluate whether it would improve air pollution predictions by incorporating spatial heterogeneity directly in the relationship between ground-based observations and several spatial variables in Europe.
Air pollution concentrations have generally been reduced in the past decades in Europe (Ortiz and Guerreiro, 2020). To obtain air pollution predictions for multiple years, explicit models for these years would likely improve upon back-extrapolating a single-year model with modeled spatio-temporal trends. It is however not clear whether developing specific models for every single year (single-year model) or developing models for groups of multiple years (multi-year model) is the most effective approach. We explored the application of geographically and temporally weighted regression (GTWR) (Fotheringham et al., 2015;Huang et al., 2010;Wu et al., 2014) to develop models for multiple years. GTWR is an extension of GWR, allowing both spatially-and temporally-varying relationships between air pollution and predictors. Incorporating temporally-varying relationships is important as we would avoid assuming that the quantitative relationship between a predictor and air pollution concentrations remains fixed over decades. Previously, GTWR was used at a regional scale in China to estimate ground-level daily PM 2.5 concentrations using satellite-derived aerosol optical depth (AOD) values and other spatial variables (Bai et al., 2016;He and Huang, 2018). They showed that GTWR performed better than GWR at a daily scale. Building on these studies, we evaluate whether model performance is also improved in modeling annual average concentrations for a long (20-year) time period at the continental (European) scale.
Recently, machine-learning methods, including random forest (RF), have been applied in developing LUR models (Chen et al., 2019;Kerckhoffs et al., 2019;Lu et al., 2020). RF models relax the linear assumption in linear regression methods and allow interactions between variables. Potentially, RF could therefore also include spatial heterogeneity, e.g. by using kriging to explain its residuals (Zhan et al., 2018) or by including climate zones or geographical coordinates as a predictor (Hengl et al., 2018).
We built annual LUR models in Europe from 2000 to 2019 for four regulated air pollutants: NO 2 , O 3 , particulate matter < 10 µm (PM 10 ), and PM 2.5 . Our first aim was to evaluate potential improvements by including spatially-varying relationships in Europe-wide LUR models. Our second aim was to compare single-year and multi-year modeling approaches. Our third aim was to compare our annual LUR models with our previous back-extrapolation methods (de Hoogh et al., 2018a;Gulliver et al., 2013). The novelty of our work is to evaluate model performance of GWR and GTWR in modeling relationships between ground-based observations and spatial predictors; to include multiple major pollutants with different spatio-temporal patterns; to compare with a machine-learning method; and the spatial resolution of 25 m at the continental European scale.

Materials and methods
Our models extended our previous Europe-wide models (de Hoogh et al., 2018a) by evaluating the spatial variations in the coefficients of the LUR model; by developing models for a large number of years; by improving spatial resolution from 100 m × 100 m to 25 m × 25 m; by extending the domain to include Eastern European countries; by incorporating time-varying predictors when available; and by incorporating more predictors such as meteorology.
We collected routine-monitoring data of air pollution and spatiotemporal predictors to estimate air pollution concentrations across Europe from 2000 to 2019. We developed models to estimate annual average air pollution exposure because we aimed to apply the model predictions in studying health effects of long-term air pollution exposure. As the health studies started recruitment in the past, the most important objective was to develop models for current and historical long-term exposure at a fine spatial resolution. Some recent studies have developed daily air pollution models for multiple years with good performance (de Hoogh et al., , 2018bDi et al., 2019;Liu et al., 2020;Requia et al., 2020;Shtein et al., 2018) at a spatial of 1 km × 1 km for the US studies and about 12 km × 12 km for the Chinese studies. As our objective was to develop long-term exposure at a fine spatial resolution and as we do not need daily maps, we did not develop daily models.
The data served as input to LUR models built by four algorithms in two temporal settings: 1) SLR for single-year and multi-year, 2) GWR for single-year, GTWR for multi-year, and 3) RF for single-year and multiyear. We refer to single-year as training a single model per year using data from that specific year and refer to multi-year as training a single model for consecutive years using all data from 2000 to 2019. We also evaluated shorter multi-year periods, specifically 3-to-6-year periods. Model performance was evaluated by 5-fold cross-validation (CV) and by an external validation dataset obtained in the ESCAPE project (European Study of Cohorts for Air pollution Effects) collected across Europe in 2010 Eeftens et al., 2012).
We built the models from 2000 to 2019, because of two reasons. Firstly, some potential predictors were only available after 2000 such as satellite-derived data. Secondly, the monitoring observations were highly clustered in specific countries and were limited in quantity before 2000 especially for PM 2.5 (with less than 10 observations) and PM 10 (with less than 450 observations), as shown in Figs. S1 & S2.

Air pollution monitoring observations
We collected routine monitoring data for NO 2 , O 3 , PM 10 , and PM 2.5 from the European Environmental Agency (EEA). We collected the observations before 2012 from the Airbase database v8 (European Environmental Agency, 2020a) and after 2012 from the Air Quality e-Reporting database (European Environmental Agency, 2020b).
For PM 10 and PM 2.5. , we aggregated daily observations to yearly averages, because most observations for PM were only available at a daily scale (integrated filter-based methods or continuous methods with insufficient hourly precision). NO 2 and O 3 were measured with continuous methods and available as hourly averages. For NO 2 , we aggregated hourly observations to annual averages. For O 3 , we calculated the daily maximum 8-hour mean for each day from hourly observations and then aggregated the daily values to annual averages. This statistic of daily maximum 8-hour mean was chosen because it is used in regulatory guidelines (WHO, 2021). All annual statistics were only used if more than 75% of the daily or hourly observations were valid as defined by the EEA. We did not train the single-year model if the annual average observations were available from less than 200 monitoring sites across Europe for that specific year. This only applied to the PM 2.5 observations before 2006. From 2000 to 2019, the number of monitoring sites with more than 75% annual validity grows from 1533 to 3176 for NO 2 , from 1243 to 1954 for O 3 , from 445 to 2937 for PM 10 , and from 11 to 1433 for PM 2.5 (Table S2). PM monitoring in the earlier years was also limited to specific countries such as Germany, the Netherlands, and UK (Figs. S1 & S2). After around 2010, monitoring sites were spread sufficiently and widely for all pollutants (Figs. S1 & S2). The observed air pollution concentrations have decreased substantially in the past decade (Table S2), except for O 3 . The decreasing trend in annual averages is not linear for PM, because the complex interactions between emission sources and meteorological conditions might affect the trend in annual averages across Europe. Especially for PM 2.5 the trend could also be affected by the different set of monitors included in different years.

Predictor variables
We calculated several road, land-use, satellite retrievals, and chemical transport model estimates to use them as predictors in the LUR models (Table S1). Predictors were similar to the ones used in the ELAPSE study (de Hoogh et al., 2018a), supplemented with extra timevarying spatial predictors such as meteorology chemical transport model estimates, and satellite retrievals. These time-varying spatial predictors could help capturing regional changes in annual average concentrations over time (Gulliver et al., 2013). We did not include industrial emission data as predictors, because emission data was already included in the chemical transport models (Brandt et al., 2012;Marécal et al., 2015) and we used the model estimates as a predictor.
To capture the dispersion of the air pollution, several variables were calculated in several circular buffer sizes (ranging from 50 m to 10,000 m), such as land use data and road data (Table S1). After the buffered values of land cover data and road data were calculated, all predictor maps were regridded to a 25-m resolution using nearest neighbor resampling. Most of these predictors were processed in Google Earth Engine (GEE) (Gorelick et al., 2017) except for road data.
We calculated road predictors from OpenStreetMap data (Open-StreetMap contributors, 2020). OSM is a crowd-sourced project, and the robustness of the OSM data highly depends on the data contributed by the users. We obtained the OSM road data in 2020 because the robustness and validity of OSM show large spatial discrepancies back in time (more than 5-8 years ago) (Girres and Touya, 2010;Neis et al., 2011). Although road network changes over time, we focused on capturing the representative of the road network instead of the temporal changes in the network, as most of the network stays unchanged in most developed countries in Europe. From OSM, we extracted motorways, primary, and local roads. Each road segment in each road type was reprojected to ETRS89-LAEA and then was intersected with 25-m grids. Road lengths per grid cell were calculated and then summed to obtain the buffered road predictors. We finally calculated predictors in buffer sizes ranging from 50 m to 10,000 m (Table S1). Other detailed descriptions of all predictor variables are in Table S1.

Algorithms for training LUR models
We used four algorithms (i.e., SLR, GWR, GTWR, RF) to train our annual LUR models. Details on the technical implementation of GWR and GTWR are in the Supporting Information.

Supervised linear regression (SLR)
SLR is a standardized approach for including the most informative variables stepwise and it has been widely used for LUR modeling (Chen et al., 2021;de Hoogh et al., 2018a;Eeftens et al., 2012a). We followed the ESCAPE protocol (Eeftens et al., 2012a) to train SLR. In short, firstly, a linear regression model was built using the predictor variable that explains the most variation in the concentrations (highest R 2 value) among all variables. In subsequent steps, additional predictor variables were allowed to enter the model if they improved the adjusted R 2 ; if the sign of the coefficient met the predefined direction of effect (stated in Table S1); and if the coefficient value was statistically significant (p < 0.1). Finally, a predictor variable was excluded if its variance inflation factor (VIF) was larger than 3 to avoid multicollinearity and this exclusion step stopped when no variables had VIF larger than 3. We built the single-year SLR model for each year, and therefore the model for each year could have different variables selected. We built one multiyear SLR model using data from 2000 to 2019, and after variables were selected, year was included as a predictor to vary the intercept value of the regression model for each year. Year was only included as a predictor after the variables were selected because year would not necessarily be selected in the step-wise procedure in SLR.

Geographically weighted regression (GWR)
GWR (Brunsdon et al., 1996) includes spatial heterogeneity by using spatially-varying coefficient values. The coefficient values were estimated by using a weight function. The weight function gives higher weights to observation points closer to the estimate points than points further away. The function decays with spatial Euclidean distance according to a predefined kernel function and bandwidth. The kernel function determines the shape of the relation between distance and weight, and the bandwidth controls how fast the kernel function decays in space. For the kernel function, we used an exponential function to give an abrupt decrease in weight with increasing distance. Due to the heterogeneous spatial distribution of the air pollution observations across Europe, we chose an adaptive bandwidth (nngb) to ensure sufficient local information is included. We used 200 km × 200 km grids for the spatially-varying coefficient values of the linear regression. We used functions in the GWmodel package version 2.2-4 (Gollini et al., 2015;Lu et al., 2014) in R (R Core Team, 2020) to train GWR. GWR can tackle the spatial heterogeneity in the relationships between predictors and air pollution concentrations. The detailed technical implementation of GWR is in the Supporting Information.

Geographically and temporally weighted regression (GTWR)
GTWR (Fotheringham et al., 2015;Huang et al., 2010;Wu et al., 2014) is an extension of GWR, allowing both spatial and temporal heterogeneity of the relationships between air pollution and predictors. GTWR uses a weight function to estimate the spatiotemporally-varying coefficient values of the regression model, and the weight function includes both spatial and temporal distances. In short, it assumes that observation points closer in the spatiotemporal distance have a higher impact on local coefficient estimates than observation points further away, and one-meter spatial distance has a different impact compared to one-year temporal distance. We used functions in the GWmodel package version 2.2-4 (Gollini et al., 2015;Lu et al., 2014) in R (R Core Team, 2020) to train GTWR. We first used SLR to select the informative variables and then used GTWR to estimate the potential spatially-and temporally-varying coefficient values. Some of the original settings in GTWR are unsuitable for air pollution modeling (Wu et al., 2014) and therefore we adjusted some parameter settings. First, the original GTWR only uses the observations from prior time steps to estimate the GTWR coefficients for the current time step. But we used observations from both prior time steps and subsequent time steps to train GTWR for the current time step, because we assumed that observations from the past and subsequent years could improve the concentration predictions. Second, we included more flexibility in modeling spatial and temporal distances by adding a conversion factor. We tuned the parameters using 5-fold CV. The detailed technical implementation of GTWR is in the Supporting Information.

Random forest (RF)
RF is an ensemble tree-based machine learning method (Breiman, 2001). Previous studies showed that RF gave similar or superior model performance for air pollution compared to linear regression or spatial interpolation methods (Chen et al., 2020(Chen et al., , 2019Kerckhoffs et al., 2021Kerckhoffs et al., , 2019Li et al., 2011;Lu et al., 2020). RF can deal with highly correlated variables by randomly selecting a subset of variables in each split node of a tree. The regression trees are built using bootstrapping training data (i.e., by sampling training data with replacement), and thus not all training data is used in each regression tree. The unused data is often referred to as out-of-bag (OOB) data. We used the overall OOB root mean squared error (RMSE) to optimize RF hyperparameters.
The optimized hyperparameters that led to the least OOB RMSE were used in our RF. We optimized the hyperparameters using a grid search method. We only optimized the number of trees (ntree) and the number of variables being split at each node (mtry) due to their higher effect on model performance than other hyperparameters (Lu et al., 2020). We optimized the hyperparameters for every year (single-year) or between 2000 and 2019 (multi-year) using the grid searching process with ntree ranging from 200 to 800 with an increment of 200 and mtry ranging from the squared number of the predictors to the total number of the predictors with an increment of 20. We used functions in the ranger package version 0.12.1 (Wright and Ziegler, 2017) in R (R Core Team, 2020) to train RF.
Because RF can tackle highly correlated variables, we used all variables (shown in Table S1) in the RF and evaluated the variable importance. Variable importance was measured by averaging a variable's total decrease in the remaining mean square errors (MSE) left after the variable was used as the node split. The variable importance was calculated using the ranger function (Wright and Ziegler, 2017) by setting the variable importance mode as 'impurity'.

Comparison with previous approach using back-extrapolation
We compared predictions from the newly developed annual LUR models with predictions from back-extrapolation methods used previously (de Hoogh et al., 2018a;Gulliver et al., 2013). We added this comparison to evaluate how large the reduction was in model performance of back-extrapolation compared to year-specific LUR. As previous studies have used the back-extrapolation methods, it is important for air pollution epidemiology to evaluate how different exposure predictions were from the new models and the previously applied backextrapolation models. The back-extrapolation methods extrapolated predictions from the annual LUR model built for the year 2010 (t 1 = 2010) to another time (t 2 ) using the surfaces from the Danish Eulerian Hemispheric Model (DEHM) described in Table S1. The DEHM is a 3D long-range atmospheric chemical transport model (Brandt et al., 2012;Christensen, 1997). Following de Hoogh et al (2018) (de Hoogh et al., 2018a), we used two extrapolation methods: ratio (Eq (1)) and differencing (Eq (2)) (Gulliver et al., 2013): where y extrap is the extrapolated value, y LUR is the estimate from a LUR model, and y DEHM is the DEHM estimate. The DEHM surfaces (50 km × 50 km) were resampled to the LUR surfaces (25 m × 25 m) using nearest neighbor resampling. The reference time was set to be 2010, the midpoint of our study period and the year in which our previous Europewide ELAPSE LUR model was built (de Hoogh et al., 2018a). The comparison in model performance was done by using 5-fold CV and calculating correlations between extrapolated concentrations and the new annual LUR predictions. We calculated the correlations at random points. We generated 77,675 random points in populated areas (Fig. S3). The comparison was done for the full study domain and per NUTS1 area. The number of random points in each NUTS1 region was proportional to the population and each region had at least 300 points generated. We ensured these random points were randomly distributed in the populated areas defined by the impervious density data for the year 2018 and the population data for the year 2011 (see Table S1).

Model performance evaluation
For internal validation, we used 5-fold CV to evaluate the model performance in explaining the variability of observed concentrations for each year. For single-year modeling, the observation points were divided randomly into 5 folds stratified by the climate zone (see Fig. S4) and routine monitoring station type (i.e., background, industrial, traffic). For multi-year modeling, because some stations stopped or started monitoring during the period (2000-2019), we first obtained the number of annual averages available during the period for each station. Then, the observation points were divided randomly into 5 folds stratified by the number of annual averages, climate zone, and station type. For multi-year modeling, we ensured that the observations from the same station could only be in either the training fold or the validation fold across the years.
For external validation, we used ESCAPE measurements Eeftens et al., 2012) to evaluate model performance in 2010 for NO 2 , PM 10 , and PM 2.5 . The ESCAPE measurements consist of 1396 longterm NO 2 measurements and 415 PM 10 and PM 2.5 measurements collected in specific clustered study areas across Europe in 2010 (see Fig. S5).
Two performance metrics were used: the coefficient of determination (R 2 ) and root mean square error (RMSE). We calculated the MSE-based R 2 , affected by both systematic bias and random differences between model predictions and observations. The predictions from all five validation folds were combined to obtain one value of each performance metric described in Eq (3) and Eq (4): where y i is the observation of the annual average concentration at station i, ŷ i is the estimate of the annual average concentration at station i, y is the average observations from all N stations.

Model performance of algorithms with and without spatially-varying coefficients
Sections 3.1.1 to 3.1.3 document the performance of single and multi-year models. Section 3.1.4 illustrates the predictor variables in the different models. Section 3.1.5 compares our findings with previous studies and discusses the importance of including spatially-varying coefficients.

GWR: spatially-varying coefficients in single-year models
GWR modestly improved the explained variance R 2 values of singleyear models compared to SLR and RF for all four pollutants (see columns in 'Difference in CV R 2 between single-year models' in Table 1). Table 1, Fig. 1 (for selected years), and Tables S3-S6 (for all years) document that GWR improved predictions compared to SLR and RF consistently across years. The largest improvement in R 2 was found for O 3 and PM 10 (about 10% on average) and the smallest for PM 2.5 (about 2% on average) compared to SLR (see column 'GWR vs SLR' in Table 1). The difference between the performance for the two particle metrics may be because the ratios between PM 2.5 and PM 10 concentrations are different across Europe . RF improved model performance compared to SLR modestly for NO 2 , O 3 , and PM 10 , but not PM 2.5 for which it performed modestly worse (Table 1). The small improvement of model improvement by RF compared to SLR is consistent with previous work for European models of the year 2010 (Chen et al., 2019) and models based upon mobile monitoring data (Kerckhoffs et al., 2019).
The model performance of the annual GWR LUR models was quite stable over time for NO 2 (  (Table S5) and PM 2.5 (Table S6), model performance improved in recent years compared to earlier years, probably related to the significant increase in the number of sites and possibly harmonization of monitoring methods across Europe. The trends and improvement in R 2 (Fig. S6) and RMSE values (Fig. S7) were similar.
to RF00-19 by 3%, 9%, 8%, 7% on average for NO 2 , O 3 , PM 10 , and PM 2.5 respectively (see columns 'Difference in CV R 2 between multi-year models' in Table 1).  (Tables S3-S6). The scatterplots of GTWR predictions against ground-based monitoring observations are shown in Fig. S8. We evaluated whether using observations from shorter multi-year periods would lead to different model performance and model predictions for each year. We used observations from shorter multi-year periods (ranging from 3 to 8 years) to train GTWR. These GTWR models with shorter multi-year periods gave similar 5-fold CV R 2 values (Table S7) in 2010, 2015, and 2019. The correlations were high between predictions from GTWR00-19 (our default GTWR model using observations from 20 years) and other GTWR models built by observations from multi-year periods shorter than the 20-year period (as shown in row 'cor' in Table S7).

Comparison between single-year and multi-year models
When we compared single-year models and multi-year models, we noted that GTWR00-19 estimated spatial variation in annual average observations as good as the annual GWR models (column 'GWR vs GTWR00-19 ′ in Table 1, Fig. 1 for selected years, and Tables S3-S6 for all years). For PM 2.5 , because the number of observations was too small to train GTWR (between 2000 and2003) for some folds and GWR (between 2000 and 2005) for all folds, the values of 5-fold CV R 2 were unavailable. The performance of multi-year models with the SLR and RF algorithm was slightly less than the performance of the single-year models with these algorithms (column 'SLR vs SLR00-19 ′ and 'RF vs RF00-19 ′ in Table 1).
Using the ESCAPE data as external validation data available for the year 2010 for only NO 2 , PM 2.5 , and PM 10 , we observed that GWR and GTWR00-19 performed similarly compared to other algorithms for NO 2 . For PM 2.5 and PM 10 , GWR and GTWR00-19 performed similarly compared to SLR and SLR00-19 but slightly worse than RF (indicated by the plus icon in Fig. 1). As the ESCAPE monitoring data is much more spatially clustered and covers fewer countries than the routine monitoring data (Fig. S5), we gave more importance to the 5-fold crossvalidation when comparing modeling approaches.
GWR and GTWR both improved the performance of regression-based methods by incorporating spatial-varying coefficients in the LUR models. The GWR models used slightly different predictors with spatially-varying coefficient values each year, whereas the GTWR model had a fixed model structure in terms of predictors and buffer sizes but with spatially-and temporally-varying coefficients. Although the annual GWR models have different predictors across years, these mostly reflect the same determinant (e.g., nearby road traffic represented by different buffer sizes) (Fig. S9). Although GTWR and GWR showed similar 5-fold CV accuracies, GTWR had the extra advantage of estimating PM2.5 concentrations when the number of observations was too small to train an annual LUR model for years before 2006. GTWR includes observations in other years to inform the model for a specific year. Thus, we will use the GTWR models as our default model. Fig. S9 shows the variables selected by and used SLR and used in GWR in each year. The variables selected for NO 2 and PM 10 were quite similar across years with some different variables selected in some years, whereas the variables for O 3 and PM 2.5 were quite different across years. Road variables were selected with slightly different road types and buffer sizes in all years for all pollutants. The chemical transport model estimates were also selected for all pollutants: estimates from the MACC-II ensemble model for NO 2 (no2_10MACC), estimates from the DEHM model for O 3 (O3_dehm), and estimates from the satellite retrievals converted by the GEOS-Chem model for PM 2.5 and PM 10 (gwr_sat). Impervious densities with different buffer sizes were selected in all years for NO 2 and O 3 . Population and altitude were selected almost every year for O 3 . Different meteorological variables were selected almost every year for all pollutants, which are wind speed for NO 2 and temperature for O 3 , PM 10 , and PM 2.5 . Fig. S11 shows the optimized GWR adaptive bandwidth (nngb) for each pollutant and each year. Figs. S14-S17 illustrate spatially-varying coefficients estimated by GWR for all pollutants for the year 2010, indicating noticeable spatial variability in the coefficient values across Europe. We do not have a clear interpretation of the spatial pattern of coefficients related to known sources or dispersion characteristics. Fig. S10 gives the top-12 variables in reducing MSE the most for each RF model. Most variables selected in SLR were also the top-12 variables in RF. The optimized RF hyperparameters were similar in some RF models (Fig. S12, S13). Fig. S9 (in the columns of'00-19 ′ for four pollutants) shows the variables selected by and used in the multi-year SLR model (SLR00-19) and used in the multi-year GTWR model (GTWR00-19). Figs. S18-S25 illustrates the spatially-and temporally-varying coefficient values of the GTWR00-19 multi-year model in the year 2010 and 2015. In Figs. S18-S25, the legend scales remain fixed for the same predictors and pollutants in the two years to observe changes in coefficient values over time. We observed fair consistent spatial pattern in the coefficient values over time. In other words, as observed from the spatially-and temporally-varying coefficients in the GTWR00-19 (Figs. S18-S25), the spatial variability in the coefficient values was larger than the temporal variability. Table S12 shows the optimized GTWR parameters for all pollutants.

Discussion of the importance of allowing spatially-varying coefficients
Overall, the modest improvement of model performance by allowing spatially-varying coefficients over a large spatial area, such as Europe, is consistent with the previous modeling at the North-American and global scale (Hammer et al., 2020;Lu et al., 2020;Van Donkelaar et al., 2015). Incorporating spatial variations in the relationships between predictors and air pollution is important for estimating air pollution over a large spatial extent. The GTWR00-19 models were used to estimate Europewide annual average concentrations for NO 2 , O 3 , PM 10 , and PM 2.5 at a 25 m × 25 m spatial resolution, a considerable improvement compared to our earlier work on a 100 m × 100 m spatial resolution (de Hoogh et al., 2018a). The high spatial resolution (25 m) provided by the road predictors could improve capturing traffic-related emission sources, especially for NO 2 .
Our previous Europe-wide model ( . But these lower R 2 values for O 3 were because we did not include the X and Y coordinates as predictors in our SLR model, whereas the previous SLR included the X and Y coordinates. Our GWR model, however, had higher R 2 values (0.6 in 2000, 0.56 in 2005, 0.6 in 2010) than the previous SLR model. The improvement compared to the previous Europe-wide model could be because in this study we included more variables (e.g., meteorological variables) and variables with improved spatial resolution (e.g., road variables improved from 100-m to 25-m).
Our GTWR model also outperformed another European model created in a global study in 2011 (Larkin et al., 2017) for NO 2 (0.64 vs 0.57). Although in these previous studies the study areas and validation methods were slightly different from ours, our higher R 2 values could be because of two factors. Firstly, we used spatially-varying regression coefficients instead of spatially-fixed linear regression. Secondly, the predictors we used here were more spatially resolved and we also included some additional CTM, SAT, and meteorological data as predictors to capture spatial variations in air pollution (Fig. S9). For PM 2.5 , the second factor was especially important; for O 3 and NO 2 both factors were important. The improvement in model performance by including spatial variations in linear regression was larger than by including spatial variations in non-linear models as achieved by using climate zone in RF to capture the spatial variations in air pollution (shown in columns 'GWR vs SLR' and 'GTWR00-19 vs RF00-19 ′ in Table 1). This may be specific to modeling at the annual time scale, where the relationships between predictors and concentrations do not materially deviate from linear (Chen et al., 2019). Moreover, the similar model performance between GWR and GTWR and the relatively stable temporal variability in GTWR coefficient values indicate that capturing spatial variability in linear regression is more important than capturing temporal variability at an annual scale in Europe.
We note that the improvement of model performance by GTWR and GWR compared to SLR and RF was generally fairly modest, suggesting that models derived with these methods provide reasonable model predictions as well. Fig. 2 shows Europe-wide annual average concentration maps of Fig. 2. Europe-wide annual average ground-level NO 2 , O 3 , PM 10 , and PM 2.5 concentrations (µg/m 3 ) estimated by GTWR00-19 in 2000GTWR00-19 in , 2005GTWR00-19 in , 2010GTWR00-19 in ,2015GTWR00-19 in , and 2019 (Base map source: Google Maps). For maps for all years from 2000 to 2019 for all pollutants, readers are referred to https://youchenshenuu.users.earthengine.app /view/expanse-air-pollution-20-yr-maps. NO 2 , O 3 , PM 10 , and PM 2.5 estimated by GTWR00-19 and Fig. 3 shows the zoom-in maps in Paris. All annual maps from 2000 to 2019 are available on the website (https://youchenshenuu.users.earthengine. app/view/expanse-air-pollution-20-yr-maps).

Modeled spatial patterns over 20 years
For NO 2 , the large-scale spatial patterns were similar over time with a decreasing trend. High concentrations were in cities and in specific areas such as the Italian Po-Valley, the surroundings of the Netherlands, and eastern Europe, related to population density (de Hoogh et al., 2018a). For O 3 , the large-scale spatial patterns were similar over time with a small increasing trend. Overall, the Alps, southern Europe, and the Balkans had higher annual average O 3 concentrations than the rest of the study area.
For PM 2.5 , the spatial patterns were also similar over time with a decreasing trend, but the patterns varied more in time compared to NO 2 and O 3 . Overall, the northern part of Italy, eastern Europe and the Balkans had higher annual average PM 2.5 concentrations than the rest of Europe. For PM 10 , the concentrations were decreasing over time, but the spatial patterns were more dynamic compared to NO 2 and O 3 . In the early 2000 s, the PM 10 concentrations were above 20 µg/m 3 in all regions except in the Alps, Ireland, and northern Europe. In the recent 5 years, only the northern part of Italy, eastern Europe, and the Balkans had higher PM 10 concentrations (>25 µg/m 3 ) than the rest of Europe. For both PM 10 and PM 2.5 , some distinct dividing lines in northern Europe were because areas above these lines had no values in the satellite-derived product (gwr_sat in Table S1) (Hammer et al., 2020;Van Donkelaar et al., 2019), and we replaced these missing values with zero values for this product.
The overall spatial patterns for NO 2 remained quite stable from year to year with a decreasing trend, but for PM 10 , PM 2.5 , and O 3 some regional changes over time were visible. The ambient air pollution concentrations are mostly driven by anthropogenic emission sources and are also influenced by meteorological factors and long-range transport of precursor gases for some pollutants. For NO 2 , the ambient concentrations are mainly driven by the anthropogenic emission sources (i.e., mostly from road transport and energy combustion) reduced by policy regulation and improved efficiencies in energy combustion (Colette and Rouïl, 2020;EEA, 2021aEEA, , 2021bEEA, , 2021c. Thus, the declining estimated NO 2 concentrations were spatially stable. For O 3 , PM 10 , and PM 2.5 , the ambient concentrations are influenced by not only emission sources but also meteorological factors and long-range transport of precursor gases (such as nitrogen oxides) (Colette and Rouïl, 2020;EEA, 2021b). The annual average of the daily maximum 8-hour mean for O 3 was the only pollutant with an increasing trend over time, as found in both our study and ground-based observations (Colette and Rouïl, 2020).

Back-extrapolation
In section 3.3.1 we first discuss the performance of back-

Performance of back-extrapolated models
The 5-fold CV performance of the back-extrapolation methods generally decreased when the extrapolation year (t 2 in Eq (1) and (2)) became more distant from the reference year (t 1 = 2010 in Eq (1) and (2)), as shown in Fig. S6 and Tables S8-S11 (in the 2nd-4th columns). For NO 2 (Table S8), the differencing method had higher CV R 2 values than the ratio method, with only a small decreasing trend for backward extrapolation (before 2010) and with a large decreasing trend for forward extrapolation (after 2010). For O 3 (Table S9), both backextrapolation methods had similar performance that decreased with time especially for backward extrapolation and that decreased slightly for forward extrapolation (except 2018 & 2019). For PM 10 (Table S10) and PM 2.5 (Table S11), both back-extrapolation methods had similar performance that decreased with time especially for backward extrapolation and that decreased slightly for forward extrapolation (except 2019). The large decreasing in CV R 2 values for backward extrapolation for PM 10 and PM 2.5 could be because of the significant decrease in the available observations and the quality of the observations in the early 2000 s. The number of observations for the earliest was small, especially for PM 2.5 (with less than 200 observations available before 2006).
Overall, the average annual predictions from the back-extrapolation were less similar to the annual average observations in years more distant to the year 2010 (for years both before and after 2010) for all pollutants.

Comparison between GTWR LUR model and back-extrapolation
The 5-fold CV performance of the back-extrapolation methods was lower compared to the GTWR00-19 models for all pollutants when going further away in time from 2010. For NO 2 (Table S8), the CV R 2 was 8% lower for back-extrapolated values in 2000 using the differencing method compared to GTWR00-19. For O 3 and especially PM 10 and PM 2.5 , the difference in CV R 2 values of GTWR00-19 and backextrapolation increased away from the year 2010 for both differencing and ratio methods in this study.
The overall correlations were above 0.86 between the GTWR00-19 annual average predictions and the differencing back-extrapolated Fig. 4. Correlation between predictions from GTWR00-19 (GTWR built for period from 2000 to 2019) and predictions extrapolated from the 2010 GTWR00-19 predictions to other years by the differencing method using DEHM data described in Table S1 (t 1 = 2010 in Eq (2)) for NO 2 , O 3 , PM 10 , and PM 2.5 . Correlation values shown in the upper corner of each graph are overall correlations. predictions at random points for all years (Fig. 4). For NO 2 , the correlations were high also within all NUTS1-regions (>0.8). For O 3 , PM 10 , and PM 2.5 , the correlations were above 0.8 in most regions but were below 0.6 in some regions and years.
Between the GTWR00-19 LUR models and back-extrapolated models for all years, the decreasing difference in CV R 2 (calculated as MSEbased R 2 ) and the high overall correlation could be explained by the slightly increasing differences between absolute average levels estimated by the large-scale DEHM model and the average levels estimated by the GTWR00-19 models (as shown in the 5th-8th columns of Tables S8-S11). Boxplots of the predicted values showed that the outliers and the average of the back-extrapolated values were slightly higher than the outliers and the average of the GTWR00-19 predictions (Fig. 5).
Overall, our result is similar to previous studies for NO 2 . In Great Britain (Gulliver et al., 2016), a decrease of 8% in R 2 was found between a 1991 LUR model and a back-extrapolated LUR (extrapolated from 2009 LUR). In Vancouver in Canada, a decrease of 3% in the R 2 was found between a 2003 LUR model and a 2010 LUR model recalibrated to 2003 (Wang et al., 2013). The moderate to high correlations indicate that the GTWR predictions and the back-extrapolated predictions may give similar relative exposure ranking/classification across the population in Europe-wide cohort studies, but the difference in absolute values, as indicated by the low MSE-based R 2 values, could be high. Thus, GTWR would be preferred especially to study the shape of the exposure-response relationships between health effects and air pollution (i. e., at what level of air pollution health effects occur) because accurate absolute values are required. The back-extrapolated values with systematic bias would be a reasonable approximation in health studies in which the occurrence of specific health outcomes is compared in linear models or compared between low and high exposed subjects.

Limitations and strengths
An important limitation of this study is the lack of ground-level observations for PM 2.5 before 2006. This limitation was mitigated by the multi-year modeling method, but we were unable to perform 5-fold CV for GTWR00-19 before 2004 because of the limited observations, with the number of observations in each of those years less than 100. This limited number of ground-level observations however could make our cross-validation less reliable in the early 2000s than in the later years for PM 2.5 . With GTWR00-19 we were able to estimate PM 2.5 concentrations with limited observations and to increase 5-fold CV R 2 values by 7% in NO 2 , 13% in O 3 , 12% in PM 10 , and 3% in PM 2.5 compared to SLR.
We did not have traffic intensity data available across Europe and instead used buffered road length of different road types from Open-StreetMap (OSM). We used the road type data from the year 2020 only because we judged that using historical data was associated with too large methodological issues (improved quality of OSM over time).
Despite these limitations, our GTWR00-19 and GWR have captured the overall spatial variations in air pollution every year with satisfactory 5-fold CV R 2 values against the ground-level observations. Our output gives harmonized exposure estimates in European cohort studies at a high spatial resolution (25 m × 25 m).
Our annual average air pollution exposure maps will be applied in health effect studies of long-term exposure to air pollution. For studies of short-term exposure, a more refined temporal resolution is needed, following the methodology of some recent studies that have developed daily pollution maps for multiple years at a resolution of 1 km or coarser with good performance (de Hoogh et al., , 2018bDi et al., 2019;Liu et al., 2020;Requia et al., 2020;Shtein et al., 2018). For our current objective, the spatial resolution of 1 km × 1 km is not sufficient.
All code to build LUR models in R programming language is available on Github for single-year (Shen et al., 2021a) and multi-year modeling (Shen et al., 2021b).

Conclusions
We showed the importance of including spatially-varying relationships in LUR models to improve long-term air pollution estimates in Europe. The spatially-varying linear regression models (GWR, GTWR) explained a modestly larger amount of spatial variation in air pollution concentrations across Europe than the spatially-fixed linear regression (SLR) and the machine learning method (RF). Our harmonized annual Fig. 5. Boxplots of predictions at random points from GTWR00-19, differencing back-extrapolation (diff) and ratio back-extrapolation (ratio) using GTWR00-19 for year 2010 extrapolated to other years using DEHM data described in Table S1 (t 1 = 2010 in Eq (1) and (2)). estimates available for NO 2 , O 3 , PM 10 , and PM 2.5 from 2000 to 2019 will allow time-varying exposure-health risk models for Europe-wide health analysis studies.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.

Acknowledgement
This work was supported by EXPANSE and EXPOSOME-NL projects. The EXPANSE project is funded by the European Union's Horizon 2020 research and innovation programme under grant agreement No 874627. The content of this article is not officially endorsed by the European Union. The EXPOSOME-NL project is funded through the Gravitation programme of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research (NWO grant number 024.004.017). The authors declare no competing financial interest.