Effect of monitoring network design on land use regression models for estimating residential NO 2 concentration

(cid:1) Land-use regression (LUR) models for NO 2 were evaluated using a dispersion model. (cid:1) The number of monitoring sites improved LUR model performance, but not > ~30 sites. (cid:1) Networks including sites in populated areas better estimated across residential NO 2 . (cid:1) Roadside sites needed to better characterise the high end of residential NO 2 . (cid:1) No speci ﬁ c monitoring site design estimated both overall and high NO 2 levels well. Land-use regression (LUR) models are increasingly used to estimate exposure to air pollution in urban areas. An appropriate monitoring network is an important component in the development of a robust LUR model. In this study concentrations of NO 2 were simulated by a dispersion model at ‘ virtual ’ monitoring sites in 54 network designs of varying numbers and types of site, using a 25 km 2 area in Edinburgh, UK, as an example location. Separate LUR models were developed for each network. The LUR models were then used to estimate NO 2 concentration at all residential addresses, which were evaluated against the dispersion-modelled concentration at these addresses. The improvement in predictive capability of the LUR models was insigni ﬁ cant above ~30 monitoring sites, although more sites tended to yield more precise LUR models. Monitoring networks containing sites located within highly populated areas better estimated NO 2 concentrations across all residential locations. LUR models constructed from networks containing more roadside sites better characterised the high end of residential NO 2 concen- trations but had increased errors when considering the whole range of concentrations. No particular composition of monitoring network resulted in good estimation simultaneously across all residential NO 2 concentration and of the highest NO 2 levels. This evaluation with dispersion modelling has shown that previous LUR model validation methods may have been optimistic in their assessment of the model's predictive performance at residential locations. © 2016 The Authors. Published by Elsevier Ltd. This an open access article the CC license (http://creativecommons.org/licenses/by/4.0/).


Introduction
The assessment of long-term exposure to air pollution for epidemiological and health burden studies has been a challenge because of the high spatial variation of pollutant concentration in the urban environment, particularly nitrogen dioxide (NO 2 ) (Briggs, 2005;Jerrett et al., 2005). Over the years, land use regression (LUR) modelling has demonstrated better or equivalent performance to other geostatistical methods , and therefore has become popular in health studies to estimate long-term exposure to ambient NO 2 (Beelen et al., 2014;Jerrett et al., 2009). LUR modelling is a stepwise multiple regression method that regresses the pollutant concentration at the measurement sites against the land-use variables within buffer areas around the measurement sites (Jerrett et al., 2007). The derived empirical relationship between pollutant concentration and surrounding land use is then applied to un-sampled locations to provide a spatially-resolved seasonal or annual average pollution field.
The selection of monitoring sites to build the LUR model has been identified as one of the factors affecting the quality of the LUR model, but a rigorous method to determine the number and distribution of monitoring sites is lacking . One study  aimed to develop a formal method to locate air quality monitors for LUR model development. However, the method has been rarely applied due to its complexity and the extensive prior knowledge required on the population and pollutant distributions. A few studies (Basagaña et al., 2012;Johnson et al., 2010;Wang et al., 2012) evaluated the effect of number of monitoring sites on LUR model performance, but the effect of the distribution of monitoring sites remains to be investigated.
Evaluation of an LUR model has always been limited to the measurements available in a monitoring campaign . The ultimate goal of exposure assessment is to accurately predict the exposure of hundreds or thousands of study subjects, but evaluation of an LUR model at this level through measurements is practically impossible. However, with the use of a dispersion model it is possible to simulate a pseudo-measured concentration at every residential address, which can then be compared with an LUR-model estimated concentration to assess the validity of the latter.
The aim of this study was to evaluate a large suite of LUR models built from different monitoring network designs by using dispersion modelled concentration at each home address to assess the predictive power of the LUR models. This modelling study used as its basis the city of Edinburgh (population~460,000) in the east of Scotland, UK (55.94 N,3.18 W). The outcome of the evaluation is to recommend sampling strategies and to highlight how particular monitoring network designs may lead to potential exposure misclassification.

Method
The evaluation of the performance of a monitoring network for constructing an LUR model for estimating concentration at home addresses was carried out in four stages. An overview of the methodology is presented first, with details described in subsequent sub-sections. A schematic of the overall workflow is shown in Supporting Information (SI) Fig. S1.
The ADMS-Urban model v3.4 (CERC, 2015) was used to simulate NO 2 concentrations for each of the population home addresses (centre points of residential buildings) in a 5 km Â 5 km study area in Edinburgh (Fig. 1). This area covers the commercial (city centre) and residential areas of the city and encompasses 7445 residential buildings housing a total population of 144,715. The dispersion modelled NO 2 concentration is considered to be the reference, on which the subsequent LUR model development and evaluation are based. Next, three different types of monitoring networks were designed (comprising different numbers of monitoring sites) based on household density and proximity to road. The NO 2 concentration at each monitoring site was modelled with ADMS-Urban using the same setup as the modelling of residential NO 2 concentration. The third stage was to develop a separate LUR model for each monitoring network, which was then applied to residential address to provide an LUR-model estimate. Finally, the LUR-modelestimated residential concentration was compared with the dispersion modelled residential concentration. The extent of agreement between the two indicates the performance of the LUR model and, in turn, the performance of the monitoring network from which the LUR model was constructed.
2.1. Stage 1 e dispersion modelling of residential NO 2 concentration 2.1.1. Data preparation All the Geographic Information System (GIS) data, including buildings and road networks for the City of Edinburgh, were obtained from EDINA Digimap Ordnance Survey Service (Ordnance Survey, 2015) as ESRI Shapefiles. Annual average daily traffic (AADT) of traffic for the major roads in 2013 were downloaded from the Department for Transport (DfT, 2015). The population of each postcode area for the 2011 census was distributed to the buildings within the polygon area based on the volume of the building (building polygon area Â building height). The centre of the building polygon with assigned population was used as the home address.

ADMS-urban setup
The model domain (12 km Â 12 km) covered most of the City of Edinburgh where all the emissions of NO x and NMVOC (nonmethane volatile organic compound) were modelled (Fig. 1). Within this larger domain, a 5 km Â 5 km subset was chosen to output the concentration at each home address. To allow the receptors on the edges of the inner domain to be modelled smoothly, a 1 km buffer zone was added to the 5 km Â 5 km output area, within which all the major and minor roads were explicitly modelled as road sources, whereas emissions outside the buffer zone were modelled as a 1 km Â 1 km gridded area source. NO x and NMVOC emissions were downloaded from the UK National Atmospheric Emissions Inventory (NAEI) for 2012 (NAEI, 2015) with a resolution of 1 km 2 . Road emissions were calculated by dividing the total emissions for the major or minor road subsector by the total length of the corresponding roads within each 1 km 2 grid. For grids in which road emissions were explicitly modelled, the road emissions were subtracted from the grid total emission. Measured meteorological data for the model, including wind speed/direction, cloud cover and temperature, were obtained from a WMO station to the west of the model domain (Gogarbank: 55.93 N, 3.35 W) (Met Office, 2012). An urban canopy file was prepared to account for the variation in the vertical profiles of wind speed and turbulence caused by the presence of buildings. Background concentrations of hourly-average NO 2 , NO x and O 3 were obtained from a rural national-network monitoring station to the south of the model domain (Bush Estate: 55.86 N, 3.21 W). For 2012, 0.8% and 23% of O 3 and NOx measurements, respectively, were missing. These were replaced by the average concentration for that particular hour over the whole year. Monthly average concentration was calculated at each receptor in ADMS-Urban, from which the annual-average concentration was calculated as the metric in the subsequent analysis.

Stage 2 e sampling network design
Three different types of sampling networks were investigated. The aim in the selection of monitoring sites was to investigate how network selection criteria and number of sites influence the representation of the spatial variation of NO 2 at the residential home addresses. Specifically, the exposure study area was first disaggregated into 25 m Â 25 m grid cells. The following GIS variables were then calculated for the centroid of each grid cell: total home addresses within a 100 m buffer (HH100) and distance to major/ minor road edge (MJRDDIST/MNRDDIST). Three types of monitoring sites were then defined: High household density sites (HH sites): centroid of the cells with HH100 falling in the top 10% of all the 25 m grid cells; Roadside sites: centroid of the cells with MJRDDIST between 0 and 5 m; Background sites: centroid of the cells with both MJRDDIST and MNRDDIST > 200 m.
A subset of each of the total set of HH and roadside sites was then randomly selected, subject to a minimum distance of 300 m between any pair of sites, to form two pools of potential HH and roadside monitoring sites to be used in the sampling networks. The locations of these sets of monitoring sites are shown in Fig. 1. The purpose of adding a minimum distance constraint to the random selection was to ensure that potential network sites were distributed across the range of localities in the study area. A third subset of sites was randomly selected from the total set of background sites, but with a minimum distance constraint of 500 m, since the background concentrations in this modelling study are mainly determined by the gridded emissions which have a resolution of 1 km 2 rather than by the road network which has a finer resolution. Due to the minimum distance constraint added to the random selection, the number of monitoring sites of each type from which a monitoring network could subsequently be selected comprised 54, 70 and 50 for roadside, HH and background sites, respectively From these potential monitoring sites, the following three types of sampling networks were designed by randomly selecting different numbers of sites from each type of monitoring site.
Household density based network (HH network): randomly selecting from the HH sites only; Proximity to road based network (Road network): randomly selecting equal numbers of roadside sites and background sites; Mixed network: randomly selecting equal numbers of roadside sites and HH sites.
Eleven different numbers of monitoring sites were tested for each type of network design ranging from 10 to 60 (in steps of N ¼ 5). Random sampling of each number of monitoring sites was repeated 30 times to obtain a statistical distribution of a particular network configuration, resulting in 990 unique networks (3 network designs Â 11 network sizes Â 30 random replications). Table S1 summarises the configurations of all the networks examined. As a further network sensitivity test, different proportions of roadside and HH sites within the Mixed network were investigated to evaluate the effect of network composition in estimating residential NO 2 concentration. Table S2 summarises the different network compositions investigated.
The HH networks were designed in the anticipation that such networks would more accurately estimate concentrations at most residential addresses. However this sampling design might underpredict concentrations for a small fraction of population who live close to roads. The Road networks, being a mixture of roadside and background sites, should capture the greatest NO 2 variation in the study area; this is the network site selection design used in many monitoring campaigns (Beelen et al., 2013). The Mixed networks of roadside and HH sites aimed to capture similar spatial variation of NO 2 as the Road network, but also to represent where most the population live. This sampling design resembles the concept of a formal methodology for locating monitoring sites , namely locating monitors where the expected pollution spatial variability and density of the study subjects are high. Unlike the formal methodology, however, the sampling design here does not require prior knowledge of the pollutant concentration surface, therefore the application of this sampling design is less restricted.

Predictor variables
A total of 15 predictor variables were selected for model development (Table 1). These variables were chosen based on prior knowledge that they may correlate with the input emissions in ADMS-Urban and their inclusion in previous LUR models for NO 2 (Beelen et al., 2013). As shown in Fig. S2, NO x emissions for each of the 1 km 2 grids in the study area are mostly dominated by road transport and combustion in commercial/residential sectors. The total road length, population counts and building plan area within a buffer radius are considered to reflect these emissions. In addition, in some areas, NO x emissions from 'other' transport (most likely resulting from railways) are also significant. Therefore total railway length within a buffer was also included as a predictor variable. Since the emissions apart from major roads and some minor roads were modelled as 1 km 2 grid sources in ADMS-Urban, the buffer radii for the relevant predictor variables were chosen to be comparable with the resolution of input emissions, namely 0.5 and 1 km ( Table 1). The rest of the predictor variables attempt to account for the increase in NO 2 concentration close to road sources (Table 1).

LUR model development and diagnostics
The development of the LUR models followed the method used in the ESCAPE project (Beelen et al., 2013). The method is a supervised forward stepwise procedure which aims to maximise the adjusted R 2 of the model while also ensuring that the included variables are associated with coefficients with pre-defined directions (Table 1).
First, all variables were individually regressed against the NO 2 concentrations in that monitoring network. The variable with the highest adjusted R 2 and a coefficient with pre-defined direction formed the initial model. Second, the remaining variables were successively added to the start model and the change in adjusted R 2 recorded. The variable resulting in the highest increase in adjusted R 2 was added to the model if: (i) the increase in adjusted R 2 was greater than 1%; and (ii) the coefficients of this variable and the variables already in the model conformed to the pre-defined direction. The selection process was continued until no variable fulfilled the above criteria. At the final step, variables with p-value greater than 0.1 were subsequently removed from the model starting from the variable with the highest p-value.
Diagnostic tests were performed on the final model. Multicollinearity in the variables was checked using Variance Inflation Factor (VIF). Predictors with high VIF value (>3) were excluded from the model one at a time starting with the variable with the highest VIF. Potential influential observations were investigated using Cook's D value. An influential observation (indicated by a Cook's D > 1) was generally caused by including a variable with extreme values or many zero values. A sensitivity test was therefore conducted on a model with an influential observation problem by fitting a new model without using the observation with Cook's D > 1. If the change in the coefficient for that variable was large (over 100% of the coefficient derived from using all the observations), a new LUR model was developed following the above procedure but excluding that specific variable from the outset. For the LUR model validation, leave-one-out-cross-validation (LOOCV) was used to assess the generalisability of the LUR model. LOOCV uses the variables in the final model to develop a regression model using N e 1 observations (N ¼ total number of observations in a monitoring network), which was then applied to the leave-out site. The procedure was repeated N times at which point all the predicted concentrations are compared with the observations to test the validity of the model within the dataset. Values of R 2 and Root Mean Squared Error (RMSE) calculated from LOOCV were used to assess the LUR model's capability to predict the concentrations within a monitoring network.

Stage 4 e evaluation of LUR model's capability at estimating simulated NO 2 concentrations at residential addresses
This aspect of LUR model evaluation compares the LUR modelled concentration at residential address with that modelled by ADMS-Urban. In essence this is similar to the concept of holdout validation (HV) in a regression model validation, where the training data and testing data are completely independent. However the validation dataset is based on ADMS-Urban output and is of constant size and much larger (7445 residential addresses) than the traditional HV validations based on measurement data. In this context, the evaluation results not only reflect the performance of the LUR model but also indicate the relative effectiveness of the underlying monitoring sites used to build the LUR model. R 2 , RMSE and Mean Bias (MB) were used here to evaluate the LUR modelled concentration for all population addresses and for different concentration ranges.
All GIS calculations were conducted in the Feature Manipulation Engine (FME) (Safe Software Inc., 2015). Statistical analyses were conducted in R software (R Core Team, 2015).

ADMS-urban model validation
ADMS-Urban was evaluated against measurements taken by both reference chemiluminescence analyser and passive diffusion tube (PDT). Comparison between the modelled annual average concentration of 2012 and the measurement by reference analyser at three monitoring stations in the study area showed that the bias was small at urban background (ED3) and minor roadside (ED7) ( Table 2). The relatively large underestimation at major roadside (ED5) could be associated with the known issue of under-reporting of NO x from diesel vehicles (Carslaw and Rhys-Tyler, 2013).
A network of 30 PDT sampling sites within this study area were deployed weekly for 6 weeks during summer and winter periods of 2013/2014. Detail of the site locations and characteristics can be found in Table S3 and in Lin et al. (2016). Seasonal average concentration (i.e. mean of 6 weekly average NO 2 concentrations) was compared with ADMS-Urban output. Sites containing more than 1 week's missing weekly NO 2 were excluded from the model evaluation. Some PDTs were located next to bus stop or at the traffic junction, where additional emissions from buses and traffic queueing are considered to be great but not modelled in the current model setup. These sites therefore do not reflect the general predictive ability of the ADMS-Urban model and were also excluded from the evaluation. Fig. 2 shows the relationship between modelled and PDT-measured NO 2 concentrations during different seasons. Overall the model underestimated NO 2 concentrations compared to the PDT measurements. However the spatial variation in the measured NO 2 was explained very well by the model (R 2 ¼ 73% and 77% for summer and winter, respectively) and was comparable to a previous ADMS model evaluation study (D _ edel _ e and Mi skinyt _ e, 2014). This indicates that although there is bias between modelled and PDT-measured NO 2 concentration the spatial pattern predicted by the model is consistent with the measurements. The bias could result from both the errors in the model and the errors in the PDT measurements. Large discrepancy (55% for summer and 82% for winter) between PDT measurement and reference analyser was observed during the deployment period at one co-location site (Table S4). This partly explains the general underestimation in the modelled NO 2 compared to the PDT measurements.
Given the good agreement between the model and real-time analyser measurements at the urban background and minor roadside monitor locations, and the very good capture of spatial pattern indicated by the dense PDT network, it can be deduced that the dispersion model here fulfils the purpose of this study; that is, to simulate a realistic pollution surface of NO 2 for the evaluation of the LUR model validity and of the monitoring sites used to build the LUR model.

Evaluation of the LUR models constructed from different monitoring networks
The distributions of NO 2 concentrations at the locations of each type of monitoring site, and at all the population addresses, are summarised in Fig. S3. Consistent with the expectations underpinning the network design principles, Fig. S3 shows that a Road network (roadside sites þ background sites) is likely to cover the whole range of concentration across the modelled domain, whereas a HH network (only HH sites) matches most closely the interquartile range of residential NO 2 concentration. Fig. 3 summarises the following statistics evaluating LUR model performance as a function of network design and size: (i) the percentage of variance explained within the data used to build the LUR model (LUR R 2 ); (ii) the ability of the LUR model to predict the observed concentrations at the virtual monitoring sites (LOOCV R 2 and LOOCV RMSE); and (iii) the effectiveness of the monitoring networks at predicting concentrations at all the residential addresses (Residential R 2 and Residential RMSE). Fig. 3 shows that LUR R 2 and LOOCV R 2 slightly decreased with increasing network size, while LOOCV RMSE slightly increased. In contrast, the effectiveness of the monitoring networks at predicting residential NO 2 concentration improved with increasing network size as shown by the increasing Residential R 2 and decreasing Residential RMSE (Fig. 3). The improvement in the prediction of residential concentration (Residential R 2 and RMSE) was, however, insignificant between LUR models constructed with >30 monitoring sites, as indicated by the overlap of inter-quartile range of the statistic calculated from 30 random repetition. The fact that the LOOCV R 2 was significantly higher than the Residential R 2 across the network size for Road and Mixed networks (Fig. 3a) suggests that using LOOCV to evaluate the LUR model's predictive ability might be overly optimistic. The contrast between the performance of the LUR model and its ability to predict residential NO 2 concentration was especially large for the Road network design (comprising a mixture of roadside and background sites), and for the other network designs when there were only 10 or 15 monitoring sites (Fig. 3a). The most effective type of monitoring network was the Mixed network, as indicated by the highest Residential R 2 limit and lowest Residential RMSE limit. The variability of Residential R 2 and RMSE in the 30 random repetitions of each network configuration (whiskers in Fig. 3) decreased with increasing network size, suggesting that larger number of monitoring sites better capture the actual relationship between predictor variables and NO 2 concentration, hence less between-LUR-model variabilities.
The performances of the LUR models in estimating residential concentration within three separate ranges of NO 2 concentration Fig. 2. Comparison of modelled and measured NO 2 concentration for summer (a) and winter (b) seasons. The cross markers denote the sites excluded from regression analysis due to special local effects as described in the text. Site 8 PDTs were co-located with reference analyser at ED3 marked by the red triangle. This site is also marked on Fig. 1.   Fig. 3. Diagnostic statistics for LUR models as a function of network design and size for simulating network site concentrations (LUR R 2 and LOOCV R 2 shown in (a), LOOCV RMSE shown in (b)); and for predicting residential NO 2 concentration (Residential R 2 in (a); Residential RMSE in(b)). The points represent the median of the statistics for the 30 random repetitions of each network configuration. The whiskers extend to 25th and 75th percentiles of the statistics for the 30 random repetitions of each network configuration. The horizontal dashed lines denote the Residential R 2 and RMSE if all the potential monitoring sites (70, 104 and 124 for HH, Road and Mixed networks, respectively) are used for calculation.
are compared in Fig. 4. At the low end of NO 2 concentration (<20 mg m À3 ), RMSE was similar between Mixed and HH networks, but both HH and Road networks significantly overestimated (MB) the overall residential NO 2 concentration. For NO 2 concentrations between 20 and 30 mg m À3 , the HH networks generally underestimated the residential concentration (Fig. 4b). The most distinctive difference between the three network designs was observed at the high end of NO 2 concentration (>30 mg m À3 ). For these NO 2 concentrations, the prediction errors (RMSE) and the extent of overall underestimation (MB) were significantly higher for the HH networks (Fig. 4). Mixed and Road networks performed similarly at high NO 2 concentration, although they both still, on average, underestimated (Fig. 4b). Similar to the statistics in Fig. 3, the variability of estimation errors also reduced with increasing network size. Overall, considering the results shown in Figs. 3 and 4 together, the Mixed networks were most effective in estimating residential NO 2 concentration when considering both all residential addresses together and subsets of addresses in different ranges of NO 2 concentrations. Fig. 5 shows the results of the investigation of the different proportions of HH sites and roadside sites within the Mixed network on the LUR model predictions of residential NO 2 concentrations. There was no significant trend in the R 2 values across the different proportions of roadside sites in the Mixed network (Fig. 5a). The prediction error (RMSE) increased with increasing proportions of roadside sites (Fig. 5b). The variability of the RMSE resulting from networks consisting only of roadside sites was the largest, whereas the LUR models derived from networks with more HH sites were relatively more precise (Fig. 5b).
The performance of networks containing different percentages of roadside sites at different NO 2 concentration ranges are compared in Fig. 6. For NO 2 concentration <30 mg m À3 , the RMSE increased with increasing percentage of roadside sites (Fig. 6a) and the MB suggested overestimation for networks with high roadside site composition (Fig. 6b). However at NO 2 concentration >30 mg m À3 , the RMSE decreased with increasing percentage of roadside sites (Fig. 6a) and higher roadside site composition led to reduced bias (Fig. 6b). Fig. 6a shows that LUR models constructed with only roadside sites resulted in high variabilities in the RMSE at NO 2 concentration <30 mg m À3 , whereas LUR models constructed with all HH sites resulted in high variabilities in the RMSE at high level of NO 2 concentrations (>30 mg m À3 ).

Discussion
In most LUR studies, LOOCV and/or hold-out validation (dividing monitoring sites into two independent sets for model development and validation) have been used to validate the LUR models. LOOCV tests how well the LUR model predicts the observation within the training dataset. Hold-out validation evaluates the predictive ability of the LUR model at locations that were not used in model development. The latter evaluation is of more interest for an LUR model; in practice, however, there is always a trade-off between building a more robust LUR model using a larger training dataset and giving more power to the evaluation using a larger validation dataset. A limited number of monitoring sites in many studies makes the division of the dataset even more difficult. Evaluation of LUR models on all potential exposure subjects has been unfeasible in reality. However, this can be achieved by using a dispersion model to provide a realistic spatial field of urban ambient NO 2 concentration. Although there may be uncertainties in the dispersion-modelled concentrations, the nature of the errors should be similar at the virtual monitoring sites and at the residential addresses.
As expected, more monitoring sites yielded better estimation of residential NO 2 concentration by the LUR model (Fig. 3). For all three network types, however, the improvement in the estimation was insignificant for networks with more than~30 monitoring sites. Although the improvement was insignificant, higher number of monitoring sites increased the stability of the developed LUR models as shown by the very small inter-quartile range for the statistics at larger network sizes in Figs. 3 and 4. As the number of monitoring sites increased, the number of unique variables appearing in the LUR models decreased (Table S5), indicating that a greater number of monitoring sites was more effective at eliminating insignificant predictor variables. This is consistent with the findings of Basagaña et al. (2012) using actual NO 2 measurements. In our work, it was found that~30 observations are sufficient to capture the spatial variation of the residential NO 2 concentrations in a dispersion modelled pollution surface of an urban area of 25 km 2 , but this number is expected to be larger in reality due to local effects (e.g. street canyon effect and traffic queueing) that were not modelled by ADMS-Urban and for larger areas than simulated in this study. Basagaña et al. (2012) showed that the  improvement of R 2 in hold-out validation was minor after~60 monitoring sites in a study area of 45.7 km 2 . In a national wide Netherlands study, LUR models constructed with over~90 monitoring sites seemed to result in similar prediction ability. Collectively these results suggest that a minimum optimal number of monitoring exists but depends on the actual study area.
In the ESCAPE study (Beelen et al., 2013) and many other LUR studies (Aguilera et al., 2008;Madsen et al., 2007), urban background sites were selected in conjunction with roadside sites to build the LUR model. Urban background sites are usually defined with respect to the distance to road source or traffic activity within a certain buffer, irrespective of the distribution of the exposure study subjects, as was represented by the Road network design in this study. Fig. 3a shows that LUR models derived from such networks were generally poorer at estimating NO 2 concentration at residential addresses than LUR models derived from networks with sites selected on the basis of household density. LUR models derived from Mixed networks were better at estimating residential NO 2 concentration than those derived from Road networks (Fig. 3) and also gave comparable errors to Road network-derived LUR models for estimating concentrations at the high end of the distribution (Fig. 4). This observation emphasises the importance of characterising both the concentration and population distribution in the study area when designing a monitoring network.
The composition of different types of measurement sites in most monitoring networks used to construct LUR models to date has been rather arbitrary. Some researchers (Cyrys et al., 2012) followed the principle of over-representing the roadside sites with respect to the fraction of addresses close to the roads, as this captures the spatial variation of NO 2 . In our work, LUR models constructed from networks containing 0e30% of roadside sites (compared with 0.2% of addresses within 10 m to the roads in the study area) showed lower estimation errors (Fig. 5b) compared to other network compositions for all three network sizes tested. When examining the estimated residential concentrations at different NO 2 levels, LUR models constructed from networks containing 0e30% of roadside sites resulted in larger errors at high NO 2 concentrations compared to networks containing higher proportions of roadside sites (Fig. 6). The results here suggest that a greater proportion of roadside sites in a monitoring network yielded LUR models that better characterised the higher end of the residential NO 2 concentration ( Fig. 6) but also introduced greater prediction error considering the population as a whole (Fig. 5), and vice versa for LUR models derived from networks containing a greater proportion of HH sites. No particular network composition was simultaneously able to provide an LUR model capable of good overall estimation of the residential NO 2 concentrations and a good estimation of the higher end concentration. This illustrates the limitation of LUR models to capture the spatial contrast in residential NO 2 concentration predicted by the dispersion model.
As a common LUR model evaluation method, the LOOCV R 2 statistic was found to overestimate the LUR predictive ability, consistent with the limited number of other studies on the same topic (Basagaña et al., 2012;Johnson et al., 2010;Wang et al., 2012). Collectively, the results from these studies highlight the limited predictability of empirical NO 2 LUR models that are highly dependent on the measurement sites. Dispersion modelling, as demonstrated in this study, is a potentially useful tool to design an effective monitoring network and to better evaluate the LUR models in a way that is otherwise unfeasible in reality.
We acknowledge that the area of our domain (25 km 2 ) is smaller than some LUR studies (Aguilera et al., 2008;Fern andez-Somoano et al., 2011). This choice was mainly limited by the intensive computational requirement of the dispersion model to calculate concentration at the large number of residential addresses.
Clustering addresses with similar characteristics would reduce the calculation time and facilitate dispersion modelling over a larger area. In this study, a dispersion model provided the NO 2 concentrations for development of the LUR models and for evaluation of their predictive capabilities. Whilst accepting potential discrepancies between dispersion model and real measurements, this work shows that more comprehensive evaluation of LUR models and their underpinning monitoring networks is needed. Although we only evaluated the LUR models for NO 2 , results for the effect of number and type of monitoring sites on LUR model performance should be transferable to other traffic-related air pollutants such as black carbon and ultrafine particle number, given their mutual high correlations.

Conclusions
Using a greater number of sites to build an LUR model improved its ability to estimate residential NO 2 concentrations. However, improvement in LUR model predictive capability was not significant beyond a certain number of monitoring sites: the predictive capability achieved using~30 monitoring sites was similar to that achieved using 70e100 monitoring sites, but a greater number of monitoring sites tended to decrease imprecision. LUR models constructed from a network design incorporating both high household density areas and roadside sites better characterised the full range of residential concentrations and specifically those with highest concentrations. It is therefore recommended to incorporate monitoring sites representing most of the study subjects when designing of a monitoring network aimed at studying the health effects of air pollutants. The more roadside sites included in a monitoring network used to construct LUR model, the larger the RMSE for the estimation of residential NO 2 concentrations, but the lower the estimation error for high NO 2 concentrations. The fact that no particular proportion of roadside sites within the network design estimated well both the overall residential concentration and higher level of NO 2 concentrations suggested a lack of spatial contrast in LUR modelled pollution surface. A dispersion model has been shown to be a useful tool for both designing a monitoring network for LUR models and for the evaluation of the LUR models.