External validation for statistical NO 2 modelling: A study case using a high-end mobile sensing instrument

Statistical learning models have been applied to study the spatial patterns of ambient Nitrogen Dioxide (NO 2 )


Introduction
Chronic exposure to air pollution poses a threat to public health.The World Health Organisation estimated air pollution to contribute to 7 million deaths worldwide in 2016 (Organization et al., 2018).From a medical perspective, common air pollutants such as particulate matter and Nitrogen Dioxides (NO 2 ) damage the cardiovascular and respiratory systems (Anderson et al., 2012;Pascal, 2009).In the Netherlands, the NO 2 concentration limit set in 2017 was exceeded several times in areas with busy traffic (Atlasleefomegeving, 2020).Spatial prediction of NO 2 is needed for making scientific-based recommendations to reduce NO 2 emissions and meet the Goal 13 (topic: Atmosphere) of the Sustainability Development Goals 2030 (SDGs, 2017).Emissions from traffic can be direct and indirect.Atmospheric NO 2 is mostly trafficrelated as an indirect secondary emission of an oxidation result of emitted NO, while the direct primary emission as NO 2 is minor (UK et al., 2017) or fused with ground monitoring stations (Gressent et al., 2020) for more accurate prediction (than using ground monitoring stations alone) of air pollution in space-time.Meanwhile, the deficits of low-cost sensors are obvious.Measurements from low-cost sensors are subject to sensor drift and interference effects.Sensor drift denotes the growing bias of sensor response due to the ageing electrochemical cell.Interference effects refer to the sensors' response to other pollutants, gases, temperature and relative humidity, leading to the measurements being presented lower (negative interference effects) or higher (positive interference effects) than the actual concentrations (van Zoest et al., 2019).For instance, an application of low-cost sensors in the Netherlands carried out in Amsterdam showed a significant signal drift over a period of two months.Calibration with temperature and relative humidity however improved the fit with ground stations (Mijling et al., 2018).
Methods for spatiotemporal mapping of NO 2 include dispersion models (Holmes and Morawska, 2006;Health Effects Institute, 2010), statistical models (Chen et al., 2019b), or hybrid models (Mölter et al., 2010b;Marshall et al., 2008;Beelen et al., 2010;Dijkema et al., 2010;Akita et al., 2014).Dispersion models simulate the emission, transformation, transportation and deposition of atmospheric particles, but require detailed emission inventory data and are computationally intensive; scaling a dispersion model towards larger areas can be at the cost of computational intractability.Statistical models aim at finding relationships between ground monitor station observations, satellite measurements, and ancillary NO 2 data, called geospatial predictors (Rivera et al., 2013;Park and Kwan, 2017;Kharol et al., 2015;Isiugo et al., 2018;Chen et al., 2019b;Lu et al., 2020a;Li et al., 2020;Chan et al., 2021).The geospatial predictors are variables that relate to the emission sources (e.g., transport network) and dispersion processes (e.g., meteorological data) of the pollutants (Briggs et al., 2000).The statistical approach has been used to predict high-resolution NO 2 at various spatial scales.Conventionally, linear regression methods have been used (Hoek et al., 2008a;Larkin et al., 2017) for mapping and analysing the pollutant sources.More recently, ensemble tree-based (Dou et al., 2021;Song et al., 2021) and neural networksbased (Li et al., 2020;Shams et al., 2021) machine learning methods have shown to be capable of further improving the prediction accuracy for the spatial and spatiotemporal prediction of NO 2 .Environmental epidemiological studies have shown that mapping temporally resolved, long-term NO 2 climate, e.g.NO 2 of each hour of a day aggregated over multiple years (Lu et al., 2020b), to be preferable to account for human space-time activities in personal exposure assessment (Lu et al., 2019).However, validation thus far is mostly done against ground station measurements, which does not enable extensive validation of the spatial pattern of air pollution predictions.To validate spatiotemporal predictions regarding spatiotemporal patterns, external measurements providing more detailed spatiotemporal information is necessary.
Several studies compared the accuracy of high-resolution NO 2 mapping of various statistical models, including standard and regularised linear regression (Briggs et al., 2000), ensemble tree-based models (dos Santos et al., 2020), transformation-based (e.g.support vector machine), and artificial neural networks (Chen et al., 2019b;Kerckhoffs et al., 2019;Lu et al., 2020a).Kerckhoffs et al. (2019) modelling UFP (Ultra Fine Particles) and Chen et al. (2019b) modelling NO 2 over several European countries, concluded that the ensemble treebased and linear regression models obtain similar prediction accuracy.However, neither of these two studies showed or evaluated the gridded prediction (i.e. the predicted continuous surface of NO 2 ).Also, in the global NO 2 modelling study of Lu et al. (2020a), it is found that various ensemble tree-based methods obtaining almost the same accuracy based on Cross-Validation on Ground Stations (CVGS) may give very different prediction patterns.There is thus a need to evaluate prediction models by comparing modelled and observed spatial patterns of NO 2 and preferably use observed patterns also in model building additional to CVGS.
Using external air pollution measurements that measures local spatial variation for validating air pollution models (both statistical and numerical) has been shown to be necessary for a reliable model evaluation (Khreis et al., 2018), however, remains a challenge due to limited air pollution measurements and the selection of an external validation dataset (Liu et al., 2018;Khreis et al., 2018).Most studies used external ground monitoring measurements for external validation.Liu et al. (2018) used ground station measurements to correct numerical models through the application of a correction factor to the ground station measurements.Mölter et al. (2010a) used air dispersion model output to LUR models and ground monitoring stations for validation.Ren et al. (2020) used monitoring stations that do not fulfil the 75% completeness criteria (i.e. less than 25% of data is missing) as external validations additional to other cross-validation strategies for a comparison between machine learning and LUR models.Similarly, Chen et al. (2019b) used external ground station measurements to evaluate different statistical air pollution models for comparison and concluded the strengthened model evaluation by the use of an external evaluation dataset.Khreis et al. (2018) evaluated the impact of using different ground measurements to validate statistical and numerical to the model evaluation results.The use of low-cost sensors in model validation are less frequent, likely due to the challenge in quality control and assurance.Lu et al. (2019b) used low-cost sensors (PurpleAir sensors) to evaluate a LUR (Land Use Regression) statistical PM 2.5 model.They compared block-level LUR predictions with PurpleAir sensor measurements and found higher values of the PurpleAir sensors and that the LUR predictions explain on average 20% of the variations from the PurpleAir sensor observations.
The objectives of our study are to (1) design a study to collect high quality measurements of air pollutants that are suitable for validating statistical models, and (2) use these measurements to evaluate and compare statistical NO 2 models built using the official national ground air quality monitoring network.We will answer the question if we can better optimise hyperparameters (parameters whose values are used to control the model training process and are not obtained from model training) in the models using these measurements and whether model comparison results differ from those based on CVGS.
We monitor NO 2 concentrations by installing a mobile air quality station onboard a cargo-bike to evaluate temporally resolved NO 2 models based on different statistical algorithms.Specifically, we focused on comparing three methods, Lasso (Tibshirani, 1994), Random Forest (RF, Breiman, 2001), and eXtreme Gradient Boosting (XGBoost, XGB, Chen et al., 2019a).These methods are selected as they are representative for the most recent spatial prediction techniques for NO 2 mapping, with Lasso and Random Forest compared in Chen et al. (2019b), Kerckhoffs et al. (2019) and Lu et al. (2020a).We compared the Lasso, RF, and XGB models of the corresponding hours with the cargo-bike measurements to understand the amount of spatial variability a statistical model could capture, and to compare different models and hyperparameter settings beyond CVGS accuracy of the models.

Cargo-bike and instruments
The cargo-bike (Fig. 1) was designed to carry reference apparatus in a compact package, to operate with optimal freedom of movement while providing reliable measurement data.It was equipped with two 12,8 V/100 Ah LiFePo4 batteries (Victron), a high-efficiency MultiPlus Compact inverter/charger (Victron) and a 150Wp solar panel.The cargo-bike and the instruments onboard weight 160 kg, has an e-bike support motor, and can operate 3-5 h continuously.
The air quality monitoring apparatus were: 42i NO  monitor, 49i O3 monitor (Thermo Fisher Scientific), model 3321 APS (TSI), WXT536 climate sensor (Vaisala), MI70 + probe GM70 CO2 sensor (Vaisala), GPS, a PC and several low-end air quality sensors.The housing was custom-made from sandwich panels and extruded aluminium profiles.
The cargo-bike covered a route shown in Fig. 2, in Nijmegen, the Netherlands, from July 16 to 19, 2019.The total route length is 29 km.The cargo-bike measured NO 2 every morning from 8 am to 11 am, the route was repeated daily.A two-point calibration was performed on the Thermo 42i NO  monitor using synthetic zero gas (0 ppb) and calibration gasses of NO (410 ppb) and NO 2 (366 ppb).
We compared the cargo-bike measurements with ground station sensor measurements for validation.Along the cargo-bike route, there are two national ground stations established, one is the Nijmegen-Graafseweg station (called Graafseweg station, Latitude: 51.941372, Longitude 5.857777), which monitors mainly air pollution from traffic.The other is Nijmegen-Ruyterstraat station (called Ruyterstraat station, Latitude: 51.838221, Longitude: 5.856938), which monitors mainly background pollution.The stations are managed by RIVM.Over the period of the cargo-bike measurements considered, the mean NO 2 level is 23.25 μg/m 3 and 14.03 μg/m 3 , respectively, for the Graafseweg and the Ruyterstraat stations.We acquired minutely ground station measurements and did the comparison using the cargo-bike and ground station measurements 2 min before and after when the cargo-bike was at the national ground stations (we cycled slower when we were close to the two ground stations), to account for differences in sampling duration and frequency between the cargo-bike and the ground stations.To facilitate the comparison, we also averaged the ground stations and cargo-bike measurements, respectively.Fig. 3 shows that the cargo-bike measurements matches better with the measurements from the Ruyterstraat station than the Graafseweg station.This could be explained by the much higher traffic intensity around the Graafseweg whereas the Ruyterstraat is in a neighbourhood away from the larger busy street.The cargo-bike is closer to the emission outlets of vehicles and has a higher sampling rates.These may explain that they show higher variations compared to the ground station measurements.A notably high value of a cargo-bike measurement on 17-07 may be caused by a heavy-emission vehicle passing by.
To use the cargo-bike measurements as external validations of statistical models, which are trained on ground station measurements in unit μg/m 3 (micrograms per cubic metre).The cargo-bike measurements are converted from ppb to μg/m 3 under the assumption of the ideal gas behaviour.Specifically, we firstly use the ideal gas law to determine the molar volume at the pressure and temperature that were measured during the bike-ride sampling: where n is 1 mol, R is the gas constant (0.082057366080960), T is temperature in Kelvin and P is pressure in atm.Then the NO 2 in μg/m 3 is calculated as: The cargo-bike measurements are aggregated in space-time to every minute and in each (25 m) grid cell of the prediction map.The measurements are also aggregated over the four days, to eliminate effects from random vehicles to represent general emission patterns.

Ground monitor stations used for statistical modelling
In the Netherlands (41,543 km 2 ), the national air quality ground monitoring network consists of 66 ground stations.These ground stations are managed by the National Institute for Public Health and the Environment (RIVM, National Institute for Public Health and the Environment, 2017).We further incorporated the ground monitor network from Germany (357.386 km 2 , 376 stations) to better identify NO 2predictors relationships using machine learning models.The ground monitoring station measurements are from the European Environment Agency (2021) for the Netherlands and the Umweltbundesamt (2021) for Germany.Three stations with inadequate measurements (i.e.missing values at certain hours) are neglected.The measurements are  downloaded for the same days as the cargo-bike measurements and from 7:00 am-11:59 am.
The ground station measurements used for prediction are aggregated in two ways, one is the mean of all hours and days measured by cargo-bike, called NEDL-avg dataset, and the other the mean of each hour of the days corresponding to the time of cargo-bike measurements, called NEDL-hr dataset.The NEDL-avg is used for XGB and RF hyperparameter optimisation and to identify the grid resolution for mapping.The NEDL-hr is used for building hourly statistical models.

Geospatial predictors
The geospatial predictors (Table 1) were calculated at 25 m resolution.They are either spatial attributes aggregated within a circular ring centred at each sensor or prediction location, called buffered predictors, or values of the spatial attribute at the observation or prediction location, called gridded variables.The buffered predictors include industry areas, roads, VIIRS (Visible Infrared Imaging Radiometer Suite) Nighttime Day/Night Band (DNB) radiance values (nightlight, NOAA, 2021) and population.Gridded variables include wind speed and temperature (Dee et al., 2011), elevation (Amante and Eakins, 2009), mean of the NO 2 column density from TROPOMI level 3 product (Google Earth Engine, 2019) in July 2019 (3.5 × 7 km 2 resolution).The buffered predictors of road and industry are calculated from OpenStreetMap (OpenStreetMap contributors, 2019).For detailed descriptions of the sources of geospatial predictors and how they are calculated please refer to Lu et al. (2020a).

Research design
The methodological framework is designed as follows: 1.A set of statistical learning methods is applied to DENL-avg to identify hyperparameters for DENL-hr and to identify grid resolution for mapping based on variable importance.2. The statistical learning methods and optimised hyperparameters are applied to DENL-hr to model hourly NO 2 .These statistical learning models are evaluated using CVGS.
3. Hourly NO 2 is predicted for the study area with each of the statistical models using DENL-hr.XGB hyperparameters are adjusted while observing the prediction pattern.4. The predictions from hourly models using the hyperparameters optimised in step (1) and for XGB additionally the parameters optimised in step (3) are compared to the cargo-bike measurements.

Random forest, extreme gradient boosting, Lasso
Lasso is a linear regression algorithm that uses an L1 norm to shrink variable coefficients to zero to reduce model variance.RF and XGB are ensemble tree-based methods, which mitigate two negative effects of a single tree model: instability and coarse separation.Large trees are subject to instability, while small trees are inaccurate for their piece-wise constant approximations.Bagging overcomes these two constraints by using small trees to add stability and avoid coarse approximation by averaging over small trees (Friedman, 2001).RF is based on Bagging, which grows trees independently while XGB is based on gradient boosting, which grows trees subsequently based on current model residuals.RF extends from bagging by choosing the variable to split from a subset a of variables.XGB is a scalable gradient boosting algorithm, which enables multiple penalisation paths to control model complexity to prevent model over-fitting, including regularisation on tree width and terminal node values, as well as dropping trees.
The variable importance is calculated for XGB and RF.We use the averaged ranking in 20 times bootstrapping for each method (Lu et al., 2020a).For the XGB the gain scores (Chen and Guestrin, 2016) are used and for the RF the permutation test is used to calculate the variable importance.

RF and XGB hyperparameter optimisation
The NLDE-avg dataset is used for hyperparameter optimisation, through grid search, with 5-fold CVGS.For XGB, the learning rate (eta), number of iterations (rounds), maximum tree depth (max.tree.depth)and gamma are tuned, each time 70% of data is drawn from the training set.The search grid for the number of iterations (rounds) was from to 3000, with a step of 200; maximum tree depth (max-depth) from to 6 with a step of 1, learning rate (eta) from 0.001 to 0.1 with a step of 0.05, the penalty term gamma (Chen et al., 2019a) from 1 to 5 with a step of 1.The 5-fold CVGS result indicates the optimal hyperparameters to be eta = 0.051, rounds = 200, max-depth = 3, gamma = 1.For RF, the minimum number of trees on the end nodes (min.node.size),and number of variables that are randomly drawn for each tree (mtry) are optimised.The optimal setting for RF is min.node.sizeequals 5 and mtry equals 12, the number of trees is set to 1000 for random forest, which is a safe choice as the high number of trees will not negatively affect model performance.

High setting of XGB hyperparameters
As the spatial pattern of XGB can vary greatly with different hyperparameter settings despite the CVGS accuracy remaining the same (Lu et al., 2020a), we observed spatial prediction patterns from multiple XGB hyperparameter settings.We tested increased learning rates of 0.002, 0.001, and 0.0005, and estimators (i.e. the number of trees) of 300, 3000, and 5000.We found altering the learning rate from 0.002 to 0.0005 only affect the prediction patterns subtly but setting the learning rate to 0.002 and 0.051 (original setting optimised by CVGS) makes a considerable difference in their predicted NO 2 patterns.The CVGS accuracy is optimised at 3000 trees with this new learning rate, which remains approximately the same compared to the original setting (Table 4).We also increased the L1 norm (lambda) from 2 to 10, with a step of 2, and gamma (Chen and Guestrin, 2016) to 5, to further control respectively extreme values at the terminal node and model complexity.

Predictor
Variable With these settings the spatial prediction patterns of XGB, as well as their correlations with cargo-bike measurements change subtly.We will show the result of XGB with the maximum tree depth set to 5, learning rate 0.002, number of estimators 3000, lambda 2, and gamma 5. We refer to this hyperparameter setting of XGB as ''high setting'' (XGB hs); compared to the original setting, it searches the gradient much more slowly, correspondingly with more iterations, and uses additional penalties to control model over-fitting.

Accuracy assessment
The RMSE (Root Mean Squared Error,  = √ ) provides a general insight into the variance and magnitude of the error.In addition, we calculated the MAE (Mean Absolute Error,  = (|  −   |)) for the magnitude of the error and the IQR (Inter-Quartile Range,  =  3 −  1 ) for the variance of the error.To make the accuracy assessed at different times comparable, we calculated the  2 (R-squared,  2 = 1 − ((  −   ) 2 )∕(  )), where var(.)indicates the variance function.We used 80% of the dataset for modelling and 20% for validation.A 20-time bootstrapped cross-validation was used.Thus, the accuracy metrics described above were calculated on validation datasets 20 times, and the median of these 20 results was used as the final accuracy measure.

Models trained on NLDE-avg
Table 2 compares the variable importance obtained by the RF and XGB models trained on the NLDE-avg dataset.The top three ranked variables are the same, which consists of the emission-related variables primary road length within 25 m buffers.This indicates the maps can capture spatial variability at 25 m resolution.Therefore, the statistical modelling is based on 25 m resolution grids.Other emission-related variables include primary roads in 50 m and 100 m buffers and local roads in 100 m buffers.The variables are ranked similarly between RF and XGB.Over the study area, the Pearson's correlation coefficients between XGB and RF predictions is 0.97, between RF and Lasso 0.94, XGB and Lasso 0.90.

Hourly models
The results of XGB, XGB hs, RF, and Lasso models built respectively for 8-9, 9-10, and 10-11 am (referred to as 8 am, 9 am, and 10 am, respectively), averaged over 4 days, using the NLDE-hr dataset, and a comparison of variability with the cargo-bike measurements at corresponding hours are shown in Tables 4 and 6 and Figs. 5 and 7.In all the time slots, the XGB (i.e. with the hyperparameters tuned based on grid search using k-fold CV), XGB hs and RF obtained similar CVGS  accuracy, and both outperformed Lasso.Fig. 4 shows the impact of learning rate and compared XGB and XGB hs prediction patterns.It could be observed that a too high learning rate leads to sporadic spatial patterns and the spatial prediction from XGB is noisier compared to XGB hs.The R 2 of the linear regression between model predictions and the cargo-bike measurements (Table 6) also indicated the XGB hs having a higher correlation with the cargo-bike measurements compared to the XGB at 8 am and other times similar.Therefore, for the rest of the comparisons between models and with the cargo-bike measurements, we used XGB hs instead of XGB.The spatial predictions of the three models at the three time slots and the corresponding cargo-bike measurements are shown in Fig. 5.All the models predicted highest NO 2 along the primary road and show a decreasing trend away from the city centre to the suburban areas.The Lasso prediction shows the least spatial variation and the XGB the most.At 9 am, the highest NO 2 is measured near the river Waal (Fig. 2), this is captured by XGB and RF predictions but is completely missed by the Lasso prediction.At 10 am, the cargo-bike measurements are higher along the roads, and this is consistent across the model predictions.
The XGB hs, RF, and Lasso predictions along the cargo-bike track are compared to the cargo-bike measurements (Fig. 6).The mean (Table 5) at 8 am between model predictions and the cargo-bike measurements are the closest.The mean of model predictions at 9 and 10 am are higher than the cargo-bike measurements (less than 25% higher).To facilitate visualising the spatial variations of the statistical model predictions compared to cargo-bike measurements, we regressed three model predictions along the cargo-bike track each against the cargo-bike measurements (Fig. 7).The model predictions along the cargo-bike track vary more similarly between each other compared to the variations in the cargo-bike measurements.The best match between the cargo-bike measurements and the three model predictions occurs at 10 am, all the models are capable of predicting the peaks between points 7500-9000 at 10 am, when the cargo-bike left the main road.This may be the reason that the  2 (Table 6) between model predictions and cargo-bike measurements are the highest at 10 am.Compared with RF and XGB, the Lasso model prediction obtained the highest correlation with cargo-bike measurements at 10 am despite the lowest CVGS R 2 .At 8 am, the cargo-bike is mostly in an area with less traffic.The R 2 of the XGB hs against cargo-bike measurements (0.27) is notably higher compared to that of Lasso (0.00) and RF predictions (0.1).At 9 am, the cargo-bike is mostly on the main road.The R 2 are similarly low in all model comparisons.

Discussion
In this study, we used a cargo-bike with various air quality sensors and high-end instrumentation for gathering spatially more detailed ground validation data.The high-end monitors we used are equivalent  We demonstrated that it is important to consider spatial prediction patterns when evaluating model predictions of NO 2 .In addition, we have shown that spatially-continuous measurements are a valuable source of information to improve the hyperparameter tuning and model evaluation.Importantly, external measurements provide us with an objective, quantitative measure to involve the spatial prediction pattern in the hyperparameter tuning and modelling processes.The route track taken at 8 am by the cargo-bike is mostly far away from major traffic roads (further than 500 m away from ''primary roads'' defined by OpenStreetMaps), and the route track taken at 9 am and 10 am are mostly in a traffic area.Among the prediction models, XGB hs obtained the highest  2 with the cargo-bike measurements at 8 am, which shows the least variations compared to other times (Fig. 6).This may indicate that the XGB is less prone to over-fitting when its hyperparameters are properly set.The best match between all the models and the cargo-bike measurements is at 10 am.This indicates that the statistical models are capable of capturing the traffic emission related variations.The inconsistency between the R 2 of CVGS (Table 3, i.e. the lowest for Lasso and for 10 am) and when comparing with the cargo-bike measurements (Table 6 the highest for Lasso and 10 am) may be explained by that Lasso captured an "on and far-away from the main road" pattern with ground station measurements and this pattern is presented in this part of the trip.This pattern is not everywhere as evidenced by the low CVGS obtained by the Lasso.Future studies need to include traffic volumes to better understand the amount of local variations that could be captured with statistical models as well as using the designed mobile sensing technique to sample in more areas to better understand the spatial heterogeneity in model predictions.
The hyperparameter that affects the prediction pattern of XGB the most is the learning rate.If the learning rate is too high, the model may miss the minimum of the objective function and not fully learn the patterns in the data, causing sporadic effects.When the learning rate is optimal, the sporadic effects diminish and will only change minimally when the learning rate is further reduced.In this study, further reducing the learning rate from the optimal learning rate identified by cross-validation led to a clearer pattern (Fig. 4).Moreover, other model over-fitting control strategies, such as increasing gamma, lambda, and reducing sub-sampling do not considerably alter the prediction patterns and the CVGS accuracy, which may indicate that the model is not subject to over-fitting.
An important difference between the validation on cargo-bike and CVGS is that the CVGS measurements include all scales of spatial variations (short range, medium range, large range), while the cargobike measurements include only local variations.This means testing with CVGS evaluates the model also on its capability to predict patterns of different scales, while testing on the cargo-bike measurements mainly evaluates the model on its capability to predict detailed local variations.A model trained on a dataset that includes all scales of variation (based on CVGS) is likely not the model that performs best when mapping the detailed local variations.To the other side, satellite imagery (e.g.Tropomi instrument measurements) can be used to evaluate how the model is scalable to regions where dense ground monitoring networks are not available.

Conclusion
In this study, we designed a novel high-end instrument to measure detailed NO 2 not only along the primary roads but also in the city centre and used the measurements to further evaluate and compare high-resolution statistical NO 2 models.The spatially dense measurements from the cargo-bike allow us to compare different statistical methods and hyperparameter settings accounting for their spatial patterns.This is complementary to the CVGS-based accuracy assessment.As the ground truths are scattered, ignoring the spatial prediction patterns in hyperparameter optimisation and model evaluation may lead to one-sided model evaluation and comparison.The accuracy of model predictions is spatially heterogeneous.We showed that while the CVGS accuracy stays the same, the XGB model predictions vary non-trivially with different hyperparameter settings, particularly the learning rate.With advanced mobile sensing techniques, this study provides an approach for a more in-depth look into NO 2 statistical modelling and model comparison and highlights the possible pitfall of exclusively depending on CVGS accuracy in model hyperparameter optimisation and comparison.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.The cargo-bike and instruments that are used to sensor the NO 2 in our study.

Fig. 2 .
Fig. 2. Routes taken by the cargo-bike, the background map is from OpenStreetMap (OpenStreetMap contributors, 2019).The start and end of routes were at the same location, indicated by the point ''A''.

Fig. 3 .
Fig. 3. Comparing the cargo-bike measurements with the Dutch national ground monitoring stations in the region.(a) Locations of the two ground stations, blue block is around the Graafseweg station and yellow the Ruyterstraat station.Bounding boxes (min latitude, max latitude, min longitude, max longitude) for the Graafseweg station are (51.84126,51.84151, 5.85753, 5.85786) and for the Ruyterstraat station are (51.83810,51.83830, 5.85701, 5.85719).The red line indicates the cargo-bike route.Within the two blocks the cargo-bike measurements are compared with the ground station measurements at the corresponding times.(b) The cargo-bike and ground station measurements while the cargo-bike was close to the ground stations.(c) Average values of the ground stations and the cargo-bike measurements.
M.Lu et al.

Fig. 5 .
Fig. 5. Maps of cargo-bike measurements and predictions from XGB hs, RF, Lasso, hourly models.The dashed lines in the prediction maps indicate the cargo-bike routes.

Fig. 6 .
Fig.6.Cargo-bike measurements and the XGB (high setting), RF, Lasso predictions and the cargo-bike measurements, visualised in the 1-D view.The point_id on the -axis refers to the position on the track followed by the cargo bike.

Fig. 7 .
Fig. 7. Cargo-bike measurements and the fitted values of a linear regression between respectively the XGB (high setting), RF, Lasso predictions and the cargo-bike measurements, visualised in the 1-D view.The point_id on the -axis refers to the position on the track followed by the cargo bike.

Table 2
Ranking of the top 15 most important variables of XGBoost and Random Forest, using NLDE-avg.Please refer to Table1for the variable description.

Table 3
20 times bootstrapped cross-validation results of the XGB, RF, and Lasso using NLDE-avg.

Table 5
Mean NO 2 of cargo-bike, RF, Lasso, high-setting XGB (XGB hs) and cargo-bike measurements.for official air quality monitoring.With high-end monitors, i.e. apparatus, we can measure as accurate as reasonably possible in (most) circumstances, not or minimally affected by temperature, humidity and other cross-reactivity (e.g., NO 2 sensors can be cross reactive to NO and O 3 ).As these monitors are usually quite large (19 ′′ rack form factor or comparable), until now this equipment (as far as we