Estimating surface NO 2 concentrations over Europe using Sentinel-5P TROPOMI observations and Machine Learning

Satellite observations from instruments such as the TROPOspheric Monitoring Instrument (TROPOMI) show significant potential for monitoring the spatiotemporal variability of NO 2 , however they typically provide vertically integrated measurements over the tropospheric column. In this study, we introduce a machine learning approach entitled ‘S-MESH ’ ( Satellite and ML-based Estimation of Surface air quality at High resolution) that allows for estimating daily surface NO 2 concentrations over Europe at 1 km spatial resolution based on eXtreme gradient boost (XGBoost) model using primarily observation-based datasets over the period 2019 – 2021. Spatiotemporal datasets used by the model include TROPOMI NO 2 tropospheric vertical column density, night light radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS), Normalized Difference Vegetation Index from the Moderate Resolution Imaging Spectroradiometer (MODIS), observations of air quality monitoring stations from the European Environment Agency database and modeled meteorological parameters such as planetary boundary layer height, wind velocity, temperature. The overall model evaluation shows a mean absolute error of 7.77 μ g/m 3 , a median bias of 0.6 μ g/m 3 and a Spearman rank correlation of 0.66. The model performance is found to be influenced by NO 2 concentration levels, with the most reliable predictions at concentration levels of 10 – 40 μ g/m 3 with a bias of < 40%. The spatial and temporal error analyses indicate the spatial robustness of the model across the study area, with better prediction accuracy during the winter months and the associated higher NO 2 concentrations. Despite the complexity and the continental scale of the study area, the XGBoost-based model shows fast execution potential in providing daily estimates of surface NO 2 concentrations over Europe. The Shapley Additive exPlanations (SHAP) value analysis highlights TROPOMI NO 2 tropospheric column density as the main source of information in deriving surface NO 2 concentrations, indicating its significant potential for such studies. The SHAP values also indicate the importance of anthropogenic emission proxy inputs such as VIIRS night lights, in complementing TROPOMI NO 2 values for deriving higher resolution and detailed spatial patterns of NO 2 variations.


Introduction
Nitrogen Dioxide (NO 2 ) is one of the major air pollutants affecting human health.Long-term exposure to NO 2 leads to respiratory and cardiovascular diseases, and causes irritation of the nose, eyes, and throat (Manisalidis et al., 2020).Furthermore, the presence of NO 2 can lead to the formation of secondary pollutants such as tropospheric ozone and aerosol nitrates, resulting in acid rain and a reduction in visibility (Seinfeld and Pandis, 2016).NO 2 also has a climate impact due to its indirect effect on methane concentrations, ozone, and secondary inorganic aerosol (Cohen et al., 2010;Peng et al., 2022;Wang et al., 2020a).High levels of NO 2 can damage vegetation and reduce crop yields and plant growth via nitrogen deposition, which can impact negatively on natural ecosystems by fertilizing otherwise poor soils (Hettelingh et al., 2015).Despite a decreasing trend in NO 2 concentration over the past few years in Europe, in the year 2019 a total of 22 of the countries that report air quality monitoring data to the European Environment Agency (EEA) showed annual NO 2 concentrations above the European Union (EU) annual average limit value of 40 μg/m 3 .Furthermore, all the reporting countries exceeded the stricter WHO annual average guideline level of 10 μg/m 3 during 2019 (EEA, 2021a).To help mitigate the pollution risks through decision making and policy planning, it is important to monitor and understand the spatiotemporal distribution of NO 2 concentrations.The sources of NO 2 pollution are mostly anthropogenic with main contributions from traffic, industry and, power plants (EEA, 2022;Khomenko et al., 2023;Liu et al., 2016), and to some extent from soil fertilization activities in agriculture, particularly over Europe (Huber et al., 2020;Mielcarek-Bocheńska and Rzeźnik, 2021;Wang et al., 2021).The NO x -ozone chemistry influenced by the presence of sunlight further controls the NO 2 levels.Ground monitoring stations equipped with reference instruments continuously monitor these surface pollutants and provide hourly average observations.These measurements are highly accurate and are used for policy-based emission regulation and monitoring pollution levels.However, they are pointbased observations with inhomogeneous spatial distribution and variable spatial representativity.Chemistry Transport Models (CTM) on the other hand provide spatially complete information about the surface NO 2 concentrations by simulating the underlying physical and chemical processes.CTMs are computationally expensive, and this complexity increases with the spatial and temporal resolutions of the model, thus demanding heavier computational infrastructure and time.In addition, CTMs can be subject to systematic biases and require extensive inputs such as meteorology and emissions.
Though satellite observations can be regarded as a source of global measurements with good spatial distribution, they often exhibit data gaps due to cloud cover or retrieval issues (Schneider et al., 2021).Instruments in low earth orbit further exhibit limited temporal sampling with typically one overpass per day.Satellite measurements of NO 2 represent the tropospheric column density and not directly the concentration of NO 2 at the surface.Information on surface NO 2 concentrations is required for assessing human exposure and its impact on human health and is further needed for making the link with regulatory monitoring.Hence there is an increase in studies that focus on deriving surface NO 2 concentration from satellite measurements using various approaches.The most prominent approach is using a scaling factor derived from CTM models that relates the tropospheric column density to surface NO 2 (Bechle et al., 2013;Lamsal et al., 2008).Other methods use regression-based approaches such as land use regression (LUR) that combine satellite observations, CTM data, ground-based measurements along with land use and emission sources as dependent variables (de Hoogh et al., 2016;Hoek et al., 2015;Yang et al., 2017) or Geographically Weighted regression (GWR) that additionally take into account the spatial locations (Qin et al., 2017;Robinson et al., 2013;Song et al., 2019).However, the interactions between NO 2 and its impacting factors are non-linear and can be complex.
Lately, machine learning (ML) models such as eXtreme gradient boosting (XGBoost) (Chen and Guestrin, 2016), are being explored for deriving surface NO 2 concentrations using satellite datasets.Some studies, such as Chi et al. (2021) train an XGBoost model to derive daily surface NO 2 concentrations over China using TROPOMI and study the spatial and seasonal NO 2 variations across different regions in mainland China.Others, such as Chen et al. (2019) take a slightly different approach by combining universal kriging with the XGBoost model to fill the temporal information lost due to low (usually 1-2) satellite overpasses.While most studies have derived daily concentration maps from single satellite observation, a recent study by Kim et al. (2021) uses the XGBoost model on inputs that are spatially and temporally gap-filled using approaches such as normalized convolution and interpolation to derive hourly surface NO 2 maps at 100 m resolution.Other studies also employ the more complex neural networks (Chan et al., 2021;Ghahremanloo et al., 2021) or an ensemble of machine learning models (Di et al., 2020;Huang et al., 2022).All these studies show good agreement with station observations, indicating the potential of satellites for surface NO 2 estimations.Furthermore, these studies also demonstrate the capability of ML models in mapping non-linear, complex relationships.This feature makes ML models a good candidate for further exploring the potential of deriving surface NO 2 concentrations from satellite measurements.
Despite their potential, ML models are considered a black-box due to their lack of transparency in explaining the predictions.This has resulted in the application of explainable artificial intelligence (XAI) (Angelov et al., 2021;Minh et al., 2022) approaches such as Shapley Additive Explanation (SHAP) (Lundberg and Lee, 2017), Linear Interpretable Model-Agnostic Explanations (LIME) (Dieber and Kirrane, 2020) that allow the interpretability of model predictions, thus offering important insights and understanding of the ML model decisions (Zednik and Boelsen, 2022).
Our approach in this study, entitled S-MESH (Satellite-and ML-based Estimation of Surface air quality at High resolution) for NO 2 , trains and evaluates an XGBoost model for estimating high-resolution surface NO concentrations using TROPOMI observations over the large, continentalscale study area of Europe.XGBoost is known to be computationally fast and efficient.The objective is to take advantage of the prediction speed and evaluate the robustness of the model across temporal and spatial dimensions.While there are previous studies using ML models for NO estimation at local or regional levels, in this study we derive daily NO information at 1 km resolution at the continental scale for Europe between the years 2019-2021.We further focus on using satellite observation-based input features in the ML model as much as possible and evaluate their importance in estimating surface NO 2 using an XAI approach.

Study area
The study domain covers all areas over land ranging between 26 • W-45 • E and 34-72 • N (Fig. 1) for those countries in Europe that report country-specific air pollution data to the European Environmental Agency (EEA).Thus, the study area is made up of 39 countries which include 30 nations under the European Economic Area consisting of European Union member states along with Iceland, Liechtenstein, and Norway.Other countries include Switzerland, United Kingdom, Turkey and six West Balkan countries.With a population of ~550 million inhabitants, there is strong presence of metropolitan areas, industries and dense road networks over Europe influencing the outdoor air quality.

Datasets
This study uses various datasets representing factors that influence NO 2 such as the spatiotemporal distribution of NO 2 (station measurement and tropospheric NO 2 column density), location and pattern of anthropogenic activities (night light radiance), meteorological conditions (Planetary Boundary Layer Height (PBLH), wind speed, and temperature), impact of solar radiation (solar zenith angle), type of land use and change (normalized difference vegetation index (NDVI)), and topography (elevation).These datasets are obtained from different sources such as satellite instruments, reanalysis products, and station measurements.The Google Earth Engine (GEE) data platform (Gorelick et al., 2017) is used as the single data source to obtain all input datasets except Reprocessed (RPRO) TROPOMI data and station observations.The study period ranges from 2019 to 2021 with high resolution daily surface NO 2 maps derived for all three years.Datasets from 2019 to 2021 are segregated such that data from 2019 to 2020 are used for training and testing the model.To independently evaluate the model's sensitivity to the training years, the 2021 test dataset is evaluated separately at overall level.
Table 1 provides an overview of all input features used here.All of them are discussed in more detail in the following sections.

Sentinel-5P TROPOMI observations
The S5P satellite carrying TROPOMI was launched on 13 October 2017 as a Copernicus mission with the aim of monitoring the atmosphere at higher spatial resolution than its predecessors such as the Ozone Monitoring Instrument (OMI) on the Aura satellite.The TRO-POMI instrument onboard the S5P satellite measures top-of-atmosphere radiance from which Level-2 products of atmospheric trace gases such as NO 2 , ozone, methane, formaldehyde, and UV radiation are produced.TROPOMI, being a nadir-viewing push-broom instrument, covers a swath of 2600 km with an ascending node equator overpass time of approximately 13:30 local solar time (Veefkind et al., 2012).TROPOMI provides daily global coverage of NO 2 columns at a spatial resolution of 7.5 × 3.5 km 2 until 6 August 2019 and at an improved spatial resolution of 5.5 × 3.5 km 2 thereafter.
The TROPOMI spectrometer measures in the ultraviolet (UV), UV-visible (UV-VIS), near-infrared and shortwave infrared (SWIR) wavelengths, of which the UV-VIS bands ranging between 405 and 465 nm are used to retrieve NO 2 slant column densities using a Differential Optical Absorption Spectroscopy (DOAS) method (van Geffen et al., 2020).Separation of stratospheric and tropospheric columns is achieved using the Stratospheric Estimation Algorithm from Mainz (STREAM).The slant column densities of NO 2 comprising both the stratosphere and troposphere, are converted to vertical column densities using air mass factors, which are obtained from the global chemical transport model

Table 1
List of input features used in the ML model along with their original data source.The temporal features such as day of the year and weekend indicator are derived based on the S5P measurement timestamp.(Eskes et al., 2023).To maintain consistency across years, a RPRO TROPOMI NO 2 dataset based on version 2.4 is made available.This study uses RPRO version of TROPOMI NO 2 data available from dataspace.copernicus.eu.Here we filter TROPOMI data based on quality assurance flag and an uncertainty threshold.Quality assurance (QA) flag (QA > 0.75) as recommended by the S5P retrieval team (Eskes et al., 2022) is used, which ensures only good-quality retrievals for further analysis.Uncertainties in the retrieval are higher under cloudy conditions and this filtering removes scenes covered by clouds (with a cloud radiance fraction >0.5), scenes partially covered by snow and ice, as well as other erroneous or problematic retrievals.

Air quality station measurements
Observations of ambient NO 2 from approximately 4100 ground monitoring stations distributed across Europe are reported regularly to the EEA by all cooperating countries.Approximately 62% of the stations are situated in urban regions.From Fig. 1, which shows the distribution of stations, we can observe comparatively lower densities of stations over Latvia and around most of the Nordic countries like Norway, Sweden, Finland, and Iceland.We obtained the validated hourly average measurements of surface NO 2 from the EEA e-Reporting database (EEA, 2021b).After filtering the data to obtain only valid and verified observations based on the quality flags and those matching the satellite overpass time, data from 3044 stations are considered.The level of uncertainty in NO 2 reference measurements conducted at air quality monitoring stations across Europe exhibits variation, stemming from differences in the equipment utilized and the underlying measurement principles employed.Nonetheless, all measurements carried out in member countries must adhere to the guidelines stipulated in the EU Air Quality Directive (EU 2008/50/EC), which mandates a maximum relative expanded uncertainty of 15% for regulatory NO 2 measurements.In practice, it can be reasonably assumed that air quality monitoring stations typically maintain uncertainties well below this threshold (European Parliament and the Council of the European Union, 2008).

VIIRS Nighttime lights
The low-light imaging data obtained from the day/night band (DNB) detector onboard the Suomi National Polar-orbiting Partnership (Suomi-NPP) satellite's Visible Infrared Imaging Radiometer Suite (VIIRS) instrument operated jointly by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA), provides a measure of night light radiance.This represents artificial lights present on the Earth surface, mostly from the electric lights of human settlements and human activities (Elvidge et al., 2017).The spatial patterns of nighttime lights are strongly correlated with population density and industrial activity and thus this dataset acts as a spatial proxy for various anthropogenic activities causing NO 2 emissions (Levin et al., 2020;Liu et al., 2020).The VIIRS nighttime lights observations are measured during night at a spatial resolution of 500 m and with a daily local overpass time of approximately 1:30 a.m.Because of infiltration of lights from unwanted sources, the night light radiance is corrected to remove background noise, the effect of solar and lunar illumination, and auroras to some extent.During summer solstice, the data is affected by strong solar illuminations resulting in a decrease in its availability towards the northern pole.To improve the data coverage, to reduce the effect from ephemeral lights, and because we assume that nighttime light values typically do not change rapidly with time, we use monthly cloud-free night light radiance composites which are corrected for stray lights and provided by the Earth Observation Group (Elvidge et al., 2013).The stray light corrected version improves the availability of datasets near the northern part of the study area (Mills et al., 2013).

MODIS NDVI
The Normalized Difference Vegetation Index (NDVI) is an indicator that generally quantifies the presence and health of green vegetation (Pettorelli et al., 2005).NDVI is derived using the reflections of an entity in the near infrared and red wavelengths.Due to the different absorption and surface reflectance properties of land cover in these wavelengths, NDVI is also a good indicator of different land cover types such as builtup versus vegetated areas (Akbar et al., 2019) and as such can act as an indirect indicator for NOx emissions.NDVI information can be derived using radiance from satellite sensors that have measurements in the required wavelengths such as the NASA MODIS sensors on-board the Terra and Aqua satellites.For this study, the 16-day global NDVI product from MODIS-Aqua at 1 km spatial resolution is used (Didan, 2021).Since we intend to identify land cover types and seasonal vegetation changes from NDVI, which indirectly refer to NOx sources but has lower temporal changes (Eisfelder et al., 2023), monthly average NDVI composites from this dataset are derived for providing a temporally dynamic distinction between urban and rural areas.

Meteorological parameters
Meteorological parameters influence the concentration and transport of surface NO 2 concentrations to a great extent.This study uses the meteorological parameters temperature, wind velocity, wind speed, and planetary boundary layer height (PBLH).Hourly temperature and wind information are obtained from ERA5-Land (Muñoz-Sabater et al., 2021), which is a reanalysis product generated from the land components of the European Center for Medium-Range Weather Forecast (ECMWF) ERA5 climate reanalysis (Hersbach et al., 2020(Hersbach et al., , 2023)).ERA5-Land is available at an enhanced spatial horizontal resolution of 0.1 • (~11 km) compared to ERA5, which is provided as gridded dataset with a spatial horizontal resolution of 0.2-0.25 • .
PBLH usually varies from a few meters to several kilometers depending on the location and weather conditions (Garratt and Hess, 2003).Since GEE does not provide the ERA5 PBLH product, this study uses PBLH data from National Centers for Environmental Prediction (NCEP) Climate Forecast System Reanalysis (CFSR).The NCEP-CFSR is a global coupled atmosphere-land-surface-sea ice system that provides data at a global horizontal resolution of CFSR, which is ~38 km (Saha et al., 2010).As both the reanalysis products are based on similar assimilated observational data, we assume consistency between the two sources (Guo et al., 2021) (see supplementary table 1).

Topographical features
Topography can impact transport and trapping of surface pollutants.Europe has a varied topography ranging from flatlands to mountainous regions.To consider the impact of topography, elevation of a location and the variation of elevation in terms of standard deviation within a 5 km neighborhood of a location is used in this study (Mutlag et al., 2020).Elevation data is obtained from the space-based Multi-Error-Removed Improved-Terrain (MERIT) Digital Elevation Model (DEM) datasets which is generated at a spatial resolution of ~90 m by correcting the errors from previous DEMs such as Japan Aerospace Exploration Agency ALOS World 3D DEM, NASA Shuttle Radar Topography Mission-3 DEM, and Viewfinder Panoramas DEM (Yamazaki et al., 2017).

Other auxiliary factors
The concentration of NO 2 is highly influenced by anthropogenic activities and is further subject to photochemical processes in response to changes in sunlight throughout the year.The anthropogenic activities, especially traffic density has a strong variation between the weekdays (Monday-Friday) and the weekends (Saturday-Sunday) with an increased activity during the weekdays.To capture these variations and the effect of photochemistry, temporal indicators such as a weekend indicator, day of the year, and solar zenith angle from S5P TROPOMI observations are also used as input features.

Input feature extraction and pre-processing
The study uses 12 parameters as input features to the ML model (see Table 1) and ground monitoring station measurements as target features.For the purpose of training and testing, the corresponding feature values for the years 2019-2021 are extracted over the station locations.Datasets are segregated such that data from 2019 to 2020 are used mainly for training and testing the model in this study.To independently evaluate the model's sensitivity to the training years, the 2021 test dataset is evaluated separately at generic level.To generate daily surface NO 2 maps, single day input features are tabulated into a twodimensional structure and exported as raster to be used for prediction.For this study we use the GEE cloud platform to extract different input feature datasets from GEE data catalog, which acts as a single point of access for all data sources except RPRO TROPOMI data and station observations (Gorelick et al., 2017).GEE also provides tools for resampling and regridding the dataset into the required resolution.Input features derived from GEE as described in Section 2 are resampled and regridded to 1 × 1 km 2 resolution using the GEE javascript APIs.Features such as the MERIT DEM, NDVI, and VIIRS nighttime lights, which were originally at finer spatial resolution than the target grid size, are spatially aggregated to a common 1 km grid.On the other hand, features with originally coarser spatial resolutions than the 1 km target grid size (see Table 1) such as planetary boundary layer height, temperature, wind speed, and wind direction are bilinearly interpolated to match the common target grid of 1 km spatial resolution.RPRO TROPOMI data is regridded into the above target grid generated from GEE using the HARP tool (S[&]T, 2022).In addition to the recommended filter of 0.75 on quality assurance value, TROPOMI tropospheric column densities are also filtered to exclude any values that are below 0.5 × 10 15 molec/cm 2 , which is the approximate uncertainty in individual TROPOMI retrieval (van Geffen et al., 2022).This ensures the removal of negative and excessively small TROPOMI values, thereby eliminating potential observations that may be attributed to retrieval or measurement errors from the TROPOMI dataset (Qin et al., 2023).Due to gaps in VIIRS night light datasets during the months of May-August, an exponential smoothening as described in Zhao et al. (2017) is applied on the dataset for these months to generate more spatially complete information.Since ML models mostly assume a Gaussian distribution of the features, data transformations are performed on certain features based on their distribution (see supplementary table 2) to help in better mapping, improved speed and accuracy (Ahmed et al., 2010;Vishnu and Rajput, 2020).

Data preparation for station measurements
Measurements of NO 2 from air quality monitoring stations were obtained from the EEA database and filtered to retain only observations which are marked as validated and verified by the reporting countries.While the station measurements are available as hourly averages, TROPOMI provides one to two instantaneous measurements per day at a given location.Since the NO 2 values can vary strongly within a day, we focus on obtaining the station measurement which is most representative of the surface NO 2 concentrations during the satellite overpass time.Hence, we calculated a weighted average of those two consecutive hours of station measurements that are temporally closest to the satellite overpass time.This is shown in eq. 1, where G wa represents the weighted average of station measurement, sm i and sm i+1 are the station measurements of two consecutive hours starting at i and i + 1, sat ot is the satellite overpass time, and t i/2 and t i+1/2 are the central time of the consecutive station measurements.

Sampling strategy and XGBoost model training
To avoid bias towards data from training stations, ground monitoring stations are divided into two sets such that 80% of the stations are reserved for training and 20% stations are reserved for evaluating the model.Hence, out of the total 3044 stations, 2435 stations are reserved for training and 609 stations for testing.To choose a non-overlapping station split, the stations are randomly grouped into training and test stations and evaluated based on spatial overlap extent and empirically for non-overlapping and distributed representation of the station types (Hengl et al., 2018;Pohjankukka et al., 2017).For a given split of train and test station, a model is trained and evaluated (using Mean Absolute Error (MAE)) using the corresponding training and test station datasets.The process of splitting stations and evaluating them is repeated multiple times with data from different sets of train and test stations to find the bias and variance and model's sensitivity to different datasets.If the variance between the group's performance is not high and spatial overlap between train-test station locations are low, one of these groups is chosen as the final set of train-test stations (Kohavi and Edu, 1993).The data from the final train and test station grouping then form the final train and test datasets.This consists of 12 input features and the corresponding station observations of NO 2 as the target feature.
Further, the data from each training station is stratified and equally sampled based on four seasons (winter, spring, summer, autumn).Season is chosen as a base for stratum to have a representative data of the whole year and to avoid bias towards any particular seasonal or pollution pattern (Huang et al., 2022).This is based on the knowledge that NO 2 has a strong seasonal variation (also seen in supplementary fig. 1, showing NO 2 along with other input features depicting seasonal variations).Furthermore, seasonal changes also have an impact on the TROPOMI data availability, especially due to the cloud cover during winters (Goldberg et al., 2021) .
The training datasets are provided to the XGBoost regression algorithm for learning the non-linear relationship between the 12 input features and the target surface NO 2 station measurements.XGBoost, an ensemble machine learning algorithm, follows a gradient boosting technique to build an ensemble of trees and is designed to be computationally efficient (Chen and Guestrin, 2016).XGBoost works by sequentially learning the error in prediction from the previous trees and minimizing the loss function while constructing the next sequence of trees.This is realized by performing a first and second order derivative that minimizes the sequential prediction errors.The amount of learning or overfitting of the ML models are controlled by hyperparameter tuning (Yang and Shami, 2020).In this study, the hyperparameter values for the XGBoost model are obtained using a Bayesian optimization (Joy et al., 2016) and grid search cross validation methods (Bergstra et al., 2011).The Bayesian Optimization approach is used to define the initial range of values for the grid search method.The internal evaluation metrics of the objective function during hyper parameterization are set to evaluate the predictions and station observations in the original dimension (back transformed from log) as this defines the actual model improvements.Moreover, the hyperparameters are not finely adjusted to enable building a more generalized model over all of Europe.Supplementary table 3 shows the different variables used in the hyperparameter tuning.The practical implementation was carried out using the Python library 'xgboost' version 1.6.1 and its corresponding regression instance (Chen and Guestrin, 2016).

Model evaluation and interpretation
Evaluation of the trained XGBoost model is performed on the test datasets from 2019 to 2020 across different dimensions such as concentration levels, time, space, and overall performance.The error measures Absolute Error (AE) and Bias (B) are used to evaluate the model using statistical estimators like mean and median.Relative error (AE / observed value) or Relative Bias (B / observed value) is further used as measures when comparing the model performance across different dimensions.Additionally, Spearman's rank correlation coefficient (ρ) is used to measure the correlation between the model prediction and actual surface NO 2 measurements.To understand the performance in terms of spatial patterns and current air quality models, a comparison of annual average predictions of the year 2020 is made with the annual average NO 2 from CAMS European air quality reanalysis.
To interpret the behavior of the model in terms of its features and their interactions, we use an XAI framework called SHAP (Lundberg and Lee, 2017).The SHAP method follows a game theory approach to calculate the impact of a feature in estimating the final output for every prediction and explains them in terms of both magnitude and direction.The impacts are fairly distributed by calculating the difference between the model prediction with and without the feature value (Lundberg et al., 2020).Thus, the SHAP value of a feature indicates the marginal contribution of the feature to a prediction when compared to the base prediction (average prediction) of the dataset.Averaging the absolute SHAP values for each feature across all predictions provides an overall impact of the feature and is used in this study to evaluate the feature importance.SHAP allows for both individual and global interpretation of feature contribution.This study uses SHAP values to further understand the contribution of features, verify the correctness of the model based on domain knowledge and evaluate the importance of input features, mainly satellite observations in modelling the surface NO concentrations.

Model performance evaluation
The S-MESH model is built and tested using the dataset spanning the time period 2019-2020.The train and test datasets were obtained from non-overlapping train-test stations as described in Section 3.3.Low variance and low spatial overlap of train-test station segregations are observed in Supplementary fig. 2. Fig. 2 shows the overall performance of the final trained and hypertuned XGBoost model in predicting surface NO 2 concentrations over Europe.The model has an MAE of 7.77 μg/m 3 , median AE of 4.43 μg/m 3 , RMSE of 13.07 μg/m 3 , mean (relative mean) and median (relative median) bias of − 2.82 μg/m 3 (32%) and 0.59 μg/ m 3 (6.38%)respectively and, ρ of 0.66 (Fig. 2a).The histogram of the model bias distribution is mostly Gaussian with majority of the bias values centered around zero and an interquartile range of 9.6 μg/m (Fig. 2b).Data from the test stations corresponding to the year 2021, which is not used in training the model, is additionally evaluated to verify model's unbiased performance towards the year of training.Similar performance is observed when validating against the test station datasets for the year 2021 with a slightly better MAE of 6.46 μg/m 3 and a median bias of − 0.7 μg/m 3 (Supplementary fig.3).concentration intervals as violin plots along with orange points/lines representing the corresponding median relative biases.At very low NO 2 levels of the range 0-10 μg/m 3 , the model has a low positive median bias of 2.97 μg/m 3 which represents a high median relative bias of 61%.As the concentration levels increase, there is an increase in negative bias.However, the median relative bias is within 40% for concentration levels <40 μg/m 3 .Specifically, at low pollution levels of 10-20 μg/m 3 the median bias is also − 1.17 μg/m 3 .But it represents a relative median bias of − 8.5%, which is much lower than the relative bias at all concentration groups.Supplementary table 4 summarizes the bias distribution values for all concentration intervals.
The station types and the areas they are located in indirectly represent the NO 2 pollution levels based on their proximity to pollution sources.The correlation of surface NO 2 values between the model prediction and ground measurements is depicted in Fig. 2d for different station types.Correlation is found to be comparatively higher over generally polluted urban and suburban regions with similar values ranging from 0.64 to 0.68, except at traffic station types where the NO 2 variations are high.Highly polluted traffic regions show relatively low correlations at suburban traffic stations where the correlation is lowest at 0.39.In cleaner background regions, the correlation values are around 0.57-0.68.
In terms of computational speed, the final XGBoost model with 97 trees requires ~10 s for training with ~1million data points and 12 input features on an 8-core (32GB RAM) AMD Ryzen 7 PRO 475 U with Radeon Graphics 1.70 GHz processor.The prediction of European-scale surface NO 2 as raster maps are performed on a computer cluster.Using 2 CPUs with 14 cores each, the trained model takes ~39 s and requires 150 GB RAM to predict surface NO 2 for a single map consisting of 37 million pixels measuring ~500 MB in file size.The overall execution time including reading the inputs in the form of raster files, performing the model predictions, and creating the corresponding output raster file is approximately 6 mins.

Model performance across space and time
Fig. 3a shows the median relative error in model prediction at all test stations based on day of the year aggregated over latitudes, where the latitudes are grouped based on the climatic zones as described in supplementary table 5. Most latitudes have lighter shades of blue representing relative errors <50% across different days.The shades of blue get darker towards spring, summer period, indicating an increase in relative error of upto 75%.At lower latitudes between 35 and 43 • N, few occurrences in darkest blue are also observed during the beginning of summer, representing some days when the relative errors are high and close to 100%.Such occurrences are distributed throughout the period for latitudes between 55 • -58 • N, including winter and autumn months.Fig. 3b shows a time series of the relative error over all test stations across days of the year and corresponding mean station NO 2 .The blue line shows the median relative error with the shaded region depicting the 95% confidence interval.It can be observed that the relative error ranges between 22% and 73% with few peaks in the spring/summer period showing relatively higher errors.The mean NO 2 concentrations in orange depict higher values >40 μg/m 3 during colder winter days and low NO 2 during warmer spring/summer days.Both the lines also highlight a seasonal error pattern and dominant inverse relationship between relative error and NO 2 concentration.Supplementary fig. 4 shows a similar relationship across latitudes with higher winter and lower summer NO 2 concentrations at higher latitudes.
To understand how the generalized model performs across the study area, the relative model error for all stations over Europe for the years 2019-2020 is aggregated into ~10kmx10km grid cells, which then represents the median prediction error of all stations within the grid (Fig. 4a).For comparison, corresponding station measurements of NO are averaged within the above grid cells (Fig. 4b

SHAP analysis for feature importance and interpretation
To assess the importance and impact of input features in the model prediction, this study uses SHAP values to rank and interpret the extent of the contribution of each feature (Fig. 5).Fig. 5a shows the mean of absolute SHAP values for the 12 input features.The mean SHAP values indicate the overall impact of each feature towards surface NO 2 prediction and are listed in the order of importance.According to the mean SHAP value ranking, TROPOMI satellite observations representing tropospheric column NO 2 have the highest impact on the prediction.Night light radiance from VIIRS, which are a measure of artificial lights on the earth, have the second highest impact on the prediction output.The NDVI, representing vegetation phenology and by extension land cover, is the input feature with the third largest impact on the prediction.Meteorological parameter such as planetary boundary layer height and the temporal weekend indicator, appear next in the impact ranking, followed by wind speed and elevation variation.The satellite-based datasets -TROPOMI tropospheric NO 2 column density, VIIRS night light radiance, and the MODIS NDVI, as well as the modeled meteorological PBLH parameter have much higher impact than the rest of the input features, i.e., over parameters such as day of the year, wind direction and temperature.
Fig. 5b allows us to understand the relationship of input features to model predictions through a "beeswarm" plot that depicts the distribution of local (individual) SHAP values of input features.Every dot/ point represents the SHAP value (local contribution) of a feature in an individual prediction.The vertical extent of the points indicates the density of a particular SHAP value for that feature, and the color variation represents whether the feature value is low (towards blue) or high (towards red).The locations of higher density point indicate the general impact of the feature which is reflected in the feature ranking.The positive and negative SHAP values indicate whether the input feature value contributes to higher or lower model prediction.It can be observed that, as would be expected, lower values of TROPOMI tropospheric NO 2 have a negative impact in prediction whereas higher values of the TROPOMI NO 2 observations contribute towards increasing the value of the prediction.A similar pattern can be observed for night light radiance which has a higher impact at higher radiance values, usually corresponding to regions with urban areas, road traffic and industries including power plants.
For the weekend feature, a clear distribution of positive and negative SHAP value can be observed for days representing weekday (blue dots) and weekend (red dots), respectively.Although higher individual SHAP values from day of the year can be observed compared to weekend feature in some instances, the latter has a higher average SHAP ranking due to higher density at higher SHAP values, thus indicating higher contribution in most scenarios from the weekend feature.The NDVI feature has an inverse effect on model predictions where higher values (associated with green vegetation) tend to decrease the prediction values and lower values (often associated with paved and built-up areas) lead to increase in predictions.A similar inverse effect can also be observed for windspeed.The meteorological parameter PBLH also shows an opposite correlation pattern with the model prediction, in that low boundary layers increase the predicted surface NO 2 .For other features such as elevation and day of the year, such clear monotonic correlations are not observed.Since NO 2 has a seasonal pattern during the year and day of the year indirectly represents these changes, observing a monotonic relationship of the SHAP value with the model prediction output would have indicated the incorrect learning of the relationship.Such local interpretations of the model help in understanding the To further understand the monotonic interaction of highly influential features, SHAP value dependency plots of TROPOMI tropospheric NO 2 and VIIRS night light radiance are shown (Fig. 5 (c-d)).The values of the input features are on the x-axis and the corresponding SHAP values are on the y-axis.The values of the most influential features are color coded, i.e., temperature and elevation for TROPOMI NO 2 and VIIRS night light radiance, respectively.The most influential feature for a given feature is obtained based on the shapely interaction index which considers the effect of features together after removing the individual effects (Molnar, 2020).These dependency plots of SHAP values provide two insights: 1) The change in local impact towards model prediction as the value of input feature changes 2) the interaction of a feature with its most influential feature.From Fig. 5c it can be observed that when the TROPOMI NO 2 values <45 μmol/m 2 , the impact on the model prediction is negative.Between 45 μmol/m 2 and 180 μmol/m 2 , there is a steep increase in SHAP value indicating a large positive increase in impact on the model prediction for small changes in TROPOMI values.This range represents the region of highest sensitivity of the input feature.These TROPOMI NO 2 values are representative of moderate pollution levels (Verhoelst et al., 2021).For higher TROPOMI NO 2 values >180 μmol/ m 2 , the sensitivity of the model to change in TROPOMI observations is lower but the impact on the model prediction is higher and in the positive direction.Beyond ca.200 μmol/m 2 , which usually represents very highly polluted regions (Verhoelst et al., 2021), the data density is low and SHAP value variation is almost constant.Furthermore, for a given value of TROPOMI tropospheric NO 2 column, the impact on the model output can be different based on the values of other input features.Here, we study the interaction of TROPOMI NO 2 observation with its most influential feature -temperature.In Fig. 5c it can be seen that, except for low values where SHAP is negative, for a given TROPOMI NO 2 concentration, higher temperatures have an influence towards a higher magnitude of the model output.This can be observed from the vertical blue (lower feature value) to red (higher feature value) color gradient for a particular value of TROPOMI NO 2 observation.Fig. 5d shows a similar dependency plot for VIIRS night light radiance, which depicts an increase in SHAP values for increased values of night light radiance.The most influential feature for night light radiance is elevation.No clear pattern of interaction is observed because most of the elevation features are in blue, indicating that most stations are located at lower elevation levels.Few high elevation data occur both over high and low night light radiance values, with higher SHAP values over regions with high night light radiance indicating urban regions at higher elevations.These higher elevation data are mostly found over Spain and Italy.
Fig. 6 shows the fractional contribution of each input feature in terms of average SHAP value at different concentration levels of NO 2 .It can be observed that the TROPOMI NO 2 feature dominates most concentration levels, accounting for ca.35% to 40% of contribution towards deriving the model output prediction.It can also be observed that the influence from the TROPOMI NO 2 and, similarly PBLH feature, tends to slightly increase with the concentration levels.On the other hand, contribution from VIIRS night light radiance decreases with concentration and is at an average of 18.13% (SD ± 4.2%).Similar contribution patterns are observed for wind speed and temporal variables such as day of the year and weekend indicator.Contributions of NDVI feature is at an average ~13%.
The feature values vary spatially, and consequently this results in a spatially varied magnitude of impact from these features.Here we focus on the spatial distribution of SHAP values for the top two ranked satellite input features, namely tropospheric column NO 2 density from TROPOMI and night light radiance from VIIRS.Fig. 7 shows the monthly average of SHAP values corresponding to the values of input features during November 2020 over central Europe.Though comparatively higher SHAP values are observed for TROPOMI NO 2 features in Fig. 7a (and also seen in Fig. 5), night light radiance exhibits significantly more detailed spatial patterns.The added value that can be derived from the higher resolution information of night light radiance can be particularly observed around road networks and urban centers such as, for example, Dijon, France (shown as rectangles in Fig. 7).Such spatial SHAP analysis helps in understanding the level of detail in prediction contributed by  the various input features, which cannot be solely obtained by feature ranking.

Comparison with CAMS reanalysis
In order to evaluate the performance of the ML-based model against an independent reference, we carried out a preliminary comparison against the CAMS regional reanalysis (Marécal et al., 2015).Fig. 8 depicts a comparison of the annual average ML model prediction of surface NO 2 with the annual average CAMS regional reanalysis surface NO 2 data for 2020 (CAMS data available at ads.atmosphere.copernicus.eu).Fig. 8a and Fig. 8b shows the annual average surface NO 2 patterns over the city of Warsaw in Poland at original spatial resolution of S-MESH model predictions (~1 km) and CAMS (~10 km) respectively.The ML-based maps show significantly more spatial detail around urban areas and along major roads when compared to the CAMS reanalysis.Fig. 8c shows a histogram of data distributions for annual averaged 2020 surface NO 2 at all station locations in Europe obtained from three different sourcesstation observations, S-MESH ML model prediction and CAMS Reanalysis.On comparing the annual averaged ML model prediction against corresponding station observations, a relative error of 24% is observed, while comparing the CAMS reanalysis against station observations the relative error is 37%.Performance of CAMS estimations are better than ML at low concentrations of <10 μg/m 3 .ML predictions are better than CAMS at medium concentration levels of 10-20 μg/m 3 and closely follow station observations.Beyond 25 μg/m 3 , underestimation of higher values can be observed in both CAMS Reanalysis and S-MESH model predictions.

ML Model derived surface NO 2 maps
Using input features such as S5P TROPOMI NO 2 satellite observations together with auxiliary datasets as well as measurements from ground monitoring stations as a reference, we employed the trained XGBoost model to derive daily map of surface NO 2 over Europe at ~1 km spatial resolution.Due to high errors of the model over Turkey (supplementary fig.6), predictions over this region are not considered.Fig. 9a shows the annual average of the RPRO TROPOMI tropospheric vertical NO 2 column density data over Europe during 2019 and Fig. 9b shows the model predicted annual average surface NO 2 map for the same year.It can be observed that the spatial patterns of NO 2 in Fig. 9a and Fig. 9b are similar but not the same.The TROPOMI-derived surface NO 2 map (Fig. 9b) provides significantly more spatial details of NO 2 .The annual average surface NO 2 map in Fig. 9b shows large number of hotspots over Germany, Netherlands, Poland, United Kingdom, and over the Po Valley region in northern Italy.We can observe strong NO 2 hotspots with annual concentrations above 25 μg/m 3 in and around major urban regions as expected.At a closer view around the city of Stuttgart in Germany (Fig. 9c and Fig. 9d), significant spatial variation of TROPOMI and surface NO 2 can be observed in and around the region containing various pollution sources.High surface NO 2 levels can be observed at locations such as airports, along the road networks, and over  urban clusters.Pollution levels are low in places with less anthropogenic activities such as forests and protected areas.Fig. 9c also shows the extent up to which the current 1 km resolution of the map can, in principle, capture the NO 2 variations within a city.This is limited to identifying a larger emission source or a 1 km section of the road but needs higher resolution maps to identify any finer emission sources.Fig. 9e shows the time series of daily averaged predicted surface NO 2 for the years 2019 and 2020 over Stuttgart for the days TROPOMI data is available.The predicted NO 2 for both years shows a seasonal pattern with low concentrations during spring and summer months (day of year from ~60-244).The annual average concentration during 2019 and 2020 is 16.9 μg/m 3 and 15.6 μg/m 3 respectively.The concentrations during 2020 are generally lower than during 2019 starting from February.During the months of March, April, May average concentrations in 2019 are 18.94 μg/m 3 ,14.53μg/m 3 , 13.65 μg/m 3 respectively, while in 2020 the corresponding monthly average concentrations are at 15.65 μg/m 3 , 11.07 μg/m 3 , 11.36 μg/m 3 , a clear indication of the effect of the Covid-19 lockdown impact on NO x emissions, which has been reported in many previous studies (see Section 1 for references).Fig. 9f shows the SHAP value distribution of features such as TROPOMI observations, VIIRS night lights, wind speed and weekend indicator over Stuttgart during the years 2019-2020.Higher SHAP values can be observed for TROPOMI input while, overall, decreased SHAP values for activity related features such as TROPOMI observation, night lights and weekend indicator are observed in 2020, during the lockdown period (also see supplementary table 6).For weekend indicator, more frequent dips in SHAP values are observed in 2020.
Surface NO 2 concentrations are strongly influenced by anthropogenic activities and meteorological conditions.Hence, they are known to exhibit strong seasonal and weekly variations.In Fig. 10, the TROPOMI-derived surface NO 2 concentrations are averaged across weekdays and weekends for the year 2020 and over the four different seasons in Europe.The TROPOMI-derived surface NO 2 dataset clearly demonstrates the typical annual and weekly cycle of NO 2 .At seasonal level, a gradual decrease in surface NO 2 concentrations can be observed from the spring (Fig. 10a, Fig. 10e) to the summer months (Fig. 10b, Fig. 10f) across all parts of Europe (except Iceland).In autumn (Fig. 10c, Fig. 10g), a large increase in surface NO 2 concentrations can be observed in certain regions, partially exhibiting average concentrations >25 μg/ m 3 .In winter (Fig. 10d, Fig. 10h), higher pollution patterns can also be observed in most parts of central and eastern Europe, while there is an ever more significant increase in NO 2 pollution over northern Italy.As would be expected, throughout the seasons the surface NO 2 concentrations are higher during the weekdays (Fig. 10a-d) when there is more anthropogenic activity compared to the weekends (Fig. 10e-h).The seasonal and weekly variations are captured by the trained ML model and can be clearly observed in Fig. 10, with high pollution episodes during winter and autumn and the overall highest pollution levels during winter weekdays (Fig. 10d).The lowest pollution levels can be found during weekends in the summer where at most locations the average NO 2 concentrations are <10 μg/m 3 .

Discussion
Using primarily satellite-based measurements such as TROPOMI NO 2 , VIIRS nighttime light data, and MODIS NDVI, as well as modelbased meteorological parameters and temporal variables, in S-MESH approach we trained an XGBoost-based machine learning model to estimate surface NO 2 concentrations over Europe for the years 2019 through 2021 at 1 km spatial resolution during the satellite overpass time.The model was trained using 12 input features and measurements from air quality monitoring stations as reference for the years 2019-2020.This resulted in a single generic model with applicability over large parts of the continental-scale study area.

Model performance
Evaluation of the S-MESH model predictions with test datasets from the year 2019-2020 shows that the overall MAE is 7.77 μg/m 3 , the median AE is 2.43 μg/m 3 , the mean and median bias are − 2.82 μg/m 3 and 0.59 μg/m 3 respectively, and the Spearman rank correlation is 0.66.
The larger MAE compared to median AE is mainly driven by higher error rates at high and very low concentration levels (>40 μg/m 3 & <10 μg/ m 3 ) while the low median error represent the model's general performance which is less influenced by the presence of few extreme errors (Fig. 4a).Similar overall performance is observed when evaluating against test datasets for the year 2021, further indicating the robustness of the model for the periods not used in training.This makes the model more suitable for exposure assessment studies and less suitable for identifying cleaner regions.The error rates of our model are comparable to those obtained by other studies that use machine learning models for this purpose (e.g., Chan et al., 2021;Chi et al., 2021;Ghahremanloo et al., 2021;Kim et al., 2021;Li et al., 2022).However, most of these previous studies are implemented over smaller regional domains compared to this study.
From the model error analysis, it can be observed that the model is overall negatively biased, and its predictive power depends on the concentration level of surface NO 2 (Fig. 2b, Fig. 2c).The negative bias in the model is partly due to underestimations in TROPOMI NO 2 (Wang et al., 2020b) and the fact that the TROPOMI NO 2 dataset with its comparatively coarse spatial resolution cannot capture the spatial detail of the often very local representativity point measurements by the monitoring stations (in particular traffic sites).The reprocessed version of the TROPOMI tropospheric NO 2 used here, which mainly influences the model predictions as seen from SHAP feature importance (Fig. 5a, Fig. 6), is overall negatively biased (median bias − 23%) with a positive bias in tropospheric retrievals of 13% over clean regions and negative bias of − 40% over highly polluted regions (Ialongo et al., 2020;Lambert et al., 2022;Verhoelst et al., 2021).The S-MESH model estimates for NO 2 follow a similar uncertainty pattern with highest accuracy between concentration levels of 10 μg/m 3 to 40 μg/m 3 , and a decrease in prediction accuracy at very high concentrations (>50 μg/m 3 ).SHAP analysis at high concentration levels confirms this dependency where the model derives most information from TROPOMI observations (see Fig. 5c and Fig. 6).But the predictions at higher concentrations that have higher relative errors (>50%) usually have lower values of TROPOMI NO 2 , thus leading to lower predictions (Supplementary fig.7a).Another reason for low model performance at higher concentration levels can be seen in the statistical distribution of NO 2 measurements from stations, where values <40 μg/m 3 are dominant.Though log transformations helped in improving the data distribution from skewed to nearly normal, lack of scenarios for very high concentration levels cannot be completely avoided.A similar argument can be made for the comparatively low correlation between model predictions and traffic station measurements in rural areas.One way to improve the model predictions towards less representative datasets is to artificially generate the underrepresented samples but this is an area that needs more research.For very low concentration levels (<10 μg/m 3 ), the model's relative bias is positive.A common pattern of overestimation can be observed in all previous studies at very low concentration levels.In our case, the bias is mostly high for concentration levels <4 μg/m 3 (See supplementary fig.7b).
While this is partly due to the positive bias in unpolluted regions introduced by TROPOMI observations, the effect of 'regression towards mean' on the ML model is also observed where very low concentration values that are away from the average training station observations are overpredicted.These can also further arise from lower station counts, especially at northern latitudes and measurement uncertainty from station observations.Though quality-controlled station observations are considered in the study, this uncertainty cannot be directly verified.However, these overestimated predictions are still reliable in terms of low median bias.
A spatial error analysis of the model shows relative errors <30% across most of the study area except over few distributed regions where relative errors are close to 100% (Fig. 4a).These high error regions also exhibit very low concentrations from station observations i.e., <4 μg/m 3 (Fig. 4b).This can also be observed in supplementary fig. 5 which makes a comparison between grid cells that show low and high relative errors.It can be observed that the high error grid cell has low average concentration of around 1-4 μg/m 3 which is in the region of the S-MESH model's low predictive power.The relatively low number of stations over Nordic regions lead to an incomplete spatial error evaluation over these regions but an overall low relative error across the study area shows the model's robustness over different regions.Across the latitudes, a seasonal error pattern is observed over time (days of the year) with higher prediction errors during low NO 2 pollution months of spring and summer and, lower prediction errors during higher pollution months of autumn and winter (Fig. 3b).This is due to the higher bias of the model at very low concentrations, which is further highlighted in the spring and mid-summer months.Countries such as Denmark, Lithuania, southern parts of Sweden which majorly form the region between latitudes 55-58 • N, generally have low NO 2 concentration levels ranging between 1 and 8 μg/m 3 and are not well predicted by the model (median absolute error of 5.5 μg/m 3 ).Such low concentrations commonly appear at this latitude, especially during spring, summer months.Similar patterns can be explained for >58 • N latitude regions and summer periods in 35-40 • N regions.Despite the high relative errors, these scenarios have lower median absolute errors (ranging between 4 and 7 μg/m 3 ) which still make these predictions useful for understanding pollution levels.
Machine learning models can be biased towards the more representative dataset in the training.In order to avoid such bias, we initially separated the ground monitoring stations into training and test station sets and then sampled an equal number of training points based on season.Segregating the datasets at the level of training and test station itself is advantageous as it helps in avoiding any location-specific or spatial bias during evaluation.Here we used the bias-variance tradeoff to understand the sensitivity of the model to different station groups.The relatively higher bias and low variance of the model against different station groups indicates its capability for generalization and robustness for different test sets.We followed a pattern-based approach to stratify the training datasets equally based on seasons as opposed to the traditional way of proportionally stratifying based on station types.This way, the stratification is focused on equally representing the seasonal patterns and not station area, which we observe has similar data distribution patterns (Supplementary fig.8) and can increase the bias towards urban stations.Our results show similar correlations across station types (Fig. 2d) but the seasonal bias could not be completely avoided due to the existing bias of the input features at different concentration levels.We are not aware of a best or a preferable approach from the literature for stratifying such datasets for ML regression-based surface concentration studies and this is an area that needs further exploration.

SHAP analysis of features to evaluate their importance
The study performs an XAI analysis using SHAP to evaluate the feature importance and to understand the model behavior.Since SHAP values consider the marginal contributions of each feature in every prediction, they provide a better insight in deriving feature importance compared to other traditional methods that are based on overall model performance.The mean SHAP values reveal that TROPOMI tropospheric NO 2 column density, contributing on average 23%, holds the highest importance among features for predicting surface NO 2 concentrations.This significance persists across various concentration levels, increasing linearly (Fig. 6) and indicating that the TROPOMI NO 2 data acts as the main source of NO 2 information for determining the surface NO 2 concentrations.VIIRS night light radiance is the second most important feature according to the SHAP value ranking.The feature plays an important role in deriving finer spatial patterns in and around the anthropogenic sources, as it provides a good spatial proxy for NO x emissions at a high level of detail and is available at significantly higher spatial resolution than the TROPOMI NO 2 data (as seen in Fig. 7).Since the night light feature provides the intensity of light at different locations, it represents a good spatiotemporal alternative to more static inputs such as population density, road networks or density, and temporal changes of anthropogenic activities.The NDVI feature representing a proxy for land cover and the amount of vegetation, and indirectly to NOx emissions to some extent, plays the third most important role in determining the model predictions, followed closely by meteorological feature such as PBLH and temporal feature indicating weekday or weekend.The ranking also represents the higher importance of spatiotemporal variables over purely temporal variables but a SHAP value of temporal variable such as weekend parameter on a similar level as that for PBLH highlights the possible importance of these single temporal indicators.While the spatiotemporal variables include indirect sources (TROPOMI NO 2 and VIIRS night light) and factors influencing NO 2 distribution (NDVI, PBLH and windspeed), the temporal variable (weekend) captures important patterns in NO 2 variations.The other temporal variable 'Day of the year' has comparatively lower contribution.This could be attributed to its correlation with solar zenith angle, which was included to capture the effect of photochemistry.However, removing the 'Day of the year' feature decreases the MAE between model prediction and actual observation by ~2%.A similar impact is observed by removing the temperature feature (see supplementary table 7).The static topographic features such as elevation variation play an important role considering the large topographical variations of the study area and ranks higher than some spatial-temporal variables.While the model error evaluation helps in understanding the model performance, the local interpretability of SHAP values (see Fig. 5b) helps in verifying the correctness of the model behavior.The positive correlation of features such as TROPOMI NO 2 , VIIRS night lights with the model prediction explains the direction of influence.Higher intensity from night lights indicates higher anthropogenic activities and hence a higher impact in the model prediction.A similar argument can be made about the TROPOMI NO 2 values.However, the partial dependency plots for these features (see Fig. 5c-d) also reveal that correlation is not linear but monotonic with higher sensitivity of TROPOMI NO 2 over regions that represent medium concentration levels.The least sensitivity is observed for TROPOMI NO 2 at very high pollution levels and very low concentration levels explaining the lower capacity of the feature in capturing the NO 2 variations at these levels, in line with the higher bias pattern of the TROPOMI retrieval.
An inverse correlation is observed for some variables such as NDVI and meteorological parameters such as PBLH and windspeed.This can be associated with what the range of NDVI values represent i.e., low NDVI values between 0 and 0.1 usually represent urban areas or barren lands and higher index values indicate the presence of vegetation and forests which are typically associated with low NO 2 levels due to the absence of NO x sources in the immediate vicinity.The interaction of night lights with elevation (Fig. 5d) does not reveal a strong interaction pattern.This indicates that elevation as a single feature in the model does not directly influence NO 2 predictions but becomes influential when combined with other factors.This is also an imitation of real-world conditions where depending on emission sources and meteorology, topography can influence the trapping or dispersing of NO 2 .The dominance of blue dots also indicates that the majority of the stations are situated in low-elevation regions.The role of the solar zenith angle as an indirect indicator of photochemical activity can also be established from the impacts observed towards higher prediction during higher values of the feature.Such an XAI analysis based on SHAP values increases the transparency of the trained model and allows for verifying its correctness by better understanding the input features and ensuring that meaningful physical relationships, rather than spurious statistical correlations, are the primary drivers of the model.SHAP value analysis further helps in understanding the capability of TROPOMI NO 2 and other input features in capturing the changes in NO 2 which can deviate from the usual pattern due to unexpected events such as lockdowns associated with the Covid-19 pandemic (discussed further in Section 5.4).

Model-derived spatial predictions
An initial comparison of annual averaged surface NO 2 patterns for 2020 between the S-MESH model predictions and the CAMS regional reanalysis shows similar spatial patterns over the city of Warsaw in Poland.However, the predictions from ML model capture substantially more spatial variations thus highlighting more detailed pollution patterns around urban areas and road networks than can be seen from the CAMS regional reanalysis (Fig. 8).Besides the TROPOMI observations, S-MESH derived spatial patterns are heavily influenced by high resolution input features such as night lights (also deduced from Fig. 7) and NDVI.This indicates the potential of using such high-resolution input features in ML models for further exploration in air quality studies.But it should be noted that CAMS reanalysis provides data at higher temporal resolutions.The data distribution comparison strongly highlights the strength of CAMS Reanalysis estimates at low background concentrations and weakness of S-MESH model at these low concentration levels of <10 μg/m 3 .Alignment of S-MESH model with station observations is good at medium concentration levels where CAMS estimates have more errors.This comparison helps identify the suitability of the models and highlights their drawbacks based on pollution level.
The model-generated surface NO 2 maps of Europe highlight many known hotspots with high NO 2 pollution levels (annual average > μg/m 3 ).These are distributed mostly over central Europe, particularly over countries such as Germany, France, Italy, Poland, Netherlands, Belgium, and the United Kingdom.The seasonal and weekly average maps show higher NO 2 concentrations during the colder months of the autumn and winter seasons, demonstrating the capability of the method in reproducing the typical temporal cycles of NO 2 .The model also captures the influence of different anthropogenic activities and thus NO x emissions between weekdays and weekends, where higher NO 2 concentrations can be observed during weekdays.Additionally, the seasonal maps show consistently >25 μg/m 3 concentrations of NO 2 over Paris, Milan and some regions over Germany during weekdays.As an example, the 1 km by 1 km map over the city of Stuttgart in Germany (Fig. 9c, d) shows the fine-scale spatial variability of NO 2 with higher concentrations over large roads, urban centers, airports, and lower concentrations of NO 2 over forests and similar natural areas without major NO x sources.Through these maps, we also establish that the model is suitable to capture NO 2 variations at an urban scale.The model time series predictions (Fig. 9e) over Stuttgart predicted lower concentrations during 2020 in comparison to 2019.This is due to the general impact of the Covid-19 lockdown in 2020 which resulted in reduced anthropogenic activities.Lockdowns were implemented in Stuttgart during the months of March, April, and May with slow reopening of normal activities from June.Stuttgart, known for its automobile industry and high traffic density, shows a decrease in predicted NO 2 concentration in 2020, especially during the lockdown months.SHAP value distribution of features such as TROPOMI NO 2 strongly demonstrates the contribution of this feature in monitoring unexpected events such as the Covid-19 lockdown that have significant impacts on the surface NO 2 concentrations.Fig. 9f shows lower SHAP value of TROPOMI observations in 2020, during the lockdown months of Stuttgart resulting in lower NO predictions than 2019.A similar impact can also be derived from decreases in the SHAP values of night lights and NDVI (Supplementary table 7).Based on SHAP value variations, especially seen during 2020, it can be said that the weekend indicator alongside other features in the model, acts not only as a temporal indicator but also as an indicator of NO 2 variation.High traffic density and industries are the main source of NO 2 in Stuttgart and the above four features seem to predominantly detect the changes due to these reduced activities from the source.However, this impact is not as strongly detected by night light radiance during the lockdown periods as expected.Wind velocity in Stuttgart is generally low and thus contributes to enhancing the NO 2 pollution levels from traffic (Petry et al., 2020).Therefore, this parameter plays an important role in terms of SHAP value in this scenario.The model predictions and SHAP analysis thus indicate that the model can capture unexpected events such as Covid lockdowns based on the current set of input features.Similar outcomes of the model can also be observed in prediction maps related to the Covid-19 lockdown (supplementary fig.9) and wildfire events (supplementary fig.10).Model predictions and SHAP time series also show a seasonal variation of NO 2 in prediction.

Limitations and recommendations
The study also identifies some limitations and recommendations that can be considered in future work.Some variations in input features could potentially help in further improvement of model accuracies.The meteorological parameters wind and temperature are taken from ERA5-Land at 10 m and 2 m height, respectively.We consider this to be sufficient keeping in mind that the study estimates surface NO 2 .However, if available, an integrated or average value of these parameters throughout the boundary layer can be considered but needs further exploration in terms of its added value on the model accuracy.Further, the study uses temporally averaged information from VIIRS night lights and MODIS NDVI.While monthly averaged NDVI products are representative of land use, weekly NDVI information could additionally help in better capturing some extreme events (Cohen et al., 2017).On the other hand, higher biases at very low concentrations could potentially be improved by using higher temporal resolution of the night light datasets, with the caveat the data could then be impacted by less coverage and urban glow.Other ensemble ML methods and utilization of TROPOMI NO 2 uncertainty information could also potentially help in decreasing the overestimations at low concentrations and needs to be explored.It should also be noted that the model estimates NO 2 variations corresponding to the satellite overpass time (usually between 9:30-14:30 local solar time).This was done to keep the model outputs conceptually as close as possible to the original satellite observations, however this also means that the diurnal variability cannot be captured.Further, limiting satellite retrievals of TROPOMI to cloud-free observations results in decrease of spatial coverage and underestimation of TROPOMI NO 2 .Filtering is required to ensure good quality datasets.Despite these limitations, S5P TROPOMI observations are an important spatiotemporal source for surface NO 2 estimations.In synergy with ML models such as XGBoost, single day prediction of surface NO 2 concentrations for the whole study area can be estimated within ~39 s, which is a substantial advantage, particularly for such large-scale applications.Furthermore, with the availability of higher resolution atmospheric information from satellites such as Sentinel-3 and Sentinel-5P, input features such as aerosol optical depth, ultraviolet aerosol index and other factors impacting NO 2 concentration can be used as potential inputs for exploration into ML models for similar studies (Li et al., 2023).With the advent of geostationary satellites such as Sentinel-4, the higher temporal NO 2 information from satellites and other input variables can be put into best use in ML models for capturing diurnal variations and further improving the accuracy of the model.

Conclusions
This study demonstrates the importance of satellite observations and the potential of machine learning models in estimating surface NO 2 over continental-scale study areas such as Europe.Through an explainable AI approach using SHAP values, we establish the reliability of the S-MESH model and understand the feature importance.The study shows that the most important contributor for deriving surface NO 2 concentrations is the TROPOMI NO 2 observations.Despite the 3.5 km × 5.5 km spatial resolution of the TROPOMI NO 2 product, the study demonstrates the possibility to derive higher resolution spatial maps of surface NO 2 (such as 1 km) by synergistically utilizing the higher spatial resolution of representative proxy features such as VIIRS night lights.The resulting predictions are able to provide significantly more spatial detail than the CAMS regional reanalysis but also highlight a weakness in capturing very low NO 2 concentrations using satellite NO 2 .Additional datasets and a model-based synergistic approach need to be further explored to improve this limitation.Using machine learning models such as XGBoost and its potential for fast estimations on complex datasets, the study demonstrates the applicability of ML models in such large-scale applications.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Study area and distribution of ground monitoring stations considered in the current study.The ground monitoring stations are divided into training (black circular points) and test station (orange circular points) sets for the purpose of unbiased validation of the ML model.The map background is made with natural earth.

Fig
Fig. 2c provides the distribution of model bias at different NO Fig. 3. Relative error distribution (%) across space and time evaluated using test datasets from 2019 to 2020 a) Heatmap of relative error distribution across latitude over time with lighter shades of blue representing lower errors and darker shades of blue representing higher errors.The white areas over the latitude 58 • -72 • N are days with no data due to cloud cover b) Relative error in percentage and observed NO 2 from test stations based on day of the year.The dark blue line represents the median relative error while the blue shaded region represents the 95% confidence interval.The orange line represents the average station observation of NO 2 .A seasonal pattern in error and NO 2 concentration can be observed.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 .
Fig. 4. Gridded spatial distribution of relative error and average station NO 2 concentration across Europe aggregated to a grid of ~10kmx10km for the years 2019-2020.To obtain better spatial coverage of the error distribution, both training and test stations are included.Due to limited station distribution, there are grids with no stations and hence no data.These are shown in grey.The black lines outline the study area and the countries within while the dotted line depicts the neighboring territory.a) Spatial error patterns in terms of median relative error of stations across Europe b) Average NO 2 concentration from all stations in each grid for the years 2019-2020.

Fig. 5 .
Fig. 5. Global and local interpretation of SHAP values for various input features a) Average absolute impact of input features on the model output based on average SHAP values.The x-axis represents the SHAP values of input features, and the y-axis has the input features ordered in descending order of impact b) SHAP value distribution of input features for every prediction.The x-axis shows the SHAP values of the features and y-axis has the input features arranged in the order of highest to lowest average SHAP values with input feature labels referenced from panel a.Every point in the horizontal row represents an instance of input feature from the training data and the vertical spread represents the higher density of features with a particular SHAP value.The value of features is indicated with colors where a color of blue represents lower feature value and a gradient towards red representing higher values of the features.The points that are distributed towards the direction of negative SHAP value contribute to decrease in value of prediction (c-d) Dependency plot of input feature and their interaction with the most influential feature.X-axis represents the value of the input feature and Y-axis the corresponding SHAP value.Each point represents an instance of input feature in training data.The points in panel c and d are colored based on the value of the interacting feature and not that of input feature.Points towards blue represent lower and towards red represent higher values of the interacting feature c) TROPOMI tropospheric NO 2 (μmol/m 2 ) vs the SHAP values.The color of the points represents the values of the most influential featurein this case temperature.d) VIIRS night lights (NWatts/cm 2 /sr) values vs. their corresponding SHAP values.The color of the points represents the values of the most influential featurein this case Elevation.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .
Fig. 6.Input feature contribution in terms of percentage of total SHAP values for different NO 2 concentration levels grouped by increasing concentration range.

Fig. 7 .
Fig. 7. Monthly average of SHAP value for November 2020 of two input features over a particular region.The area includes parts of different countries such as Germany (labelled DE), Switzerland (CH), France (FR), Luxembourg (LU) and Belgium (BE) (a) shows SHAP values derived for TROPOMI Tropospheric NO 2 .Country boundaries (bold black lines) and main highways (black lines) are shown in the figure for reference.The black rectangle shows an urban cluster -Dijon in France and its surroundings (b) shows SHAP values for VIIRS night light radiance.The white box refers to the same place as that in panel athe city of Dijon and its surroundings.

Fig. 8 .
Fig. 8. Surface NO 2 concentration distribution of annual average 2020 from station observations, S-MESH model predictions and CAMS Reanalysis a) Map of annual average surface NO 2 from the S-MESH model predictions for 2020 around the city of Warsaw, Poland b) Map of annual average surface NO 2 from the CAMS Reanalysis.The black curves in panels (a) and (b) show the primary and secondary road networks as obtained from Open Street Maps.c) Histogram showing data distribution for 2020 from station observations (shown in yellow), S-MESH model predictions (blue) and CAMS Reanalysis (green).RE SMESH and RE CAMS here indicate relative error for S-MESH model predictions and CAMS reanalysis respectively on validating the 2020 annual average data with corresponding annual averaged station observations.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 9 .
Fig. 9. NO 2 concentration maps of Europe and Stuttgart in Germany for the year 2019.a) Shows the annual average of tropospheric NO 2 column density obtained from TROPOMI observations b) Annual average of surface NO 2 concentrations derived by averaging the daily predictions of the S-MESH model c) Annual average of tropospheric NO 2 column density obtained from TROPOMI observations over Stuttgart.A reference to this location is shown in panel b with a triangular marker.d) Shows the annual spatial pattern of S-MESH model derived surface NO 2 concentrations over a part of Germany, close to Stuttgart.High concentrations of NO 2 can be observed over major cities, towns and along the road networks.e) Time series of predictions over stations in Stuttgart for the years 2019,2020.The x-axis shows the day of the year and y-axis is the surface NO 2 concentration.f) SHAP value distribution of input features for predictions over Stuttgart.SHAP values for TROPOMI observation, VIIRS night lights, wind speed, weekend indicator across day of the year for 2019 and 2020 are shown.The change in SHAP values is observed in input features during the lockdown months (period highlighted in blue).Note the different y-axis scales of each feature to better capture the variations in SHAP values.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 10 .
Fig. 10.Seasonal average of S-MESH model derived surface NO 2 across weekdays and weekends for 2020/2021.The panels (a)-(d) show seasonal averages for the weekdays (Monday-Friday) and panels (e)-(h) show the seasonal averages over the weekend (Saturday-Sunday).Panels (a) and (e) represent the spring season, (b) and (f) represent the summer season, (c) and (g) represent the autumn season and, (d) and (h) represent the winter season.The white gaps during winter over northern Europe are caused by missing TROPOMI data due to extensive cloud cover and high solar zenith angles limiting the retrieval.The spring season is defined as March, April, May (MAM) of 2020; summer season as June, July, August (JJA) of 2020; autumn as September, October, November (SON) and winter months spread across December 2020, January-February 2021 (DJF).
van Geffen et al. (2022)rvations have an approximate uncertainty of 0.5 × 10 15 molec/cm 2 for individual retrievals.The detailed retrieval algorithm from the initial measurement to the final tropospheric column density product is described invan Geffen et al. (2022).The final measurement products are available as offline and near-real-time Level-2 data products from Copernicus Data Space.Changes in TROPOMI NO 2 version 2.2 led to a significant spike in tropospheric NO 2 over cloud-free polluted regions.Later changes in version 2.4 derives surface albedo climatology from TROPOMI observations, leading to substantial changes in NO 2 over vegetation