Parsimonious Random-Forest-Based Land-Use Regression Model Using Particulate Matter Sensors in Berlin, Germany

Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods’ black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of models include the data quality and computational intensity. In this study, we employed feature selection methods using recursive feature elimination and global sensitivity analysis for a random-forest (RF)-based land-use regression model developed for the city of Berlin, Germany. Land-use-based predictors, including local climate zones, leaf area index, daily traffic volume, population density, building types, building heights, and street types were used to create a baseline RF model. Five additional models, three using recursive feature elimination method and two using a Sobol-based global sensitivity analysis (GSA), were implemented, and their performance was compared against that of the baseline RF model. The predictors that had a large effect on the prediction as determined using both the methods are discussed. Through feature elimination, the number of predictors were reduced from 220 in the baseline model to eight in the parsimonious models without sacrificing model performance. The model metrics were compared, which showed that the parsimonious_GSA-based model performs better than does the baseline model and reduces the mean absolute error (MAE) from 8.69 µg/m3 to 3.6 µg/m3 and the root mean squared error (RMSE) from 9.86 µg/m3 to 4.23 µg/m3 when applying the trained model to reference station data. The better performance of the GSA_parsimonious model is made possible by the curtailment of the uncertainties propagated through the model via the reduction of multicollinear and redundant predictors. The parsimonious model validated against reference stations was able to predict the PM2.5 concentrations with an MAE of less than 5 µg/m3 for 10 out of 12 locations. The GSA_parsimonious performed best in all model metrics and improved the R2 from 3% in the baseline model to 17%. However, the predictions exhibited a degree of uncertainty, making it unreliable for regional scale modelling. The GSA_parsimonious model can nevertheless be adapted to local scales to highlight the land-use parameters that are indicative of PM2.5 concentrations in Berlin. Overall, population density, leaf area index, and traffic volume are the major predictors of PM2.5, while building type and local climate zones are the less significant predictors. Feature selection based on sensitivity analysis has a large impact on the model performance. Optimising models through sensitivity analysis can enhance the interpretability of the model dynamics and potentially reduce computational costs and time when modelling is performed for larger areas.


Introduction
Around 10% of the people living in Berlin, Germany, reside in areas with a very low or low development index [1], which is directly associated with a higher risk of exposure to particulate matter (PM) and other air pollutants [2].Located at the mid-latitude ranges in the Northern European plateau, Berlin is the largest German city and the most populous city in the European Union [3]; thus, the local population is subject to significantly increasing health risks due to poor air quality.Berlin's climate is dominated by a modified maritime air mass originating from the Atlantic from a southwest to northwest wind direction [4].Regularly occurring weather episodes with air flow from the east are typically associated with lower wind speed and overall elevated levels of PM concentrations [5].In the last decade, the annual mean concentration of PM 2.5 in Germany has reduced from approximately 20 µg/m³ to approximately 7 µg/m³.However, the World Health Organisation's recommended limit value of 5 µg/m³ (annual mean) is exceeded in 99.5% of the stations in Germany [6].Globally, within urban areas, the sources of PM 2.5 are a mix of local sources such as traffic, cooking, construction activities, power generation [7], formation of secondary particles [8], and long-range transport of dust [9].PM concentrations in Berlin are attributed to both long-range transport and local sources [10,11].In general, the largest contribution to the PM 2.5 concentration in Berlin is from locally produced sources such as traffic-related emissions, households, and industry, followed by long-range transportation of PM from other German cities and transboundary sources [11].Knowledge of the microscale variability of PM, and thus the quantification of the exposure risk, could significantly reduce health risks and increase the quality of life and well-being of citizens [12,13].This can be achieved through city-planning approaches that are informed by the knowledge of the relationship between urban structures and PM concentration [14,15].In this study, we aimed to develop a prediction model for PM 2.5 that can be used at the local and regional scales.We combined the advantages of machine learning (ML) and sensitivity analysis to develop a parsimonious random-forest-based land-use regression model.
Modelling of air pollution is carried out through land-use-based models [15-17], Bayesian network probability models [18], multi-objective scheduling models [19], satellitebased models using aerosol optical depth [20,21], meteorology-based models with regression and ML [22,23], and urban-scale models [24,25].Land-use-based regression models (LURMs) are commonly used to understand how urban design, land use, and socioeconomic factors [15][16][17]26] interact with air pollution.LURMs are especially useful in locations where air pollution data are unavailable [27].The spatial distributions can be derived by correlating air pollution with land-used-based predictors [28,29], local sources of emissions [28,29], and meteorology [30].The review articles by Azmi et al. (2023) [16], Hoek et al. (2008) [15], and Ryan et al. (2007) [17] provide a comprehensive overview of LURM studies for air pollution parameters such as volatile organic compounds, nitrogen oxides, ozone, and PM.The coefficient of determination (R²) is used in most of the studies to compare the models and evaluate the model performance.Mean error and cross-validation techniques are also used as model evaluation metrics in studies that have employed ML methods to develop LURMs [31][32][33].
The applicability of LURMs beyond the extent of the official measurement networks has been the focus of many studies.For example, the study by Merbitz et al. (2012) [29] aimed to create a statistical model for PM 10 and PM 2.5 using a small number of predictors, including the traffic-emissions-based concentration profile (relative concentration to the source as a function of distance) and land-use classification simplified to two categories-building density (all building classes grouped together) and green area (urban green and forest).They did not discriminate between residential and commercial buildings.Although all three predictors correlated with PM, it was concluded that the results were better for PM 10 than for PM 2.5 , as PM 10 is more directly influenced by local sources than is PM 2.5 .The model tended to largely underestimate concentrations in open areas, temporary traffic hotspots, and street canyons.Hoogh et al. (2016) [34] developed a LURM for PM 2.5 and nitrogen dioxide data incorporating satellite and chemical transport modelling data.They used data from the European AIRBASE network and the European Study of Cohorts for Air Pollution Effects measurement sites air quality database as ground-based data sets.They used 80% of the monitoring sites to train and 20% to test and validate and achieved an R² of 0.6.However the AIRBASE data set is now depreciated.The study by Ge et al. (2023) [32] used the two ML models, least absolute selection and a shrinkage-operator (LASSO)-enhanced random forest (RF) model called LASSO-RERF to extrapolate PM 2.5 data from the regulatory monitoring network and low-cost sensor network to sparsely monitored areas.Using this method, Ge et al. (2023) [32] could improve the R² from 0.49 to 0.65, and the RMSE from 3.56 µg/m³ to 2.96 µg/m³.Kumar et al. (2020) [35] also combined two ML models, extra-trees regression and Ada-Boost to develop a LURM using meteorological variables to estimate PM 2.5 concentrations in Delhi, India.
Due to microscale variations in PM concentrations, it is a challenge to develop LURMs that are transferable to other areas [15].For example, the LURM developed for 20 European cities by Eeftens et al. (2012) [31] could not be generalised for all cities and had to be locally optimised for each study area.The simplicity of air quality sensors (AQSs) and advantages due to their portability and longer battery life, helps in capturing the microscale variation in PM concentrations in more detail [36].These kinds of studies are not feasible with the use of conventional stationary measurements using reference-grade devices due to the higher costs involved to procure and safeguard the devices [37].The usage of sensors to measure air quality has spread rapidly in the last decade due to the advances in microsensor technology, modern production facilities, and reduced development costs [38,39].In contrast to reference-standard devices, AQSs are simple in design, lightweight, and easier to deploy in large numbers [40,41].AQSs are also used in mixed networks in combination with satellite images and/or monitoring stations [42,43].AQSs perform well under laboratory and stationary conditions but have a low temporal resolution of 30 to 60 min [44,45].AQSs on mobile platforms need to operate with a higher temporal resolution in order to compensate for the vehicle speed and capture as many data points as possible.This leads to mobile AQSs needing to have higher uncertainties due to road conditions, vibrations, and wind [46].Nevertheless, with the good measurement practises as outlined in [39], AQSs offer the capability to generate high volumes of data at reasonably low cost.A LURM based on data collected from mobile sampling could potentially offer a highly cost-effective approach to modelling and mapping air pollution concentration levels in spatially continuous ways [47].
ML methods are widely used in PM prediction models using air quality data from officially run stations [48], AQSs [30,49], satellite images [21], meteorological data such as temperature and precipitation [50], and urban structures [30].Several ML methods can be used for prediction of PM depending on the availability, type (images or text), quality, and quantity of data.Furthermore, the type of study, use of a time-series or cross-sectional design, and the relationship between PM and the variables used to predict it play an important role in determining the ML model to be used [51].Since each ML model has its own strengths and weaknesses, many studies such as those by Murugan et [57] have used multiple models to compare and contrast the ML models against each other in order to find the model that fits best to the particular study case.The majority of these studies have concluded that random forest (RF) performs best.Tian et al. (2021) [57] concluded that tree-based models (RF and gradient boosting), and neural network models (back propagation neural network and Elman neural network) produce similar estimations but that RF has the best estimation accuracy.Support-vector machine and generalized additive models were also examined but were found to result in worse performance.Similar results were found in studies by Mandal et al. (2020) [58] and Chen et al. (2019) [59].XGBoost has a good predictive accuracy [60] and can perform better than RF models [61,62].However, these studies used stationary measurements for training and testing their model.
RF is a nonlinear estimator which fits an ensemble of decision trees on subsets of the data and calculates a mean over all decision trees [63].By creating multiple subsets, RF improves the predictive accuracy, particularly for nonlinear data [64].Compared to decision trees, RF is better suited to handling high-dimensional data and preventing overfitting [65].The RF algorithm is less sensitive to outliers and noisy data as compared to gradient boosting models (GBMs), such as XGBoost and LightGBM, which require the careful tuning of hyper-parameters [66,67].The RF estimator function in the scikit-learn [68] package introduces randomness into the model through bootstrapping and splitting each node during the construction of a tree.The quality of a split is based on criterion such as mean squared error (MSE), mean absolute error (MAE), and Poisson deviance.The overall variance of the model is therefore reduced by the randomness introduced to the model and by combining individual trees [68].
Despite its advantages, RF estimator is a black-box model.The split criterion of the RF estimator helps in determining the most important predictors or features for the model but does not reduce the number of predictors used.This can affect the quality of the model especially if there are uncertainties in the predictor data, as these uncertainties will be propagated through the model [69].The second problem with such a model is the multicollinearity of the predictors.Although multicollinearity does not affect the predictive capacity of an RF model, it is a problem when the features need to interpreted.The features determined as important by the model do not necessarily reflect reality, thereby limiting its value for generating interpretations and decisions.The study by Berrocal et al. (2020) [70] compared statistical and ML methods for creating national daily maps for ambient PM 2.5 concentration in the continental United States and concluded that the numerical methods (such as kriging and statistical downscaling) outperform ML methods, including RF, since they explicitly account for spatial dependence while ML methods do not.The third problem is the computational time and costs of the model: the more predictors used, the more time needed for the model to run through.This especially is exacerbated when the model needs to be run for larger regions.
By removing the inputs that have negligible influence on the output, the model can be simplified and made parsimonious [71].Recursive feature elimination (RFE) and recursive feature elimination with cross-validation (RFECV) are two types of feature selection methods that are commonly used for reducing the number of predictors for ML models.RFE works by eliminating features by training the model multiple times and then recursively removing the features of low importance until the desired number of features remain [72].The desired number of features is chosen manually before the model is trained.RFECV, on the other hand, performs RFE in a cross-validation loop to produce the optimal number of features [73].Despite their effectiveness, these methods do not provide information on the interactions between the predictors.Sensitivity analysis can be an important tool here to understand the dynamic behaviour of the model [74].Saltelli et al. (2008) [69] define sensitivity analysis as the study of how uncertainty in the output of a model (predicted values) can be apportioned to different sources of uncertainty in the model inputs (predictors).Sensitivity analysis is performed either locally, where the sensitivity of an input parameter is analysed at one at a time [75] or globally, wherein the interactions of an input parameter by itself and with the other input parameters are analysed.Several studies have used sensitivity analysis to enhance their models [76][77][78].The studies by Todorov et al. (2023) [76] and Wang et al. (2023) [78] are particularly interesting, as they used Sobol-based sensitivity analysis to improve their air quality prediction models.The regional model developed by Todorov et al. (2023) [76] has a resolution of 10 km × 10 km but does not account for PM.Wang et al. (2023) [78] employ global sensitivity analysis (GSA) to an ML model, combining a convolutional neural network and so-called long short-term-memory models to study pollution trends before and during the COVID-19 outbreak.
In this paper, we address predictor selection methods in the context of RF-based LURM for maximum PM 2.5 (PM 2.5 ) prediction in Berlin, Germany.This was achieved by applying RFE, RFECV, and GSA using the Sobol method [69] to a baseline RF model to create a parsimonious RF model with fewer inputs, with validation being conducted via hold-out validation (HOV) [79].The parsimonious RF model can also be applied to select locations spread across Berlin that correspond to where the regulatory measurement stations run by the Senate Department for Urban Mobility, Transport, Climate Action and the Environment, Berliner Luftgütemessnetz (BLUME), are located [80].The predicted concentration using the parsimonious RF model was compared against the BLUME station data, which allowed us to examine the possibility of exclusively using AQS data for the LURM.The features of the parsimonious models and their influence on PM 2.5 concentration in Berlin are discussed since the World Health Organisation's recommended daily mean concentration of 15 µg/m³ (with a maximum of three to four exceedances per year) for PM 2.5 [81] is exceeded in all of the BLUME stations.

Methodology 2.1. Data Acquisition and Preparation
For the model development, three types of data sets were needed, namely, PM data, land-use data for model development, and PM data for validating the developed model.The acquisition and preparation of each of the three data sets is described in the following sections.Figure 1 shows the methodology for this study.PM Training Data Acquisition: Three different suburbs in the city of Berlin, Germany-Berlin-Hermsdorf (Hmd), Berlin-Charlottenburg in the vicinity of Ernst Reuter Platz (ERP), and Berlin-Adlershof (Adl) were chosen to collect data for the study (see Figure 2).Berlin-Hermsdorf is located in the northwestern edge of the city bordering Brandenburg.It is characterised by residential areas with a mix of old farm houses, villas and new residential buildings, green areas, and bodies of water.The streets-Hermsdorfer Damm and Heinsestrasse are characterised by higher traffic volumes due to the former's proximity to the highway A9 and the latter acting as the commercial hub of Hermsdorf [82].Berlin-Charlottenburg is located in the centre of the city.It is constituted by a mix of residential areas, tall buildings, high commercial activity, wide streets, and large park areas.ERP is a major traffic junction where five major streets-Hardenbergstrasse, Strasse des 17.Juni, Marchstrasse, Otto-Suhr-Allee, and Bismarckstrasse-converge.The Kurfuerstendamm area in Berlin-Charlottenburg is a commercial hub with a large traffic volume and residential area with high population density [83].Berlin-Adlershof is a suburban area located in the southeastern part of the city.Adlershof consists mostly of large residential areas.The street Rudower Chaussee and the vicinity around it consist mainly of research institutes and university and office buildings.The main federal road, Adlergestell (B96), and highway A113 are prone to high traffic volumes [84].  1.The measurements were carried out from 15 June 2018, to 15 October 2018, as a part of an intense observation measurement campaign in the Urban Climate Under Change [UC] 2 project [86].The measurements at Hmd continued over the winter between 2018 and 2019 until 1 March 2019.A total of 7 rounds in ERP, 15 rounds in Hmd, and 16 rounds in Adl were carried out.However only the measurements taken during summer months-June to October-were considered for this study, limiting the measurements in Hmd to 2 rounds.By limiting the measurements to summer, variations in pollutant levels that would generally occur due to seasonal variations were removed.Additionally, the measurement rounds that were incomplete due to missing PM 2.5 or GPS data, change in weather conditions (sudden rain), or device malfunction were removed from the analysis.PM data were collected using an optical particle counter, OPC-N2, from Alphasense, Ltd, Essex, United Kingdom Ltd. [87].Temperature and relative humidity (RH) were measured using two SHT35 sensors from Sensirion Ltd, Staefa, Switzerland [88].The sensors, along with the data acquisition system, were built into a metal housing as described in [46].The sensor ensemble was mounted inside the front basket of a bicycle.The measurements were carried out with a maximum bicycle speed of 15 km/h, with a logging interval for measurement of 2 s.The sensor ensemble was calibrated in the laboratory and in the field [46] using an aerosol spectrometer from Grimm Aerosol Technik Ltd, Ainring, Germany .[89].The temperature and RH measured are used to account for the meteorological influences on the PM concentration measured.RH is used to correct for hygroscopic growth of PM via Koehler's theory , as shown in Equations ( 1) and (2).Equation ( 1) calculates the correction factor C using Koehler's factor (κ) and particle density (ρ).The correction factor is used to recalculate the PM concentrations, with RH being taken into account [90,91].
The measured data of each round are individually handled during pre-processing.First, outliers, here defined as all data in the top and bottom 5th percentile, are removed, and a temporal median of 30 s is calculated.Then, the bottom 5th percentile of the data is assumed to be the background concentration and subtracted from every measurement, and thus only the local concentration of PM is considered for further analysis [92,93].Additionally, this step eliminates some of the ageing or sensor-drift-induced bias in the data between rounds, with a linear drift being assumed.
The PM 2.5 data are time and GPS tagged.With the GPS information, an idealised route is created via calculation of the mean of all rounds in a given measurement area.The PM 2.5 data are then interpolated to the ideal path with all points within a radius of 25 m being considered.The interpolated point contains the information on the maximum, minimum, mean, and standard deviation of PM 2.5 concentration within a 25 m buffer zone.
For each of the predictors, the maximum (max), minimum (min), and mean value for each buffer are calculated.For categorical predictors such as LCZ, LUC, and BT, the most frequently occurring value (cat_max) and the least frequently occurring value (cat_min) are calculated.This is because we use regular grid-point data for categorical variables.In the subsequent analysis, the predictors are named according to the combination of their abbre-viation, the statistical means, and buffer size as abbreviation of predictor name_statistic used_buffer-size.

BLUME Station Data:
The validation of a trained ML model is usually carried out with a subset of the data set used, which is also known as hold-out-method validation (HOV) [79].However, since this study used AQS data, we checked the possibility of the model to transfer and generalise to the official measurement stations run by the Berlin Senate Department for Urban Mobility, Transport, Climate Action and the Environment, Berliner Luftgütemessnetz (BLUME) [80].Twelve locations in Berlin that correspond to the official measurement station were chosen for validation.The BLUME stations that collect PM 2.5 data are classified into three categories: urban background (Wedding, Neukoelln, Mitte), suburban (Grunewald, Buch, Friedrichshagen), and traffic (Messwagen-Leipziger-Strasse, Schildhornstrasse, Mariendorfer Damm, Silbersteinstrasse, Frankfurter Allee, Karl-Marx-Strasse).The LURM predictors within the 10 buffers are assigned to all 12 locations (BLUME_LURM).Similar to the AQS data, the local concentration of PM 2.5 for each station is calculated by assuming and subtracting the lower 5th percentile as the background concentration.The local maximum is determined by calculating the median over the maximum concentrations of the days when the bicycle measurements took place (BLUME_val).

Model Development 2.2.1. Model Estimator
RF regression is an ensemble technique that makes use of multiple decision trees in determining the final output.Each decision tree is constructed by recursively splitting the training data into subsets based on the values of the model attributes until a criterion is met [63].RF then performs a bootstrap aggregation, wherein the predictions of all the decision trees are combined in such a way that the overall variance of the model decreases [102,103].
The models were applied using the scikit-learn package [68].Each variable in the categorical data was numerically coded by assigning a numerical value each to the categorical variable.The points in which any of the predictors had missing data were removed before the analysis.The predictors were not scaled via standardization or normalization, as the RF algorithm is insensitive to it [104].The data set is split randomly into 70% (584 observations) and 30% (251 observations).The models are trained with the 70% split of the data and tested with the 30% split of the data.The data are split between testing and training in such a way that the distribution is maintained and data from either extremes are included.Figure 3 shows the data distribution for the whole data set (left), the training data (centre), and the testing data (right).The chosen 70% for training covers the entire range covered by the whole data set.The testing data covers all data from 1 µg/m³ to 100 µg/m³).

Feature Selection
The aim of this study was to develop a parsimonious LURM that is comparable or better to a baseline model consisting of all predictors.Model predictor selection is critical to developing such a parsimonious model.Two methods were used and compared in this study: recursive feature elimination with and without cross-validation and sensitivity analysis using the Sobol method.Both methods work by reducing the variance of the model.
Recursive feature elimination: Recursive feature elimination (RFE) is a feature selection method in which the predictors that do not reduce the "impurity" or the chosen criterion such as mean squared error or absolute error of the model are removed.This is done by the iterative training of the model with the chosen estimator-RF in our case-until the desired number of predictors are chosen [68].RFE can be carried out in two ways: one where the number of predictors are given before training or using cross-validation (RFECV) where the data are divided into multiple subsets consisting of training and testing data sets and where the optimal predictors are derived as a mean of all the models.
In RFE, the model is trained using the test data as a whole.First the model is trained with all the features.The features are then ranked, from 1 to the total number of predictors, based on how much influence they have on the model's prediction.The least important feature is then eliminated, and the model is trained again until the desired number of features is reached.
RFECV, on the other hand, uses a cross-validation technique wherein the test data are split into k subsets.The model is therefore trained on one subset and tested on another subset, thereby training the model k − 1 times.By using cross-validation, the model utilises the data efficiently, and resultant model is more reliable [68].To account for uncertainty, Monte Carlo method [105] is applied, wherein the model is trained 1000 times with a random state parameter of the estimator set from 0 to 1000 (Monte Carlo runs).The features with a mean importance rank of less than 2 are chosen as the important predictors, and the model is retrained and validated.
The three models, RFE, RFECV_baseline (all predictors), and RFECV_parsimonious (RFECV filtered predictors), are assessed and validated.Neither the RFE nor RFECV method provides information on higher-order effects.
Global sensitivity analysis: Global sensitivity analysis (GSA) is the process of apportioning the uncertainty of the output to the uncertainty of all the input factors of interest, thereby quantifying the importance of model inputs [76,[106][107][108].This thus allows for the identification of a parameter or a set of parameters that have the largest influence on the model output.For this study, a variance-based sensitivity analysis, also known as the Sobol method [69,109,110], was used to quantify the effect of each of the inputs to the model output variance.We refer to studies such as those by Sauter et al. (2011) [107], Todorov et al. (2023) [76], and Zhang et al. (2015) [77] who describe the principle behind the Sobol method in detail.The Sensitivity Analysis Library (SALib) [111] is used for the analysis.
In order to carry out the GSA, a parameter data set is created using the maximum and minimum value of each predictor.A large enough data set should be created in order to get the best results.The number of simulation runs (N) is determined using the desired coefficient of variation (CV) of 0.1 and a confidence interval width (w) of 0.01, assuming a normal distribution of data from Table 2 of Byrne et al. (2013) [112].Since the Sensitivity Analysis Library [111] recommends the number of simulations to ideally be a power of 2, N is rounded to the next 2 n number, 2048.The number of data points (S) is then calculated using the number of predictors (D), as shown in Equation (3).
In this paper, we assume a uniform distribution for the model inputs.This is because we do not know the exact distribution that may exist in real time for the model inputs.By assuming a uniform distribution, we can generate a data set which has information covering the entire input space.When Equation ( 3) is applied, a data set (Sobol data) containing 442,368 data points is generated.The Sobol data set is used to predict the PM 2.5 concentration, and the results are then analysed.
Analysis of the inputs using the Sobol method offers insights into the interaction within and between the inputs to predict an output.Sobol analysis provides the sensitivity of individual inputs on the output as a first-order sensitivity (FOS) index and the sensitivity of an input due to its interaction with all of the other inputs as the total-order sensitivity (TOS) index.Both indices range between 0 and 1, with 0 denoting no effect and 1 indicating high effect on the model output variance.FOS equal to TOS indicates that the input does not interact with other inputs, whereas a large difference between FOS and TOS indicates higher-order effects (the input has a strong interaction with the other input parameters).A value of FOS or TOS equal to zero indicates that the input has no impact on the prediction of the model.Sensitivity analysis only identifies the impact of input variability on the model output but not its cause [69].
Categorical variables are not directly supported in the SAlib library.Therefore, the indices of categorical variables were rounded to the closest integer as used in the general probabilistic framework [113].One thousand RF models that were fit for the baseline model were applied to the Sobol data to predict PM 2.5 concentrations.Each of the 1000 runs were analysed to produce FOS and TOS.The mean of the FOS and TOS over 1000 model runs was then calculated.
Through GSA, two models were developed, one with predictors with an FOS index greater than 0.01 (GSA_parsimonious) and a second one with the predictors from the GSA_parsimonious and street types (GSA_streets).The RF models were retrained with both GSA models 1000 times each and then validated.

Validation
Model validation is an essential part of model development for assessing the general performance and stability of the model.All six models-baseline, RFE, RFECV_baseline, RFECV_parsimonious, GSA_parsimonious, and GSA_streets-were compared to one another to identify the best method for this type of LURM study.
Hold-out Validation: Hold-out validation (HOV) is the simplest type of validation of ML models.The data set is divided into a two sets: one for testing and the second for training.The test data set is used to validate the model [79].The features of the test data set are used to predict PM 2.5 and are compared against the observed PM 2.5 concentration.HOV validation may be problematic if the test data set are not representative of the training data set.However, in this work, the range of test data set was similar to that of the training data.Additionally, by means of Monte Carlo runs, the uncertainty of the model could be accounted for.

BLUME Validation:
The training of the RF models uses AQS data.Since AQS data involve a degree of uncertainty regarding quality, we used data from the official air quality stations run by the city of Berlin for the model validation data (BLUME_val).
BLUME_LURM data were applied to each location, and the six RF models were used to predict the PM 2.5 concentration.The predicted concentration was compared to the BLUME_val data for assessing and validating the models.

Model Assessment
The assessment of predictive models is a crucial aspect to evaluating the model performance and understanding its strengths and weaknesses.Different regression metrics such as coefficient of determination (R 2 ), mean squared error (MSE), root mean squared error (RMSE), scatter index or normalised root mean squared error (NRMSE) [114], and mean absolute error (MAE) were used in this study to evaluate the models.The specific meaning of each of the metrics are detailed in Garreta et al. (2017) [115].RMSE, MSE, and MAE evaluate the model by examining the absolute errors between the predicted and true or observed values.MSE provides insight into outliers by squaring the difference between true and predicted values.RMSE provides the square root of the MSE, which is easier to interpret as it has the same unit as the predicted variable.RMSE places emphasis on the larger errors.Therefore, it is an important parameter to assess a model that is biased in the upper extremes.This is because there are legally set limit values above which there are penalties due to increased health risk.MAE provides a straightforward understanding of a model's prediction errors.It calculates the average absolute error magnitude of the model by calculating the mean absolute difference between the true and predicted values [115].NRMSE calculates the variance between the different models as a function of RMSE and the range of observations [114].Lower NRMSE, RMSE, MSE, and MAE values indicate less residual variance for a model, thus suggesting a better model, with a value of zero being the best model.

Feature Selection
The RFECV method is used to ascertain those features that are important to develop the model.Monte Carlo runs are carried out, and the features with a mean importance rank of less than two are chosen as the important predictors.Figure 4 shows the coefficient of determination of the model when the model is trained with the particular predictor as its important feature.The point represents the mean R 2 of 1000 model runs, and the error bars show the spread of the model.Categorical data are indicated by "cat" after the name of the predictor.The number at the end of the predictor indicates the buffer size in meter and whether the mean, minimum (min), or maximum (max) of the predictor within the buffer is considered.
To select features using GSA, the Sobol method is used.The mean over 1000 runs is calculated, and all inputs with an FOS of greater than 0.01 (see Figure 5) are considered significant for the model.The higher-order effects are indicated by the difference between FOS and TOS.Population density shows a larger influence on the model output followed by LAI and DTV.The error bar (black line) indicates the spread of the sensitivity indices of the predictor over 1000 model runs.The LURM features selected using RFECV as shown in Figure 4 are also selected by GSA (Figure 5).

Model Validation
Six RF models, baseline (M1), GSA_parsimonious (M2), GSA_streets (M3), RFE (M4), RFECV_baseline (M5), and RFECV_parsimonious (M6), were trained and validated using HOV (Figure 6) and BLUME data (Figure 7).Each of the Q-Q plots in Figures 6 and 7 shows a 45°line in red which indicates the ideal distribution line (1:1) and metrics of the evaluated model.The model metrics show the overall performance of the models.
Figure 6 shows that all six models similarly follow the 1:1 line.The concentrations below 5 µg/m³ appear to be overestimated for the GSA_parsimonious (Figure 6 (M2)) and GSA_streets (Figure 6 (M3)) models, while those below 7 µg/m³ appear to be overestimated for the baseline (Figure 6 6 (M6)) models, which contain all the predictors, appear to follow the 1:1 line better as compared to the parsimonious models.For concentrations above 30 µg/m³, all the models appear to underestimate the concentration.The model metrics MAE R 2 and NRMSE show similar results for all the six models, with a slightly better performance being observed for the GSA_parsimonious (Figure 6 (M2)).
Figure 7 shows the data distributions of the BLUME station data and model predictions at the respective BLUME location for all the models.All the models show a bias towards higher values and a low R 2 .The parsimonious models GSA_parsimonious (Figure 7 (M2)), GSA_streets (Figure 7 (M3)), and RFECV_parsimonious (Figure 7 (M6)) have distributions that are closer to the 1:1 line.As opposed to HOV, in BLUME_validation, the GSA_parsimonious model performs best in all the metrics, with an RMSE, MAE, NRMSE, and R 2 of 4.24 µg/m³, 3.61 µg/m³, 0.39 and 0.17, respectively, as compared to 9.86 µg/m³, 8.69 µg/m³, 0.9, and 0.03 in the baseline model, respectively (Figure 7 (M1)).
Figure 8 shows the residual plot of the observed BLUME PM 2.5 concentration and the predicted concentration of the six models: baseline (M1), GSA_parsimonious (M2), GSA_streets models (M3), RFE (M4), RFECV_baseline (M5), and RFECV_parsimonious (M6).All of the models generally appear to overestimate the PM 2.5 concentration except at two stations, Mitte and Schildhornstrasse, where the concentrations are underestimated.The GSA_parsimonious (M2) is able to predict concentrations with an MAE of less than 10 µg/m³ for all the 12 stations and with less than 5 µg/m³ at 8 out of 12 locations.
At the urban background stations, all the models perform similarly, with a mean RMSE of 3 µg/m³ to 5 µg/m³ at Wedding and 2 µg/m³ to 4 µg/m³ at Mitte.Station Neukoelln, on the other hand, has a mean RMSE between 6 µg/m³ and 15 µg/m³, with the GSA_parsimonious model performing the best and the baseline model the worst.
The GSA_parsimonious model shows the best performance for traffic stations at Mariendorfer Damm, Silbersteinstrasse, Frankfurter Allee, and Karl-Marx-Strasse, demonstrating significant improvements over the baseline model.On the other hand, for traffic station Schildhornstrasse, all the models show a similar performance, with a mean RMSE between 1 µg/m³ and 4 µg/m³.Table 2. Root mean squared error (RMSE) of the predicted PM 2.5 concentration at each Berlin air quality station.The predictions were carried out using six models, baseline (M1), GSA_parsimonious (M2), GSA_streets (M3), RFE (M4), RFECV_baseline (M5), and RFECV_parsimonious (M6).At the traffic station Messwagen-Leipziger-Strasse, the RFE model (RMSE = 3 µg/m³) performs significantly better than does the baseline model (RMSE = 7 µg/m³).However, the station at Messwagen-Leipziger-Strasse is a mobile station and was therefore not considered.This is the only station where the RFE model performs better compared to the other models.

Station Type and
Including the street predictors improves the predicted PM 2.5 concentration by approximately 1 µg/m³ at the stations Schildhornstrasse and Mitte.

Feature Selection
Figures 4 and 5 show the results of the feature selection process under the RFECV and Sobol methods, respectively.As the features selected using RFECV overlap those chosen by Sobol, the features are discussed together.
The PM 2.5 concentration depends on local and regional factors (see Figure 5).Four out of the seven inputs selected by RFECV and eight selected by GSA shown in Figure 4 and Figure 5, respectively, are related to population density and daily traffic volume.This is in line with the generally accepted sources of PM 2.5 concentrations according to the German Environment Agency, wherein approximately 60% of the emissions result from combustion processes, with the largest shares coming from households, small consumer operations such as restaurants, and from road traffic [116,117].Similar results were found in the study of population density and PM by Borck et al. (2021) [117], who reported that an increase in population density had a direct effect on the increase in PM.The study by Casallas et al. (2024) [118] highlights that the elevated PM levels can be attributed to vehicles and industries, similar to the results found in our study.
LAI is the second most important predictor in both RFECV and GSA model.Since the model pertains to summer months, LAI plays a crucial role.A study on five different urban sites in the United Kingdom by Beckett et al. (2000) [119] indicated that trees of various sizes and ages play a significant role in particulate matter reduction by capturing significant quantities in its foliage.However, the rate of PM uptake can vary between species.A similar study by Nowak et al. (2013) [120] in the United states of America in ten different cities also showed that trees remove fine particles in the air, thereby reducing the urban PM 2.5 concentration.
LCZs and land-use types are determined by infrastructure planning.It is therefore important to design cities in a way that can effectively reduce the concentration of PM 2.5 [121].Building type refers to the age (pre-war, post-war, and current) and usage (residential or office) of the building.Although it is unclear whether the age of buildings or their usage determines the effect on PM 2.5 concentrations, the building type parameter still has a significant influence on the output of the model.
The GSA also shows that the predictors LAI, LCZ, PD, DTV, and BT have higher-order effects, i.e., a difference between FOS and TOS.This may be because these predictors not only have a direct effect on the predicted PM 2.5 concentration but can also interact with each other and indirectly affect the predicted PM 2.5 .For example, the type of LCZ has a direct effect on the predicted PM 2.5 concentrations.However, a change in LCZ, for instance, due to infrastructural developments, can alter PD, DTV, and as BT.Similarly, a change in PD can affect the traffic volume, which in turn affects the LCZ.One such scenario could be when new sealed surfaces (roads) are constructed in barren or green areas to accommodate increased traffic flow.BT, depending on the presence of a residential or commercial type, can affect the local PD, which in turn can affect DTV.
The higher-order effects show the predictors that have the most influence on the model outcome.Domain knowledge is, however, necessary to understand and interpret these effects.This is due to the fact that the sensitivity analysis lacks any information pertaining to the causal directions and cannot differentiate causes from effects.For example, PD and DTV are directly proportional to PM 2.5 concentration.However, an increase in LAI leads to a decrease in PM 2.5 .Similarly, LCZ is an effect of urban and infrastructure planning and not a cause of it.Therefore, the importance of LCZ as a predictor lies in analysing the underlying infrastructural and land-use plans.
All six models validated with the HOV data (Figure 6) follow the 1:1 line of the Q-Q plot, showing that the predicted data follow the distribution of the observed data and are comparable.The general bias for concentrations above 30 µg/m³ shows that the model cannot be used for modelling extreme cases, as the model would underestimate concentrations in such cases.The model, however, can be used to obtain a general PM 2.5 spatial distribution, as the annual mean concentration in Berlin is less than 20 µg/m³.The parsimonious models using GSA and RFECV capture situations with around 100 µg/m³, which only rarely occur if the concentration is measured right next to combustion processes such as vehicles or traffic or smoking.The R² between 0.76 and 0.81 in models M2-M5 as compared to the baseline (M1) (Figure 6) shows that the feature selection process does not reduce the predictive power of the model.The RFE model has the best MAE of 5.20 µg/m³ as compared to the other models.A lower MAE indicates a better model since the absolute errors are low.However, since MAE calculates errors linearly, the errors due to outliers are not pronounced.A lower RMSE of a model shows that the model has a lower variance of the residuals.This indicates that the GSA_parsimonious model is able to predict without strong outliers and is therefore better suited for the data set used.RMSE is a deciding factor here since PM 2.5 has threshold values that when exceeded, can have health risks and legal consequences.Exaggerating such extremes limits the usability of such a model for applications that include information to the general public.The GSA models perform best with a lower MSE/RMSE as compared to the other models.All six models have a low NRMSE of 0.07 to 0.08, indicating that the models perform consistently and robustly to outliers.
Figure 7 shows that the models trained using the AQS data do not generalise well to the BLUME station data.The lower R² in BLUME_val as compared to HOV_val across all models indicates that the model has poor performance when transferred to BLUME locations and that the model does not explain the variance when compared to stationary measurements.The higher NRMSE of 0.38 and low R 2 of 0.17 of the validation against BLUME station data show that the model is affected by micro-scale structures and therefore cannot be generalised at the regional scale.The sparsity of BLUME stations and the non-availability of AQS measurements at the locations of BLUME stations makes it difficult to make a direct comparison.However, the distribution and R² of the GSA_parsimonious model predictions are better than those of the baseline model, thus indicating that the feature selection using the Sobol method helps to reduce the total uncertainties and improves the explained variance of the model.
Figure 8 shows a direct case-to-case comparison that provides insight into where the model excels and where it falls short.Below, we compare the results of the GSA_parsimonious model (M2) at each BLUME station.
The predicted PM 2.5 concentration at suburban station Buch is overestimated by 5 µg/m³.This is perhaps is due to its location at the northern border of the State of Berlin.Due to its location and the model input parameters being confined to the area of the State of Berlin, the land-use parameters within the buffer zone but outside the State of Berlin are not considered in the model.The predicted concentration at the suburban station Grunewald has the highest mean error at 9 µg/m³.The overestimation could be attributed to the station's location amidst a large green area and away from urbanised areas, which is unlike other suburban stations which are located closer to urbanised areas (see Figure 2).Moreover, the station is located in a dense green area, whereas the training data for the overall model is limited in information about PM 2.5 concentration in green areas.
The predicted PM 2.5 concentration at the urban background station Mitte is also underestimated.There have been road works and construction since 2016 in the vicinity of the station Mitte [122,123].The RF model, however, does not consider road and construction works as a parameter that might explain why the model underestimates the PM 2.5 concentration at this location.The predicted PM 2.5 concentration at the urban background station Neukoelln is overestimated by 6 µg/m³.The urban background station at Neukoelln is located in a residential area characterised by medium traffic volume.DTV and PD_ha both being significant predictors of the M2 model would explain the overestimation.Neukoelln, along with the stations at Silbersteinstrasse, Karl-Marx-Strasse, and Frankfurter Allee, has some of the highest exceedances of limit values in Germany [124].
The traffic station Mariendorfer Damm is characterized as an area with densely built residential area with a high traffic volume.As with the station at Neukoelln, DTV and PD_ha are both significant predictors of the M2 model, and thus the predicted concentration is overestimated.The traffic station Messwagen-Leipziger-Strasse is a temporary station set up in a measurement car park on the road Leipziger Strasse.The station supposedly did not stay at the same location and was moved around [125] to locations along the street with the highest concentrations.It is therefore difficult to pinpoint the exact location of the station along the street, thereby potentially causing the misrepresentation of the predicted PM 2.5 concentration.
The station Schildhornstrasse is a traffic station located near a roadside parking area on the highly trafficked street Schildhornstrasse in the neighbourhood of Steglitz.The PM 2.5 concentration at station Schildhornstrasse is underestimated by the model.The vicinity of Schildhornstrasse is classified as "residential use" in the land-use classification.However, it is also less than 1000 m away from the highways A100 and A103 and from the controversial motorway bridge at Breitenbachplatz, which diverts traffic from the motorway into Schildhornstrasse.The street type classification is disregarded as a significant parameter for the RF model after the sensitivity analysis, as the input parameters are selected based on the mean sensitivity of 1000 model runs.This may result in the contribution of the parameter being overlooked for such a specific case.However, including the street type predictors to the parsimonious model does not have a major effect on the predicted concentration, as seen in Figure 8.
The case-by-case analysis of the predicted PM 2.5 shows that the GSA_parsimonious model may be able to explain the model metrics at a local scale.However, with a low R² of 0.17, it cannot be used for regional-scale modelling.Other studies have found similar results wherein the LURMs were non-transferable to other areas [29,31,126].However, the better performance of the GSA_parsimonious model as compared to the baseline model (R² = 0.03) shows that the Sobol GSA helps to improve the model metrics better than the feature elimination or importance techniques offered by the RF algorithms.This goes to show that the higher-order effects of predictors for a LURM play a significant role in the model performance.The ability of GSA to improve models is consistent with studies by Todorov et al. (2023) [76] and Wang et al. (2023) [78].
Overall, the GSA_parsimonious model performs best (Figure 7 M2).This could be because the GSA_parsimonious model has fewer inputs which could reduce the uncertainty propagated through the model.By removing inputs which do not contribute significantly to the model, issues such as unstable coefficients and data overfitting that arise due to multi-collinearity are reduced.Moreover, considering that there are eight predictors in the GSA_parsimonious model as compared to 220 in the baseline model, we can largely reduce the computational time and costs by using the GSA_parsimonious model.RFECV_parsimonious model uses the same features as those of the GSA_parsimonious model, except for DTV_min_750 and building_type_cat_max_1000.The better performance of the GSA_parsimonious model could be due to the larger data set, with 442,368 data points used to perform feature selection as compared to 835 in the RFE and RFECV methods.

Conclusions
AQSs provide a unique opportunity to collect large quantities of data and capture microscale variations in PM 2.5 concentrations.Several studies have used AQS data in combination with satellite images and monitoring stations to develop predictive models [42].In our work, we assessed the potential of using AQS data to develop a LURM incorporating different feature selection methods for a random forest model.The baseline model consisting of all the predictors is subject to feature selection methods RFE, RFECV, and Sobol GSA.RFE and RFECV optimise the model by removing features with the least influence on the model output.While Sobol-GSA, RFE, and RFECV perform variance-based decomposition to highlight the most important features for model prediction, the higher-order effects that can be quantified with Sobol GSA provides an insight into the model dynamics.The developed models were validated using AQS data (hold out validation) and reference station data (BLUME validation).HOV validation showed that the baseline, GSA_parsimonious, GSA_streets, RFE, RFECV_, and RFECV_parsimonious models have similar performances, with the GSA_parsimonious model performing slightly better.BLUME validation showed that the GSA_parsimonious model performs best across all metrics (RMSE = 4.23 µg/m³, MSE = 17.93,NRMSE = 0.38, MAE = 3.6 µg/m³, and R 2 = 0.17) as compared to the baseline model, which had significantly poorer metrics (RMSE = 9.86 µg/m³, MSE = 97.16,NRMSE = 0.9, MAE = 8.69 µg/m³, and R 2 = 0.03).
The data used for this study included less data from green and blue spaces.Although LCZ in this study included green spaces under the categories of dense trees, scattered trees, low plants , and water, the measurements were carried out in the border of the LCZ and do not represent the LCZ.The GSA_parsimonious model is able to predict PM 2.5 concentration with an MAE of 9 µg/m³ in the suburban station Grunewald and with an MAE of less than or equal to 5 µg/m³ in stations Buch and Friedrichshagen.The GSA analysis does not show any impacts of street type classification on the output of the model.Although including the street information reduces the RMSE in two stations, Mitte and Messwagen Leipzigerstrasse, it increases the RMSE in other stations, thereby not providing any significant addition to the GSA process.
The trained GSA model performs well on unseen data from the test sample that uses AQS data, with an MAE of 5.24 µg/m, R² of 0.81, RMSE of 9.9 µg/m², and NRMSE of 0.07.However, the model does not transfer and generalise to data from the reference stations.Therefore, the model can only be used for local analysis and needs to be adapted for regional-scale using data that can capture regional conditions.Other studies have also found problems with transferability using LURM models in general due to differences in training data, data availability, and complex urban structures [29,31].
AQSs can be a good source of training data for predictive models if the sensors are calibrated and if the data are correctly processed.However, the inherent problems with the technology associated with AQSs and the added uncertainty due to a mobile platform makes it a suboptimal choice for regional modelling.Nevertheless, the sensitivity analysis is able to improve the model from an R² of 0.03 in the baseline model to 0.17 in the GSA_parsimonious model, while reducing the number of predictors significantly.The same effect is not as pronounced with the use of RFE and RFECV techniques.This is a proof of concept that using GSA is a highly useful technique for predictor screening when using AQSs.
RF is one of the commonly used ML methods used in regression analysis for air quality.However, with the increasing popularity of ML techniques, other methods such as neural networks [127], coupled deep learning models [78], and stacked ensemble methods [47] show further promising capabilities for predictive modelling.
Feature elimination methods enable the creation of an easier, faster, and parsimonious model, which can reduce the computational intensity and the uncertainty propagated through the model.Although RF has the ability to handle noisy data and has feature importance and elimination techniques available in its algorithm, the GSA Sobol method outperforms these in its feature selection techniques.Potentially, such an analysis can be used to model large study areas with reduced computational intensity and without compromising the predictive capacity.
The higher-order effects show that population density and traffic volume have the largest impact on the outcome of the model.This is in line with other studies suggesting that the combustion processes from households and traffic contribute the most to PM 2.5 concentrations.These are followed by the leaf area index, local climate zones, and building type come in next, which emphasises the importance of efficient city-planning measures that take PM 2.5 into consideration, as PM 2.5 has been reported to have an adverse impact on the cardiovascular and pulmonary health of the population [128][129][130].
The higher-order effects of the sensitivity analysis of our study are valid for local scales and can provide key information for urban planners.With fewer factors to consider, urban planners can make informed decisions to optimise land-use planning.However, sensitivity analysis lacks insight into causal direction and therefore domain knowledge needs to be applied to the higher-order effects.Combining the information of the predictors with the highest sensitivity and domain knowledge on the interactions of these predictors can enhance this study through the inclusion of a causal analysis.This analysis could be performed using packages such as DoWhy [131,132] and provide information that can be used for risk management by means of scenario analysis by policy makers, urban planners, and health officials.

Figure 1 .
Figure 1.Flowchart showing the methodology of the study.

Figure 2 .
Figure 2. Mobile measurement transects in Berlin-Germany at Hermsdorf, Charlottenburg-Ernst-Reuter-Platz, and Adlershof.The triangular points mark the locations of the official Berlin air quality measurement stations.Bicycle routes covering a distance of approximately 18 km in Hmd, 21 km in ERP, and 27 km in Adl were designed with 11 out of 14 local climate zones (LCZs) that are present in Berlin being taken into account [85]: compact high rise, compact mid-rise, open high rise, open mid-rise, open low rise, large low rise, sparsely built, heavy industry, dense trees, scattered trees, low plants, and water.Three LCZs-sparsely built, bare rock or paved, and bare soil or sand-were not covered within the measurement routes.The measurements in the LCZs of dense trees and water took place at the border of the areas and not in the midst of them.The bicycle routes and their location in Berlin are shown in Figure 2. The details of the bicycle rounds are summarized in Table1.The measurements were carried out from 15 June 2018, to 15 October 2018, as a part of an intense observation measurement campaign in the Urban Climate Under Change [UC] 2 project[86].The measurements at Hmd continued over the winter between 2018 and 2019 until 1 March 2019.A total of 7 rounds in ERP, 15 rounds in Hmd, and 16 rounds in Adl were carried out.However only the measurements taken during summer months-June to October-were considered for this study, limiting the measurements in Hmd to 2 rounds.By limiting the measurements to summer, variations in pollutant levels that would generally occur due to seasonal variations were removed.Additionally, the measurement rounds that were incomplete due to missing PM 2.5 or GPS data, change in weather conditions (sudden rain), or device malfunction were removed from the analysis.

Figure 3 .
Figure 3. Histogram and density distribution of the PM 2.5 data used in the study.Left: Distribution of the whole data set.Centre: Distribution of the training data that constitute 70% of the entire data.Right: Distribution of the test data that constitute 30% of the whole data set.

Figure 4 .
Figure 4.The mean test accuracy of predictors selected using RFE with cross-validation.The Y-axis shows the mean test accuracy, and the X-axis shows the selected predictors: leaf area index (LAI), local climate zone (LCZ), population density per hectare (PD_ha), and daily traffic volume (DTV).Categorical data are indicated by "cat" after the name of the predictor.The number at the end of the predictor indicates the buffer size in meter and whether the mean, minimum (min), or maximum (max) of the predictor within the buffer is considered.

Figure 5 .
Figure 5. Variance decomposition of the random forest model through attribution of the input to the model output variance using the Sobol method.The X-axis shows the sensitivity indices and both the total order sensitivity of the predictors (blue) and the first-order sensitivity (orange) of the predictors.The Y-axis shows the predictors used: leaf area index (LAI), local climate zone (LCZ), population density per hectare (PD_ha), daily traffic volume (DTV), and building type.Categorical data are indicated by "cat" after the name of the predictor.The number at the end of the predictor indicates the buffer size in meter and whether the minimum (min) or maximum (max) of the predictor within the buffer is considered.

Figure 6 .
Figure 6.Q-Q plot showing the data distributions and model metrics of six random forest models assessed using hold-out validation.The red line shows the best-fit line and the blue circles show the data points.The top left corner of the Q-Q plots show the MSE, MAE, RMSE, NRMSE, and R 2 metrics for the baseline (M1), GSA_parsimonious (M2), GSA_streets models (M3), RFE (M4), RFECV_baseline (M5), and RFECV_parsimonious models (M6) respectively.

Figure 7 .
Figure 7. Q-Q plot showing the data distributions and model metrics of six random forest models assessed using BLUME station data for validation.The red line shows the best-fit line and the blue circles show the data points.The top left corner of the Q-Q plots show the metrics MSE, MAE, RMSE, NRMSE and R 2 for the models baseline (M1), GSA_parsimonious (M2), GSA_streets models (M3), RFE (M4), RFECV_baseline (M5) and RFECV_parsimonious models, (M6) respectively.

Figure 8 .
Figure 8. Box plot showing the absolute error at each BLUME station for each of the six models: baseline (M1), GSA_parsimonious (M2), GSA_streets models (M3), RFE (M4), RFECV_baseline (M5), and RFECV_parsimonious (M6).Each plot includes the absolute errors of all the 1000 predicted PM 2.5 concentrations on the Y-axis and the model used on the X-axis.

Table 1 .
Summary of bicycle rounds carried out for the study in Berlin-Hermsdorf, Berlin-Adlershof, and Berlin-Charlottenburg-Ernst-Reuter-Platz.For locations see Figure1.