Evaluation of white-box versus black-box machine learning models in estimating ambient black carbon concentration

Air quality prediction with black-box (BB) modelling is gaining widespread interest in research and industry. This type of data-driven models work generally better in terms of accuracy but are limited to capture physical, chemical and meteorological processes and therefore accountability for interpretation. In this paper, we evaluated different white-box (WB) and BB methods that estimate atmospheric black carbon (BC) concentration by a suite of observations from the same measurement site. This study involves data in the period of 1 st January 2017 – 31 st December 2018 from two measurement sites, from a street canyon site in M ¨ akel ¨ ankatu and from an urban background site in Kumpula, in

Air quality prediction with black-box (BB) modelling is gaining widespread interest in research and industry. This type of data-driven models work generally better in terms of accuracy but are limited to capture physical, chemical and meteorological processes and therefore accountability for interpretation. In this paper, we evaluated different white-box (WB) and BB methods that estimate atmospheric black carbon (BC) concentration by a suite of observations from the same measurement site. This study involves data in the period of 1 st January 2017-31 st December 2018 from two measurement sites, from a street canyon site in Mäkelänkatu and from an urban background site in Kumpula, in Helsinki, Finland. At the street canyon site, WB models performed (R 2 = 0.81-0.87) in a similar way as the BB models did (R 2 = 0.86-0.87). The overall performance of the BC concentration estimation methods at the urban background site was much worse probably because of a combination of smaller dynamic variability in the BC values and longer data gaps. However, the difference in WB (R 2 = 0.44-0.60) and BB models (R 2 = 0.41-0.64) was not significant. Furthermore, the WB models are closer to physics-based models, and it is easier to spot the relative importance of the predictor variable and determine if the model output makes sense. This feature outweighs slightly higher performance of some individual BB models, and inherently the WB models are a better choice due to their transparency in the model architecture. Among all the WB models, IAP and LASSO are recommended due to its flexibility and its efficiency, respectively. Our findings also ascertain the importance of temporal properties in statistical modelling. In the future, the developed BC estimation model could serve as a virtual sensor and complement the current air quality monitoring. purpose from two measurement sites in Helsinki, which is situated on a relatively flat land at the coast of Gulf of Finland. According to National Land Survey of Finland (2019) and Official Statistics of Finland (2020), Helsinki, together with Vantaa and Espoo, which forms the Helsinki Metropolitan Area (HMA), has a total area of 1484 km 2 and population of about 1.18 million inhabitants as of 2019. The two measurement stations, Mäkelänkatu and Kumpula, are located approximately 1 km from each other (Fig. 1). They represent environments of a street canyon and an urban background, respectively. The two environments possess the common features in urban areas and they will give feedback on the model performance.
Street canyon site: Mäkelänkatu street canyon site (street address Mäkelänkatu 50; 60 • 11'N, 24 • 57'E) is operated by the Helsinki Region Environmental Services Authority (HSY). This station (Hietikko et al., 2018;Kuuluvainen et al., 2018) is located 3 km from the city centre in a street canyon in the immediate vicinity to one of the main roads leading to downtown Helsinki. The street consists of six lanes and two tramlines and the annual mean traffic volume is approximately 28,000 vehicles per workday. The proportion of heavy duty vehicles is 11% and the speed limit is set to 50 km h − 1 . The traffic loads are especially high during rush hours at 8 a.m. and 5 p.m.. Both sides of the street have rows of buildings which weaken the dispersion process of the direct vehicular emissions (Dos Santos-Juusela et al., 2013). All the inlets for the measuring devices are positioned approximately at a height of 4 m from the ground level .
Urban background site: The Station for Measuring Ecosystem-Atmosphere Relations III (SMEAR III) in Kumpula (60 • 12'N, 24 • 57'E, 26 m a.s.l.) is built on a rocky hill about 4 km northeast from the Helsinki centre (Järvi et al., 2009). The surroundings of this urban background station are heterogeneous at different directions, constituting of patchy forest and low vegetation (180 • -325 • ), buildings and parking lots (325 • -35 • ) and a main road (35 • -180 • ) at 150 m from the site (Järvi et al., 2009). The main road Hämeentie leads to the centre of Helsinki. The average daily traffic intensity is 50,000 vehicles and the amount of heavy duty vehicles on that road is considerable. Secondary aerosol formation, traffic, wood combustion and long-range transport account for a significant portion of atmospheric pollutant sources Timonen et al., 2013). Trace gases are measured at a 31-m-high triangular lattice tower, while aerosol measurements are taken at a measurement container next to the tower.
In addition, traffic rates are monitored by the City of Helsinki. The nearest continuous counter station is about 500 m northwest and 1 km southwest from Mäkelänkatu and Kumpula measurement sites, respectively (Fig. 1). The traffic monitoring does not split between light-and heavy-duty vehicles. It logs the number of vehicles along Mäkelänkatu but it is estimated that the actual traffic is about 30% less at the Mäkelänkatu site because traffic may have been diverted to Kumpulankatu before reaching the Mäkelänkatu site.

Data collection
In this study, BC mass concentration (μg m − 3 ) is measured by a multi-angle absorption photometer (MAAP; Thermo Scientific model 5012) with a PM 1 inlet at a 1-min time resolution (Petzold & Schönlinner, 2004) based on its optical properties. Using measurements of both transmittance and reflectance at different angles, the MAAP determines the light absorbance from particles deposited on the filter. The absorbance is converted to BC mass concentration by using a fixed 6.6 m 2 g − 1 mass absorption coefficient at wavelength of 637 nm (Müller et al., 2011).
Particle mass concentration of diameter less than 2.5 μm (PM 2.5 ) and less than 10 μm (PM 10 ) are measured continuously with ambient particulate monitor TEOM 1405 at Mäkelänkatu street canyon site and TEOM 1405-D at Kumpula urban background site.
Both are recorded in μg m − 3 .
A differential mobility particle sizer (DMPS), which is a combination of a differential mobility analyser (DMA) and a condensation  The simplified workflow and the architecture of the different methods used in this paper. White-box models include IAP, LASSO and DT, which stand for input-adaptive proxy, least absolute shrinkage and selection operator and decision tree, respectively. Black-box models are RF, SVR, SNN and LSTM, which denotes random forest, support vector regression, shallow neural network and long short-term memory, respectively. particle counter (CPC), measures aerosol size distribution. A Vienna DMA and an Airmodus A20 CPC are used at the street canyon site while at the urban background site a twin DMPS (Hauke-type DMA and TSI Model 3025 CPC + Hauke-type DMA and TSI Model 3010 CPC) are used. The two DMPS systems measure particles with diameter 6-800 nm and 3-950 nm, respectively. The technique makes use of the bipolar charging of aerosol particles, followed by classification of particles into size classes according to their electrical mobility (Aalto et al., 2001;Järvi et al., 2009).
From the DMPS size distributions, we calculated the lung deposited surface areas (LDSA) in μm 2 cm − 3 by a particle size-dependent deposition fraction conversion equation first introduced in the human respiratory tract model of the ICRP report by Bair (1994). This conversion involves the calculation of alveolar deposition efficiency and the summation of the LDSA concentration at the corresponding size.
The total particle number concentration (N tot ) is extracted from the DMPS size distribution data, which represents the total number concentration of particles with diameters between 6-800 nm and 3-950 nm at the street canyon and at the urban background site, respectively. We also extracted three particle modes from DMPS data. They include nucleation mode, Aitken mode and accumulation mode. The nucleation mode sums up the particle number concentration of size bins below 25 nm (PN 0.025 ). Aitken mode is defined to include particles with diameter between 25 and 90 nm (PN 0.09-0.025 ). Accumulation mode is also computed as size equal to or larger than 90 nm (PN 1-0.09 ) at both sites. All the modal aerosol particle number concentrations are reported in number concentration cm − 3 .
Gaseous variables (in mass concentration μg m − 3 at the street canyon site and in part per billion, ppb, at the urban background site), such as nitrogen oxide (NO), NO 2 , their sum NO x , O 3 and CO measurements are measured with standard instruments at both sites. In addition, meteorological variables, including air temperature (Temp), relative humidity (RH), air pressure (P), wind speed (WS) and wind direction (WD), are measured in both locations. Since wind direction is a circular variable, it is resolved into North-South and East-West vector components by trigonometric function (Fernández-Guisuraga et al., 2016;Järvi et al., 2009), labelled as WD-N and WD-E respectively. Solar radiation, including photosynthetically active radiation (PAR), is measured only at the rooftop in the university building 100 m away from the urban background site. In summary, a total of 21 variables, including 1 output variable (BC) and 20 predictor variables, are utilised in the analysis and modelling.

Methods
A simplified workflow for the work is outlined in Fig. 2. The raw data stored in our database first went through pre-processing procedures, including outlier removal, normalisation and data partition, which are described in detail in Section 3.1. The preprocessed data were trained for establishing models through WB and BB models. We select and showcase seven methods (three for WB and four for BB) in this study due to their proven performance addressed by several researchers (Cabaneros et al., 2019;Fung et al., 2020;Kisi et al., 2017;Van Roode et al., 2019;Yu et al., 2016). The architecture of the methods and their optimisation criteria are elaborated for WB and BB models in Section 3.2 and 3.3, respectively. In addition, the characteristics and benefits of the individual models are stressed in the same sections. Finally, the optimised models are tested and compared by evaluation attributes further explained in Section 3.4.

Data pre-processing
We first synchronised the data into hourly median time resolution, which is less influenced by extreme values, when compared with mean, and we subsequently converted the data into same unit for both Mäkelänkatu and Kumpula sites for better comparison. We identified and eliminated the outliers by calculating their z-scores with a threshold of |z| = 3 such that 99.7% of data points were retained. We then substituted values below detection limit (DL), including negative values, with 0.5 × DL. In order to undergo ML regression, the data has to follow a normal distribution; therefore, we examined this by plotting histograms for each variable. We applied appropriate transformations, and in this case, the frequently used logarithm transformation was applied to all aerosol and gaseous variables (Maciejewska et al., 2015;Teinilä et al., 2019). No transformation was done to the meteorological data.
The mean of BC are (1.03±0.88 μg m − 3 ) and (0.47±0.46 μg m − 3 ) at the street canyon and at the urban background site, respectively, and the distribution of the BC concentration is graphically illustrated in Fig. 3. The overall data coverage of all parameters used is 94% and 87% in the two sites. For the operation of WB and BB models, missing data was gap-filled with two methods, linear interpolation and nearest neighbour method, based on the length and the extent of the missing gap in datasets (Junninen et al., 2004). Both datasets were then normalised into a new variable (mean = 0, standard deviation (SD) = 1) (Cabaneros et al., 2019;Järvi et al., 2009) as the ML processes are sensitive to variable scales. Some selected models take autocorrelation into account themselves; therefore, to keep the comparison simple, no autoregressive terms were included.
To increase the reliability of our comparison, we partitioned the first 80% of the time series as training set and the last 20% as testing set. Flag vectors were created for workdays and weekends (also includes holidays) for comparison. Seasons were also classified thermally Järvi et al., 2009). Since the testing set consists of the last 20% of the measurement period, i.e., 7 th August-31 st December 2018, only winter, summer and autumn were classified in the testing set.

White-box (WB) models
WB models refer to transparent ML processes that produce accountable predictions. They put emphasis on introspection and processes. Users are able to interpret the models and identify the relative importance of the corresponding predictor variables. It allows users to spot possible errors when the model coefficients contradict our physics-based knowledge.

Input-adaptive proxy (IAP)
IAP was first described by Fung et al. (2020). This method managed to estimate continuous BC concentration with coefficient of determination (R 2 ) higher than 0.8. IAP first examines the correlation of output variable with the input features by Pearson's correlation coefficient. It pre-selects the most correlated input features and creates sub-models with maximum three input features with ordinary least-squares (OLS) linear regression (Fernández-Guisuraga et al., 2016): where ŷ is the output variable, β 0 is the intercept of the model, β i and x i correspond to the i th regression coefficient and i th input variable of the model, and ε is the random error. The model applies an extra regularisation by using 'bisquare' weight function, which depends on the residuals, leverages from OLS fits and the estimates of the standard deviation of the error terms, with a tuning factor 4.685 (Gross, 1977). This robust fitting serves as an alternative to the traditional OLS regression (Holland & Welsch, 1977) when data are contaminated with outliers that often takes place in field measurements. We run the regression and evaluate every sub-model. According to the evaluation metrics used, we rank the sub-models based on their performance in descending order. In practice, datasets are typically incomplete; in case of missing data, some data points in the input variables in the best sub-model can possibly be missing, hence the imputation cannot be fully achieved. IAP further imputes the missing data with the second best performing sub-model, and so on, until all the voids are filled up. Since IAP has been developed to fill up the missing data itself, no missing data imputation is done before modelling, unlike the other models we use.

Least absolute shrinkage and selection operator (LASSO)
This method was first introduced by Tibshirani (1996) and later extensively used in air pollutant prediction (Van Roode et al., 2019) and epidemiological studies (Davalos et al., 2017;Roberts & Martin, 2005). It is a multiple linear regression method that uses a regularisation term in order to avoid over-fitting. The regularisation imposes a penalty on different parameters of the model to reduce the model freedom and eliminates redundant predictor variables. In this way, the model improves the generalisation capacity. LASSO makes use of L1-norm penalty by using a geometric sequence of λ as below (Tibshirani, 1996): T is the input variable vectors for i = 1, 2,…,N. N and p represent the total number of data points and the number of predictors, respectively. β 0 is the intercept of the model, β = (β 1 , β 2 , …, β p ) T and β j are the model coefficient vector for all predictors and for j th predictor, respectively. λ is a hyperparameter that controls the penalty term. Five-folded cross-validation was performed to obtain the optimised value of λ by obtaining the minimum mean square error. Selecting a suitable λ is essential to the performance of LASSO since it controls the strength of shrinkage and variable selection, which, in moderation can improve both prediction accuracy and accountability. However, if the regularisation becomes too strong, important variables may be left out of the model and coefficients may be shrunk excessively.

Decision tree method (DT)
DT is a binary tree model in which each branch node represents a choice between binary alternatives, and each leaf node represents a numerical decision. The topmost decision node in a tree, also called root node, is usually involved with the best predictor. It is a supervised learning technique, which uses a predictive model to map observations about an item (represented in the branches) to conclusions about an item's target value (represented in the leaves). The core algorithm for building decision trees is called Iterative Dichotomiser 3. This is an algorithm introduced by Quinlan (1986), which employs a top-down, greedy search through the space of possible branches with no backtracking.
This WB method, which is able to display how the tree selects important predictor variables, has been used for decades (Singh et al., 2013;Yu et al., 2016). In this paper, a curvature test, which makes use of the computation of residuals (Loh, 2002), is used to determine a split decision at each node. The best split predictor variable is the one that minimises the significant p-values of curvature tests between each predictor and the output variable. Such a selection is robust to the number of levels in individual predictors. We control the tree depth by optimizing hyperparameters, such as minimum leaf size and maximum splits, using the grid search method.

Black-box (BB) models
BB models are the functional relationships between system inputs and system outputs, with higher computation complexity. They tend to emphasise speed and quality of prediction. BB models generally perform better in term of accuracy as a sacrifice for the lack of accountability for interpretation.

Random forest -bagging ensemble method (RF)
RF is constructed on each subset by aggregating the results generated from all individual decision trees (Kamińska, 2019;Kang et al., 2018;Masih, 2019), as individual decision tree tends to over-fit (Yu et al., 2016). There are different ways for the aggregation: bagging (Breiman, 1996), boosting (Drucker et al., 1994) or stacking (Ting & Witten, 1997). In the bagging approach, different random subsets from the original dataset with replacement are generated. The same learning method on each sample is trained later and finally outputs of each model are simply weighted. This bootstrap-aggregated ensemble method reduces bias and error variance and improves generalisation (Van Roode et al., 2019). Same curvature test is used for determining split decision. Breiman's random forest algorithm is also applied to determine the number of variables to be selected at random for each decision split (Breiman, 1996). Out-of-bag error is also calculated as a metric of inaccuracy. As the outcome is obtained by aggregation, after the aggregation, the comprehensive decision splits and straight-forward feature importance are not easily depicted, unlike DT. Therefore, using it as an explanatory model is almost impossible, and this method is regarded as a BB model.

Support vector regression (SVR)
SVR is a statistical learning theory that was introduced by Vapnik et al. (1997) and extensively implemented in air quality prediction (Kisi et al., 2017;Leong et al., 2019;Moazami et al., 2016;Siwek & Osowski, 2016). SVR performs regression by finding the kernel function that maximises the margin of tolerance ε of the regression fit. The vectors that define the kernel function are the support vectors. The goal of the SVR model is to find a function f(x) that deviates from output variables in the training data by a value no greater than ε for each training point x. Another goal is to minimise the flatness of f(x), determined by box constraint C, a positive numeric value that controls the penalty imposed on observations that lie outside the predefined ε, by using two Lagrange multipliers of support vectors: where a i and a * i are the two Lagrange multipliers of support vectors in upper and lower bounds, respectively. N and b are the total number of data points and the bias of the model, respectively. K(x i , x) represents radial basis kernel function, which is recently used by Leong et al. (2019): where x j and x k represent two inputs to the kernel function, respectively.
In this paper, we use sequential minimal optimisation to perform a series of two-point optimisations. In each iteration, a set of two points are chosen based on a selection rule that uses second-order information. The final output is resulted from the addition of the summation of the difference of the support vectors terms and a bias.

Shallow neural network (SNN)
Artificial neural network models have been utilised in predicting air quality (Cabaneros et al., 2019;Freeman et al., 2018;Maleki et al., 2019). The architecture of neural networks consists of nodes which generate a signal or remain silent as activation function. Activation function in each layer determines the output value of each neuron that becomes the input values for neurons in the next hidden layer connected to it. In this paper, feed-forward multi-perceptron network is used. This type of network usually consists of a series of layers. The first layer has a connection from the network input. Each subsequent layer has a connection from the previous layer. The final layer produces the network's output. According to the review paper by Cabaneros et al. (2019), a shallow neural network (SNN) with one hidden layer and enough neurons in the hidden layers can fit any finite input-output mapping problem for non-linear relationship. In order to keep the simplicity, one hidden layer of five neurons are implemented.
A neuron can be thought as a combination of two parts: where z n and b n are the intermediate output and the bias term for the n th neuron, respectively. w n is the n th weight for each data points x i . The second part performs the sigmoid activation function on z n to give out the final output ŷ n of the neuron: We initialise the weights randomly and the weights have to be updated through gradient descent optimisation, which might cause vanishing gradient problem. We perform five training cycles and several iterations within a cycle to minimise the training loss with Bayesian regularisation.

Long short-term memory (LSTM)
LSTM was first described by Hochreiter and Schmidhuber (1997). Since then, LSTM has been investigated intensively in estimating air quality (Cabaneros et al., 2019;Freeman et al., 2018;Kim et al., 2019;Wang & Song, 2018). LSTM units partially solve the vanishing gradient and long-term dependency problem because LSTM units allow gradients to also flow unchanged (Wang & Song, 2018). A common architecture is composed of a cell (the memory part of the LSTM unit) and three regulators of the flow of information inside the LSTM unit: an input gate, an output gate and a forget gate. The cell is responsible for keeping track of the dependencies between the elements in the input sequence. The input gate controls the extent to which a new value flows into the cell. The forget gate gives the information to throw away from the cell state. The output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit block at timestamp t. The activation function of the three LSTM gates is the logistic sigmoid function as in equation (6).
There are connections into and out of the LSTM gates, a few of which are recurrent. The weights of these connections, which need to be learned during training, determine how the gates operate as follows (Hochreiter & Schmidhuber, 1997): where c t is the candidate calculated for cell state (memory vector) at the current timestamp t, which comprises weights w c,ht and w c,ht− 1 of the layer for its connections to the input vector x t and to the previous short-term state h t− 1 , respectively. A bias term b c is also added to equation (7). c t is the cell state (memory vector) at the current timestamp, which is the addition of the outcome from forget gate (f t ) times the previous timestamp and the outcome from the input gate (i t ) times the candidate at the current timestamp. Lastly, we filter the cell state and then the outcome from the output gate (o t ) is passed through another activation function 'tanh' which predicts what portion should appear as the output of current LSTM unit at timestamp t. We can pass the final outcome (h t ) in the current LSTM block through the other two layers, fully connected layer and regression layer, to get the predicted output from the current block. In this paper, we use adaptive momentum ('adam') (Freeman et al., 2018) as an optimiser to minimise loss function and reduce generalisation errors.

Evaluation attributes
In order to evaluate and compare the models, coefficient of determination (R 2 ), together with mean absolute error (MAE) and root mean square error (RMSE), are used as diagnostic evaluation attributes, as follows (Lee Rodgers & Nicewander, 1988): where y i and ŷ i are i th measured and i th estimated output variable by the model, respectively, and y is the expected value of the measured valuable. N is the number of complete data input to the model. R 2 is a measure of how close the data lie to the fitted regression line. It, however, does not consider the biases in the estimation. Therefore, we further validate the models with MAE and

RMSE,
where MAE measures the arithmetic mean of the absolute differences between the members of each pair, whilst RMSE calculates the square root of the average squared difference between the forecast and the observation pairs. RMSE is more sensitive to larger errors than MAE (Fernández-Guisuraga et al., 2016;Zaidan, Dada, et al., 2019).

Results and discussion
This section first illustrates the general characteristics of BC in both locations and the relative importance of predictor variables in using WB models in Section 4.1. Section 4.2 analyses the model performance with respect to accuracy and sensitivity. In the latter part of the section, we entail the possible influence of seasonal variation, weekend effects and diurnal pattern on the model performance.

Variable characterisation
Before moving on to the model evaluation, it is important to familiarise ourselves with the pattern of the output variable, BC. We also briefly describe the correlation of BC with other input variables and sort out the relative importance in the ML process. Table 2 shows the statistical description in measured BC concentration. Mean and median are presented with SD and 25 th and 75 th percentiles as the dispersion of the dataset, respectively. Figs. 3, 4a and 5 graphically portray the BC concentration distribution, trend and the distribution in terms of month of the year and hour of the day at both sites, respectively. The typical features of BC in this study are in alignment with Luoma et al. (in review, 2020), who performed long-term spatio-temporal trend analysis for BC concentration in HMA. Overall, the BC concentration at the street canyon site in Mäkelänkatu (1.03±0.88 μg m − 3 ) is almost twice as high as that at the urban background site in Kumpula (0.47±0.46 μg m − 3 ) because the street canyon site is in proximity of a heavily-trafficked road while the urban background records the source from residential area situated 500 m to the north and roads with moderate traffic separated by a forest. The long distance from the source could increase dilution rate of BC and hence lower BC levels. When we consider seasonal variation, BC concentration in the summer at the street canyon site appears to be only 10% higher than the other seasons. The small variation agrees with Helin et al. (2018) who observed the lack of BC seasonal variability in traffic environment. Also, Teinilä et al. (2019) suggested that the seasonal changing mixing height does not show correlation with the dilution of local pollution in the street canyon. In urban background environment, BC concentration in winter appears to be the highest (0.65±0.6 μg m − 3 ) due to the lower mixing height (Teinilä et al., 2019) and elevated wood combustion by domestic heating (Hellén et al., 2017). Seasonal meteorological conditions can be seen in Fig. 6.

Typical features of BC concentration at the two sites
In street canyon, the mean of BC on weekdays (1.17±0.95 μg m − 3 ) are 60% higher than on weekends (0.72±0.57 μg m − 3 ). The discrepancy between weekdays and weekends in the urban background is smaller than in the street canyon. The discrepancy is less than 30% because the traffic emissions at the urban background site are less pronounced than at the street canyon site.
At both sites, the diurnal cycle is bimodal on weekdays due to the elevated traffic counts (Fig. 7) during working peak hours at 7-9 a.m. and at 4-6 p.m. in all months as typically observed in Helsinki . The evening peak during workday is smaller than the morning peak because of the higher mixing height in the evening (Teinilä et al., 2019), which dilutes the BC pollutants from the surface. In addition, there is a clear seasonal variation in diurnal pattern in the street canyon site. Due to low boundary layer height in summer mornings, BC concentrations are higher in summer mornings than in winter mornings (Luoma et al., in review, 2020). During weekends, only one peak is observed in the late evening. The evening peak during weekends appears at 5-8 p.m. in the wintertime and at 8-10 p.m. in the summertime. The boosted nocturnal BC concentration might be attributed to the increasing traffic rates along the daytime, reaching a peak approaching sunset when residents in the city return home .

Table 2
Mean, standard deviation (SD) and 25 th , 50 th and 75 th percentiles (P25, P50 and P75) of BC concentration (μg m − 3 ) in different time separations at Mäkelänkatu street canyon site and at Kumpula urban background site. The time separations are winter, spring, summer, autumn, weekdays, weekends and the whole measurement period. Data coverage in percentage is presented in the last column of the The completeness of the dataset in Mäkelänkatu is close to 99% while only 70% of data in Kumpula is valid during the whole measurement period. The reason for the missing data is owing to instrument failure due to connectivity issues, data corruption or human error in data acquisition (Junninen et al., 2004;Zaidan, Dada, et al., 2019).
The trends of the other air pollutants are illustrated in Fig. 4b-e. Without taking into diurnal patterns into account, the seasonal variations are not obvious for NO x , PM 2.5 . CO appears to have higher concentrations during winter and lower amounts in the summer due to the same reasons for BC, i.e. variation in the height of boundary layer and the amount of local combustion. O 3 varies in a rather narrow range; however, it fluctuates more in the winter. These can serve as additional information to better review the site condition.

Relative importance of the different predictors to the modelled BC concentration
In this study, the database is comprised of 20 predictor variables for BC. The importance for each predictor variable is determined by various metrics for different WB models (Fig. 8). We limited the number of predictor variables to up to three in all sub-models for IAP, hence there is no simple way to compare all the 20 variables. Here we used simple Pearson's correlation for IAP. For LASSO, since the dataset has been normalised, we took the coefficients of each standardised predictor variable to be the relative importance. DT uses curvature tests to split at nodes and decide which predictor variable is more preferred for the next branch. Since different metrics were used, we normalised them into the range of 0-1 for comparison. The resulting values indicate the relative importance with respect to the most important variable. A similar analysis was carried out by Singh et al. (2013). The most important predictor variables illustrated by IAP are NO x and LDSA at street canyon and urban background, respectively. Many of the other aerosol and gaseous predictors have also relatively strong importance at both sites because they both are influenced by combustion which emits gaseous and aerosol pollutants at the same time (e.g. Helin et al., 2018;Rönkkö & Timonen, 2019). However, IAP puts low importance to most meteorological variables. At the street canyon site, wind speed influences the model most, with relative importance of about 0.4. In the urban background, on the other hand, meteorological variables generally influence  stronger, compared to the street canyon site . This is because there are various BC sources from a distance at the urban background site than at the street canyon site. The longer distance from the source allows meteorological factors to impose a relatively strong influence on the estimation of BC, for instance dispersion and dilution. Wind speed, pressure and temperature are the three strongest predictor variables, with relative importance of about 0.6, 0.4 and 0.3, respectively.
The LASSO regularisation has eliminated most of the meteorological variables, except temperature. It has the highest relative importance for NO x and accumulation mode (PN 1-0.09 ) at street canyon and urban background, respectively. The second strongest variables, which are accumulation mode at the street canyon site and temperature at the urban background site, have the relative importance of about 0.5. The optimised λ in LASSO regularisation is 0.0082 and 0.265 at the street canyon site and at the urban background site, respectively. The higher the λ, the smaller the number of input predictors. The coefficients and intercepts of both IAP and LASSO linear models can be seen in Table 1 for street canyon and urban background.
As for DT, the strongest predictor variables are LDSA and NO 2 at street canyon and urban background, respectively. Generally, the relative importance by DT resembles the other WB models. However, PM 2.5 and PM 10 have almost zero influence at urban background unexpectedly, which might be owing to the high data incompleteness. As for meteorological data, they have least impacts on the fitting. PAR and wind speed are relatively stronger variables at street canyon and urban background, respectively.
Combining the relative importance by the three WB models, BC at street canyon has high correlation with NO x and LDSA, which mainly come from vehicular combustion. It appears that the BC concentration measured at the street canyon site has similar sources as of NO x . At the urban background site, similar parameters, NO x and LDSA, correlate well with BC, for the same reason. PM 10 , which is believed to come from longer range transport, and temperature also impose an influential factor to the concentration of BC. The more predictor variables with high relative importance further prove that BC at the urban background site comes from different sources.

Model evaluation
WB and BB models are evaluated by the attributes described in Section 3.4. Scatter plots of the fitting, histograms of residuals and diurnal cycle plots are presented to delineate the model accuracy and sensitivity. The latter part of this section further analyses the variation in model performance in different seasons, days of the week and hours of the day at both sites. Table 3 displays the overall performance of training set and testing set. R 2 for all the WB and BB models are higher than 0.8 (WB: R 2 = 0.81-0.87; BB: R 2 = 0.86-0.87) at the street canyon site in Mäkelänkatu. Among all methods, DT has a relatively low R 2 and high MAE. By observing the scatter plots, the fitting in DT gives many discrete levels due to its model architecture. A single tree with a decision split is not able to give spectrum of estimates. Increasing the tree depth might be able to solve the problem, but over-fitting issue might arise (Kang et al., 2018). On the other hand, RF grows decision trees to generalise the model and its R 2 is apparently higher than DT. In spite of a higher R 2 , the fitting for RF, as well as SVR, at the lower tail diverges from the 1:1 line (Fig. 9a).

General analysis of model evaluation
The model performance at the urban background site in Kumpula is generally worse (WB: R 2 = 0.44-0.60; BB: R 2 = 0.41-0.64) and shows more marked variation among all WB and BB models than in Mäkelänkatu (Table 3). LSTM performs best (R 2 = 0.64, RMSE = 0.2646 μg m − 3 ) and LASSO's MAE is the lowest. SVR has the lowest R 2 (R 2 = 0.41) while IAP generates the highest MAE. DT fails to give spectrum of estimates, likewise in street canyon. The extent of the discretion is more severe than that in street canyon. LASSO, RF, SNN and SVR tend to underestimate the extreme values and the fitting slope is less than 1 (Fig. 9b). In addition, IAP appears to overestimate all data points generally.
RF performs well for training data, but is not able to maintain a high R 2 for testing data. With the hyperparameters used, RF over- fits datasets at both sites. DT is the single decision tree version of RF and apparently it shows poorer performance than RF. LASSO, on the other hand, is the only model that has a lower R 2 for training data than testing data. The shrinkage of the number of predictor variables seems to work well to avoid over-fitting. Overall, LSTM performs quite well in all cases, because it treats also data from the previous timestep to cope with the time dependency. SNN and LSTM models have resembling architecture where SNN has only one hidden layer and does not consider short-term memory. SNN turns out to perform generally worse than LSTM. IAP and SVR have fair fitting accuracy, but IAP is still recommended because it manages to fill in missing data by other predictor variables without extra interpolation. The overall performance difference at Mäkelänkatu street canyon site and Kumpula urban background site indicates that the models are location-specific because of their different pollutant sources. The location-specific property of the models imposes limitation to the use of the models as virtual sensors, which will be elaborated in Section 4.3. The street canyon has been suggested to have a relatively constant BC source (Helin et al., 2018;Luoma et al., in review, 2020), so the regression performance at the street canyon site is overall better than at the urban background where the large temporal variation is large on the impact of different BC sources, such as local traffic and residential wood combustion as well as long-range transported pollutants. The model performance also depends on the missing data patterns, such as completeness of the dataset and the length of the data gap, suggested by Junninen et al. (2004).
In short, the difference between WB and BB models are insignificant at both sites, which can further be justified by the time-series of measured BC and BC modelled by WB and BB (Fig. 10). The average absolute difference of WB (represented by LASSO) and BB (represented by LTSM) is 0.08 μg m − 3 for both sites. The difference for more than 80% of the data points is less than 0.2 μg m − 3 . Due to the transparency property in WB models, one could relate the models with physics-based knowledge and be able to spot the possible errors. Based on the relative importance of predictors described in Section 4.1.2, one could recognise the variables that alter the results. The estimation of BB models, on the contrary, might contain misconceptions (Rudin, 2019), which are hard to discover.

Verification of model performance
We then investigate how the seasonal variation, the weekend effects and the diurnal cycle affect the model performance. . Although the strong derivation in summer is found in many models aforementioned, the SD in summer is the lowest compared to the other seasons. The more varied performance in all models might be resulted from the lower completeness of data in urban background. Although we have interpolated the missing data, the data gap length is not considered, which make the interpolation unsatisfactory and lead to the poor model performance. Despite the clear difference of BC concentration between workdays and weekends, weekend effect on model performance is not observable in street canyon (Fig. 9a). In urban background, the fitting of LASSO, RF, SNN and LSTM on weekdays performs differently as on weekends. Their underestimation at extreme values are more severe on weekdays than that on weekends (Fig. 9b). Fig. 12a shows diurnal pattern of measured BC and featured WB models (IAP and LASSO), together with BB models (RF and LSTM) in street canyon. Apart from seasonal variation and weekend effect, the models follow the diurnal pattern quite well. Most of them catch the two peaks on workdays at 6-8 a.m. and at 3-4 p.m.. However, all the featured models fail to predict the sharp increasing trend in the morning at 3-6 a.m. perfectly. LASSO slightly underestimates the period between the two peaks. On weekends, the featured models overestimate the declining trend after the only peak at 6-8 p.m..
At the urban background site, IAP tends to overestimate in all diurnal time-stamps on workdays (Fig. 12b). The other models also fail to catch the tips of the trough and peak at 4-9 a.m.. Among the featured models, RF performs best and manages to fit tightly the diurnal cycle. LSTM is the second best for workdays case. During weekends, the diurnal curve is comparatively flat. The medians of the calculated values by WB and BB models further flatten the curve. Again, compared to the abovementioned methods, RF manages to best estimate the characteristics of the diurnal pattern.

Model evaluation in other aspects
Apart from the model accuracy (Table 3), other evaluation aspects, such as flexibility, complexity and efficiency, are also taken into consideration. The architecture of the WB models is transparent, which allows users to understand and modify, if necessary, the influencing variables. Among all WB models, IAP takes maximum three input predictors in an adaptive way, while the number of input predictors for LASSO depends on λ. DT takes all input predictors. For BB models, the inner components or logic are inaccessible. In case of anomalies, it is difficult to inspect and locate the problems from the model structure. Some BB models, for example SNN and LSTM, have many hyperparameters, which requires extensive efforts for model optimisation. In terms of efficiency, LASSO, DT and RF require the least computational resources among all the models we tested in this study. SVR and SNN take 30-50% longer time for the modelling process. Since IAP has to search for the best combination of input predictors and fill in missing data, its computational time can be five times higher than the most efficient ones. LSTM takes the longest computation time, still under 10 min, because it also deal with time delay in the modelling process.
Combining all the evaluation criteria, we deduce that IAP could be the best model despite the slightly higher computational time. It has relatively good accuracy at both sites. The model structure is transparent and the inner components can be inspected easily. The flexibility to select input predictors is also one of the advantages when using this model. The second best model could be LASSO since it performs evenly in both locations and could be considered to be the most generalised model. Its fast modelling process and transparent model structure could be also able to validate the high rank of this model.

BC models as virtual sensors
This section discusses the possibility of turning BC models into virtual sensors to provide additional information in support with physical sensing data. We further discuss the benefits and limitations of using BC models as virtual sensors, and how the virtual sensors are used to supplement the current AQI. Virtual sensors use physical sensing data and calculate the outputs by suitable model to grasp the information of interested areas without placing real sensors there. Wilson (1997) suggested that virtual sensors should be able to provide 1) virtual measurements, 2) predictive capability, 3) continuous output based on periodic real measurements, and 4) robustness in case of physical sensor failure. These four criteria are used to assess the feasibility of using BC models as virtual sensors (Liu et al., 2009;Woo et al., 2016). On the ground that we deduced that IAP and LASSO could be the best two models, the assessment in this section only consider these two models. Based on the model evaluation of the BC models in Section 4.2, provided with adequate trained data in advance, virtual measurements can be given continuously with fairly high accuracy and predictive capability at the street canyon site. In case of physical sensor failure, models which take all the predictors might not work. IAP and LASSO still manage to predict BC by fewer input predictors. In extreme situations, where only meteorological parameters are available, IAP is still able to work thanks to its input adaptability.
However, there are limitations in using BC models as virtual sensors . In this paper, the two measurement sites can be characterised as street canyon and urban background. Each characterisation represents places with similar air pollutant sources. The corresponding BC models are also location-specific, but it might be possible to extend the models from one street canyon site to another with a calibration factor. Similar ideas could also be applied to urban background sites. Once we gather more training data from different locations, a network of database can be established. Within the network, selection of predictors can be interchanged. These calibration factors and the estimations from the virtual sensors might drift due to environmental changes or physical sensors degradation. These models might not be accurate anymore in case of occasional events, such as wildfire, extreme weather, etc. Therefore, the application of virtual sensors require a long-term drift monitoring and online-adaptive models could be used to adjust the models accordingly .
Once the virtual sensors are validated and approved, the continuous measurements can be utilised to update the current AQI. While there is no universally accepted method, most organisations consider PM 10 , PM 2.5 , NO 2 , O 3 , SO 2 and CO as parameters for AQI calculation. From the scientific point of view, these parameters are insufficient to show association with health risk of aerosol particles, especially for cardiovascular effects (Geng et al., 2013). Although World Health Organization (2012) has recommended the inclusion of BC as one of the components in AQI alongside with the other air quality parameters, this has not yet been taken into action due to the unavailability of continuous BC measurements. This is attributed to the lack of national environmental legislations (Kutzner et al., 2018), instrument failure or data corruption (Junger & De Leon, 2015;Zaidan, Wraith, et al., 2019). For this, Monteiro et al. (2017) suggested to use modelling data to complement non-existent physical sensing data in AQI computation. Our BC models can then provide continuous modelling data with fairly good accuracy as virtual sensors. Similarly, other aerosols parameters with data unavailability, such as LDSA, can also be estimated by virtual sensors.

Conclusion
This paper presents the evaluation of white-box (WB) and black-box (BB) model performance for air quality monitoring, focusing in particular on black carbon (BC) measurement. The evaluation involves data in the period of 1 st January 2017-31 st December 2018 from the two measurement sites, the street canyon site in Mäkelänkatu and the urban background site in Kumpula, in Helsinki, Finland. The two selected environments possess the common features in urban areas and they will give feedback on the model performance. A total of 20 variables including aerosol particles, gaseous and meteorological parameters are used as predictor variables.
In street canyon, WB models perform (R 2 = 0.81-0.87) in a similar way as BB models do (R 2 = 0.86-0.87). The overall performance in urban background is much worse probably because of the greater variation in BC sources (Helin et al., 2018;Luoma et al., in review, 2020) and the greater extent of missing gap (Junninen et al., 2004). Similarly, the difference in WB (R 2 = 0.44-0.60) and BB models (R 2 = 0.41-0.64) is not noticeable in urban background. As for seasonal factor, fitted data in street canyon tend to overestimate in summer and autumn while this occurs in winter for data in urban background. The standard deviation of the residual distribution is the highest in summer in street canyon, but the lowest in urban background. In terms of the weekend effect, particular models, like Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF), Shallow Neural Network (SNN) and Long Short-Term Memory (LSTM), underestimate extreme values at both ends at a larger extent on weekdays than on weekends. RF appears to closely follow the diurnal pattern, followed by LSTM in both locations. Input-Adaptive Proxy (IAP), however, shows the worst fitting diurnally.
Among all models we use in this paper, LSTM surpasses others in term of accuracy at both the street canyon (R 2 = 0.87) and the urban background sites (R 2 = 0.64) probably due to its ability to consider the time dependency. RF has a high R 2 for training sets and it manages to fit the diurnal pattern well. IAP performs fairly compared with other methods; however, it is still recommended to serve as a virtual sensor owing to its flexibility and its ability to fill in missing data by other variables in the same dataset. In spite of its longer computational time, it is easy to use with only three input predictors at most. LASSO could be considered to be the most generalised model because it performs evenly in both locations. It also has a high modelling speed and, therefore, it could be the second best model. Overall, despite the slightly higher performance for some individual BB models, the prediction might contain and perpetuate critical misconceptions that have gone unnoticed. The widespread use of machine learning models which are not comprehensive could impose a lasting negative impact in the field we are applying to (Rudin, 2019). WB models are still preferred because of its transparency of the architecture. Moreover, WB models are closer to physics-based models so that we can easily recognise the relative importance of the predictor variables and determine whether the model makes sense.
For our future step, we could combine the benefits of both models and improve the overall accuracy of the performance. Since we know that BC concentrations perform a mild seasonal cycle and clear diurnal cycle and that LSTM, which includes short-term memory, outperforms the other models, it is of great interest to see how time dependency improves the precision of WB models. We could consider extracting the time dependency ability from LSTM and apply it to our selected WB models by, for instance, adding an autoregressive term. With the transparency properties of WB models, we could determine the lag of the autoregressive term, and verify the use of it.
Another possible improvement could be to utilise the proposed models to emulate existing physics-based models, such as ENFUSER (Johansson et al., 2015), which is expensive in computation. The developed emulator can speed up some modules in those expensive physics-based models. Reliable spatio-temporal air quality map can then be created.
Since the extent of missing data might be a factor that degrades the fitting performance, the existing network of several measurement stations within the Helsinki metropolitan region (Luoma et al., in review, 2020) is beneficial. In case of missing data, we could fill up the voids with the information from the other stations after conscientious calibration. For example, in this paper, the two measurement sites are characterised as street canyon and urban background. In a different setup, we may assume the similarity of the same type of environment and utilise the measurements as replacement.
Due to the comparatively scarce amount of continuous urban BC measurements (Kutzner et al., 2018), this BC estimation can be used as a virtual sensor in support with the reference measurement. The BC virtual sensor empowers a reliable continuous BC measurements. This continuous BC measurement could be useful in updating the current air quality indices so as to be better associated with health effect by air pollutants (World Health Organization, 2012).

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.