A data-driven approach to the forecasting of ground-level ozone concentration

The ability to forecast the concentration of air pollutants in an urban region is crucial for decision-makers wishing to reduce the impact of pollution on public health through active measures (e.g. temporary traffic closures). In this study, we present a machine learning approach applied to the forecast of the day-ahead maximum value of the ozone concentration for several geographical locations in southern Switzerland. Due to the low density of measurement stations and to the complex orography of the use case terrain, we adopted feature selection methods instead of explicitly restricting relevant features to a neighbourhood of the prediction sites, as common in spatio-temporal forecasting methods. We then used Shapley values to assess the explainability of the learned models in terms of feature importance and feature interactions in relation to ozone predictions; our analysis suggests that the trained models effectively learned explanatory cross-dependencies among atmospheric variables. Finally, we show how weighting observations helps in increasing the accuracy of the forecasts for specific ranges of ozone's daily peak values.


Introduction and motivations
Ground-level ozone (O 3 ), which is formed in the troposphere by photochemical reactions in the presence of sunlight and precursor pollutants, such as the oxides of nitrogen (NO x ) and volatile organic compounds (VOCs) (Calvert, Orlando, Stockwell and Wallington, 2015), is known to be a pollutant particularly dangerous to human health (World Health Organization, 2003;Stewart, Saunders, Perea, Fitzgerald, Campbell and Stockwell, 2017). The forecast of ozone concentrations is an important task to ensure the protection of outdoor workers who are exposed to polluted air during the most dangerous hours of the day, as well as sensitive people such as children or the elderly.
In this paper, we tested a number of machine learning algorithms that forecast the maximum hourly ozone concentration of a given day by performing the prediction at two different times: the evening before and the early morning of the target day. We are particularly interested in the days in which the ozone concentration is significantly higher than usual, due to their potential impact on public health. The choice of the target variable is based on Swiss legislation, which states that the 1-hour mean of 120 ∕ 3 must not be exceeded more than once per year (The Swiss Federal Council, 1985). We use an empirical, data-centric approach that leverages a large data set of air quality, weather station measurements and weather forecasts. Data are collected for seven sites in southern Switzerland, for which the forecast is performed. We use Shapley values and a genetic algorithm to select the most important features for each model, and we calculate various forecasts with the help of cutting-edge forecasting algorithms. The work is organized as follows: Section 1.1 contains an overview of similar research in the scientific literature, while Section 1.2 highlights the contributions we bring to this work. Section 2 introduces the used dataset and the nomenclature we used trough the paper to refer to the different features, while Section 3 presents the forecasting problem peculiarities and the problem formulation. Section 4 describes the two feature selection methods that have been tested, namely a custom genetic algorithm and a feature selection based on Shapley values. Section 5 outlines the regression algorithms used to perform the analysis. Section 6 introduces the deterministic and probabilistic KPIs that we have used to evaluate the different forecasters. Section 7.1 presents the results of the two tested feature selection algorithms. In Section 7.2 we study how different features and features' interactions affect the final predictions of the forecast, using Shapley values. In Section 7.3, we show the numerical results for the tested forecasting algorithms, while in Section 7.4 we focus on the prediction of extreme events. Finally, Section 8 concludes the paper with a summary of our main findings.

Related works
Tropospheric ozone concentration has been the subject of several studies, both for prediction (the task of finding a map from a set of covariates to a target) and for forecasting (predicting the values of the target in advance, in future time steps). In Al Abri, Edirisinghe, Nawadha and Kingdom (2015) different non-parametric models from the WEKA toolkit are tested to derive the ozone concentration from a set of 8 different gaseous chemicals and atmospheric conditions measured at a single location. Similarly, WEKA is used in Mohan and Saranya (2019) to adapt models representing atmospheric conditions to the ozone concentration at ground level, which shows that even summer ozone peaks can be accurately predicted if the atmospheric conditions are known. In Feng, Zhang, Sun and Zhang (2011), meteorological data from a site near Beijing were used to predict the hourly ozone concentration at that point by using a neural network whose weights were trained using a genetic algorithm. In addition, different models were adapted for different times of the day. In Sheta, Faris, Rodan, Kovač-Andrić and Al-Zoubi (2018), a nonlinear state-space model using PM 10 , temperature, wind speed, and relative humidity as input is identified by using a neural network to predict ozone concentration. The model is then compared with linear models and multilayer perceptron. In Siwek and Osowski (2016) the authors used a data set of 55 characteristics (meteorological conditions and their statistical transformations) collected in Warsaw to predict various air pollutants. They showed how by reducing the number of features with a pre-selection step, the final accuracy of the prediction could be increased. Two pre-selection methods were compared, one with a genetic algorithm and a stepwise greedy strategy for linear models.
The task of forecasting PM 2.5 and ozone concentrations for three large Chinese cities is considered in Lv, Cobourn and Bai (2016). Like in our study, the authors considered multiple monitoring stations, but the final values of the relevant atmospheric variables were weighted averages of neighbours of the target cities. The forecasts have been obtained by fitting knowledge-based empirical formulae using historical data. No systematic investigation of variables interaction has been carried out. The authors showed how the maximum daily temperature is the single most relevant variable in predicting (and forecasting) ozone concentration. They consistently found a strong correlation in numerical weather prediction forecasts error of this value with the error for ozone prediction. In Eslami, Choi, Lops and Sayeed (2019) the authors proposed a deep convolutional neural network (CNN) to forecast the hourly ozone concentrations for the day-ahead, over 25 monitoring sites. Despite the ability of the CNN in correctly predicting daily ozone trends, the authors found that it under-predicted high ozone peaks during the summer. In Gong and Ordieres-Meré (2016), the authors focused on the forecast of extreme ozone concentrations, which are also the most useful to predict. Forecasting extreme events is, in fact, more complicated than predicting them, as found out in Mohan and Saranya (2019). When one is mostly interested in predicting these tail events, sampling techniques can be applied in order to mitigate the class imbalance problem (rare events are under-represented in the training data). The authors of this study applied different sampling methods to increase classification accuracy of ozone concentration, considering three different classes. They found out that under-sampling can indeed increase the classification performance. Unfortunately, a drawback of this technique is that several data of the most represented class are discarded, which could lead to a lack of generalization of the model, due to over-fitting or a reduction of cross-learning (learning patterns from data in a given class, which are also present in a second unobserved class, which could increase the predic-tion accuracy).

Contributions
In the presence of a high number of relevant features, the task of forecasting the next day peak in the concentration of ozone becomes highly challenging, due to the low number of observations on which a forecasting algorithm can be trained. In fact, having a dataset consisting of a few years of observations could result in having a number of features higher than the number of observations, as in the presented case. On the other hand, observations further back in time may not be representative of the current situation, as the mixture of nitrogen oxides in the air has changed over time following vehicle fleet renewal. As a consequence, due to the scarce number of instances, we couldn't apply undersampling techniques, as done in Mohan and Saranya (2019). The only effective way of training a model is by applying some dimensionality reduction techniques. Our first contribution consists of evaluating two different methods to perform feature selection. First, we tested a genetic algorithm, as was done in Siwek and Osowski (2016) for the pollutant prediction task. In this case, we crafted custom mutation and crossover functions tailored to the forecasting task. The second approach we tested is based on Shapley values (Lundberg and Lee, 2017). We then evaluated and compared the two feature selection methods. To show that the feature selection step is beneficial in increasing the accuracy of the predictive algorithm, we compared our models with two control cases: one in which the model uses all the available features and one in which we pick the features entirely at random. Our second contribution is to compare the performances of different popular learning algorithms trained on the selected features. Thirdly, we investigate the effect of imposing weights on the observations with the highest daily ozone concentration on the algorithms' forecasting quality of extreme values. Our final contribution is an a-posteriori explanation of feature importance. We investigate the more relevant feature interactions in predicting the ozone peak and explain our findings in terms of atmospherics's physics.

Geographical context and data acquisition
In this study, we focused on the Canton of Ticino, the southernmost canton of Switzerland. In this region, the concentration of air pollutants is generally higher than in the rest of the country and is influenced both by the orography and the level of urbanization and industrialization. The natural shield provided by the Alps makes Ticino the region with the highest solar radiation rate in Switzerland. Ticino is characterized by a densely populated and heavily trafficked southern region, and by a sparsely populated and more mountainous northern region. It also borders Lombardy to the south, the most industrialized region in Italy. In this study, we used data acquired from several air quality and weather stations distributed in the region. In addition, the Swiss Federal Office of Meteorology and Climatology MeteoSwiss 1 numerical weather prediction (NWP) service provided weather forecasts for some of these locations. Fig.  1 shows the position of the monitoring stations and the locations for which the weather forecasts are available. Tables 1 and 2 describe the geographical context for monitoring stations and the weather forecasting locations, respectively. Due to its photochemical origin, O 3 shows a strong seasonal pattern, with higher concentrations in summer. For this reason we have focused our analysis only on the period from May to September in the years between 2015 and 2019. This period was chosen to take into account a significant number of measurements, i.e. enough to train the algorithms correctly, while simultaneously avoiding the use of previous years, when emissions of precursors NO, NO 2 and NO x in Switzerland were more intense than today. We used the data from the first 4 years to train the forecasting algorithm and the data from 2019 to test them. A variety of signals covering weather and air quality with hourly resolution were considered for the model, as shown in Table 3. Most monitoring stations record the full set of signals specified in Table 3 on site. The few exceptions, where some signals were not collected locally, were managed with data from the nearest available stations. When training forecasting algorithms, we considered data measured up to 72 hours in the past (except for the ARIMAX model) and weather forecasts up to 33 hours in the future. The 72 hours value was chosen based on preliminary results, in which we considered a history length up to 7 days, and systematically shorten it. As we didn't experience significant accuracy improvement using a history length higher than 72 hours, we have fixed this value for all the experiments. Table 3 shows the available signals. For various reasons, such as station maintenance, data transmission failure or power outages, part of the data in the time series of the years from 2015 to 2019 are missing. During the training period, data completeness is above 99%, but two stations have substantial data holes in 2019: in Tesserete and Sagno respectively, 50 and 15 full days of measurements are missing during the test period. All the missing data in the training set have been filled using a random forest with surrogate splits, trained to predict the missing data using the station itself and its neighbours.

Feature engineering
The large number of signals and their high granularity resulted in a high dimensionality of the dataset. To reduce the overall number of features and minimize the computational effort, we partly replaced the hourly values of the measured and forecasted signals with basic statistical aggregations, i.e. minimum, maximum and average value over a longer time period, as illustrated in Table 4.
Based on suggestions from experts in the field of atmospheric physics 2 , we further manipulated some of the signals available in the dataset to create additional features. The engineered features are listed in Table 4. In addition, we also included a categorical feature, called , which describes the general situation of the weather in Switzerland for the prediction day using 12 weather types.
For each location, separate forecasting models are trained using a subset of the matrix of all features. This subset contains the data specific to the location and information from the neighboring stations. For NWP, hourly values are used for the specific location, while bins are used for the data of the neighboring stations, as summarized in Table 4. For example, the dataset of Chiasso contains the hourly NWP from Chiasso itself and bins aggregations from Sagno. Given the different number of stations involved each time, the number of features for each model is variable and comprised between 1700 and 2100.

Nomenclature
The ozone forecast at any station for any given day D is computed twice: the first time at 18:30 (16:30 UTC) of the previous day D − 1, which we call EVE forecast, and the second time at 06:30 (04:30 UTC) of the same day D, here called MOR forecast. This is because the weather forecasts issued by the NWP services are published twice a day, at 05:00 and at 14:00 local time. Fig. 2 illustrates the time window for a generic day. For each station, we tested 8 different prediction methods at the respective prediction times EVE and MOR, for a total of 16 models per station.
When labeling the aggregated data in Table 4, we use the following conventions. Measured quantities are denoted by the letter and weather forecasts provided by NWP services are denoted by the letter . The index is the difference in hours between the last available data point and the acquisition time. Following this convention, 0 refers to the last measured data point available, i.e., the value measured at 06:00 for MOR and at 18:00 for EVE. 1 refers to the value measured at 05:00 and 17:00 respectively, and so on up to 23 . The same temporal indexing applies to values provided by NWP services. For MOR we call 0 the forecasted value at 06:00, 1 the value for 07:00 and so on. In the EVE case we call 0 the predicted value at 18:00, 1 the predicted value at 19:00 and so on. The structure of the aggregation bins is shown in Table 4. To better refer to each specific component of the models, we denote the features based on the location of the measurement and time to which it is referred, combining the codes of Tables 3, 4, 1 and 2. For example, 1 1 designates the global irradiance measured in Locarno at 05:00 in the MOR model and at 17:00 in the EVE model. Similarly,̂ 3 10 is the forecasted temperature in Bioggio at 16:00 in the MOR model, and at 04:00 of the following day in the EVE model. is the mean value of all the measured NO concentrations up to 72 hours before the prediction, in Brione.

Problem formulation
The problem of forecasting the daily maximum ozone signal presents the following characteristics: 1. The signal is strongly seasonal due to the presence of annual patterns in both anthropogenic and non-anthropogenic processes governing the ozone generation. 2. The signal is non-stationary since its variance is subject to inter-and intra-annual fluctuations. 3. The forecasts dependence on the features is non-linear, as described in the literature and as further detailed in Section 7.2. 4. The forecasted values are physically bounded by the photo-chemistry and advective phenomena regulating the formation and transport of ozone in the troposphere and atmosphere. 5. In our use case, monitoring stations providing measurements of relevant features for the ozone forecast-ing, such as temperature, past ozone and NOx values, are not dense enough (nor at similar distances from prediction points) to provide a regular mesh, as can be seen in Fig. 1. In this case, the use of spatio-temporal Gaussian processes Kupilik and Witmer (2018), Gaussian Markov Random Fields Cameletti, Lindgren, Simpson and Rue (2013) or other graph-based spatio-temporal techniques Carrillo, Leblanc, Schubnel, Langou, Topfel and Alet (2020) can lead to poor results.
Given the above considerations, we chose to model daily ozone maxima using separate predictors for each location, while still taking into account relevant features from nearby locations. The neighbouring stations for each prediction point are illustrated in Fig. 1, where the whole region is divided into three macro zones. In any case, we let the feature selection processes described in Section 4 discriminate whether a given measurement station is relevant or not. Calling the number of observations and the number of features, we define a training and a test dataset,  = { , } and  = { , }, respectively, where ∈ R × , ∈ R × , are matrices of features, and ∈ R and ∈ R are the targets' vectors, containing the maximum hourly ozone concentration of the same days. In this paper, and are equal to 587 and 151, respectively, that is, the number of available days between May and September for the 2015-2018 period and for 2019. On the other hand, the number of features, , has been kept fixed to 30 for all the numerical experiments and raised to 100 for the study on the high peaks prediction in Section 7.4. These numbers were chosen experimentally by systematically increasing them and choosing the value beyond which the predictors' performance no longer increased significantly. We train a model ( , Θ), where Θ is a set of the model's parameters, in order to produce the forecasts for unseen data ∈ R × : In order to compare the results across the different approaches, we used regression specific key performance indicators (KPIs), as classification scores can only be compared while using the same bins for the choice of the classes. Different values for the bins' edges are used in the ozone prediction literature since those are typically chosen based on the local legislation. As such, we trained the model ( , Θ) minimizing L2 loss: We highlight how this notation must be slightly adapted for the ARIMAX model introduced in Section 5; in this case the model can be described as ( , , Θ), where the endogenous input signal is then opportunely shifted with the use of the backshift operator, as further explained in Section 5.

Feature selection methods
Given the large number of features in each model, if we were to train the prediction algorithms using all the vari-ables, whose number largely exceeds the number of available observations, we could potentially incur numerical problems of solution non-uniqueness and multicollinearity that would corrupt the prediction process. Moreover, even if the dataset contained a proportional number of observations, an excessive number of features would still result in a long computational time, which is justified only if the forecasting performance is better than that of an algorithm trained on a subset of the features. In this paper, we decided to perform a feature selection using a custom implementation of a genetic algorithm (GA), as well as using a proceeding issued from game theory, exploiting Shapley values. The effectiveness of these two approaches is compared in Section 7.1 against a model composed of features picked at random and a model composed of all the available features.

Feature selection using genetic algorithm
In our implementation of the GA, an individual is defined as a subset of the entire feature set with cardinality k where k is the number of retained features. As anticipated in Section 3, in this study we set = 30.
We have defined a crossover function that ensures that the offsprings that emerge from it still retain k features from their parents, with no repetitions. Formally, the offspring is a subset of the union of the sets of its parents and , with cardinality : We defined a custom mutation function so that each feature of the offspring is either the original feature of its parent with probability 95%, or a new feature from ⊂ ( ⧵ ) with probability 5%, and such that has only unique features. In practice, we generate two sequences and from the sets and , by randomly fixing their order and iterate on them to generate the new set , which is composed by the elements of the sequence c, where the th element of c is defined as: The fitness function is defined as the out-of-bag mean squared error (MSE) of a random forest composed of 30 bootstrapaggregated (bagged) decision trees, trained on the active features of the individuals. We selected a population size of 10 individuals, a crossover fraction of 80%, and an elite count of 5% of the population size. The GA stops after 100 stall generations and the feature set of the best individual is selected.

Feature selection using Shapley values
Another method to assess the importance of each feature is presented in Lundberg and Lee (2017). This method assigns feature importance scores using Shapley values, which originated in the field of game theory where they are used to estimate the contribution of various agents in increasing the welfare of a community. These are expressed as: where ( , Θ) is a regression model, in our specific case the NGBoost algorithm, which will be introduced in Section 5, is the feature set on which the model has been trained, Θ is a model-specific set of parameters, is the number of variables in the training set , and ∖ denotes the minus set operation, that is, the subtraction of the ℎ feature from the reduced dataset . The authors in Lundberg and Lee (2017) have shown that such coefficients have highly desirable properties, which favourably affect their ability in the (local) explanation of the models, and have been shown to be consistent and more robust with respect to other more widespread methods for the evaluation of feature importance. Furthermore, authors in Lundberg, Erion, Chen, DeGrave, Prutkin, Nair, Katz, Himmelfarb, Bansal and Lee (2020) recently proposed a computationally efficient algorithm specifically tailored for tree-based models, and made it available through the shap python package. The shap package provides an exact computation of Shapley value explanations for tree-based models. This provides local explanations with theoretical guarantees of local accuracy and consistency, which increase the robustness of the method, since it doesn't rely on random samplings, which would be required to find the Shapley values using approximate algorithms.

Regression models
After the features that best explain the data have been selected, we use them to create the regression matrix and produce the test forecasts. For this work, we studied the output of several parametric models, such as linear regression, Ridge regression, LASSO and ARIMAX, as well as more complex non-parametric tree-based algorithms, such as random forests, XGBoost, NGBoost and LSBoost, described below. Ridge regres-sion is a method designed to avoid collinearity issues and avoid near-singular matrix inversions when solving linear regression problems, especially in the case in which the number of features is large compared to the number of observations. In this case, the regression coefficients ∈ R are quadratically penalized with parameter , such that the closed form solution becomes:

Penalized linear regression algorithms
where is the identity matrix of size . In this work, is tuned in cross validation on the training set. Instead of punishing using the L2 norm, the LASSO (Least Absolute Shrinkage and Selection Operator) regression (Tibshirani, 1996) penalizes using the L1 norm, such that some of the elements of̂ could be set to zero. Unlike Ridge regression, LASSO does not have a closed form solution, but it has to be approximated through numerical methods.

ARIMAX The well known Autoregressive Integrated Mov-
ing Average with explanatory variable (ARIMAX) is defined as:̂ where is the backshift operator, i.e. ( ) = − , is an additive white Gaussian noise and the time series ′ is the result of differencing times. The matrix contains the daily values of the selected features up to the day before the prediction and are the regression coefficients as usual. The models are created and calculated with statsmodels's SARIMAX function and the parameters , , are chosen via grid search and fitted using maximum likelihood estimation; we considered only = 0 since we don't have important trends, while we set 7 and 3 as maximum values for and , respectively. We stress that, even if the features contained in refer to the last 72 hours, the ARIMAX model has been left free to extend the endogenous signal's influence on the forecast up to 7 previous days, that is = 7. However, for all the considered locations, the grid search returned <= 3.

Random Forests and Quantile Random Forests
The random forest (RF) algorithm independently fits several decision trees, each trained on different datasets, created from the original one through random re-sampling of the observations and keeping only a fraction of the overall features, chosen at random (Hastie, Tibshirani and Friedman, 2009). The final prediction of the RF is then a (possibly weighted) average of the trees' responses. One important variant of RF algorithms are Quantile Regression Forests (QRF); the main difference from RF is that QRF keeps the value of all the observations in the fitted trees' nodes, not just their mean, and assesses the conditional distribution based on this information. In this paper, we have used the Matlab TreeBagger class, which implements the QRF algorithm described in Meinshausen (2014).

Tree-based boosting algorithms
Boosting algorithms employ additive training: starting from a constant model, at each iteration a new tree or any other so called "weak learner" ℎ ( ) is added to the overall model ( ), so that +1 ( ) = ( ) + ℎ ( ) where ≤ 1 is a hyper-parameter denoting the learning rate, which helps reducing over-fitting. The Least-squares gradient boosting (LSBoost) algorithm applies boosting in functional space: each weak learner ℎ tries to learn the gradient (with respect to the previous model ( )) of the least-squares loss function. In other words, ℎ is fitted on the overall prediction error at iteration − 1. A different approach is used by the XGBoost algorithm (Chen and Guestrin, 2016), which fits the additive model ( ) in parameter space, that is, using a second-order approximation of the loss, as a function of the parameters of the weak learners (decision trees). This approximation and other techniques used by XGBoost (like approximate histogram search for selecting splitting points in the trees) result in a speedup of the training process, with respect to LSBoost or RF algorithms, without sacrificing accuracy. At the same time, the algorithm introduces quadratic penalization on the parameter's value and on the overall complexity of the trees, which parameters can be tuned to further mitigate over-fitting. In addition to the QRF algorithm, in this paper we used a second algorithm that is able to assess the conditional probability distribution of the predictions: Natural Gradient Boosting (NGBoost) (Duan, Avati, Ding, Thai, Basu, Ng and Schuler, 2019). While none of the previous algorithms introduced assumptions on the probability distribution of the observations, NGBoost explicitly fits the parameters of a parametric probability distribution on each observation. This is made possible by exploiting the tree structure of the underlying weak learner, since observations in the same leaves share the same probability distribution's parameters. The algorithm is fitted in functional space, but instead of directly learning the maximum likelihood gradient, the authors propose to correct it with the Fisher Information. This results in fitting the so-called natural gradient, which makes the learning process invariant to reparametrization of the underlying probability distribution. We fitted LSBoost models using Matlab's fitrensemble function, tuning its hyper-parameters via Bayesian optimization and using 5 folds cross validation. The XGBoost and NG-Boost algorithm were fitted using their official xgboost and ngboost python packages, respectively, while hyper-parameters were selected using grid search, always using a 5 folds cross validation strategy. We highlight that tuning the hyper-parameters in cross validation mitigates overfitting issues of the regression algorithms.

Key performance indicators
The performance of the forecasting algorithms introduced in Section 5 has been evaluated using to the following standard performance indicators: where the acronyms, respectively, stand for Root Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error, forecast Skill and Accuracy. RMSE pers is the RMSE of the persistence model, i.e. the model where the prediction at day + 1 is equal to the measured value at day . ( ) is the function that associates every measured or forecasted value to the respective class, explicitly given by These values are the thresholds of classes of increasing severity of air pollution as indicated by the Swiss society of air protection officers (Cercl'Air) (Swiss society of air protection officers, 2019). Class 3 is especially narrow compared to the other classes, as a result it will be harder for the regression forecasting algorithms to correctly predict this class. Finally, we evaluated those algorithms which also returned conditional distributions, that is, QRF and NGBoost, using two additional KPIs. The first one is the reliability (Pinson, McSharry and Madsen, 2010), defined as wherê , is the quantile predicted by the algorithm at the level ∈ [0, 1]. This KPI calculates how many of the total number of measured values are indeed lower than the quantile predicted on the same observations. If the forecasting algorithm were perfect, the R( ) curve would lie on the bisector of the first quadrant. The second probabilistic KPI is the average quantile loss function, also known as pinball loss (Bentzien and Friederichs, 2014):̄ where the function ( ) is defined as This KPI measures how narrow is the predicted probability density function around the observations. It can be shown that this loss is minimized, independently of the underlying distribution which generated the data, when the predicted quantiles are the true ones. It should be noted that for = 0.5, the corresponding valuē ( ) is half the value of the MAE statistic. To further evaluate the performance of the quantiles as a single score, we also integratē ( ) over the [0, 1] interval, as outlined in Gneiting and Raftery (2007). Thus we define

Feature number and feature selection methods comparison
To investigate the impact model complexity on the general quality of the results, we varied the number of features from 6 to 36 in steps of 6. Due to the relatively high computational time required to run this experiment, the models were calculated only for the 3 stations of Bioggio, Chiasso and Locarno, on the data from 2015 to 2018, using a 7 folds cross-validation. Fig. 3 shows the computational time, for all the models, as a function of the number of features and the feature selection method, for the three selected stations.
The time needed to perform the feature selection and prediction of the results increases exponentially as a function of the number of features for the feature selection based on GA, while it shows a linear trend for the feature selection based on Shapley values. The CV computation could take up to 4 hours on our machine, a 16 core Intel i9-7960X CPU @ 2.80 GHz with 128 GB of RAM. Similarly, using features selected with SHAP, we compared the performance of a subset of forecasting algorithms by gradually increasing the number of features (Fig. 4). The errors of the models do not appear to decrease after about 24 features. Consequently, we decided to construct our models for the main analysis with = 30 features. This number is a good compromise between the quality of the model and the computation time.  To investigate whether feature selection effectively improves the quality of prediction and to determine which is the most efficient technique, we compared the performance of the algorithms across all locations using four methods: GA, Shapley values (SHAP), random features and all available features (no feature selection). For the first three methods, 30 features are either selected according to the corresponding method or picked at random, while in the last case all the available features are used to build the model. To compare the performances of the different models, we used Nemenyi statistical tests (Hollander and Wolfe, 1999), a post-hoc pairwise test, which is used to compare a set of different models on a group of independent experiments. Firstly, a matrix ∈ R × whose elements , are the ranks for experiment and model , is obtained. Then, the mean rank for each model is retrieved through column-wise averages of . The performance of the two models is identified as significantly different by the Nemenyi test if the corresponding average ranks differ by at least the critical difference where is the quantile of the Studentized range statistic with samples, here setted to = 0.9. We implemented the Nemenyi test in python following the implementation in the tsutils R package (Kourentzes, Svetunkov and Schaer, 2020). The Nemenyi test is usually performed after a Friedman's test, which is a non-parametric analog of variance for a randomized block design; this can be considered as non-parametric version of a one-way ANOVA with repeated measures. More details on the difference and implementation of the two tests can be found in Dale (2006). The results of Nemeyi test can be observed in Fig. 5. In Table 5, we show the KPIs calculated for a few selected forecasting algorithms for the locations with best, average and worst RMSE statistics, namely Bioggio MOR, Chiasso MOR, and Mendrisio EVE.
The results show that the Shapley values approach obtains the best rank for all the ensemble methods across all the locations. For Ridge and LM, we essentially have a tie between SHAP and GA. In the case of LASSO, the approach that uses all features obtains the best rank. This is likely explained by the fact that LASSO, by design, inherently performs feature selection. We can conclude that performing a preliminary feature selection using Shapley values analysis is preferable in terms of KPIs quality and computation time. The SHAP feature selection is faster compared to GA and All and avoids needlessly slowing down the predictive algorithms.

Feature importance and interactions
To assess the importance of each feature in predicting the maximum ozone concentration, we applied once again the  shap library introduced in Section 4.2, using NGBoost algorithm as the ( , Θ) regression model. We thus analyzed the most relevant features for the 7 locations and their interactions. Table 6 summarizes the 3 most relevant features for the considered locations. For 5 of the considered cases, the most important feature was a value of forecasted temperature for both the MOR and EVE predictions. For the other cases, the most important feature was a past value of ozone concentration. The three most relevant features include forecasted values of the temperature, locally or at a nearby station, and past measured ozone concentration for all the locations.

Temperature influence
In Fig. 6, we show the influence of single observations on the ozone forecast, divided by feature, for the location of Chiasso. The most relevant predictor iŝ 6 4 , , which is the forecasted temperature at a nearby location. Several chemi-  cal reactions and pathways influence the chemistry of ozone formation and removal in the troposphere (Lu, Zhang and Shen, 2019;Crutzen, Lawrence and Pöschl, 1998;Monks, Archibald, Colette, Cooper, Coyle, Derwent, Fowler, Granier, Law, Mills, Stevenson, Tarasova, Thouret, Von Schneidemesser, Sommariva, Wild and Williams, 2015), such that imputing temperature's influence to just one of them is hardly satisfactory, and other factors and processes must be considered. The photochemical processes are dominant factors in determining the ozone peak concentrations, together with thermal and radiative conditions. Other more physical pro- cesses, such as local wind systems and diurnal boundary layer dynamics, are of secondary importance in our study, since it focuses on the generally well-mixed diurnal boundary layer. We aim to forecast the peak concentrations under well-mixed afternoon conditions rather than the ozone concentration variations within the diurnal cycle. We stress that some phenomena, like the role of the effective height of the boundary layer for the air mass characteristics (precursor mixture, photochemical reactivity), determining the volume available to dilute the emissions, and the role of large-scale advection, are not directly considered by the forecasting algorithms, and that our analysis is limited to the variables selected by the feature selection procedure. However, since the temperature is causally correlated with the height of the boundary layer, we're indirectly accounting for some of its effect by using temperature as a predictor. Fig. 7 shows the Shapley values for the mean value of the temperatures forecasted in Sagno between 14:00 and 23:00, and the interaction with the mean forecasted temperature in Chiasso at 17:00. We can observe how the influence of the forecasted temperature on the maximum O 3 peak does not follow the typical exponential form of a chemical rate constant, but rather a sigmoid. A similar sigmoid-like functional relation between these two variables can be observed in all the other locations. This functional form can be explained explicitly modeling the main chemical reactions involving the formation and destruction of O 3 , and their temperature-dependent rate coefficients. Authors in (Pusede, Steiner and Cohen, 2015;Pusede, Gentner, Wooldridge, Browne, Rollins, Min, Russell, Thomas, Zhang, Brune, Henry, Digangi, Keutsch, Harrold, Thornton, Beaver, St. Clair, Wennberg, Sanders, Ren, Vandenboer, Markovic, Guha, Weber, Goldstein and Cohen, 2014) modeled the interdependence of O 3 production rate, NO x and temperature. Initially, the authors modeled the main chemical tropospheric O 3 formation processes. Then, they replaced the volatile organic compounds (VOC) reactivity with a functional relation learned from observations between the latter and daily maximum temperature. At last, they found a sigmoid-like influence between the daily maximum temperature on O 3 production, for concentrations of NO x greater than 6 ppb. A similar sigmoid-like dependence of O 3 formation rates and the temperature was found in Walcek and Yuan (1995), both considering or disregarding a linear correlation between maximum daily temperature and solar irradiance ( Fig. 4a and Fig. 5 in the reference, respectively). In all these studies, the authors kept the influence of other variables, such as VOCs concentration and irradiance fixed, or treated them as parameters. This is equivalent to performing a sensitivity analysis on the rate of change of O 3 , which is exactly the scope of the Shapley variables. In this sense, the findings of the aforementioned authors are compatible with Fig. 7. We stress that the sigmoid-like importance of temperature in predicting O 3 concentration cannot be directly extracted from the raw data. To better explain this, we show in Fig. 8 the partial dependence plot of the measured 3 , that is, the target variable and the measured temperature. As we can see, raw data show an exponentiallike relation, very similar to what was found in Walcek and Yuan (1995), Fig. 1a, where measurements of maximum hourly O 3 concentration and temperature from the New Jersey urban region are plotted.

importance interactions
Solar irradiance and NO x concentration play a major role in the photochemistry of tropospheric ozone (Crutzen et al., 1998;Walcek and Yuan, 1995). In Fig. 9 and 10, the influence of the measured NO 2 and its interaction with forecasted temperature on the MOR prediction of the O 3 in Bioggio and Sagno, is shown. We can see for both cases how the sigmoidlike NO 2 importance becomes more pronounced for increasing forecasted temperatures. This means that at high concentrations of NO 2 , higher temperatures accelerate O 3 formation rate. On the other hand, Fig. 9 and 10 suggest that at low concentration of NO 2 , O 3 generation from NO 2 becomes increasingly important with increasing temperature, with respect to other O 3 formation concurring processes. This results in low NO 2 concentrations being more significant predictors of low O 3 concentrations at high ambient  temperatures. These conclusions cannot be explained with the same kind of analysis carried out for example in Pusede et al. (2015), where only the O 3 formation rate with respect to NO x and temperature is investigated, and not its relative importance over other concurrent photochemical pathways.

Performance of the regression models
In Table 7 we present the results of our study with respect to the KPIs introduced in Section 6, obtained by first applying the SHAP feature selection approach as described in 4.2.
In general, the two algorithms that overall perform best are LSBoost and NGBoost, since most of the lowest MAE and RMSE values and the highest accuracy values are concentrated in these two algorithms. In the station of Tesserete nearly one third of the total ozone measurements of summer 2019 were not available, so the results are not directly comparable with those of the other stations. Table 7 shows that, for the same algorithm, the MOR results are generally better than the EVE counterpart, although the difference in some cases is slight. This is not surprising since the MOR predictions also use the data gathered overnight, which are not available to the EVE forecasters. However, there is an exception in Brione, where the MAE statistic is slightly higher at MOR for all the ensemble algorithms. The best results are obtained in Bioggio, where both MOR and EVE forecasts have the best KPIs across all the stations. The relatively small differences between MOR and EVE seem to indicate that the data gathered during the night are not particularly important. In fact, as Table 6 shows, the three most important features for each model essentially revolve around ozone measured during the peak of the previous day and forecasted temperature. In all models, a lot of importance is given to the forecasts of the upcoming afternoon and the measured ozone values of the previous afternoon. Arguably, the NWP forecasts are more precise in the morning, thus leading to more precise ozone predictions as well. Fig. 11, 12 and 13 graphically illustrate the results for the best (Bioggio MOR), average (Chiasso MOR) and worst (Mendrisio EVE) cases. Each figure is composed of 4 plots. The top one shows a comparison between the main forecasting algorithms and the measured values. The second plot shows the prediction intervals at levels 20%, 40%, 60% and 80% issued by the RF algorithm as well as the RF prediction. The third and fourth plots further investigate the goodness of fit of the quantiles of the RF and NGBoost algorithms.

High peaks prediction
The analysis presented so far focused on the dataset in its entirety, aiming to provide the best KPIs over all the data, independently from the air pollution severity class. However, when predicting ozone concentrations, it is generally more important to be able to correctly predict high concentrations,    as they can pose a health risk. For this reason, we decided to prompt the predictor to focus more on high concentrations, in our case classes 4, 5 and 6 as defined in the equation 14, by introducing weighted training. We assigned different weights to the observations depending on the severity class they were in. We found that it was beneficial to assign a weight 1 ∈ [20,200] to observations in class 6, 2 ∈ [20, 200] to observations in class 5, 3 ∈ [10, 20] to observations in class 4 and finally a fixed weight of 1 to all the other observations. The key idea is to find the optimal set of weights 1 , 2 , 3 for each location and case that improves the forecasting quality at high concentrations. The same set of weights is used during feature selection with SHAP and NGBoost and to train the prediction algorithm with the selected features. Applying weights during feature selection should help select the most important features to recognize the highest ozone concentrations. We sought to increase the classification accuracy of three particular subsets of our data using weighted training, by evaluating the best combination of weights to optimize prediction accuracy for: • Values in classes 4, 5 and 6 (O 3 > 135 ∕ 3 ) • Values in classes 5 and 6 (O 3 > 180 ∕ 3 ) • Values in class 6 (O 3 > 240 ∕ 3 ) We analyzed the four stations with the highest number of extreme measurements in 2019: Bioggio, Mendrisio, Locarno, and Chiasso. All these stations registered at least one event of class 6 and many events of class 5, as shown in Table  8. In particular, Chiasso registered 27 measurements above 180 ∕ 3 , of which 4 above 240 ∕ 3 . In contrast to what we observed with unweighted training, we noticed that when using weighted training, increasing the number of selected features above 30 improves the prediction accuracy at high ozone concentrations. Therefore, we decided to increase the number of features that the algorithm can use to perform its prediction to 100. We used SHAP as feature selection method and NGBoost as regressor. We calculated the KPIs of the models for each combination of the weights with 1 , 2 ∈ {2, 4, 6, … , 20}, 3 ∈ {1, 2}. Fig. 14 shows the aggregated accuracy of the prediction for the three different classes of interest. Fig. 14a and 14b illustrate the distribution of the accuracy when considering only the ozone measured values above 135 ∕ 3 . Similarly, Fig. 14c, 14d and 14e, 14f show the accuracy when restricting ourselves only to values above 180 and 240 ∕ 3 respectively. The continuous iso-lines are obtained with cubic interpolation. It is difficult to infer which weights give the best results, especially when trying to maximise the accuracy of observations above 135 ∕ 3 , but for the other 2 cases, high 2 and variable 1 give the best weights for observations above 180 ∕ 3 , while high 1 and low 2 appear to be the best weights combination for correctly predicting observations above 240 ∕ 3 . Table 9 shows the results of weighted training for the considered stations. We report the KPIs and the three sets of weights that gave the best accuracy for each fraction of the dataset, compared to the results obtained by the case where no weights are applied. We can see that the introduction of the weights does not unduly affect the KPIs, which in fact improve in some cases. For Chiasso MOR, we also show in Figure 15 the complete confusion matrices obtained. It can be seen that when actively trying to enhance recognition of observations above 135 ∕ 3 , the correct classification of these values boosted from about 50-75% to about 80-85%. Similarly, for the observations above 180 ∕ 3 the correct recognition rises from 50-70% to 80-90% in all stations but Locarno, where it stops at 70%. Finally, in Mendrisio and Chiasso, we could correctly predict all values above 240 ∕ 3 , which was not achieved in the unweighted analysis. This is not the case for Bioggio and Locarno, where the only class 6 value is never recognized.

Conclusions
In this paper, we forecasted the day-ahead maximum of the ground-level ozone concentration during the summer period of 2019 in 7 localities located in Southern Switzerland, using a physics-agnostic, data-driven approach. Due to the high number of signals potentially affecting the predictions, we performed a preliminary feature selection using two methods, which we compared. The selected features were then used to train different state-of-the-art forecasting algorithms. Analyzing feature importance interactions using Shapley values suggests that the models trained through our learning pipeline effectively learned explanatory cross-dependencies among atmospheric variables, which are described in the ozone photochemistry literature. Our analysis showed that Gradient Boosting algorithms, and in particular Least Square Boosting and Natural Gradient Boosting, consistently performed better than the other tested forecasting methods. Where possible, we further compared our results with those of other papers, and we were able to conclude our results are similar to previous analysis and, in some cases, even better. We then evaluated the effect of weighted training to increase the accuracy of predictions for high ozone concentrations. Our analysis shows that this method is feasible, as it increases forecast accuracy without compromising overall forecast quality. Future directions for this work include the formulation  (d) Optimizing class 6 Figure 15: Confidence matrices for Chiasso MOR of probabilistic techniques for the robust estimation of annual ozone concentration peaks, which are the most difficult events to predict, due to their scarcity in the training set. In this view, training the forecasters with ad-hoc generated adversarial examples could result in a better forecast of the conditional probability distributions.