Improving classification‐based nowcasting of radiation fog with machine learning based on filtered and preprocessed temporal data

Radiation fog nowcasting remains a complex yet critical task due to its substantial impact on traffic safety and economic activity. Current numerical weather prediction models are hindered by computational intensity and knowledge gaps regarding fog‐influencing processes. Machine‐Learning (ML) models, particularly those employing the eXtreme Gradient Boosting (XGB) algorithm, may offer a robust alternative, given their ability to learn directly from data, swiftly generate nowcasts, and manage non‐linear interrelationships among fog variables. However, unlike recurrent neural networks XGB does not inherently process temporal data, which is crucial in fog formation and dissipation. This study proposes incorporating preprocessed temporal data into the model training and applying a weighted moving‐average filter to regulate the substantial fluctuations typical in fog development. Using an ML training and evaluation scheme for time series data, we conducted an extensive bootstrapped comparison of the influence of different smoothing intensities and trend information timespans on the model performance on three levels: overall performance, fog formation and fog dissipation. The performance is checked against one benchmark and two baseline models. A significant performance improvement was noted for the station in Linden‐Leihgestern (Germany), where the initial F1 score of 0.75 (prior to smoothing and trend information incorporation) was improved to 0.82 after applying the smoothing technique and further increased to 0.88 when trend information was incorporated. The forecasting periods ranged from 60 to 240 min into the future. This study offers novel insights into the interplay of data smoothing, temporal preprocessing, and ML in advancing radiation fog nowcasting.


INTRODUCTION
Radiation fog continues to be a difficult forecasting task (Boutle et al., 2022).Ongoing efforts are being made to enhance fog forecasting (Bartók et al., 2022;Castillo-Botón et al., 2022;Jayakumar et al., 2021;Pauli et al., 2020), which is crucial given that fog is a rapidly developing system that can result in high risks and fatalities in both traffic (Wu et al., 2018) and public transport (Brázdil et al., 2022).Inaccurate fog forecasts also contribute to substantial economic costs due to delays in air traffic (Kulkarni et al., 2019).
The uncertainty in fog forecasting in general arises from the complexity of the fog system.Fog development and intensity result from a highly sensitive interplay of various atmospheric variables (Wainwright & Richter, 2021).Fog is considered to have formed when visibility drops below 1 km (American Meteorological Society, 2012;German Weather Service, 2023).Visibility reduction stems from a non-linear and temporally interrelated process (Steeneveld & De Bode, 2018).Turbulence and wind speed mix the air layer, affecting droplet size and concentration, which in turn determine the extent of visibility reduction (Pérez-Díaz et al., 2017;Steeneveld & De Bode, 2018;Thies et al., 2017).
Radiation fog, specifically, frequently occurs during spring, autumn, and winter nights when strong radiative cooling in high-pressure areas causes the dewpoint to be reached (Bendix, 2002;Pérez-Díaz et al., 2017).A radiation fog event can last from two to twelve hours (Bott & Trautmann, 2002;Román-Cascón et al., 2016) and is characterized by three phases -formation, maintenance, and dissipation (Maier et al., 2011;Nakanishi, 2000).The formation phase involves high fluctuations in many variables (Boutle et al., 2022), making accurate forecasting challenging.However, the duration of a fog event is heavily influenced by the transition from the formation to the maintenance phase (Bergot & Lestringant, 2019), which is a crucial step for accurate forecasting.Radiation fog typically dissipates after sunrise with increasing air temperature (Gültepe et al., 2007;Price, 2019), as warmer air can hold more moisture, causing fog droplets to evaporate (Gültepe et al., 2007;Price, 2019).A decrease in supersaturation can also be caused due to the entrainment of dry air, leading to the dissipation of fog (Li et al., 2023).All this complex information must be made available to the model for fog forecasting.
There are two primary methods for forecasting radiation fog.One long-standing approach is using numerical weather prediction (NWP) models (Boutle et al., 2022;Xinmei et al., 1990), which are based on mathematical models with partial differential equations.To achieve accurate forecasts, fog-related processes and their interactions must be precisely represented in the NWP model (Román-Cascón et al., 2019).However, some fog-influencing processes are not yet fully understood or implementable in their entirety (Lakra & Avishek, 2022), hindering the accurate modeling of such complex processes (Steeneveld & De Bode, 2018;Stolaki et al., 2015).Moreover, computationally expensive NWP models (Krasnopolsky & Fox-Rabinovitz, 2006) can take several hours to produce a forecast, covering only part of the nowcasting period of up to six hours (Urbich et al., 2020) which is the most important period, for example for an aviation forecaster.Promising efforts are ongoing to reduce the computational costs, for example by post-processing a single deterministic NWP forecast (Kamangir et al., 2021).However, up until now the nowcasting gap remains.
The second and strongly emerging approach to forecast fog is machine-learning (ML) algorithms.ML models can help to bypass the long computation time and the current knowledge gap.ML models are very suitable for nowcasting fog (Lakra & Avishek, 2022) since they can generate a rapid forecast within seconds (Palvanov & Cho, 2019).The increasingly popular ML models offer a heuristic approach, partially circumventing the knowledge gap and the parameterization of process interactions.ML algorithms can learn information directly from data without relying on a predetermined model (Mahesh, 2020;Samuel, 1959).The models in this study are trained using ML algorithms only.
For an ML model to effectively learn from data, the following requirements should be met.The dataset provides the information or problem for the algorithm to learn and must represent fog-influencing and fog-characterizing properties through the selected variables.The algorithm must be capable of learning the problem from the data.As fog variables are interconnected and do not react linearly to one another (Jacobs et al., 2007), the chosen learning algorithm must process these non-linearities.The underlying, longer-term temporal development of the variables is also valuable information regarding fog formation or dissipation.For example, a rise or fall of the temperature or relative humidity affect the chance of fog formation or dissipation.The generation of a machine-learning-based forecast is all based on calculated probabilities.Consequently, the temporal development should be made available to the algorithm during training, either through the algorithm architecture or the input variables.
One example of handling temporal developments in fog forecasting involves using different types of recurrent neural networks (RNN) (Jonnalagadda & Hashemi, 2020;Miao et al., 2020;Pan et al., 2019;Park et al., 2022).These networks have already incorporated the concept of considering temporal development into their algorithm's architecture.However, neural networks require large datasets, high computational power to train the model initially, and time-intensive fine-tuning (Memon et al., Memon et al., 2019) which also has an impact on the environment in terms of carbon emission (Lacoste et al., 2019).Furthermore, the use of complex ML (deep learning) algorithms compared to simpler ML algorithms does not always yield significantly better results at much higher computational costs (Peláez-Rodríguez, Marina, et al., 2023;Peláez-Rodríguez, Pérez-Aracil, et al., 2023).In this study, we therefore aim to use the eXtreme Gradient Boosting (XGB) algorithm.XGB is a tree-based algorithm that can handle non-linear relationships.XGB is used in various fields to achieve state-of-the-art results (Chen & Guestrin, 2016), including weather forecasting (Cai et al., 2020;Fan et al., 2021;Kumari & Toshniwal, 2021).It has been shown to outperform other tree-based models, such as random forest, and can achieve similar forecasting accuracy to neural networks (Dewi et al., 2020;Kumari & Toshniwal, 2021;Sheridan et al., 2016).It is robust to smaller datasets (Luckner et al., 2017) and requires much less computational power (Li et al., 2022), making large-scale applicability for fog forecasting realistic.
However, with the XGB algorithm, unlike RNN's, the past temporal development is not inherently available.One approach to make past information available is to include time-shifted versions of the variables in the dataset (Bartók et al., 2022).With this approach, XGB has to learn the correct temporal order implicitly.During model training, however, only a subset of the available variables is selected in order to avoid overfitting (Chen & Guestrin, 2016).Coincidentally, the development of fog is subject to high temporal dynamics, especially during fog formation (Bendix, 1998;Mason, 1982).Thus, implicit learning could pose a challenge to the algorithm.This study therefore seeks to investigate whether predefined and temporally ordered trend information will have a positive impact on the forecasting performance.
However, to provide short-term trend information, the high dynamics during fog development must also be addressed.These dynamics, caused by physical atmospheric conditions, are further amplified by noise occurring during sensor measurement of time series data (Trenberth, 1984, Kostelich & Schreiber, 1993).This likely has implications for providing information on temporal dynamics, for example, through calculated trend variables.An NWP-related study by Rémy and Bergot (2010) showed that using a filtering algorithm can improve the fog forecast on real observations within the first three hours.Therefore, we want to investigate whether the performance increase after smoothing the data also appears with the ML algorithms used in this study.By filtering the variables, we aim to better represent information about the short-term temporal trends during model training.
Building on the recently introduced ML training and evaluation scheme for time series data (Vorndran et al., 2022), the objective of this study is to investigate the effects of smoothing and the integration of additional temporal development-related variables on classification-based model performance.Therefore, an extensive bootstrapped comparison of different smoothing intensities and timespans of trend information was conducted for the station in Linden-Leihgestern (Germany) to be able to make a precise statement on the influence of the respective procedure.The impact of different smoothing levels on the forecasting performance will be evaluated, as well as the influence of combining smoothing levels with past development information.This diversity of datasets offers a broad insight into how different factors influence the model's performance.Forecasting performance is evaluated for three fog-relevant forecasting segments: overall model performance, fog formation, and fog dissipation.The correct forecast of fog formation and dissipation, that is, the transitions, are essential to an operational fog forecasting model and allow for more accurate insights into the model's performance (Bartók et al., 2022;Vorndran et al., 2022).Four different evaluation scores, a benchmark logistic regression model, and two baseline models were used to analyze the XGB model performance.

DATA AND PREPROCESSING
The data used in this study were collected from November 15, 2009 to February 25, 2016.The data sources and the methods used to prepare the data are described in this chapter.

Data source
The data was collected at the ground truth and profiling station located in Linden-Leihgestern, Germany (50 • 31 ′ 58.67 ′′ N; 8 • 41 ′ 3.85 ′′ E).The station is situated in a location with one of the highest fog occurrences in Germany (Bendix, 2002).Continental radiation fog is predominant.The data were collected by the Laboratory of Climatology and Remote Sensing (LCRS) and the Hessian Agency for Nature Conservation, Environment and Geology (HLNUG).The continuous and extensive measurement of several fog-related variables at this site provides a strong foundation for radiation fog forecasting and analysis (Egli et al., 2015;Maier et al., 2011;Vorndran et al., 2022).Table 1 shows the variables used for model training and evaluation.The variable wind direction was converted from degree to radians and combined with the Components v and u were then used instead of horizontal wind speed and wind direction.

Data preprocessing
Radiation fog most likely occurs during the colder months, when the temperature difference between the ground and the air is greatest.During these months, the cooling of the ground can cause the moisture in the air to condense into fog (Gültepe et al., 2007).The data basis for this study was therefore limited to the months from September to April.
The data underwent quality assessment to ensure accuracy and reliability.The data were analyzed for outliers and trends to ensure that the training process was not influenced by outliers.Short data gaps within the time series of each variable were filled using a Gaussian-based interpolation, considering the autocorrelation of the variables.Due to a very limited extent of data gaps (Figure 1), the training and evaluation was conducted on seven well-represented fog seasons.
For model training the data points must be organized into categories based on a specific characteristic.The data points for this study were classified into two categories: fog and non-fog points.This categorization allows the model to learn the patterns and relationships between the variables and the likelihood of fog occurrence.The model can then make forecasts about future fog events based on the learned information.The classification in this study is based on visibility, with points having a visibility equal to or below 1.1 km classified as fog points and points with visibility above 1.1 km classified as non-fog points.The LCRS sensor used for the visibility measurement (Biral VPF-730) has a measurement error of no more than 1.8% in the fog-relevant visibility range of up to 1.5 km.When classifying data points for radiation fog forecasting, we consider an overestimation that incorrectly leads to a non-fog point to be worse than an underestimation that incorrectly leads to a fog point.Therefore, the slight shift of the fog threshold was applied in our measurement to account for small amplifications of the signal near the official fog threshold of 1 km due to natural noise or unfavorable weather conditions (Lakra & Avishek, 2022).
The process of temporal resampling and scaling are two key steps in preparing data for ML-based model training.
Resampling is the process of changing the time resolution of a set of data points for each variable.For this study, the original data points of the visibility measurement were aggregated to 30-min intervals.This corresponds to the resolution of the HLNUG measurement.Each new data point represents the average of a block of original data points covering a 30-min interval.Afterwards, the data were scaled using scikit-learn's Robust Scaler.This ensured that the different variables were on a common scale and that extreme values did not dominate the results, regardless of the algorithm chosen.
The final dataset comprises 67,997 data points in total; 5.29% of the data points represent fog events, while the remaining 94.71% are non-fog points.Thus, the dataset is strongly imbalanced.However, the ratio of the two classes, fog and non-fog, was not balanced by sampling techniques to keep the natural order of the data points (cf.Vorndran et al., 2022).Instead, we penalized an incorrect prediction of the fog class twice as much as an incorrect prediction of the non-fog class during model training.the high fluctuations of the time series the underlying longer-term development might be easier detectable by the algorithm.To test this assumption, different dataset variants were created (see Figure 3) by smoothing and by calculating trend information.

Smoothing
For this study, a moving average with a centered Gaussian window was used for smoothing.A moving average is a type of filter that takes the average of a certain number of neighboring data points.The Gaussian window used here refers to a specific weighting function used to calculate the moving average.The Gaussian function scaled with a standard deviation of two gives more weight to data points closer to the center of the window.This effectively reduces the amount of noise, while preserving the overall shape of the data (Holt 2004).Seven additional datasets with different smoothing intensities from 2 to 14 (see Table 2) were generated to study the influence on the forecasting performance.According to the temporal resolution of the dataset, one time step in Table 2 also represents a 30-min interval.Therefore, a moving-average window with two time steps corresponds to a one-hour smoothing window, a moving-average window of four time steps corresponds to a two-hour smoothing window, and so on.The result of the smoothing is exemplarily shown for the variable visibility during one fog event in Figure 2 (colored solid lines).

Trend information
Additional datasets were created by compiling further variants with pre-calculated temporal information for each of the seven moving-average datasets (see Figure 3 -step 3.1 and 3.2).The temporal information was calculated for each of the four predictors temperature, relative humidity, dewpoint depression and visibility as follows: where v is the variable, t is time point, and n is the number of time steps back [2-6] (one time step is 30-min resolution).
After calculating the trend variables for each of the four predictors, a new dataset variant is created by adding the four trend variables to the moving-average dataset on which they were calculated.The trend variable datasets correspond to the full variable set as shown in Table 1.The trend window length remains constant within a dataset variant.Different trend variable datasets with window sizes from 2 to 6 (see Table 2) were created.Each trend window size was combined with each moving-average window dataset, resulting in 35 new datasets (see Figure 3 -step 3.1 and 3.2 The variety of this new collection of datasets enables us to examine the impact of different smoothing levels (illustrated in Figure 3, point 2) and the combination of smoothing levels with past development information (illustrated in Figure 3, point 3.1 & 3.2) on the model's performance, providing a comprehensive understanding of the predictors affecting performance.

TRAINING AND EVALUATION SCHEME
The performance of a model must be generally tested with an independent dataset.For this purpose, the entire dataset must be split into a training and an independent test dataset.Given that the fog time series consists of autocorrelated variables, a temporal proximity of training and test data points can cause an information flow between them.Thus, to minimize this influence and maintain the independence of datasets, a training and evaluation scheme for fog forecasting called the Expanding Window Approach (EWA) was previously introduced (Vorndran et al., 2022), which is used in this study.The EWA will be explained first.The models to which the EWA is applied are presented in Section 4. The model evaluation process is described in detail in Section 5.
Figure 4 illustrates the application of the EWA.In the first step (step 1, Training and Test Split), the dataset is split into a training dataset for model training and a test dataset for model evaluation.To assess the reliability of the performance scores (Section 5.1), the split proportion between Since the first step involved five training and test splits, five models per algorithm (see Section 4) and forecasting period were trained.The final evaluation scores (see Section 5.1) represent an average of the five iterations per algorithm and forecasting period.The performance of the XGB model is set in context by comparing it with the Logistic Regression (LR) benchmark model and a persistence baseline model (PM).Additionally, the scores of the transition evaluation are set in the context of a climatology model (CM).All the models will be explained in the following sections.
The Expanding Window Approach has been developed to train and rigorously evaluate the performance of the fog forecasting models.By applying the EWA, an application-oriented training with minimal influence of autocorrelation is achieved.By keeping the temporal order of the data points during the whole evaluation process, the performance of an operational model is simulated.The evaluation is two-part: both the overall performance of XGB and LR and their forecast of the transitions are analyzed, allowing a precise analysis of the model performance.

MODELS
To put the XGBoost model results into context, a logistic regression benchmark model, a persistence baseline model, and a climatology baseline model were trained for each dataset and each of the four forecasting periods (60, 120, 180 and 240 mins).The four models are explained in the following.

XGBoost
XGBoost stands for XGB and is a decision tree ensemble algorithm used for classification tasks (Chen & Guestrin, 2016).XGB operates by building a set of decision trees that link the input variables in various configurations to produce a forecast.This approach allows the algorithm to capture complex interactions among the variables, making it well-suited for forecasting radiation fog.During the training process, each weak decision tree is constructed iteratively by correcting the mistakes of the previous tree.The Python package XGBoost was used for training the XGB models.

Logistic regression
An LR model is employed as a benchmark model since the study aims to investigate two aspects.First, to determine if the more complex XGB algorithm outperforms the simpler LR algorithm and justifies its application.Second, to examine whether changes in model performance with different datasets are attributable to a change in information or are influenced by the choice of algorithm.During training, the LR model learns the probability of an event by examining the relationship between variables and the event's likelihood.The function "LogisticRegression" from sklearn.linear_model (Pedregosa et al., 2011) has been used.

Persistence model
The persistence model (PM) serves as a baseline model for the overall fog forecast by determining the number of correctly predicted data points.To make a forecast, the PM model repeats the input data point, assuming that the future weather will be the same as the present.If the fog state remains the same, the PM will accurately predict the outcome.However, when there is a transition from fog to non-fog, or vice versa, the PM forecast fails.The PM represents a straightforward solution that the XGB model might learn if it fails to consider the underlying dynamics of the data and only memorizes the data points.

Climatology model
To establish an additional baseline for evaluating the formation forecast, we implemented the CM.CM is a model that simulates guessing based on the frequency of fog points.CM assigns the fog class based solely on the distribution of the fog class in the training dataset.For the dataset used in this study the proportion of fog points is 5%.Therefore, CM assigns the fog class in 5% of the cases which in turn means CM can correctly forecast the fog data points with a 5% probability.

PERFORMANCE EVALUATION
The different components for evaluating the model performance are explained below, namely scoring (Section 5.1), bootstrapping (Section 5.2), significance testing (Section 5.3), feature importance (Section 5.4) and predictor importance (Section 5.5).

Scoring
For each forecasting period (60, 120, 180, and 240 min), five XGB, LR, and baseline models were trained and evaluated.The performance scores of XGB, LR and PM in the Section 6 are mean values from the five iterations of the EWA (Figure 4).The F1 score, The F1 score provides a balanced assessment of the overall performance of XGB and LR by considering both precision and recall simultaneously.This is because we consider both false alarms (false positives) and overlooked fog events to be equally important.By using the F1 score, we can determine which model has the best overall performance before going into details with the transitions.The transitions, that is, fog formation and fog dissipation, were evaluated separately.Fog formation was defined as the forecast of a fog point when based on a non-fog point.Fog dissipation was defined as the forecast of a non-fog point when based on a fog point.For the transition evaluation the scores precision, accuracy

Bootstrapping
Bootstrapping is a statistical resampling technique that allows for an uncertainty estimation of the performance scores.We employed the overlapping block bootstrapping (Berkowitz & Kilian, 2000) to test the significance of differences in the performance scores of XGB and LR for every dataset variant (see Figure 5).The dataset was split into a training part used for training and a test part used for evaluation using an 80/20 split.Equally sized blocks of a consecutive time period are randomly drawn from the training dataset with replacement.To retain data dependencies and to fully represent at least one coherent weather condition, a block size of one week was selected for this study.This process was repeated until the resampled training dataset reached the same length as the original training dataset.
The resampled training dataset is used for training and evaluation just like in the EWA scheme in Figure 4 (see step 2 -Training and step 3 -Evaluation).This process is repeated 1000 times, resulting in 1000 resampled training datasets per dataset variant.The resulting distributions of 1000 performance values for each dataset are compared to determine if there are significant differences between the dataset variants and algorithms.

Significance testing
Following the training of the models on the bootstrapped datasets and the evaluation of their performance, a distribution of 1000 performance scores is obtained for each dataset variant.To compare these distributions and determine their significance, a significance test with interval estimation is conducted.Specifically, the 95% confidence interval is calculated per dataset variant.The distribution of one dataset can be tested for overlap with the distribution of another dataset.If the two confidence intervals do not overlap, the two distributions are significantly different.An example of this procedure is shown in Figure 6.Distribution 1 is an example of the distribution of the 1000 evaluation F1 scores from the bootstrapping of one dataset variant.Distribution 2 is the distribution of the 1000 evaluation F1 scores from the bootstrapping of another dataset variant.The aim is to test whether these two distributions are significantly different.Since the 95th percentile of distribution 1 does not overlap with the 5th percentile of distribution 2, the distributions are considered significantly different.This bootstrapping and testing workflow enables a robust comparison of the model performance across the different datasets and their impact (Brandstätter & Kepler, 1999).

Feature importance
The feature importance is to determine the importance of the individual variables for the training process of the XGB model.The feature importance of XGB measures the contribution of each feature to the model's predictive power by aggregating the improvements in accuracy brought by a feature to the branches it is on.This measure is called gain (Chen & Guestrin, 2016).It reflects how much more accurate the new branches become after adding a split on a particular feature, compared to the accuracy before the split.In this study, the feature importance of the variables for the forecasting period of 60 min was demonstrated using the dataset with the highest performance scores.

Predictor importance
This metric is a supplement to the feature importance analysis, as it provides an assessment of feature stability when the model encounters the unseen test data.It enables an explanatory perspective on how the model makes its forecasts.
The predictor importance is based on the performance scores of the test dataset.In this study, the importance of each predictor in the XGB model is assessed by examining the effect of permuting each predictor's values on the model's performance.This is done for the 60-min forecast using the dataset with the highest performance scores.For each predictor, 1000 iterations are executed, during which the predictor's values are randomly permuted.In each iteration the F1-score and transition metrics from Section 5.1 are calculated to evaluate the model's performance after permuting the predictors.After completing the 1000 iterations for a predictor, the mean values of the performance metrics are computed.This process is repeated for each predictor in the model, resulting in a comprehensive evaluation of predictor importance based on the changes in the model's performance metrics.

RESULTS AND DISCUSSION
This chapter starts with the overall evaluation.First, the performance of XGB using the smoothing windows is shown and discussed (Section 6.1).This is followed by the performance analysis of the absolute and encoded trend windows (Section 6.2), calculated on the most suitable dataset (Moving Average [MAvg] 8) determined in Section 6.1.The performance using the smoothing windows and trend windows is tested for significance in Section 6.3.Section 6.4 provides a detailed analysis of the performance in terms of transitions (fog formation and dissipation) for the MAvg 8 dataset and the different trend windows.Section 6.5 deals with the feature importance of the MAvg 8 dataset and its different trend windows.Section 6.6 analyses the predictor importance of the MAvg 8 dataset using trend window 2 (best combination).

Overall evaluation -smoothing windows
The overall performance analysis provides information on how well the model performs across all data points.By smoothing the dataset, an improvement in the overall performance could be achieved.Figure 7a (top) illustrates the overall evaluation of the performance of the XGB model for forecasting periods of 60, 120, 180, and 240 min.PM serves as a benchmark for performance comparison.The performance of the XGB model trained with the base dataset is very close to the performance of the PM with slightly higher performance scores for the 180and 240-min forecasting periods.It can be observed that smoothing the variables leads to an increase of the XGB model performance.For the 60-and 120-min forecasting periods, a performance increase for XGB is achieved with the smoothing windows 2 to 8, which corresponds to a window of one to four hours.A consistent level is maintained for XGB from smoothing window 8 and higher.From the 180-min forecasting period onwards, the influence of smoothing is in general minimal.Rémy & Bergot (2010) discovered the same effect for the first three hours of their NWP-related forecast when including an Ensemble Kalman Filter.They argue that errors in initial conditions have a stronger impact on forecasting times of less than three hours, and errors in mesoscale driving forces have a stronger impact on longer forecasting times (Roquelaure & Bergot, 2007;Rémy & Bergot, 2010).Applied to our forecast using an ML model, the problem is very likely the same: due to the strong fluctuations of the variables, the model may not capture the actual underlying trend of the variables (rising/falling) accurately.This leads to inaccuracies in the model's predictions.Inaccuracies concerning the assessment of short-term dynamics already affect the performance in the short-term forecasting range since the model fails to accurately estimate the future temporal development and therefore the probability of fog development.By applying the moving-average filter, short-term fluctuations are smoothed and the noise on the data is filtered which leads to an improvement of the forecast.
This observation is not algorithm-related.Figure 7b shows the overall performance evaluation for LR that serves as a benchmark model.The development is very similar to the XGB model (Figure 7a) for the 60-and 180-min forecast periods.For the 180-min forecast, the baseline is just reached with a moving-average window of 8 and higher.For the 240-min forecast, no smoothing strength leads to the baseline being reached.Overall, the LR also shows good performance for the first two forecasting periods with only slightly worse overall performance scores compared to XGB.For the 60-and 240-min forecasts, the performance of XGB is significantly better than LR.This result was to be expected.XGB and LR are both noise-prone algorithms with XGB being more robust than LR (Atla et al., 2011;Dwidarma et al., 2021).However, analysis of the trends and transitions will show that it might be beneficial to run XGB and LR in an ensemble for the 60-and 180-min forecasting periods.
Since moving-average window sizes 10-12 show no further improvement, we continue the analysis of the influence of the trend variables with the moving-average 8 dataset (best result) and its different trend window variations.The trainings with the remaining trend windows also show a performance that is significantly higher than the performance of the training without trend variables.However, the XGB training based on the Moving Average 8 dataset with trend window 2 achieves the highest overall performance.As the trend window expands, the model performance decreases again.The same accounts for an increasing forecasting period.

Overall evaluation -absolute and encoded trend windows
The performance development of the LR closely mirrors that of XGB for the initial two forecasting periods, as illustrated in Figure 7d.The same development of XGB and LR suggests that the influence of the trend variables is not algorithm-related.Up to the 180-min forecasting period the LR performance is exceeding by far that of the PM when trained with the Moving Average 8-Trend Window 2 dataset.However, although the trend variables have a positive impact on LR performance, LR performance is significantly worse than the XGB performance for the 60-and 240-min forecasting periods.
We suspect that implementing the trend variables has a similar effect as with the smoothing windows.With the trend calculation, we pick up a very short-term and simple dynamic, which consequently also has a value for a short time span into the future.However, this short time span covers the very short-term nowcasting range which is of great interest in current research (Bartók et al., 2022;Kneringer et al., 2019;Ribaud et al., 2021).This is a time span in which a model can achieve good results by memorizing without generalizing and therefore is difficult to forecast (Vorndran et al., 2022).
Figure 7e shows the performance comparison of the encoded trend variables for XGB.The same pattern is observed as for the trends in absolute values for XGB, but the influence of the encoded trend variables is much smaller.The difference between the XGB training with the smoothed dataset without trend variables and the XGB training with the datasets with encoded variables is hardly recognizable in a 120-min forecast period.The performance of the LR when trained with the encoded trend variables again shows the same pattern (Figure 7f) as XGB.In contrast to training LR with absolute trend windows, the performance of LR, just like that of XGB, suffers from training with encoded trend windows.The persistence baseline is exceeded by LR only up to the 120-min forecast period when trained with coded trend windows.
Consequently, not only the direction of the development of the variables, represented by the encoded trend variables, is relevant information for XGB and LR but also the intensity, represented by the absolute trend variables.This result reflects the current knowledge of fog dynamics, according to which the intensity of individual variables is decisive for fog development (Mazoyer et al., 2017).For each of the four trend variables, in combination with the variable from which it was derived (relative humidity, dewpoint depression, visibility, and temperature), the probability of fog or non-fog can be calculated much more accurately.The model knows at what point the atmosphere is and at what rate the changes are happening.Using this information, the model can estimate more precisely whether the fog class or non-fog class conditions will be reached in the period to be forecasted.

Significance testing
XGB's overall evaluation performance scores of the smoothing windows (Figure 7a) and the trend windows (Figure 7c) were tested for significant differences, as the differences from one value to the next are predominantly rather small.The significance test was carried out on the basis of the bootstrapped results.Since the nowcasting periods of 60 and 120 min benefit the most from the trend variables, they are analyzed in detail.
The top row of Figure 8 shows the significance test for the different moving-average window sizes.Except for Moving Average Window Size 2, there is no significant difference for the 60-min forecast between successive values.For the 120-min forecast, there are even three consecutive smoothing steps in some cases that show no significant difference.From a moving-average window size of 8 on, no significant difference or improvement is obtained with increasing smoothing intensity for both the 60-and 120-min forecasts.Therefore, the moving-average window size of 8 is sufficient for the present dataset.While in the 60-min forecast there is already a difference from the base dataset from smoothing window 4, in the 120-min forecast there is only a significant difference from a smoothing window of 6 upwards.Thus, there is a decreasing positive effect of smoothing between the two forecast periods.This underlines our theory that smoothing highlights short-term dynamics.Consequently, smoothing only has a short-term positive effect on the forecast.
The bottom row of Figure 8 shows the significance test for the different trend window sizes used in the Moving Average 8 dataset.For both forecasting periods, the optimal result is attained with a trend window size of 2, which corresponds to one hour.Although significant differences persist among the subsequent trend window sizes, the overall F1 scores decline concurrently, contributing to the significant differences observed.The positive effects are confined to a brief time window.We attribute this observation to the fact that fog formation and fog dissipation in particular are highly dynamic processes with short duration (Mazoyer et al., 2017;Román-Cascon et al., 2016).Consequently, larger trend windows fail to capture crucial details.

Transition evaluation
It has been shown that a significant increase in overall performance can be attained by incorporating smoothing and trend information in absolute values.The influence of the trend variables remains consistent across all four forecast periods, though the gain diminishes from the 60-min to the 240-min forecast period.As highlighted in Sections 6.1 and 6.2, the optimal overall results for the station in Linden are obtained using a four-hour smoothing window and a one-hour trend window.The analysis of the transitions shows the same pattern as the overall evaluation with decreasing positive influence of the trend variables in the one-to three-hour forecasting range and a minor to no positive effect in the four-hour forecasting period.The results will be demonstrated using the 60-min forecasting period and a smoothing intensity of four hours.Since the best overall performance for XGB was achieved with a trend window of 2 (=one hour), this performance value is highlighted and forms the basis for comparing the performance of formation and dissipation.Figure 9 illustrates the effect of the different moving-average window sizes (x-axis) in comparison to the different trend window sizes (y-axis) for fog formation (upper row) and fog dissipation (lower row).Enhanced forecasting of fog onset and dissipation is achieved through smoothing alone.When combined with trend variables, further significant improvement is observed.The increase from the base dataset to the best dataset, in terms of overall performance (moving-average window of 8 and trend variable window of 2), is depicted in red.The forecasting accuracy for fog onset improves from 13% to 46%, with a concurrent increase in precision from 45% to 69% and a consistently low POFD.The same observation holds true to an even greater extent for fog dissipation.The accuracy increases from 13% to 54%, with a concurrent increase in precision from 58% to 81% and a consistently low POFD.
The same development applies to the encoded trend values in Figure 10.However, the improvements are much smaller compared to the absolute trend values.Consequently, it is not only the direction of the variable's development that is important, but also the intensity of the development.
The LR exhibits a similar development under the influence of the trend variables.With an accuracy of 59%, LR even outperforms XGB (accuracy 46%) in fog formation accuracy for the 60-min forecast, but with less precision (53%) as XGB (precision 69%).Both XGB and LR outperform the formation forecast of CM (accuracy 5%; precision 5%, POFD 5%) by far.For the 60-min forecast of fog dissipation, XGB (accuracy 54%; precision 81%)  outperforms LR (accuracy 28%, precision 42%) and also in the longer-term forecast.The comparison between absolute and encoded training runs yields somewhat different results for LR than XGB (see Table 3).The encoded fog formation performance of LR (accuracy 58.84%) is inferior to the absolute formation performance of LR (accuracy 50.15%), just like XGB. Surprisingly, the performance of fog dissipation of LR is better for the encoded training (38.43%) than absolute training (28.12%).However, with LR the POFD (7.98%/9.07%) is very high and thus the prediction is much less accurate than that of XGB.Overall, XGB is the model with the higher performance.But since LR shows good results regarding the transitions, especially fog formation, it might be gainful to make an ensemble forecast with XGB and LR for the 60-min and potentially 120-min forecasts.

Feature importance analysis
The feature importance for the dataset with a moving-average window of four hours and a trend window of one hour showed that all of the variables included in the model were important for training.Performance is improved with each variable.Visibility has by far the largest impact.This is to be expected since ML models tend to learn simple and clear patterns first (Arpit et al., 2017).The classes to be predicted are defined by visibility.Visibility therefore has a high information density for the model, without the model having to learn complex physical relationships.However, the performance of the model is not only based on visibility.Air pressure, relative humidity, dewpoint depression and visibility trend also lead to a strong improvement on the performance.These variables are key factors for forecasting (Miao et al., 2020;Negishi & Kusaka, 2022;Pauli et al., 2020).They describe the development of fog (Gültepe et al., 2007) by carrying information about the physical evolution of the atmosphere that is important for a more accurate forecast.
The feature importance of the remaining variables is comparatively low.However, variables with a low feature importance can still be important.Firstly, interaction effects might not be fully captured by the individual feature importance of XGBoost (Chen & He, 2022).A variable with low individual importance might have a positive impact when interacting with other variables.Thus, for a process as complex as fog development, a low feature importance does not necessarily mean that the variable is not important.Secondly, a variable may have low overall importance but could be highly relevant for specific subgroups within the dataset, such as the transitions fog formation and fog dissipation.
Figure 11 also shows that the trend variables contribute to an improvement.Among the trend variables, the relative humidity trend and the visibility trend are most important.A higher improvement is achieved with the temperature trend than with the temperature.The dewpoint depression trend is the only trend variable that becomes more important as the trend window expands, while the other three trend variables decrease in importance as the trend window expands.

Predictor importance analysis
The predictor importance analysis (Figure 12) shows that the variables relative humidity, temperature trend, visibility trend and visibility are the core variables of the model.If the humidity and temperature trends are missing, the forecasting performance of fog formation decreases.If visibility and the visibility trend are missing, the model no longer produces a reliable forecast.Except for the temperature trend, humidity, the visibility trend and visibility already had a high value in the feature importance analysis.The temperature trend is important for the forecast but had a comparatively low gain in the feature importance analysis.Overall, however, the model is very robust against individual sensor failures.This is a great advantage for the application, as sensor failures can occur, but the forecast should continue to run.The high dependence on visibility trend and visibility in both the feature importance analysis and the predictor importance analysis suggests that the model generates its forecast almost exclusively based on these variables.However, this is not the case, as Figure 13 shows.For this figure, a new model was trained and evaluated using the EWA scheme.This time the training dataset only consists of the variable's visibility and visibility trend from the Moving Average 8-Trend Window 2 dataset.In the left column, the F1 score of 0.85 initially appears quite high.However, the weaknesses become apparent when looking at the transitions.All transition performance scores are significantly worse (Figure 12) compared to the performance scores of the training with the complete dataset.The small change in the F1 score from 0.88 (Figure 12) to 0.85 (Figure 13) when training with all variables can be explained by the fact that the transitions make up a very small proportion of the dataset.Large changes regarding the transition's performance therefore only result in a small change of the overall performance.Therefore, a separate evaluation of the transitions is particularly important for unbalanced datasets, as shown in our previous study (Vorndran et al., 2022).
When looking at the two variables.One can see that if either the visibility trend or visibility fails, the forecast becomes even worse.Visibility is more important than visibility trend.But in both cases, a reliable forecast is no longer possible.A stable forecast with high accuracy can only be achieved in interaction with the other variables from the complete dataset.While visibility is undeniably a very important variable for the model and one that is prone to persistence, XGB has also learned to incorporate the fog-relevant physical interactions (Bergot & Lestringant, 2019;Boutle et al., 2022) by including the other variables.The question for the future is how the dynamics of the atmosphere can be represented by additional variables to further improve fog prediction with XGB, possibly together with LR in an ensemble.

CONCLUSION
In summary, this study has shown that the use of the XGB algorithm along with data pre-processing techniques holds considerable potential for radiation fog nowcasting.Our key findings are described next.

Data smoothing
It has been demonstrated that data smoothing with a four-hour moving-average window provides significant benefits in maximizing forecast accuracy within our dataset.Smoothing has been shown to be particularly beneficial in capturing short-term dynamics, especially for the 60-and 120-min forecast periods.

Overall performance analysis
The incorporation of trend variables, especially within the one-hour forecasting period, results in a significant improvement in XGB's and LR's forecasting performance.
The importance of trend variables remains significant over the 120-min forecasting period but diminishes over longer prediction periods.

Transition performance analysis
The trend variables show remarkable benefits in predicting the formation and dissipation of fog, with the optimal configuration for our station being a four-hour smoothing window and a one-hour trend window.

Key features
The examination of feature and predictor importance underscores the importance of core variables including visibility, visibility trend, relative humidity, relative humidity trend, dew point depression, and temperature trend.Furthermore, while visibility and its trend proved to be of high importance, the exclusion of the other variables significantly reduced the performance of XGB to such an extent that it no longer provides reliable forecasts.This underlines the importance of the variables' interactions in the fog development process and the ability of XGB to learn that interaction.In addition, XGB is robust to single sensor outages, enhancing its practical applicability.
In conclusion, this study demonstrates that XGB provides a robust and reliable framework for fog nowcasting predictions.The incorporation of atmospheric trends significantly improved the nowcasting performance, underscoring the importance of this information for the model to estimate the potential for fog development.The use of rather standard variables measured at several stations in Germany and the robustness of XGB to single sensor failures make XGB a promising model for area-wide fog forecasting.
XGB has outperformed LR and the other baselines.This shows that the relationships of the atmospheric dynamics can be identified by XGB.The higher performance over LR justifies the use of the more complex XGB algorithm.Due to the inferior but still good performance of LR compared to XGB, an ensemble setting might increase the nowcasting performance even more and is also considered as a next step.
We also see potential to expand the dataset by including additional satellite data as a next step, as these resources are readily available nationwide in Germany.We expect this to improve the representation of the fog-relevant atmospheric conditions and thus the model performance.For a final variable selection, especially regarding the extension of the dataset, further explainable AI methods are necessary to get a deeper understanding of the different interactions during model training and the final forecast (Kamangir et al., 2022).

F
I G U R E 1 67,997 data points (upper row) in 30 min resolution across seven seasons from 2009 to 2016.Missing values (lower row): 6641 data points.

Fog
events are characterized by high fluctuations, as can be seen in Figure 2 (black dashed line).By eliminating F I G U R E 2 The effect of different smoothing intensities on the fluctuations of a fog event.The visibility variable serves as a representative example.The boxes outlined in grey show an enlarged section.Fluctuations occur not only before and after the fog event but also during the event.F I G U R E 3 Datasets used for training and evaluation.A baseline dataset on the basis of which seven further moving-average datasets were generated.Different moving-average window sizes in brackets and Table 2. Based on each moving-average dataset, 70 (35 + 35) additional datasets with different trend window sizes were generated.Trend window sizes in brackets and Table 2.In each window a time step equals half an hour.
) Just like with the moving-average windows, a time step of two corresponds to a trend window of one hour into the past, and so on.The different dataset variants are shown in Figure3.To provide an example: in the end, the Moving Average 2 dataset has five additional variants, one for each trend window (Table2).Since there are seven Moving Average datasets and five trend window sizes, the entire procedure results in 7 × 5 = 35 new datasets.Two versions of the trend variable datasets were created.The first version contains absolute values of the trend variables in the model training.The second version contains encoded values, where positive values were changed to 1, negative values to −1, and all other values to 0 to indicate increases, decreases, and no change.

F
I G U R E 4 Expanding Window Approach (EWA): Applied training and evaluation scheme for XGBoost (XGB) and Logistic Regression (LR) including the two baseline models persistence (PM) and climatology (CM).the training and test dataset was varied five times from 70%/30% to 90%/10%.After each training and test split the training dataset is passed on to the second step (step 2, Training).One model was trained per training and test split.During model training the model needs to evaluate its performance in order to find the best fit.Therefore, the training dataset is further divided into a training dataset (80%) and a validation dataset (20%).In the third step (step 3, Evaluation), the trained model's overall performance and the transitions performance is tested with the independent test dataset.For further details of this evaluation process see Section 5.1.

F1 = 2 ×
Precision × Recall Precision + Recall , Precision = True positive True positive + False positive , Recall = True positive True positive + False negative , was used to optimize and validate the overall model performance.
used.By utilizing these three scores together, the evaluation of the transitions provides a more comprehensive understanding of XGB's and LR's performance across different aspects of fog formation and dissipation.

F
Bootstrapping procedure of the different dataset variants.Part 1: The dataset is split in 80% training data and 20% test data.Part 2: Bootstrapping the training dataset by randomly drawing equally sized time periods and putting them together to a resampled training dataset.Part 3: Training with the resampled training dataset and evaluation with the 20% test dataset like in the Expanding Window Approach (EWA) in Figure 4. Part 2 and part 3 are repeated 1000 times.

F
Example of significance testing procedure.Distribution 1 has a mean F1 score of 0.6.Distribution 2 has a mean F1 score of 0.7.The calculation of the 95% confidence intervals shows that the two distributions are significantly different from each other, as the 95% boundary of the confidence interval of distribution 1 (right blue dotted line) does not exceed the 5% boundary of the confidence interval of distribution 2 (left orange dotted line).

Figure
Figure 7c shows the XGB performance of the training based on the Moving Average 8 smoothed dataset compared to the training based on the Moving Average 8 smoothed dataset with a trend window of 2-6 respectively.Trend windows 2-6 correspond to one to three hours of trend information into the past.The trend variables are in absolute values.Compared to the training based on the Moving Average 8 dataset without a trend window, the highest improvement is achieved with the training based on the Moving Average 8 dataset with a trend window of 2. The trainings with the remaining trend windows also show a performance that is significantly higher than the performance of the training without trend variables.However, the XGB training based on the Moving Average 8 dataset with trend window 2 achieves the highest overall performance.As the trend window expands, the model

F
Overall performance comparison of the XGBoost (XGB) and Logistic Regression (LR) model runs when trained with the different dataset variations.Comparison of the performance for the different smoothing windows for XGB (a) and LR (b).Comparison of the performance for the absolut trend window datasets for XGB (c) and LR (d).Comparison of the performance for the encoded trend window datasets for XGB (e) and LR (f).The four values in each box show the F1 score of the training with the Moving Average (MAvg) 8 Trend Window 2 dataset.* Significant difference between XGB and LR.

F
Significance testing for the different Moving Average Window Sizes (upper row) and Trend Windows (lower row) for XGBoost.The F1 performance scores are plotted on the diagonal.

F
I G U R E 9 Detailed analysis of the forecast performance of XGBoost for the sixty-min forecast period with absolute trend values.The analysis covers the performance regarding the transitions in terms of accuracy, precision and probability of false detection for fog formation (upper row) and fog dissipation (lower row).The red boxes highlight the performance improvement between the base dataset and the Moving Average 8-Trend Window 2 dataset (best overall combination).

F
I G U R E 10 Detailed analysis of the forecast performance of XGBoost for the sixty-min forecast period with encoded trend values.The analysis covers the performance regarding the transitions in terms of accuracy, precision and probability of false detection for fog formation (upper row) and fog dissipation (lower row).The red boxes highlight the performance improvement between the base dataset and the Moving Average 8-Trend Window 2 dataset (best combination).

F
I G U R E 11 Values of median feature importance of the training with the Moving Average 8 Dataset and associated trend window datasets compared to the base dataset without the trend variables.

F
Sixty-min forecast.On the left, the overall, formation and dissipation performance scores of the complete dataset (all variables) are displayed.From left to right are the new performance scores with simulated outage of the respective variable.The arrows and boxes highlight the variables that cause the greatest performance change in case of outage.

F
I G U R E 13 Predictor Importance Analysis for the training with visibility trend and visibility only.Sixty-minutes forecast.On the left are the overall, formation and dissipation performance scores with the dataset unchanged for the reduced dataset.From left to right are the new performance scores with simulated outage of the respective variable.

TA B L E 1
Predictor and target variables used for model training and evaluation.The dataset was recorded at the research station in Linden-Leihgestern (Germany) by the Hessian Agency for Nature Conservation, Environment and Geology (HLNUG) and the Laboratory of Climatology and Remote Sensing (LCRS).
Different window sizes used for creating the Moving-Average datasets and trend variable datasets.Each window consists of a fixed amount of time steps per dataset variant.A time step in each window equals half an hour.
TA B L E 2

TA B L E 3
Transition Evaluation for Logistic Regression (LR) with absolute and encoded trend values.Sixty-minutes forecast.Training with the Moving Average 8-Trend Window 2 dataset.The table shows the increase in performance of the transitions fog formation and fog dissipation from training with the base dataset to the training with absolute and encoded trend variables.