Smoothed Bernstein Online Aggregation for Day-Ahead Electricity Demand Forecasting

We present a winning method of the IEEE DataPort Competition on Day-Ahead Electricity Demand Forecasting: Post-COVID Paradigm. The day-ahead load forecasting approach is based on online forecast combination of multiple point prediction models. It contains four steps: i) data cleaning and preprocessing, ii) a holiday adjustment procedure, iii) training of individual forecasting models, iv) forecast combination by smoothed Bernstein Online Aggregation (BOA). The approach is flexible and can quickly adopt to new energy system situations as they occurred during and after COVID-19 shutdowns. The pool of individual prediction models ranges from rather simple time series models to sophisticated models like generalized additive models (GAMs) and high-dimensional linear models estimated by lasso. They incorporate autoregressive, calendar and weather effects efficiently. All steps contain novel concepts that contribute to the excellent forecasting performance of the proposed method. This holds particularly for the holiday adjustment procedure and the fully adaptive smoothed BOA approach.


I. INTRODUCTION
T HE COVID-19 pandemic led to lockdowns and shutdowns all over the world in 2020 and 2021 to reduce the spread of the corona virus SARS-CoV-2 and the resulting COVID-19 disease. Obviously, mentioned lockdowns and shutdowns impacted substantially the behaviour the people. Thus, also the consumption of electricity changed dramatically during those periods, [1]. Electricity load forecasting during lockdowns and shutdown periods is a challenging task, but even months afterwards the forecasting task is still complicated. One reason is that is not obvious which of the changed behavioral pattern during the lockdown observed in many countries (e.g. increased remote work, getting up later) will persist months and years after the lockdown. Another problematic aspect is the disruption of annual seasonalities during the lockdown periods.
The IEEE DataPort Competition Day-Ahead Electricity Demand Forecasting: Post-COVID Paradigm focuses on Post-COVID aspects in electricity load forecasting [2]. The dayahead load forecasting competition was based on real data and run over a test period of 30 days. This manuscript describes one of the winning method that scored 3rd in the competition 1 . The prediction approach is based on smoothed Bernstein Online Aggregation (BOA) applied on individual load forecasting models. The full model flow is depicted in Figure 1.
The manuscript is organized as follows. First we introduce the data set and the forecasting task in more detail and discuss inital data preprocessing steps. Afterwards, we explains a holiday-adjustment procedure to deal adequately with holidays in the data. Section IV introduces multiple individual forecasting models that are mainly (high-dimensional) statistical forecasting models that are sometimes referred as experts or base learners. Then, we descripe the expert aggregation Florian Ziel is with the House of Energy Markets and Finance, University of Duisburg-Essen, Germany (e-mail: florian.ziel@uni-due.de) 1 According to significance test conducted by the organizers, the top 3 positions where not significantly different from each other. procedure BOA with a smoothing extention. We conclude with some final remarks.

II. DATA AND PREPROCESSING
The load forecasting competition contains initially hourly load data from 2017-03-18 00:00 to 2021-01-17 07:00, visualized in Figure 2. According the organizers the load data corresponds to one city, but the origin of the load data to predict was disclosed.
The daily forecasting task is to predict the next days hourly load, which corresponds to forecast 24 values 17 to 40 hours ahead. Thus, the first forecasting task was aiming for the hourly load for 2021-01-18 from 00:00 to 23:00. The second task was to predict the load on 2021-01-19. This rolling forecasting procedure was continued over 30 days in the competition. In the bottom chart of Figure (2) you see clearly the structural break due to the COVID-19 lockdown in March 2020. The overall load level dropped and the weekly profile got disturbed dramatically. In the proceeding months we observe some slowly increasing recovery of the electricity consumption. However, even in 2021 we observe that especially the peak hours have a lower load level than the previous years. Next to the actual load data, also weather input data was provided. This was actual data on humidity, pressure, cloud cover, temperature, wind speed such as day-ahead forecasts of all meteorologic features except humidity were provided, Figure 3 for last years data. The day-ahead weather forecasts were in fact 48-hours ahead forecast. Thus, for the first day, weather forecasts data up to 2021-01-19 07:00 was provided. During the competition the actual load and weather data, and the weather forecast data for the next 24 hours were released, leading to a typical rolling forecasting study design.
The weather data contained some obvious reporting problems which were cleaned using linear interpolation and the R-package tsrobprep, see [3], [4]. Afterwards, we transformed the wind direction data to the north-south (NS) and east-west (EW) component by evaluating the cosine and sine of the wind direction data. Thus, Figure 3 shows the cleaned data for the available weather forecasts and actuals. For further analysis, we extend the weather data input space by adding rolling daily means of all weather inputs.
The evaluation metric is the mean absolute error (MAE) which corresponds to point forecasting. More precisely, median forecasts are required to minimize the MAE, see [5].

III. HOLIDAY ADJUSTMENT PROCEDURE
As the origin of the data was disclosed and no holiday calendar was provided a specific solution for dealing with holidays is required. Handling holidays adequately is an important task and may improve the forecasting accuracy substantially even for the non-holidays, see e.g. [6].
By eyeballing, it is easy to spot some obvious date-based public holidays in the data (12Jan, 17Apr, 1Aug, 18Sep, 11Dec, 18Dec). But there are also a couple days which behave like holidays but the pattern of occurrence seems to be different. We consider a holiday adjustment procedure to take into account the holiday impact appropriately. The procedure is based on a high-dimensional time series model, similarly used in the GEFCom2014 (Global Energy Forecasting Competition 2014), see [7]. The result of the considered procedure is illustrated for the period from October to December in Figure  4.
To introduce the holiday adjustment procedure formally, we require some notations. Denote t = log(L t ) the logarithm of the load L t at time point t. Let T be the number of observations that is currently available for model training. The considered model is a high-dimensional linear model for t containing the following components in the input matrix: = max{x t − q p (x), 0} with q p (x) for p ∈ P as p-quantile of x for weather input feature x = (x 1 , . . . , x T ) . iii) All weather data interactions, i.e. x inter xt,yt,t = x t y t for inputs x t and y t . iv) Daily and weekly deterministic effects. The daily and weekly effects are modeled by standard and cumulative dummies: x cday x week where HoD k (t) and HoW k (t) are the hour-of-the-day and hour-of-the-week dummies. v) Annual deterministic effects described by periodic cubic B-splines with annual periodicities (A = 24 × 365.24 hours). Precisely, we consider 12 basis functions on a equidistant grid on [0, A). For more details on periodic cubic B-splines in energy forecasting see [8]. vi) Impact-adjusted holiday dummies on days which were identified in advance as potential holidays.
The lagged log load in i) describes the autoregressive impact on a specific day for the surrounding 3 weeks of information without using nearby information of the surrounding week, to exclude any impact from bridging effects. Note that ReLU-transformed weather input in ii) is relevant to capture non-linear weather impacts. However, for p = 0 the linear effect is modelled. Component iii) is motivated from the second order Taylor approximation. Considering all weather data interactions allows us to capture relevant nonlinear information. In fact, components ii) and iii) may be regarded as a manual application of the kernel trick to the input data to enlarge the feature space.
Further, in iv) the standard dummies with '='-sign in the definition (see (1) and (3)), have the job of detecting demand effects that happen only at the day or week period (e.g. if the load is high only at a certain hour of the day). In contrast, the cumulative dummies (see (2) and (4)) have the purpose to describe effects that persists over multiple hours in the day or week period. The component vi) models the holiday effect and is crucial for the holiday adjustment procedure. Its design corresponds to the holiday modeling approach used in see [7]. However, next to the impact multiplication also a scaling of the impact. Precisely it is scaled by the difference of rolling quantiles at probabilities 90% and 37% of the previous week. The idea is that the upper quantile is an estimate standard activity in a working week and the lower quantile and estimate for the Sunday peak. This adjustment procedure is required to deal with the strong structural breaks during the COVID-19 shutdown. This, effect can be seen in Figure 4 as well. We observe that the absolute holiday impact of 11th December is smaller in 2020 than the years before.
The model for the log-load t with all inputs i) to vi) is estimated using lasso (least absolute shrinkage and selection operator) on scaled input data. The tuning parameter is chosen by minimizing the Bayesian information criterion (BIC), see e.g. [9]. Now, we take the fitted parameter vector β and set all estimated parameters which correspond to the holiday impacts vi) to zero, to receive β hldadj . The fitted values with respect to β hldadj is the holiday-adjusted log-load time series˜ t , as illustrated in Figure 4 in blue.
Note that for the inital and final three weeks (exactly 510 hours as the maximum in I pos ) the procedure can not be applied as t+k is not available all the time. Therefore, we train for the inital three weeks the same model without I neg and for the last three weeks the model without I pos .
The complete lasso training procedure including tuning parameter selection on the full data set takes around half a minute on the authors laptop using glmnet of R on a single core. However, it is important to use sparse matrix support to reduce computation time.

IV. TRAINING OF INDIVIDUAL FORECASTING MODELS
Given the holiday adjusted log-loadl t and the resulting holiday adjusted load L t we train many forecasting models to create a big pool of forecasters (or experts). The considered models range from simple time series model more advanced statistical learning procedures. Also several non-linear models gradient boosting machines (GBM) (using the R packages gbm and lightgbm) and neural networks (using the R packages nnet and keras) were tested. But the forecasting accuracy was rather low and they did not improve the forecasting performance in the forecasting combination method described in Section V. The reason might be that the major impacts are linear, esp. autoregressive and seasonal effects.
The considered models, can be categorised into four types. This is A) STL-decomposed exponential smoothing → Sec. IV-A B) AR(p) models → Sec. IV-B C) Generalized additive models (GAMs) → Sec. IV-C D) Lasso estimated high-dimensional linear regression models → Sec. IV-D The lasso type model had best individual prediction accuracy. Further, all models are applied to the holiday adjusted load time series and the holiday adjusted log-loadl t and the holiday adjusted load L t . For convenience, we introduce the notation Y t ∈ {l t , L t }. When considering a log-load model, the exponential function is applied to the point forecasts T +h for the forecasting horizon h ∈ H = {h min , . . . , h max } = {17, 18, . . . , 40} to predict the load at T + h. All models were estimated using a calibration window size of C ∈ {28, 56, 77, 119, 210, 393, 758, 1123} days minus 16 hours (as the last available data point was at 8am). The general idea behind this is quite simple, models with short calibration windows (e.g. 4, 8, 12 weeks) shall adjust better to more recent data, models with larger windows have more data to learn better about rare event like the annual effects. Moreover, several forecasting studies in energy forecasting have shown that combining short and long calibration windows, may lead to substantial gain in forecasting performance, see e.g. [10], [11].
The described forecasting procedure was applied in a rolling forecasting study to all days starting from 1st June 2020 as first day to predict. This date was chosen by manual inspection the historic data, as the hard COVID-19 shutdown effects seem to be vanished.

A. STL decomposition with Exponential Smoothing
This approach applies first an STL decomposition on Y t . STL acronym represents the decomposition into to trend, seasonal and remainder components by loess (locally weighted scatterplot smoothing).
On the remainder component an additive exponential smoothing model is fitted. This is done using the stlf function of the forecast package in R, [12]. The seasonality of the time series are set to 168. Forecasting is done recursively for forecasting horizon up to h max , and report h min , . . . , h max .

B. AR(p) time series model
Here, Y t is modeled by a simple autoregressive process (AR(p)) where p, sometimes used in energy forecasting [13], [14]. The only tuning parameter p is selected by minimizing the Akaike information criterion (AIC) with p max = 24×22 = 528 (3 weeks plus 1 day). This done using the R function ar of the stats package in R, see [15]. Again, the forecasting is done recursively to h max , and report h min , . . . , h max .

C. Generalised additive models (GAMs)
This procedure utilized generalised additive models which are popular in load forecasting, see e.g. the winning method of the Global Energy Forecasting Competition 2014 in the load track [16].
In fact we consider 2 separate GAM model designs due to the limited accessibility of the Y t−24 for forecasting horizons h ∈ H. For hour the first 8 horizons h ∈ {17, . . . , 24} the GAM model is The autoregressive terms capture the dependency structure of the past for the corresponding hour. Note that the yesterdays load Y 24 and previous weeks load Y 168 is regarded as very important and therefor non-linear effects are considered. Preliminary analysis showed that the weather variables temperature and cloud cover are more relevant to explain the load behavior than other weather variables. There, we included next plain non-linear effects on each individual variable which potentially varies over the week also interaction effects. The remaining weather variables enter with non-linear smoothing effects.
The models are trained by considering only the data of the corresponding target hours. Obviously, the forecasting is done directly. The implementation is done using the gam function of the R-package mgcv, see [17].

D. Lasso based high-dimensional regression models
The lasso based models are very similar to the model used for the holiday adjustment in Section IV-D. Therefore, we only highlight the differences which concerns the autoregressive design and details on the estimation procedure.
The high-dimensional linear models are trained for each forecasting horizons h ∈ H separately. Additionally, the lag sets I h are adjusted to I h = I h,day ∪ I h,week ∪ I h,year with I h,day = ·{h, . . . , 24 . . . 15 + h} − h, I h,week = 24 · {21, 28, . . . , 56}−h and I h,year = 24·{350, 357, 364, 371}−h, for h ∈ H to incorporate daily, weekly and annual autoregressive effects. The high-dimensional regression model is trained by lasso on an exponential tuning parameter grid of size 20. In detail the grid for the regularization parameter α is 2 L where L is an equidistant grid form 6 to −1 of size 20.

V. FORECAST COMBINATION BY SMOOTHED BERNSTEIN ONLINE AGGREGATION (BOA)
After creating all forecasting models as described in Section IV, an online aggregation procedure is used to combine the forecasts. The combination method is based on an extension of the fully adaptive Bernstein Online Aggregation (BOA) procedure, see [18]. The BOA is extended by a smoothing component and is implemented in the R package profoc [19]. It is similarly as used in [20] for CRPS learning.

A. Formal description of the algorithm
To introduce the smoothed BOA formally, we require some further notations. Denote L d,h,k the available load forecasts for forecast issue day d, prediction horizon h and forecasting model k. If current forecast is for day d, then we are looking for optimal combination weights w d,h,k . This is used to combine the predictions linearly so that is the forecast aggregation to report. Moreover, denote AD(x, y) = |y − x| the absolute deviation (also known as of AD with respect to x evaluated at forecast combination L d,h . We require AD ∇ to apply the so called gradient trick to enable optimal convergence rates in the BOA, see [18], [20]. The smoothed fully adaptive BOA with gradient trick and forgetting has the five update steps. In every update step we update the instantaneous regret r d,h,k , the range E d,h,k , the learning rate η d,h,k , the regret R d,h,k , and the combination weights w d,h,k for forecasting horizon h and forecaster k: with inital values w 0,h,k = 1/K, R 0,h,k = 0 and E 0,h,k = 0. As it can be seen in equation (10) the BOA considers an exponential updating schema as the popular exponential weighted averaging (EWA), see [21]. The BOA will lead always to a convex combination of the forecasters, as the EWA. Further, is well known that the EWA in combination with the gradient trick can achieve optimal convergence rates, if the considered updating loss is exp-concave, see [21]. Unfortunately, the required absolute deviation AD is not expconcave. Therefore, the BOA uses a second order refinement in the weight update to achieve better convergence rates under weaker regularity conditions on the considered loss. In fact, the mentioned gradient trick and the second order refinement allow the BOA to achieve almost optimal convergence rates for the selection problem and convex aggregation problem. [18] and [22] prove that the BOA considered for absolute deviation loss has almost linear convergence with respect to the prediction performance of the best individual expert and a almost (standard) square root convergence with respect to the optimal convex combination. Both convergence rates are only almost optimal as there is an additional log(log) term in both convergence rates which is due to the online calibration of the learning rate. Now, we motivate the smoothing extension of the BOA: The described BOA algorithm applies the forecast combination to each target hour h individually. However, it could be a reasonable assumption that the weights w d,h,k are constant across all h ∈ H. This restriction reduces the estimation risk in the algorithm for sacrificing theoretical optimality. Hence, we want to find solution between those two extreme situations which finds the optimal trade-off. Therefore, we are considering smoothing splines, applied to the weights w d,h,k . As suggested by [20] we consider cubic P-splines on an equidistant grid of knots of size 24. The smoothed weights w d,h,k are computed by where λ ≥ 0 is a smoothing parameter, B is the matrix of cubic B-splines and D is the difference matrix where the difference operator is applied to the identity. Note that we difference only once, as this implies smoothing towards a constant function if λ → ∞, see [20]. The tuning parameter λ has to be determined.

B. Application, parameter tuning and forecasting results
As explained in the introduction the competition was conducted in a rolling window framework and maps realistic settings. However, for illustration purpose, we concentrate one forecasting task, this is to forecast the 1st February 2021 from 0:00 to 23:00 where the last available observation is on 31st January 2021 7:00.
We decided to utilize a stepwise forward approach to determine which forecasts to combine using the BOA. Therefore, we consider a burn-in period of 30 days (to allow local convergence of the BOA) and keep the last 60 days of available data for calibration. The final number of models M to combine was determined by evaluating the MAE of the M max = 40 combination procedures on the calibration data set. The results for the validation MAE across all forecasting horizons are shown in Figure (6). Additionally, we label the selected models for the optimal number of models to combine, which is 5 in this situation. We observe that especially the first few models contribute substantially to the MAE reduction which is about 10% compared to the best individual model. It is interesting to see that the selected 5 models are quite diverse. Those are three lasso based models, a GAM model and an STL+ETS model. From the selected lasso models, two use a long history of about 3 years of data and one just a very short history of about 3 months. Also the GAM model considers a relatively short history of 7 months.
After selecting the forecasters to combine we run a BOA algorithm on an exponential λ-grid. We choose always the λvalue which performs best in the past to predict the next day. More precisely, we chose the λ-value so that the exponentially discounted MAE with a forgetting parameter ρ = 0.01 is minimized. Note that this forget corresponds to an effective sample size of 1/ρ which is 100, so about 3 months. Figure  (6) shows the results for the selected values for the smoothing parameter λ on the considered training and validation set. We observe that the selected smoothing parameter clearly varies over time. It is also interesting to see that in the burn-in phase very high λ values where selected. This correspond to a conservative selection with low estimation risk. This selection is plausible, as the amount of information to evaluate is low in the burn-in period.
Figure (7) visualizes the evolution of the combination weights of the BOA algorithm over time for the forecasting horizons h = 17 and h = 40. We observe significant differences, especially the models with short calibration windows (lasso model with D = 76 and GAM with D = 209) have more weight for h = 40.
The same finding can be seen in Figure (8). Here, we illustrate the smoothing across the forecasting horizon for the 24 hours in the forecasting horizon. We added limiting cases with constant weights (λ → ∞) and pointwise optimized weights (λ = 0) to illustrate the effect of smoothing. The forecast of the smoothed BOA approach is illustrated in Figure  (9). There we see that the GAM model tends to underestimate and the STL+ETS model overestimated the load for the considered forecasting horizon. Thus, they can be regarded as bias correcting models.

VI. CONCLUSION
In this manuscript we present one of the winning methods the IEEE DataPort Competition on Day-Ahead Electricity Demand Forecasting: Post-COVID Paradigm. It utilizes a sophisticated holiday adjustment procedure, and a novel forecast combination method based on smoothed Bernstein online aggregation (BOA). The approach is flexible and can quickly adopt to new energy system situations.
Obviously, better results may be achieved by more advanced tuning parameter selection design which suffers clearly some optimality. For instance, some choices on parameter tuning were done ad hoc (e.g. forgetting rate for tuning parameter selection of ρ = 0.01, validation period of 60 days) which could be optimized. Furthermore, other BOA extensions as discussed in [20] like fixed share or regret forgetting could be used as well. Moreover, the pool of individual forecasting models could be enriched as well. This holds particularly for non-linear models that utilize gradient boosting machines or artificial neural networks. However, the analysis showed that the main features for this short-term load forecasting task are linear, especially the autoregressive and seasonal effects. Hence, no huge improvement should be expected by integrating mentioned models.