Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx

We extend the neural basis expansion analysis (NBEATS) to incorporate exogenous factors. The resulting method, called NBEATSx, improves on a well performing deep learning model, extending its capabilities by including exogenous variables and allowing it to integrate multiple sources of useful information. To showcase the utility of the NBEATSx model, we conduct a comprehensive study of its application to electricity price forecasting (EPF) tasks across a broad range of years and markets. We observe state-of-the-art performance, significantly improving the forecast accuracy by nearly 20% over the original NBEATS model, and by up to 5% over other well established statistical and machine learning methods specialized for these tasks. Additionally, the proposed neural network has an interpretable configuration that can structurally decompose time series, visualizing the relative impact of trend and seasonal components and revealing the modeled processes' interactions with exogenous factors. To assist related work we made the code available in https://github.com/cchallu/nbeatsx.


Introduction
In the last decade, a significant progress has been made in the application of deep learning to forecasting tasks, with models such as the exponential smoothing recurrent neural network (ESRNN; Smyl 2020) and the neural basis expansion analysis (NBEATS; Oreshkin et al. 2020), outperforming classical statistical approaches in the recent M4 competition (Makridakis et al., 2020). Despite this success we still identify two possible improvements, namely the integration of time-dependent exogenous variables as their inputs and the interpretability of the neural network outputs.
Neural networks have proven powerful and flexible, yet there are several situations where our understanding of the model's predictions can be as crucial as their accuracy, which constitutes a barrier for their wider adoption. The interpretability of the algorithm's outputs is critical because it encourages trust in its predictions, improves our knowledge of the modeled processes, and provides insights that can improve the method itself.
Additionally, the absence of time-dependent covariates makes these powerful models unsuitable for many applications. For instance, Electricity Price Forecasting (EPF) is a task where covariate features are fundamental to obtain accurate predictions. For this reason, we chose this challenging application as a test ground for our proposed forecasting methods.
In this work, we address the two mentioned limitations by first extending the neural basis expansion analysis, allowing it to incorporate temporal and static exogenous variables. And second, by further exploring the interpretable configuration of NBEATS and showing its use as a time-series signal decomposition tool. We refer to the new method as NBEATSx. The main contributions of this paper include: (i) Incorporation of Exogenous Variables: We propose improvements to the NBEATS model to incorporate time dependent as well as static exogenous variables. For this purpose, we have designed a special substructure built with convolutions, to clean and encode useful information from these covariates, while respecting time dependencies present in the data. These enhancements greatly improve the accuracy of the NBEATS method, and extend its interpretability capabilities, so rare in neural forecasting. (ii) Interpretable Time Series Signal Decomposition: Our method combines the power of non-linear transformations provided by neural networks with the flexibility to model multiple seasonalities and simultaneously account for interaction events such as holidays and other covariates, all while remaining interpretable. The extended NBEATSx architecture allows to decompose its predictions into the classic set of level, trend, and seasonality, and identify the effects of exogenous covariates. (iii) Time Series Forecasting Comparison: We showcase the use of NBEATSx model on five EPF tasks achieving state-of-the-art performance on all of the considered datasets. We obtain accuracy improvements of almost 20% in comparison to the original NBEATS and ESRNN architectures, and up to 5% over other well-established machine learning, EPF-tailored methods (Lago et al., 2021a).
The remainder of the paper is structured as follows. Section 2 reviews relevant literature on the developments and applications of deep learning to sequence modeling and current approaches to EPF. Section 3 introduces mathematical notation and describes the NBEATSx model. Section 4 explores our model's application to time series decomposition and forecasting over a broad range of electricity markets and time periods. Finally, Section 5 discusses possible directions for future research, wraps up the results, and concludes the paper.

Deep Learning and Sequence Modeling
The Deep Learning methodology (DL) has demonstrated a significant utility in solving sequence modeling problems with applications to natural language processing, audio signal processing, and computer vision. This subsection summarizes the critical DL developments in sequence modeling, that are building blocks of the NBEATS and ESRNN architectures.
For a long time, sequence modeling with neural networks and Recurrent Neural Networks (RNNs ;Elman 1990) were treated as synonyms. The hidden internal activations of the RNNs propagated through time provided these models with the ability to encode the observed past of the sequence. This explains their great popularity in building different variants of the Sequence-to-Sequence models (Seq2Seq) applied to natural language processing (Graves, 2013), and machine translation (Sutskever et al., 2014). Most progress on RNNs was made possible by architectural innovations and novel training techniques that made their optimization easier.
The adoption of convolutions and skip-connections within the recurrent structures were important precursors for new advancements in sequence modeling, as using deeper representations endowed longer effective memory for the models. Examples of such precursors could be found in WaveNet for audio generation and machine translation (van den Oord et al., 2016), as well as the Dilated Recurrent Neural Network (DilRNN; Chang et al. 2017) and the Temporal Convolutional Network (TCN; Bai et al. 2018).
Nowadays, Seq2Seq models and their derivatives can learn complex nonlinear temporal dependencies efficiently; its use in the time series analysis domain has been a great success. Seq2Seq models have recently showed better forecasting performance than classical statistical methods, while greatly simplifying the forecasting systems into single-box models, such as the Multi Quantile Convolutional Neural Network (MQCNN; Wen et al. 2017), the Exponential Smoothing Recurrent Neural Network (ESRNN; Smyl 2020), or the Neural Basis Expansion Analysis (NBEATS; Oreshkin et al. 2020). For quite a while, the academia resisted to broadly adopt these new methods (Makridakis et al., 2018), although their evident success in challenges such as the M4 competition has motivated their wider adoption by the forecasting research community (Benidis et al., 2020).

Electricity Price Forecasting
The Electricity Price Forecasting (EPF) task aims at predicting the spot (balancing, intraday, day-ahead) and forward prices in wholesale markets. Since the workhorse of shortterm power trading is the day-ahead market with its conducted once-per-day uniform-price auction (Mayer & Trück, 2018), the vast majority of research has focused on predicting electricity prices for the 24 hours of the next day, either in a point (Weron, 2014;Lago et al., 2021a) or a probabilistic setting (Nowotarski & Weron, 2018). There also are studies on EPF for very short-term (Narajewski & Ziel, 2020), as well as mid-and long-term horizons (Ziel & Steinert, 2018). The recent expansion of renewable energy generation and large-scale battery storage has induced complex dynamics to the already volatile electricity spot prices, turning the field into a prolific subject on which to test novel forecasting ideas and trading strategies (Chitsaz et al., 2018;Gianfreda et al., 2020;Uniejewski & Weron, 2021).
Out of the numerous approaches to EPF developed over the last two decades, two classes of models are of particular importance when predicting day-ahead prices -statistical (also called econometric or technical analysis), in most cases based on linear regression, and computational intelligence (also referred to as artificial intelligence, non-linear or machine learning), with neural networks being the fundamental building block. Among the latter, many of the recently proposed methods utilize deep learning (Wang et al. 2017;Lago et al. 2018a;Marcjasz 2020), or are hybrid solutions, that typically comprise data decomposition, feature selection, clustering, forecast averaging and/or heuristic optimization to estimate the model (hyper)parameters (Nazar et al., 2018;Li & Becker, 2021).
Unfortunately, as argued by Lago et al. (2021a), the majority of the neural network EPF related research suffers from too short and limited to a single market test periods, lack of well performing and established benchmark methods, and/or incomplete descriptions of the pipeline and training methodology resulting in poor reproducibility of the results. To address these shortcomings, our models are compared across two-year out-of-sample periods from five power markets and using two highly competitive benchmarks recommended in previous studies: the Lasso Estimated Auto-Regressive (LEAR) model and a (relatively) parsimonious Deep Neural Network (DNN).

NBEATSx Model
As a general overview, the NBEATSx framework decomposes the objective signal by performing separate local nonlinear projections of the target data onto basis functions across its different blocks. Figure 1 depicts the general architecture of the model. Each block consists of a Fully Connected Neural Network (FCNN; Rosenblatt 1961) which learns expansion coefficients for the backcast and forecast elements. The backcast model is used to clean the inputs of subsequent blocks, while the forecasts are summed to compose the final prediction. The blocks are grouped in stacks. Each of the potentially multiple stacks specializes in a different variant of basis functions.
To continue the description of the NBEATSx, we introduce the following notation: the objective signal is represented by the vector y, the inputs for the model are the backcast window vector y back of length L, and the forecast window vector y f or of length H; where L denotes the length of the lags available as classic autoregressive features, and H is the forecast horizon treated as the objective. While the original NBEATS only admits as regressor the backcast period of the target variable y back , the NBEATSx incorporates covariates in its analysis denoted with the matrix X. Figure 1 shows an example where the target variable is the hourly electricity price, the backcast vector has a length L of 96 hours, and the forecast horizon H is 72 hours, in the example, the covariate matrix X is composed of wind power production and electricity load. For the EPF comparative analysis of Section 4.3.6 the horizon considered is H = 24 that corresponds to day-ahead predictions, while backcast inputs L = 168 correspond to a week of lagged values.  For its predictions, the NBEATS model only receives a local vector of inputs corresponding to the backcast period, making the computations exceptionally fast. The model can still represent longer time dependencies through its local inputs from the exogenous variables; for example, it can learn long seasonal effects from calendar variables.
Overall, as shown in Figure 1, the NBEATSx is composed of S stacks of B blocks each, the input y back of the first block consists of L lags of the target time series y and the exogenous matrix X, while the inputs of each of the subsequent blocks include residual connections with the backcast output of the previous block. We will describe in detail in the next subsections the blocks, stacks, and model predictions.

Blocks
For a given s-th stack and b-th block within it, the NBEATSx model performs two transformations, depicted in the blue rectangle of Figure 1. The first transformation, defined in Equation (1), takes the input data (y back s,b−1 , X s,b−1 ), and applies a Fully Connected Neural Network (FCNN; Rosenblatt 1961) to learn hidden units h s,b ∈ R N h that are linearly adapted into the forecast θ f or s,b ∈ R Ns and backcast θ back s,b ∈ R Ns expansion coefficients, with N s the dimension of the stack basis.
The second transformation, defined in Equation (2), consists of a basis expansion operation between the learnt coefficients and the block's basis vectors V back s,b ∈ R L×Ns and V f or s,b ∈ R H×Ns , this transformation results in the backcastŷ back s,b and forecastŷ f or s,b components.

Stacks and Residual Connections
The blocks are organized into stacks using the doubly residual stacking principle, which is described in Equation (3) and depicted in the brown rectangle of Figure 1. The residual backcast y back s,b+1 allows the model to subtract the component associated to the basis of the s, b-th stack and block V back s,b from y back , which can be also thought of as a sequential decomposition of the modeled signal. In turn, this methodology helps with the optimization procedure as it prepares the inputs of the subsequent layer making the downstream forecast easier. The stack forecast y f or s aggregates the partial forecasts from each block.

Model predictions
The final predictionsŷ f or of the model, shown in the yellow rectangle of Figure 1, are obtained by summation of all the stack predictions.
The additive generation of the forecast implies a very intuitive decomposition of the prediction components when the bases within the blocks are interpretable.

NBEATSx Configurations
The original neural basis expansion analysis method proposed two configurations based on the assumptions encoded in the learning algorithm by selecting the basis vectors V back s,b and V f or s,b used in the blocks from Equation (2). A mindful selection of restrictions to the basis allows the model to output an interpretable decomposition of the forecasts, while allowing the basis to be freely determined can produce more flexible forecasts by effectively removing any constraints on the form of the basis functions.
In this subsection, we present both interpretable and generic configurations, explaining in particular how we propose to include the covariates in each case. We limit ourselves to the analysis of the forecast basis, as the backcast basis analysis is almost identical, only differing by its extension over time. We show an example in Appendix A.1.

Interpretable Configuration
The choice of basis vectors relies on time series decomposition techniques that are often used to understand the structure of a given time series and patterns of its variation. Work in this area ranges from classical smoothing methods and their extensions such as X-11-ARIMA, X-12-ARIMA, and X-13-ARIMA-SEATS, to modern approaches such as TBATS (Livera et al., 2011). To encourage interpretability, the blocks within each stack may use harmonic functions, polynomial trends, and exogenous variables directly to perform their projections. The partial forecasts of the interpretable configuration are described by Equations (5)-(7).
where the time vector t = [0, 1, 2, . . . , H − 2, H − 1]/H is defined discretely. When the basis where N pol is the maximum polynomial degree, the coefficients are those of a trend polynomial model. When the bases V f or , the coefficients vector θ f or s,b can be interpreted as Fourier transform coefficients, the hyperparameter N hr controls the harmonic oscillations. The exogenous basis expansion can be thought as a time-varying local regression when the basis is the matrix X = [X 1 , . . . , X Nx ] ∈ R H×Nx , where N x is the number of exogenous variables. The resulting models can flexibly reflect common structural assumptions, in particular using the interpretable bases, as well as their combinations.
In this paper, we propose including one more type of stack to specifically represent exogenous variable basis as described in Equation (7) and depicted in Figure 1. In the original NBEATS framework (Oreshkin et al. (2020)), the interpretable configuration usually consists of a trend stack followed by a seasonality stack, each containing three blocks. Our NBEATSx extension of this configuration consists of three stacks, one of each type of factors (trend, seasonal, exogenous). We refer to this interpretable and its enhanced interpretable configuration as the NBEATS-I and NBEATSx-I models, respectively.

Generic Configuration
For the generic configuration, the basis of the non linear projection in Equation (2) corresponds to canonical vectors, that is V f or s,b = I H×H , an identity matrix of dimensionality equal to the forecast horizon H that matches the coefficient's cardinality |θ f or This basis enables NBEATSx to effectively behave like a classic Fully Connected Neural Network (FCNN). The output layer of the FCNN inside each block has H neurons, that correspond to the forecast horizon, each producing the forecast for one particular time point of the forecast period. This can be understood as the basis vectors being learned during optimization, allowing the waveform of the basis of each stack to be freely determined in a data-driven fashion. Compared to the interpretable counterpart described in Section 3.4.1, the constraints on the form of the basis functions are removed. This affords the generic variant more flexibility and power at representing complex data, but it can also lead to less interpretable outcomes and potentially escalated risk of overfitting.
For the NBEATSx model with the generic configuration, we propose a new type of exogenous block that learns a context vector C s,b from the time-dependent covariates with an encoder convolutional sub-structure: In the previous equation, a Temporal Convolutional Network (TCN; Bai et al. 2018) is employed as an encoder, but any neural network with a sequential structure will be compatible with the backcast and forecast branches of the model, and could be used as an encoder. For example, the WaveNet (van den Oord et al., 2016) can be an effective alternative to RNNs as it is also able to capture long term dependencies and interactions of covariates by stacking multiple layers, while dilations help it keep the models computationally tractable. In addition, convolutions have a very convenient interpretation as a weighted moving average signal filters. The final linear projection and the additive composition of the predictions can be interpreted as a decoder.
The original NBEATS configuration includes only one generic stack with dozens of blocks, while our proposed model includes both the generic and exogenous stacks, with the order determined via data-driven hyperparameter tuning. We refer to this configuration as the NBEATSx-G model. 8

Exogenous Variables
We distinguish the exogenous variables by whether they reflect static or time-dependent aspects of the modeled data. The static exogenous variables carry time-invariant information. When the model is built with common parameters to forecast multiple time series, these variables allow sharing information within groups of time series with similar static variable levels. Examples of static variables include designators such as identifiers of regions, groups of products, among others.
As for the time-dependent exogenous covariates, we discern two subtypes. First, we consider seasonal covariates from the natural frequencies in the data. These variables are useful for NBEATSx to identify seasonal patterns and special events inside and outside the window lookback periods. Examples of these are the trends and harmonic functions from Equation (5) and Equation (6). Second, we identify domain-specific temporal covariates unique to each problem. The EPF setting typically includes day-ahead forecasts of electricity load and production levels from renewable energy sources.

Electricity Price Forecasting Datasets
To evaluate our method's forecasting capabilities, we consider short-term electricity price forecasting tasks, where the objective is to predict day-ahead prices. Five major power markets 1 are used in the empirical evaluation, all comprised of hourly observations of the prices and two influential temporal exogenous variables that extend for 2,184 days (312 weeks, six years). From the six years of available data for each market, we hold two years out, to test the forecasting performance of the algorithms. The length and diversity of the test sets allow us to obtain accurate and highly comprehensive measurements of the robustness and the generalization capabilities of the models. Table 1 summarizes the key characteristics of each market. The Nord Pool electricity market (NP), which corresponds to the Nordic countries exchange, contains the hourly prices and day-ahead forecasts of load and wind generation. The second dataset is the Pennsylvania-New Jersey-Maryland market in the United States (PJM), which contains hourly zonal prices in the Commonwealth Edison (COMED) and two day-ahead forecasts of load at the system and COMED zonal levels. The remaining three markets are obtained from the integrated European Power Exchange (EPEX). Belgium (EPEX-BE) and France (EPEX-FR) markets share the day-ahead forecast generation in France as covariates since it is known to be one of the best predictors for Belgian prices (Lago et al., 2018b). Finally, the German market (EPEX-DE) contains the hourly prices, day-ahead load forecasts, and the country level wind and solar generation day-ahead forecast. Figure 2 displays the NP electricity price time series and its corresponding covariate variables to illustrate the datasets. The NP market is the least volatile among the considered markets, since most of its power comes from hydroelectric generation, renewable source volatility is negligible, and zero spikes are rare. The PJM market is transitioning from coal generation to natural gas and some renewable sources, zero spikes are rare, but the system exhibits higher volatility than NP. In EPEX-BE and EPEX-FR markets, negative prices and spikes are more frequent, and as time passes, these markets begin to show increasing signs of integration. Finally, the EPEX-DE market shows few price spikes, but the most frequent negative and zero price events, due in great part to the impact of renewable sources. The exogenous covariates are normalized following best practices drawn from the EPF literature (Uniejewski et al., 2018). Preprocessing the inputs of neural networks is essential to accelerate and stabilize the optimization (LeCun et al., 1998).

Interpretable Time Series Signal Decomposition
In this subsection, we demonstrate the versatility of the proposed method and show how a careful selection of the inductive bias, constituted by the assumptions used to learn the modeled signal, endows NBEATSx with an outstanding ability to model complex dynamics while enabling human understanding of its outputs, turning it into a unique and exciting tool for time series analysis. Our method combines the power of non-linear transformations provided by neural networks with the flexibility to model multiple seasons that can be fractional, and simultaneously account for interaction events such as holidays and other covariates. As described earlier, the interpretable configuration of the NBEATSx architecture computes time-varying coefficients for slowly changing polynomial functions to model the trend, harmonic functions to model the cyclical behavior of the signal, and exogenous covariates. Here, we show how this configuration can decompose a time series into the classic set of level, trend, and seasonality components, while identifying the covariate effects.
In this time series signal decomposition example, we show how the NBEATSx-I model benefits over NBEATS-I from explicitly accounting for information carried by exogenous covariates. Figure 3 shows the NP electricity market's hourly price (EUR/MWh), for December 18, 2017 which is a day with high prices due to high load. Other days have a less pronounced difference between the results obtained with the original NBEATS-I and the NBEATSx-I. We selected a day with a higher than normal load for exposition purposes, to demonstrate qualitative differences in the forecasts. We can see a substantial difference in the forecast residual magnitudes in the bottom row of Figure 3. NBEATS shows a strong negative bias. On the other hand, NBEATSx-I is able to capture the evidently substantial explanatory value of the exogenous features, resulting in a much more accurate forecast.

Validation
Test Set Figure 2: The top panel shows the day-ahead electricity price time series for the NordPool (NP) market. The second and third panels show the day-ahead forecast for the system load and wind generation. The training data is composed of the first four years of each dataset. The validation set is the year that follows the training data (between the first and second dotted lines). For the held-out test set, the last two years of each dataset are used (marked by the second dotted line). During evaluation, we recalibrate the model updating the training set to incorporate all available data before each daily prediction. The recalibration uses an early stopping set of 42 weeks randomly chosen from the updated training set (a sample selection is marked with blue rectangles in the top panel). The top row of graphs shows the original signal and the level, the latter is defined as the last available observation before the forecast. The second row shows the polynomial trend components, the third and fourth rows display the complex seasonality modeled by nonlinear Fourier projections and the exogenous effects of the electricity load on the price, respectively. The bottom row graphs show the unexplained variation of the signal. The use of electricity load and production forecasts turns out to be fundamental for accurate price forecasting.

Comparative Analysis 4.3.1. Evaluation Metrics
To ensure the comparability of our results with the existing literature, we opted to follow the widely accepted practice of evaluating the accuracy of point forecasts with the following metrics: mean absolute error (MAE), relative mean absolute error (rMAE) 2 , symmetric mean absolute percentage error (sMAPE) and root mean squared error (RMSE), defined as: where y d,h andŷ d,h are the actual value and the forecast of the time series at day d and hour h, for our experiments given the two years of each test set N d = 728.
While regression-based models are estimated by minimizing squared errors, to train neural networks we minimize absolute errors (see Section 4.3.3 below). Hence, both the MAE and RMSE are highly relevant in our context. Since they are not easily comparable across datasets -and given the popularity of such errors in forecasting practice (Makridakis et al., 2020) -we have additionally computed a percentage and a relative measure. The sMAPE is used as an alternative to MAPE, which in the presence of values close to zero may degenerate (Hyndman & Koehler, 2006). The rMAE is calculated instead of a scaled measure used in the M4 competition for reasons explained in Sec. 5.4.2. of Lago et al. (2021a).

Statistical Tests
To assess which forecasting model provides better predictions, we rely on the Giacomini-White test (GW; Giacomini & White 2006) of the multi-step conditional predictive ability, which can be interpreted as a generalization of the Diebold-Mariano test (DM; Diebold & Mariano 1995), widely used in the forecasting literature. Compared with the DM or other unconditional tests, the GW test is valid under general assumptions such as heterogeneity rather than stationarity of data. The GW test examines the null hypothesis of equal accuracy specified in Equation (10), measured by the L1 norm of the daily errors of a pair of models A and B, conditioned on the available information to that moment 3 in time F d−1 .

Training Methodology
The cornerstone of the training methodology for NBEATSx and the benchmark models included in this work is the definition and use of the training, validation, early stopping, and test datasets depicted in Figure 2. The training set for each of the five markets comprises the first three years of data, the test set includes the last two years of data. The validation set is defined as the year between the train and test set coverages. The early stopping set, used for regularization, is either randomly sampled or corresponds to 42 weeks following the time span of the training set. These sets are used in the hyperparameter optimization phase and recalibration phase that we describe below.
During the hyperparameter optimization phase, model performance measured on the validation set is used to guide the exploration of the hyperparameter space defined in Table 2. During the recalibration phase, the optimally selected model, as defined by its hyperparameters, is re-trained for each day to include newly available information before the test inference. In this phase, an early stopping set provides a regularization signal for the retraining optimization.
To train the neural network, and as is common in the literature (Smyl, 2020;Oreshkin et al., 2020), we minimize the mean absolute error (MAE) using stochastic gradient descent with Adaptive Moments (ADAM; Kingma & Ba 2014). Figure A.2 in the Appendix compares the training and validation trajectories for NBEATS and NBEATSx, as diagnostics to assess the differences of the methods. The early stopping strategy halts the training procedure if a specified number of consecutive iterations occur without improvements of the loss measured on the early stopping set (Yao et al., 2007).
The NBEATSx model is implemented and trained in PyTorch (Paszke et al., 2019) and can be run with both CPU and GPU resources. The code is available publicly in a dedicated repository to promote reproducibility of the presented results and to support related research.

Hyperparameter Optimization
We follow the practice of Lago et al. (2018a) to select the hyperparameters that define the model, input features, and optimization settings. During this phase, the validation dataset is used to guide the search for well performing configurations. To compare the benchmarks and the NBEATSx, we rely on the same automated selection process: a Bayesian optimization technique that efficiently explores the hyperparameter space using tree-structured Parzen estimators (HYPEROPT; Bergstra et al. 2011). The architecture, optimization, and regularization hyperparameters are summarized in Table 2. To have comparable results, during N hr ∈ {1, 2} Whether NBEATSx coefficients take input X (Equation (1)).
{True, False} Optimization and Regularization parameters Initialization strategy for network weights.
{orthogonal, he norm, glorot norm} Initial learning rate for regression problem.
{256, 512} The decay constant allows large initial lr to escape local minima.
{0.5} Number of times the learning rate is halved during train.
{30000} Iterations without validation loss improvement before stop.
{100} Whether batch normalization is applied after each activation.
{True, False} The probability for dropout of neurons in the projection layers.
Range(0,1) The probability for dropout of neurons for the exogenous encoder.
Range(0,1) Constant to control the Lasso penalty used on the coefficients.
Range(0, 0.1) Constant that controls the influence of L2 regularization of weights.
{MAE} Random weeks from full dataset used to validate.
{42} Number of iterations of hyperparameter search.
{1500} Random seed that controls initialization of weights.
{1, 24} Number of time windows included in the full dataset. 4 years Number of validation weeks used for early stopping strategy.
{none, median, invariant, std } the hyperparameter optimization stage we used the same number of configurations as in Lago et al. (2018a). Note, that some of the methods do not require any hyperparameter optimization -e.g., the AR1 benchmark -and some might only have one hyper-parameter to be determined, such as the regularization parameter in the LEARx method, which is typically computed using the information criteria or cross-validation.

Ensembling
In many recent forecasting competitions, and particularly in the M4 competition, most of the top-performing models were ensembles (Atiya, 2020). It has been shown that in practice, combining a diverse group of models can be a powerful form of regularization to reduce the variance of predictions (Breiman, 1996;Nowotarski et al., 2014;Hubicka et al., 2019).
The techniques used by the forecasting community to induce diversity in the models are plentiful. The original NBEATS model obtained its diversity from three sources, training with different loss functions, varying the size of the input windows, and bagging models with different random initializations (Oreshkin et al., 2020). They used the median as the aggregation function for 180 different models. Interestingly, the original model did not rely on regularization, such as L2 or dropout, as Oreshkin et al. (2020) found it to be good for the individual models but detrimental to the ensemble.
In our case, we ensemble the NBEATSx model using two sources of diversity. The first being a data augmentation technique controlled by the sampling frequency of the windows used during training, as defined in the data parameters from Table 2. The second source of diversity being whether we randomly select the early stopping set or instead use the last 42 weeks preceding the test set. Combining the data augmentation and early stopping options, we obtain four models that we ensemble using arithmetic mean as the aggregation function. This technique is also used by the DNN benchmark (Lago et al., 2018a(Lago et al., , 2021a.

Forecasting Results
We conducted an empirical study involving two types of Autoregressive Models (AR1 and ARx1; Weron 2014), the Lasso Estimated Auto-Regressive (LEARx; Uniejewski et al. 2016), a parsimonious Deep Neural Network (DNN; Lago et al. 2018aLago et al. , 2021a, the original Neural Basis Expansion Analysis without exogenous covariates (NBEATS; Oreshkin et al. 2020), and the Exponential Smoothing Recurrent Neural Network (ESRNN; Smyl 2020). This experiment examines the effects of including the covariate inputs and comparing NBEATSx with state-ofthe-art methods for the electricity price day-ahead forecasting task. Table 3 summarizes the performance of the ensembled models where NBEATSx ensemble shows prevailing performance. It improves 18.77% on average for all metrics and markets when compared with the original NBEATS and 20.6% when compared to ESRNN without time-dependent covariates. For the ensembled models, NBEATSx RMSE improved on average 4.68%, MAE improved 2.53%, rMAE improved 1.97%,and sMAPE improved 1.25%. When comparing NBEATSx ensemble against DNN ensemble on individual markets, NBEATSx improved by 5.38% on the NordPool market, by 2.48% on French market and 2.81% on German market. There was a non-significant difference of NBEATSx performance on PJM and EPEX-BE markets of 0.24% and 1.1%, respectively. Figure 4 provides a graphical representation of the statistical significance from the Giacomini-White test (GW) for the six ensembled models, across the five markets for the MAE evaluation metric. A similar significance analysis is conducted for the single models. The models included in the significance tests are the same as in Table 3: LEAR, DNN, ESRNN, NBEATS, and our proposed methods, NBEATSx-G and NBEATSx-I. The p-value of each comparison shows if the performance improvement of the model's predictions corresponding to the column index of a cell in the grids shown in Figure 4 over the model's predictions corresponding to the row of this cell of the grid is statistically significant. NBEATSx-G model outperformed DNN model in NP and EPEX-DE, while NBEATSx-I outperformed it in NP, EPEX-FR, and EPEX-DE. Moreover, no benchmark model significantly outperformed NBEATSx-I and NBEATSx-G in any market.
In the Appendix A we observe similar results for the single best models chosen from the four possible configurations of the ensemble components described in Section 4.3.5. Table A2 summarizes the accuracy of the predictions measured with the MAE and Figure A.3 displays Table 3: Forecast accuracy measures for day-ahead electricity price predictions of ensembled models. The ESRNN and NBEATS do not include time dependent covariates. The reported metrics are mean absolute error (MAE), relative mean absolute error (rMAE), symmetric mean absolute percentage error (sMAPE) and root mean squared error (RMSE). The smallest errors in each row are highlighted in bold. * The LEARx results for EPEX-DE differ from Lago et al. (2021a) -the values presented there are revised (Lago et al., 2021b) . the significance of the GW test. Ensembling improves the accuracy of NBEATSx by 3% on average acrosss all markets, when compared to the single best models. Finally, regarding the computational time complexity NBEATSx maintains good performance. As shown in Table A1 in the Appendix, the time necessary to compute day-ahead predictions is in the order of miliseconds and comparable to that of the LEAR and DNN benchmarks. Additionally, the average time needed to perform a recalibration only takes circa 50 percent more than the relatively parsimonious DNN.  Figure 4: Results of the Giacomini-White test for the day-ahead predictions with mean absolute error (MAE) applied to pairs of the ensembled models on the five electricity markets datasets. Each grid represents one market. Each colored cell in a grid is plotted black, unless the predictions of the model corresponding to its column of the grid outperforms the predictions of the model corresponding to its row of the grid. The color scale reflects significance of the difference in MAE, with solid green representing the lowest p-values.

Conclusions
We have presented NBEATSx: the new method for univariate time series forecasting with exogenous variables. It extends the well-performing neural basis expansion analysis. The resulting neural based method has several valuable properties that make it suitable for a wide range of forecasting tasks. The network is fast to optimize as it is mainly composed of fully-connected layers. It can produce interpretable results, and achieves state-of-the-art performance on forecasting tasks where consideration of exogenous variables is fundamental.
We demonstrated the utility of the proposed method using a set of benchmark datasets from electricity price forecasting domain, but it can be straightforwardly applied to forecasting problems in other domains. Qualitative evaluation shows that the interpretable configuration of NBEATSx can provide valuable insights to the analyst, as it explains the variation of the time series by separating it into trend, seasonality, and exogenous components, in a fashion analogous to classic time series decomposition. Regarding the quantitative forecasting performance, we observed no significant differences between ESRNN and NBEATS without exogenous variables. At the same time, NBEATSx improves over NBEATS by nearly 20% and up to 5% over LEAR and DNN models specialized for the Electricity Price Forecasting tasks. Finally, we found no significant trade-offs between the accuracy and interpretability of NBEATSx-G and NBEATSx-I predictions.
The neural basis expansion analysis is a very flexible method capable of producing accurate and interpretable forecasts, yet there is still room for improvement. For instance, augmentation of the harmonic functions towards wavelets or replacement of the convolutional encoder that would generate the covariate basis with smoothing alternatives such as splines. Additionally, one can extend the current non-interpretable method by regularizing its outputs with smoothness constraints. As discussed in Section 3.4, the interpretable configuration of the NBEATSx method performs basis projections into polynomial functions for the trends, harmonic functions for the seasonalities and exogenous variables. As shown in Figure A.1, both the forecast and the backcast components of the model rely on similar basis functions, and the only difference depends upon the span of their time indexes. For this work in the EPF application of NBEATS, the backcast horizon corresponds to 168 hours while the forecast horizon corresponds to 24. To study the effects of exogenous variables on the NBEATS model, we performed model training procedure diagnostics. Figure A.2 shows the train and validation mean absolute error (MAE) for the NBEATS and NBEATSx models as training progresses. The curves correspond to the hyperparameter optimization phase described in Section 4.3.4. The models trained with and without exogenous variables display a considerable difference in their train and validation errors as observed by the two separate clusters of trajectories. The exogenous variables, in this case, the electricity load and production forecasts, significantly improve the neural basis expansion analysis. We measured the computational time of the top four best algorithms with two metrics: the recalibration of the ensemble models selected from the hyperparameter optimization, and the computation of the predictions. For these experiments, we used a GeForce RTX 2080 GPU for the neural network models and an Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz for LEAR.

Appendix A.3. Computational Time
The training time of the recalibration phase of NBEATSx remains efficient, as it still trains in 75 and 81 seconds, increasing by 30 seconds on the relatively simple DNN. The computational time of the prediction remains within miliseconds. Finally the hyperparameter optimization scales linearly with respect to the time of the recalibration phase and the evaluation steps of the optimization, in case of the NBEATSx-G the approximate time of a hyperparameter search of 1000 steps takes two days 4 . Table A2 shows that the best NBEATSx models yield improvements of 14.8% on average across all the evaluation metrics when compared to its NBEATS counterpart without exogenous covariates, and improvements of 23.9% when compared to ESRNN without timedependent covariates. A perhaps more remarkable result is the statistically significant improvement of forecast accuracy over LEAR and DNN benchmarks, ranging from 0.75% to 7.2% across all metrics and markets, with the exception of EPEX-BE. Compared to DNN, the RMSE improved on average 4.9%, the MAE improved 3.2%, the rMAE improved 3.0%, and sMAPE improved 1.7%. When comparing the best NBEATSx models against the best DNN on individual markets, NBEATSx improved by 3.18% on the Nord Pool market (NP), 2.03% 2.65% on French (EPEX-FR) and 5.24% on German (EPEX-DE) power markets. The positive difference in performance for Belgian (EPEX-BE) market of 0.53% was not statistically significant. Figure A.3 provides a graphical representation of the GW test for the six best models, across the five markets for the MAE evaluation metric. The models included in the significance tests are the same as in Tables A2: LEAR, DNN, the ESRNN, NBEATS, and our proposed methods, the NBEATSx-G and NBEATSx-I. The p-value of each individual comparison shows if the improvement in performance (measured by MAE or RMSE) of the x-axis model over the y-axis model is statistically significant. Both the NBEATSx-G and NBEATSx-I model outperformed the LEAR and DNN models in all markets, with the exception of Belgium. Moreover, no benchmark model outperformed the NBEATSx-I and NBEATSx-G on any market. Table A2: Forecast accuracy measures for day-ahead electricity prices for the best single model out of the four models described in the Subsection 4.3.5. The ESRNN and NBEATS, are the original implementations and do not include time dependent covariates. The reported metrics are mean absolute error (MAE), relative mean absolute error (rMAE), symmetric mean absolute percentage error (sMAPE) and root mean squared error (RMSE). The smallest errors in each row are highlighted in bold. * The LEARx results for EPEX-DE differ from Lago et al. (2021a) -the values presented there are revised (Lago et al., 2021b) .  Figure A.3: Results of the Giacomini-White test for the day-ahead predictions with mean absolute error (MAE) applied to pairs of the single models on the five electricity markets datasets. Each grid represents one market. Each colored cell in a grid is plotted black, unless the predictions of the model corresponding to its column of the grid outperforms the predictions of the model corresponding to its row of the grid. The color scale reflects significance of the difference in MAE, with solid green representing the lowest p-values.