A Temporal Neural Network Model for Probabilistic Multi-Period Forecasting of Distributed Energy Resources

Probabilistic forecasts of electrical loads and photovoltaic generation provide a family of methods able to incorporate uncertainty estimations in predictions. This paper aims to extend the literature on these methods by proposing a novel deep-learning model based on a mixture of convolutional neural networks, transformer models and dynamic Bayesian networks. Further, the paper also illustrates how to utilize Stochastic Variational Inference for training output distributions that allow time series sampling, a possibility not given for most state-of-the-art methods which do not use distributions. On top of this, the model also proposes an encoder-decoder topology that uses matrix transposes in order to both train on the sequential and the feature dimension. The performance of the work is illustrated on both load and generation time series obtained from a site representative of distributed energy resources in Norway and compared to state-of-the-art methods such as long-short-term memory. With a single-minute prediction resolution and a single-second computation time for an update with a batch size of 100 and a horizon of 24 hours, the model promises performance capable of real-time application. In summary, this paper provides a novel model that allows generating future scenarios for time series of distributed energy resources in real-time, which can be used to generate profiles for control problems under uncertainty.

ABSTRACT Probabilistic forecasts of electrical loads and photovoltaic generation provide a family of methods able to incorporate uncertainty estimations in predictions. This paper aims to extend the literature on these methods by proposing a novel deep-learning model based on a mixture of convolutional neural networks, transformer models and dynamic Bayesian networks. Further, the paper also illustrates how to utilize Stochastic Variational Inference for training output distributions that allow time series sampling, a possibility not given for most state-of-the-art methods which do not use distributions. On top of this, the model also proposes an encoder-decoder topology that uses matrix transposes in order to both train on the sequential and the feature dimension. The performance of the work is illustrated on both load and generation time series obtained from a site representative of distributed energy resources in Norway and compared to state-of-the-art methods such as long-short-term memory. With a single-minute prediction resolution and a single-second computation time for an update with a batch size of 100 and a horizon of 24 hours, the model promises performance capable of real-time application. In summary, this paper provides a novel model that allows generating future scenarios for time series of distributed energy resources in real-time, which can be used to generate profiles for control problems under uncertainty. sampled exogeneous variables v generic tensor notation s generic matrix sequence notation z 1 sequential encoding z 2 encoding of period t + 1 g distribution parameters of period t + 1 Distributions p generic distribution notation N Gaussian distribution B

INDEX TERMS
Bernoulli distribution

I. INTRODUCTION
Increasing shares of renewable energy in the global mix of power generation lead to changes in the landscape of methods required to analyze and predict them. Whereas a classical, fossile fuel based power system behaves more static and gives more control over output levels to the power producer, more renewable generation means flexibility and thus more uncertainty. Such uncertainty is also amplified by another ongoing transition in power systems: increasing decentralization decreasing forecasting accuracy. This comes as individual loads or generation profiles of e.g. specific households or solar panels are harder to forecast than aggregates of several sources over larger areas [1], a result of higher variation on the individual level [2]. The result of these changes is a conceivable shift in the methods dealing with prediction of distributed energy resources from traditionally deterministic methods to methods incorporating uncertainty [3]. Methods accurately describing uncertainty are therefore at the center of the operational problems, be it optimization of storage under solar generation or utilization of shift-able loads as flexible assets [4], [5]. However, distributed energy resources do not only add to the growing importance of uncertainty, they also affect the time frames under which the models operate. Centralized energy systems offer large ranges of flexibility in production (e.g. in the form of thermal or hydropower plants), whereas the margins of such are smaller for distributed resources, thus also reducing the time horizons of the control problems [6]. An example is that of electrical storage: for large-scale, i.e. hydropower, the storage cycles range from days to months or years [7], where for distributed storage in form of batteries the operational cycles typically lie within single days [8]. This means that where large-scale storage allows for 'over-night' calculations of the operational decisions as well as the associated predictions [9], distributed energy resources are more sensible to computational times as they have to be applied in real-time [10].
Apart from a push towards incorporating uncertainty and real-time applications of forecasting of distributed energy resources, another trend is that of using highly non-linear over linearized prediction models. Amplified by recent achievements in machine learning, deep learning based methodologies have prevailed amongst these non-linear methods [11], [12]. This trend can also be attributed to new specialized hardware that allows scalable and parallel 'training' (i.e. finding the optimal parameters) of such models via batches of data. Amongst those deep-learning models, recurrent neural networks, specifically long-short-term memory neural networks have established themselves as the recent standard in load prediction, both for short-term [13]- [15] as well as long-term [16] applications. In addition to deterministic point-forecasts, these models have also been applied probabilistically [17], [18]. Another example of such is provided by [19] which shows how to train an autoregressive, non-linear quantile regressor based on long-short-term memory in order to provide probabilistic load flows. [20] provides a quantile regression example from PV prediction that uses a Lasso regressor. However, the output of such quantile regression methods is only represented via -as suggested by the name -quantiles. Thus, albeit probabilistic, the outputs of quantile regression do not provide the possibility to have individual samples drawn from as they only provide ranges and not distributions to sample from [21]. This is a problem that has been approached by [22], which presents a method to train deep learning models for load forecasting in form of distributions. However, in turn this presented model is an implementation of an auto-regressive neural network and thus does not incorporate temporality (i.e. the effects of state transitions over the time periods) with similar quality to the recurrent neural networks. With this current state of literature, readers are thus forced to choose between accurate representation of the mean or the ability to take samples from the distribution. In recent work, [23] approaches this issue via Bayesian regression, which in the here presented work is extended by replacing the non-linear decision tree approximator with deep neural networks, similar to the generative model presented in [22] but extended to auto-regression. This is implemented via Stochastic Variational Inference [24], a method that can be utilized to train any parameterized distribution (in the here presented case a Gaussian and Bernoulli mixture model) via back-propagation, a technique that has been previously been applied in the domain of power systems to fit distributions in probabilistic optimal power flows [25] or make predictions in outlier events with data sets of small sample size [26].
To add to this contribution, the paper also analyzes the potential of convolutional neural networks as a replacement for recurrent neural networks in load forecasting. A similar model has been discussed in [27], however again with quantile losses and not considering temporality (the connection between periods in the input sequences), which recurrent neural networks do. Compared to this paper using traditional convolutional kernels, however, dilated convolutional neural networks have proven their capabilities in temporality, specifically in language prediction [28] and similarly applied in deterministic price forecasting within the power system [29].
Thus, the proposed model will build on these dilated convolutional kernels.
In recent literature on probabilistic load and generation forecasting tasks convolutional neural networks have been demonstrated to outperform recurrent neural networks. As such, [30] shows single-period forecast results for residential sites that match the analysis of convolutional neural networks provided in the multi-period case study below. Similar is done in [31]. Another example is provided by [32] where the authors propose a combination of recurrent and convolutional neural networks for the task of photovoltaic forecasting. Similar performance has been shown for quantile regression problems utilizing convolutional neural networks in [33] and [34].
In addition to an in-depth analysis on the state of the art of convolutional neural networks, this paper also illustrates how to incorporate attention mechanisms in such deep learning models. These attention mechanisms have been previously applied within the mentioned recurrent neural networks, whereas [35], [36] provide examples of such on the topic of load forecasting.
To incorporate this mechanism into temporal convolutional neural networks, the model presented in this paper will utilize a novel neural network layer, here titled a 'temporal convolutional attention' layer, which uses a similar topology to [37] mixed with the model from [28]. A similar idea has been previously proposed (deterministically and without the specific application on load forecasting) in [35], but the here presented model simplifies this attention mechanism to be more akin to the self-attention mechanism used in recurrent neural networks, thus reducing the amount of linear layers required to a third.
On top of these contributions, the paper also deals with an issue from the practical side of load forecasting -missing data. It does so by proposing an encoder-decoder architecture that encodes exogeneous variables within every time period and uses matrix transposes in order to propagate both over the dimension of variable and the dimension of sequence at the same time.
In summary, the contributions of this paper are to: 1) give an introduction of Stochastic Variational Inference as a method to train probabilistic load forecasts that allow taking individual samples from. 2) present a novel model that is a mixture between traditional convolutional neural networks, transformer models as presented in [37] and dynamic Bayesian networks. For the sake of simplicity 1 the model is from here on referred to as an Attention-based Temporal Convolutional Neural Network. 3) propose an encoder-decoder structure which uses matrix transposes to filter over sequential and feature dimensions in order to incorporate exogeneous 1 and to avoid confusion with electrical engineering terminology caused by phrases such as 'transformers'. variables into this network and more efficiently deal with missing input data. In addition, the paper illustrates how to generate probabilistic load forecasts for multiple periods in advance and demonstrates this by predicting three heterogeneous time series obtained from a demonstration site representative of the Norwegian power grid.
By these contributions, the paper not only concludes in showing that Temporal Convolutional Neural Networks outperform other state-of-the-art techniques in probabilistic forecasting, it does so by providing a new state-of-the-art topology itself. Applications of the model are various, examples of which are solving load scheduling and coordination in distributed energy resources [4], [5], both competitive [18] or cooperative multi-agent problems [38] as well as stochastic energy management systems [39], [40]. Further, the model might also be combined with other models from literature into ensemble models [41] and/or even applied on a more granular level within individual buildings or on individual assets [42].

II. METHOD
Using the notation of to mark sequential data and assuming inputs consisting of historical values of the data series denoted as vector Y = [. . . , y t−1 , y t ] and exogeneous variables denoted as matrix X = [. . . , x t−1 , x t ], the ARX (autoregressive model with exogeneous variables) predicting the next future value can be defined similar to [43]: In comparison to this deterministic problem, the probabilistic equivalent aims to fit a parameterized distribution p instead of a function f : As such, a probabilistic model is not only making a single point prediction, but defining a range of outcomes. A comparison of the proposed probabilistic generative model to the deterministic and the probabilistic quantile regression models is provided in Fig. 1.
Therefore, the probabilistic regression problem is to first find an accurate model for approximating this distribution p and then apply a method that finds the best numerical fit for the parameters θ of the given model. These two problems will be approached successively, starting with presenting the proposed deep learning model first (forward pass) and then presenting the algorithm utilized to find the model parameters afterwards (backward pass). Contribution 3 is introduced in the forward pass section, contribution 1 in the backward pass section and contribution 2 is demonstrated later together with the results in the case study section.
As all available input data Y and its corresponding exogeneous variables X will here be assumed too large for the model to consider as a whole, the model instead will utilize sampled batches from the given data sets. Sampling can be conducted by choosing a batch length B, a sample length τ and randomly sampling a number of time periods t 1 , . . . , t b : It has to be noted that the dimension of y is increased by one here. The reason for this is to fit the input dimension of this batch of vectors with the batch of matrixes that is x (with the most inner vector x t presenting the features). This is discussed in the overview of the tensor shapes in the Appendix.

1) FORWARD PASS
The schematic for the proposed deep learning model consists of three main parts and is presented in Fig. 2.
These parts fulfill the following functions: Temporal Encoder -this network encodes the temporal information provided by the exogeneous variables (i.e. the feature dimension of x ). This not only deals with missing data 2 but also incorporates long-term periodicity provided by the exogeneous variables into the input to the temporal convolutional network. Temporal Convolutional Attention Network -this network is used to identify the patterns within the sequences of its input z provided to it. Decoder -this network generates the parameters utilized in the distribution to be sampled from. As it can be observed from this illustration, the temporal encoding network and the output network are feed-forward 2 In traditional temporal convolutional neural networks, the missing data would have to be replaced with zeros. This however can lead to issues in case of too many missing values, which is the case in the provided case study. With this temporal encoder, however, missing values can be omitted as the position of a given y t value in the input sequence y is encoded via the corresponding exogeneous variables x t . neural networks mostly consisting of fully connected linear layers wrapped in 'rectified linear unit' activation functions: The only differences are that in the temporal encoder, the inputs x and y are concatenated (i.e. 'stacked') along the feature dimension. Further, the last layer of the decoder consists of several parallel layers (one for each distribution parameter).
The concatenation operation, together with the neural network applied on the feature dimension, allows to encode the information of the exogeneous variables (such as time stamps) into every single sequential step. This was done as inputs to traditional sequential networks (such as the here utilized temporal convolutional attention network or recurrent neural networks) have difficulties coping with missing values. In traditional auto-regressive networks utilizing only y as input, a longer sequence of missing values would distort the input. By encoding the historical values y with the exogeneous variables x , the missing values can instead be omitted.
Compared to the temporal encoder, the temporal convolutional attention network operates in the sequential dimension. A more intuitive illustration is presented in Fig. 3 which describes the entire forward flow of the neural networks. This also illustrates why the encoder and decoder are formulated as dense layers. The reason is that the encoder resembles a traditional regression problem and thus does not consider sequentiality, whereas the decoder resembles a mapping to the parameters of the output distribution and thus also operates in the feature dimension instead of the sequence dimension.
The utilized temporal convolutional attention network is derived from an extension to the 'wavenet' model presented in [28]. A similar network has previously been implemented underutilization of attention as shown in [35]. However, this formulation utilizes three dense layers (referred to as 'key'/'query'/'value') using the topology proposed in [37]. Here, instead, a single self-attention layer equivalent to the attention mechanism used traditionally in recurrent neural networks and as originally presented in [44] is applied, thus reducing the number of dense layers required to a third over the formulation presented in [35].
The temporal convolutional attention network is formulated as a series of temporal convolutional attention (tca) layers wrapped in rectified linear units: Using the notation from [28], i.e. to represent element-wise multiplication and * to represent convolution, a single tca layer can be described the following: (6) This is also graphically displayed in Fig. 4. Here it has to be noted that both the dense layers as well as the convolutional kernels have multiple channels (thus using the notation of θ), which, as also illustrated previously in Fig. 3, correspond exactly to the number of channels derived by the temporal encoder.
Using T to denote the matrix transpose, the forward flow of the proposed model can thus be formulated as shown in Algorithm 1.
For simplification and similar to Fig. 3, the batch dimensions have been omitted here. For the same reason, the weights θ have not been numbered. Nonetheless, every instance of θ represents an individual set of weights (or biases).
Further, the chosen distribution of p(y t+1 |g) might vary depending on application and, similarly to other probabilistic models, has to be chosen dependent on the application. Usually, due to the central limit theorem, the most suitable distribution for an application with no further information on

Algorithm 2: p(y t+1 |g) Used in Case Study
the real distribution can be assumed to be that of a Gaussian. This is also the case for the here proposed application of electrical load and generation forecasting. However, as also generation stemming from solar panels were analyzed, it was chosen to add a Bernoulli distribution to accurately model the day and night cycles. This additional information can be added via formulating the chosen distribution the following: Assuming there was no output distribution but instead deterministic outputs as in Eq. (1), these models could be trained via finding the mean squared error (or a similar loss measurement), backpropagation (i.e. finding the gradients) and applying an optimizer (i.e. a method such as stochastic gradient descent that updates the weights and biases). However, as the outputs are samples from a distribution instead a bounding function, namely the Evidence Lower BOund (short: ELBO), is applied here.

2) BACKWARD PASS
Assuming the measurement is the distance between the parameterized distribution p θ from Eq. (2) and the real distribution p, the ELBO can be derived from the Kullback Leibler divergence (a distance measure for two distributions) as shown in [45]: sample: y t+1 ∼ p θ (y t+1 |x , y ); set: t:= t+1; Note that, for simplification, this equation considers a single input series v = concatenate(y , x ). Due to log p(v ) being a constant, an ELBO of 0 thus means a perfect fit for the Kullback Leibler divergence and thus a perfect fit of the output of the neural network based on the historical values y , x on the distribution p θ of the future data y t+1 . [46] shows that for a model with a similar encoder-decoder architecture training can be accelerated by fixing the errors prior to obtaining the gradients of this ELBO function. Doing so, the process of Stochastic Variational Inference [24] for the given deep-learning model thus becomes the following step-wise updating process: The optimization algorithm utilized for the weight update steps chosen in the following case study was a batch optimization algorithm that can be found in [47].
For training this model the process of updating the parameters is repeated for a given number of episodes (or until a satisfying ELBO is reached). After this, and similar to the model presented in [28], the network can be used over a rolling horizon to predict multiple periods in the future. This, as previously shown in Fig. 3 is realized via applying Algorithm 4.
The output of this process is a batch of sequences which represent samples of future expectations. In the following section this model is demonstrated by a case study of three heterogeneous time series and compared to other state-ofthe-art models.

III. CASE STUDY
The endogeneous data Y for the given case study was obtained from a test location designed to represent timely issues in the Norwegian power grid. At the core of the application stands the modeling of battery energy storage systems and load shifts in HVAC systems under consideration of three distinctively heterogeneous time series, two consumption time series -a residential load (C1) and an office building connected to a football stadium (C2), as well as a time series of photovoltaic generation with a capacity of 800kW (PV). Even though commercially run, the test site is designed to accurately represent greater challenges in the Norwegian power grid, such being strong reactions to temperature changes, highly fluctuating weather conditions, seasonal as well as other patterns and stochastic events of increased consumption.
The available resolution of 1 minute from the sensors was kept, whereas the sensors where not operational continuously during the measurement period, leading to the previously discussed missing values with frequent outage sequences of up to 60 minutes. Fig. 5 gives a visual overview of two months of the data series (in its entirety 305 days long). The electrical demand of series C1 shows day and night consumption patterns with occasional low-amplitude outliers. Series C2 shows conceivable weekday/weekend patterns with high amplitude outliers due to operation of the stadium (mainly caused by floodlights) in addition to these cycles from C1. Series PV shows periodicity due to day/night differences and cloud patterns during the days. Consistent with the requirements expressed by the test site owners the prediction horizon was selected to be 1440 minutes, with the model being updated by the sensors in real-time, i.e. in 1 minute tacts.
The exogeneous variables X for the given case study were derived solely on these presented time series with no additional information such as weather or temperature supplied, a result of the lower resolution of the available   weather data over the 1 minute resolution of the electrical loads.. Specifically, the additional exogeneous variables were a linear trend line, the day of the week, a binary series separating between weekdays and weekends, the quarter of the year, the hour, the minute, the day of the year and the month. In addition, Fourier series were added as exogeneous variables representing periodicity:  The coefficients chosen for these series where 5, 10, 15, 30, 60, 120, 720 minutes (a = 5, 10, 15, 30, 60, 120, 720), a day (a = 1440) and a week (a = 10080).
In addition to this, both X and Y were standardized before feeding it to the model and batch normalization layers were added (behind all the linear layers, except the output layers) in order to support convergence of the model.
For the case study, each input data set was split into three equally sized chunks indicated by *_1, *_2, *_3, with the last day (=1440 periods) of each data set selected as the prediction target.
For the sake of comparison for all of these models, except for the ARIMA model the encoder and decoder section of  the original model were kept intact and Stochastic Variational Inference was utilized to train all models. The encoder layer size (channels) was kept at 50, the linear decoder layer size (nodes) at 500. As the model is probabilistic, dropout layers were not considered to be required. Further, the given parameters were arbitrarily selected, under consideration of the number of features and available GPU capacity.
Each model was trained on a Nvidia Quadro P2000 and trained for 1500 episodes with a batch size of 100 and a training time of around 1 sec per episode for each model. An input time series length of five days was utilized (in order to capture weekend/weekday changes in the time series), but these input sequences were down-sampled to 100 periods each. For the channel size 50 layers was chosen (similar for the hidden size of the long-short-term memory models) and all linear layers were set to a size of 500 nodes. The AR and MA components of the ARIMA model were selected to be 60 periods each.
The root mean squared errors averaged over the 1440 test periods and all taken samples are given in Fig. 6. As it can be observed the proposed model performs either the best or amongst the best in all given data sets. Similar can be observed comparing the correlation coefficients between the samples and the real data points in Fig. 7. It has to be noted that these values are not calculated based on the mean of the samples but instead are calculated individually for each sample and then averaged. Further, the coefficient of determination as shown in Fig. 8 also supports the performance of the proposed model. 5 The capabilities of the model are also further outlined in the visual results provided by Fig. 9. As discussed before, as a generative model, this technique allows drawing single samples, an example of such is presented in Fig. 10. These samples come in form of time series and represent the distribution if drawn infinitely. Further note that the model indicates to capture variance as well as trends, both non-linear and periodical, even with a relatively small training set of 3.5 months. The main issues the model encounters in the test set are the following: C2_1 -correctly anticipates the 'step', but does so too early. C2_3 -does not anticipate the large outlier (this outlier is also presented in the last day shown in Fig. 5, which shows that it is in fact a rare occurrence). PV_1 -overestimates the amplitude wrongly after 8:00am. PV_3 -underestimates the amplitude of the peak around noon.
This synapsis is also supported by the quantiles shown in Fig. 11, which summarizes the number of the real outcomes found within the given quantiles which shows C2_1 to be the best performer of the outliers.
Nonetheless, and as shown in Figs. 6 and 7, the TCN_Attention model in general mostly outperforms the tested alternatives, even in the discussed outlier situations. This is also supported by the training set losses as shown in Fig. 12 which shows generally more robustness to outliers in training on TCN.
Thus, in summary it can be stated that even though the model fails to capture unforeseen, rare events it still manages to accurately represent the underlying distribution of the time series better than the current state-of-the-art models. In addition to that, the results also indicate that even though the proposed attention mechanism improves the performance of the model significantly (and should thus be advised to be utilized), the temporal convolutional network still manages to compete with current state-of-the-art algorithms without this adjustment.

IV. CONCLUSION
This paper proposes a novel multi-period probabilistic load and generating forecasting model for distributed energy resources based on convolutional neural networks and a transformer-like stacked self-attention mechanism. Further, it also introduces Stochastic Variational Inference as a method to train probabilistic forecasting models that allows training any selected output distribution. As a generative method, it allows for taking samples of the output, a possibility not provided by other models based on mechanisms such as quantile regression. In addition to that the model also proposes an encoder-decoder structure in order to 'fill' the gaps of missing sensor data in the input.
The proposed model is then trained on chunks of data sets obtained from a site representative of the Norwegian power system -two consumer and one producer load series. The case study not only demonstrates the better performance of the proposed models compared to current state-of-the-art models but also highlights the performance of the temporal convolutional neural network being on-par with the state-ofthe-art without applying the proposed attention mechanism.
In summary, this paper does not only introduce two principles that will aid future probabilistic load prediction -the encoder-decoder structure as well as the Stochastic Variational Inference for back-propagation -it also discusses the application of a novel neural network model on such tasks.
For future work, focus on more efficient generation of output sequences can be proposed, as well as larger case studies including weather and other exogeneous data can be suggested. As the outputs are generated auto-regressively, generating larger batches of samples can become time-inefficient for long prediction sequences.
In summary, it can be stated that the proposed probabilistic model provides an efficient method to generate samples of sequences for distributed energy resources in real-time, whose performance against several error measures is demonstrated in the context of representative datasets from the Norwegian power system.

APPENDIX TENSOR SHAPES
Here, the dimensions of the tensors used in the model are listed. Please note that albeit x is referred to as a matrix and y is referred to as a vector, both of these variables are represented via tensors (due to the batch dimension): y -batch, sequence, 1 x -batch, sequence, feature z 1 -batch, sequence, channel z 2 -batch, 1, channel g -batch, distribution parameter, 1

ACKNOWLEDGMENT
The author would like to thank Johannes Philippus Maree, Venkatachalam Lakshmanan, and Iver Bakken Sperstad for the Inspiring Discussions on the Presented Method. Lede is gratefully acknowledged for the sharing of load data.