Data augmentation for time series regression

A model’s expected generalisation error is inversely proportional to its training set size. This relationship can pose a problem when modelling multivariate time series, because structural breaks, low sampling rates, and high data gathering costs can severely restrict training set sizes, increasing a model’s expected generalisation error by spurring regression model overfitting. Artificially expanding the training set size, using data augmentation methods, can, however, counteract the restrictions imposed by small sample sizes: increasing a model’s robustness to overfitting and boosting out-of-sample prediction accuracies. While existing time series augmentation methods have predominantly utilised feature space transformations to artificially expand training set sizes and boost prediction accuracies, we propose using autoencoders (AEs), variational autoencoders (VAEs) and Wasserstein generative adversarial networks with a gradient penalty (WGAN-GPs) for time series augmentation. To evaluate our proposed augmentors, as a case study we forecast Belgian and Dutch day-ahead electricity market prices using both autoregressive models and artificial neural networks. Overall, our results demonstrate that AEs, VAEs, and WGAN-GPs can significantly boost regression accuracies; on average decreasing benchmark model mean absolute errors by 2.23%, 2.73% and 2.97% respectively. Moreover, our results demonstrate that combining AE, VAE, and WGAN-GP generated time series can further boost regression accuracies; on average decreasing benchmark errors by 3.44%. As our proposed augmentors outperform existing augmentation methods, we strongly believe that both practitioners and researchers aiming to generate time series or reduce time series regression errors will find utility in our study.


Introduction
Sample sizes play a critical part in determining the optimal complexity of hypothesis sets and, by extension, the minimum attainable generalisation error. In multivariate time series analysis, the progression of time can often shift the target distribution. This shift increases the proportion of noise in historic data, making its inclusion in the learning process potentially detrimental. When data is scarce it is crucial to explore techniques capable of augmenting the sample. While numerous studies have demonstrated that augmentation can improve classification and regression accuracies, to date none have explored augmenting multivariate time series (i.e. time series with more than a single time-dependent variable) with exogenous variables despite the fact that the latter is commonly used by practitioners in regression analysis. Our research aims to fill this gap by identifying methods capable of successfully augmenting multivariate time series for regression analysis.
show that the availability of training data determines the minimum attainable expected generalisation error. The VC theory expresses a generalised upper bound, termed the VC inequality, to the generalisation error. Similarly to the bias-variance trade-off, the VC inequality can be used to directly prove that a model's VC-dimension -i.e. a model's complexity captured by its number of effective parametersis proportional to the number of training examples needed to attain a certain level of expected modelling performance -i.e. a certain generalisation bound [1]. Together the bias plus variance decomposition and VC inequality motivate studies which seek to reduce the expected generalisation error by expanding training set sizes.
Grounding theory in practice, note that studies, such as [2] for image classification, demonstrate that the relationship between model performance and data quantity is logarithmic. While, to the best of our knowledge, this relationship has yet to be extensively studied for regression models, back of the envelope calculations show that for every new training ( , ) example the expected generalisation error of, for instance, a linear regression modelling a linear target function decreases by ( + )∕( 2 + ), where is the best approximation error, is the number of features, and is the number of samples [1]. The above examples further underscore the importance of dataset sizes in minimising the expected generalisation loss, as well as the potential performance improvements that an expansion of the training set can yield.

Data augmentation
Linking modelling performance to data augmentation, in instances where it is impractical or even impossible to expand real training dataset sizes, due to costs, data scarcity, or time series structural breaks, it is worthwhile to consider whether augmented data could be used in place of additional real training data to boost modelling performance. While, to the best of our knowledge, a generalised proof establishing a direct relationship between the expected generalisation error and number of augmented data points has yet to be theoretically formulated, numerous studies, such as [3][4][5][6][7][8][9][10], empirically demonstrate that using augmented data can boost both classification and regression accuracies comparable to the addition of real data. Below, we introduce the augmentation methods explored in these studies.
Commencing with augmentation methods that have been utilised to boost classification accuracies, feature space augmentors, exploiting simple transformations such as symmetry, position, or style, have been observed to successfully generate data for both image and time series classification problems [3,4]. For instance, [4] observed a 5.1% accuracy increase in Parkinson's disease motor state classification, with a convolutional neural network, after augmenting wearable sensory data using random rotations. Despite this success, because of temporal relationships in time series, it must be stressed that applying feature space augmentations to time series can be risky. This is because, for augmentation methods to be useful they have to generate meaningful data originating from the underlying target distribution. Whilst we can observably conclude that images generated using feature space transformations continue to be meaningful, we cannot readily conclude this with time series.
Beyond feature space transformations, researchers have successfully applied augmentors utilising generative autoencoders to classification problems [5,6]. Autoencoders [11] are artificial neural networks that learn to encode, and subsequently decode model inputs, by performing unsupervised representation learning. Because encoding transforms input features from a feature space to a latent space , autoencoders can generate data by applying perturbations in , or by performing Bayesian inference and sampling from a latent distribution. These methods, to the best of our knowledge, have never been used to augment multivariate time series; however, both have a potential theoretical advantage over feature space transformations that make them worth examining.
Generative adversarial networks (GANs) [12] have also been utilised to boost classification accuracies [7,13]. GANs, comprising of a generator and a discriminator , differ from generative autoencoders in their training approach. While the latter is trained by minimising the divergence between model inputs and outputs directly, GANs use to indirectly train . The process is comparable to a game between a forger and a detective, and allows to learn to generate data from the distribution of the real data. Although, to the best of our knowledge GANs have never been used to augment multivariate time series, studies such as [7,8] demonstrate their ability to generate meaningful data and boost forecasting performances. Tran et al. [8] observed deep model classification accuracy improvements of 6%, across MNIST, CIFAR-10 and CIFAR-100 datasets, surpassing compound feature space transformation performances.
Finally describing augmentation methods for time series regression, to the best of our knowledge two studies, [9,10], have thus far attempted to boost regression accuracies using augmentation methods. In particular, [9] proposed generating univariate time series by replacing the error of deseasonalised and detrended time series with bootstrapped errors, while [10] proposed generating the same series by sampling Markov Chain Monte Carlo parameters and forecasting time series paths using those parameters. Across the M3-competition dataset, [9] demonstrated that the bootstrapping augmentation method, combined with bagging, significantly improves the prediction accuracies of exponential smoothing models. Across the same dataset, [10] observed a 5.7% reduction in symmetrical mean absolute percentage errors of long short-term memory recurrent neural networks. Whether these augmentation methods could be adapted to readily and successfully generate multivariate time series with exogenous inputs is uncertain.
Elaborating, hypothetically for example, to generate multivariate time series using [9], one would have to augment multiple univariate time series; replacing the errors of multiple deseasonalised and detrended time series with correlated bootstrapped errors. To achieve this without introducing an excessive amount of noise, cross correlations between the time series would probably have to be modelled. Moreover, the time series would have to be of the same length, with the same sampling periodicity. As the above, in and of itself, constitutes a new data augmentation method, as it significantly extends the use case presented in [9], we consider any attempts to implement it beyond the scope of our examination.

Electricity price forecasting
As a case study, we choose to forecast both Belgian and Dutch day-ahead electricity market (DAM) prices. We forecast DAM prices because in recent years regulatory changes and an increased emphasis on renewable energy production have increased the generalisation error of DAM forecasts [14]; establishing new price drivers, introducing structural breaks in the time series, and spurring the need for data augmentation.
In [15,18], the forecast accuracies of deep, ensemble, and statistical models were evaluated. Machine learning models were found to outperform statistical models in both studies. In [18], NNs were found to outperform long short-term memory networks (LSTM) and gated recurrent units (GRUs) in forecasting Belgian DAM prices. Other studies, for instance [14,21], have also compared the performances of deep and shallow machine learning models. In [14], deep models were found to outperform shallow models in forecasting Belgian DAM prices.
In [21], GRUs were found to outperform all other neural network structures such as NNs and LSTM and statistical methods such as seasonal autoregressive integrated moving average (SARIMA) and Markov models in forecasting Turkish DAM prices. Contrasting performances of artificial networks, such as NNs, GRUs and LSTM, across [18,21] can be explained by model complexities, data characteristics and data sizes.
In [24][25][26], hybrid models were shown to outperform individual benchmark models. In [24], the hybrid model was harmonised using a wavelet transformation, an autoregressive moving average model, a kernel extreme learning machine (KELM), and self-adaptive particle swarm optimisation (SAPSO). In [25], the hybrid model was constructed with a variational mode decomposition, SAPSO, SARIMA and deep belief networks. Model accuracies were assessed across three different DAMs in [25]. Exogenous variables, however, were not included in this study. In [26], the hybrid model was structured using a local forecasting paradigm, a general regression neural network, coordinate delay and an harmony search algorithm. Unlike in [25], exogenous variables were included in this study. Model accuracies, however, were assessed only on one DAM in [26]. In [23], applications of spike forecasting were further explored. Using the Borderline-SMOTE method to balance the number of samples in different target classes, [23] increased the number of spike samples in training data; yielding DAM forecasting accuracy improvements. For a more detailed review of the DAM forecasting literature/best practices we recommend interested readers read [15,16].
To the best of our knowledge, we are the first to explore using data augmentation with generative models as a method for boosting DAM forecasting accuracies. In our study, we include exogenous variables and neighbouring market prices, and evaluate our augmentation method on two different DAMs using deep, ensemble and statistical models.

Motivation and contributions
The literature review above exposes how researchers have thus far concentrated more on developing augmentation methods for classification than regression. Particularly, augmentation methods for multivariate time series with exogenous variables have to date not been researched. Motivated by a desire to fill this void, our paper concentrates on developing universally applicable augmentation methods for both univariate and multivariate time series regression analysis. Because of their prevailing use in classification studies as well as their ease of implementation, we evaluate three feature space augmentors: jittering, scaling, and magnitude-warping. Similarly, because of their significant achievements with classifiers, their capacity to model the underlying distribution of data, and their ability to generate data without any domain knowledge, we choose to develop and evaluate tailored model-based augmentors: autoencoders (AEs), variational autoencoders (VAEs), and Wasserstein GANs with a gradient penalty (WGAN-GP).
To fully explore the effectiveness of the above-mentioned augmentation methods, we examine their performance impact on autoregressive models with exogenous inputs (ARX) and two artificial neural networks (ANN). ARXs are chosen both to demonstrate the methods impacts on linear model performances, and because they have been shown to achieve reasonable DAM benchmark forecast accuracies. ANNs are chosen because they have been shown to outperform other state-of-the-art statistical and machine learning models in DAM forecasting [18].
Summarising the principle contributions of this paper, to the best of our knowledge, we are the first to: (1) explore the augmentation of multivariate time series with exogenous variables, (2) utilise feature space transformations, AEs, VAEs and WGAN-GPs for regression augmentation, and (3) apply augmentation methods to the forecasting of DAM prices. Outlining the structure of this paper, in Section 2 the DAM, as well as the theories and models underpinning our augmentation methods are described. In Section 3 our case study's methodology is outlined. In Section 4 the results are analysed. Finally, in Section 5 our study is concluded and avenues of potential future research are discussed.

Day-ahead electricity markets
Energy producers and suppliers must together meet the energy needs of everyday consumers and business. In doing so, they must forecast hourly or quarter-hourly energy demand and plan a path for satisfying consumers' energy needs. They do this, while jointly trying to maximise profits and minimise price and volume risks. Numerous opportunities for trading electricity and meeting the demand for energy exist: from long-term futures with maturities ranging from weeks to years, to short-term contracts with maturities ranging from one hour to one day.
The day-ahead electricity market (DAM) offers 24 hourly energy contracts for next day delivery. Uniquely, DAM clearing prices (in e/MWh) are computed using a matching mechanism that aggregates all asks and bids submitted before noon. For the DAM's 24 hourly contracts, the matching engine sets prices where demand and supply intersect. Trade settlement occurs once prices are set, with hourly energy delivery commencing at 00:00 of the following day.

Autoregressive models with exogenous inputs
ARXs are linear time series models capable of capturing both autoregressive and exogenous relationships driving stochastic processes. Formally, a single output ARX(n, m) model is defined according to Eq. (1).
where − is the model output at time − , is th exogenous input, and are the th autoregressive and exogenous parameters respectively, and is the residual noise. To train an ARX, the least absolute shrinkage and selection operator (LASSO) is employed; optimising and according to Eq. (2).
where is a regularisation parameter.

Artificial neural networks
ANNs are non-linear machine learning models consisting of input, output, and intermediate/hidden layers. Properly configured, ANNs can be used to theoretically approximate any continuous target function. Below, to introduce the fundamental building blocks of ANNs, two commonly used hidden layers, the fully connected layer (FCL) and the locally connected convolutional layer (CONV), are described.
ANN layers comprise of groups of interconnected nodes. Each node weights the outputs of previous layers according to ( + ⊺ ), where is a bias term, is a vector of trainable weights, is a vector of previous layer outputs, and (.) is a differentiable activation function, such as a rectified linear unit (ReLU), capable of adding a non-linearity to a node's output. In an FCL, nodes in layer are connected to all other nodes from layer − 1; computing the dot product of their outputs according to ( + ⊺ −1 ). Each FCL yields P −1 ×P optimisable parameters, where P and P −1 are the number of nodes in layers and − 1 respectively.
In a CONV, multiple cascaded convolutions are applied across

Kullback-Leibler divergence
The Kullback-Leibler divergence ( ) is an asymmetric distance measure, measuring the divergence between two distributions. The forward is calculated as: and̃are real and generated data distributions respectively. The reverse is calculated as . As we later demonstrate, the weights, , of autoencoders and GANs can be optimised by minimising the forward and reverse respectively. The impacts of minimising these metrics on data generation are highlighted below.
By reformulating the forward from Eq. (3), it can be shown that minimising the forward is equivalent to performing maximum likelihood estimation [27]. (3) where ( ( )) is the entropy of real data, which is known and constant. The consequence of minimising the forward is a 'mean seeking' approximation, which spurs a model to cover the support of , assigning a high probability mass where is high, while centring around the mean of . In contrast, minimising the reverse encourages 'mode seeking' approximations of by assigning a high probability mass to the mode of the real distribution [27]. Eq. (4) formally expresses this.
where (̃(̃)) is the entropy of generated data. Eq. (4) highlights that when minimising the reverse , a generative model learns to assign low probabilities tõwhere is low.

Autoencoders
Autoencoders are lossy networks that learn to encode and subsequently decode model inputs. Generally, autoencoders are implemented in a ∕ ∕ or bottleneck architecture, where is the dimension of input and output layers, is the dimension of hidden layers/latent space, and > . Under such a structure the encoder extracts the most salient features from the input vector, ; transforming inputs from the feature space ∈ R to an unstructured and generally more compact latent space ∈ R . To find the optimal encoder and decoder parameters, and respectively, autoencoders are trained by minimising a reconstruction loss according to Eq. (5) [28].
where (.), (.), and ( , .) are encoding, decoding, and distance functions respectively, and is a vector sampled from the training set = { 1 , … , N }. As [29] argues and as we further explain in Section 3.5.2, it is the dimensionality reduction, which occurs during encoding, that makes latent space transformations more likely to generate data from the underlying distribution than feature space transformations.

Variational autoencoders
VAEs [30] are a special class of autoencoder, modelling observed and latent variable probability distributions while learning structured latent space representations of real observations . To learn these representations, VAEs use variational inference to approximate the true posterior, ( | ), with a variational posterior ( | ), where is a collection of variational parameters. From a neural net perspective, VAEs perform variational expectation maximisation to optimise the parameters of an inference network ( | ) that outputs . VAE encoders are called inference networks because they parametrise the inference of a true posterior. VAE decoders, performing the parametrised probabilistic decoding ( | ), are called generation networks.
Deriving the VAE objective function [30], we note that directly approximating the true distribution of as: is intractable because it requires evaluating all configurations of . However, by reformulating the above using Bayes' rule, a tractable objective function representing the lower bound of E ∼ ( ) log ( ) can be derived. Specifically, to optimise the parameters of the inference and generation networks, and respectively, the value function  is maximised according to Eq. (6).
. This is because, as the forward approaches 0,  approaches E ∼ ( ) log ( ). Linking  to autoencoders, by taking the negative of  we can identify similarities and differences between VAE and autoencoder objective functions. Formally, the negative of  is equal to Eq. (7).
In Eq. (7), the first term on the right measures the expected negative log-likelihood of . This term is equivalent to an autoencoder's reconstruction loss. The second term measures the information loss from variational approximation: it is a regulariser term controlling the structuredness of and distinguishing VAEs from autoencoders.

Wasserstein generative adversarial networks
GANs are adversarial networks that learn to generate samples from the underlying probability distribution of data without explicitly modelling said distribution. They do this by pitting two neural networks , a generator, and , a discriminator, against each other in a minmax game. To train , [32] proposed maximising the negative crossentropy of a binary classifier, outputting variable , and separating real ( ∼ ( | = 1)) and generated (̃∼ (̃| = 0)) data. Formally, and are optimised according to Eqs. (8) and (9).
Equating GANs to VAEs, by treating the binary targets of as observed variables, and the inputś= { ,̃} of as latent variables, [31] established a formal connection between a GAN's objective functions and variational expectation maximisation. Similarly to a VAE's inference network, [31] highlighted that can be thought of as performing posterior inference; with the variational posterior (́| ) approximating the posterior (́| ) ∝ 0 (1 − |́) , where 0 and 0 are and weights from the previous training iteration respectively, and, 0 ( |́) is equal to 0 (́). Eq. (10) emphasises this connection.
In Eq. (10), the JSD term, whose impact becomes negligible once the reverse is sufficiently minimised, is upper bounded by the reverse term [31]. The optimisation of according to Eq. (9) is thus equivalent to the minimisation of the reverse .
There are two critical drawbacks with optimising according to Eq. (9): unstable gradient updates and mode collapse [33]. For a fixed , as improves and approaches optimality ( → * ) the gradient norm of the objective function -‖∇ log ( ( ) ) ‖rapidly explodes. An instability of gradient updates spurs network saturation/instability and increases the complexity of hyperparameter tuning. Further, as minimising the reverse encourages 'mode seeking' approximations of , optimising according to Eq. (9) can provoke mode collapse, which occurs when , with varying input vectors, begins generating data centred at a single mode of a complex multimodel dataset.
Wasserstein GANs (WGANs) [34] are a variant of traditional GANs that offer an alternative training approach to [32] centred around the optimisation of a more stable objective function: the Wasserstein distance. The Wasserstein distance, a symmetric measure of distribution similarity, is mathematically defined in its primal form as the infimum (greatest lower bound) energy cost of transforming̃into . Unlike the JSD and reverse , the Wasserstein distance is continuous everywhere and yields smooth and meaningful gradients even when the manifolds of and̃have disjoint supports. Moreover, it limits mode collapse, by dissuading 'mode seeking' approximations of . Because computing the infimum is intractable, in practice a dual form of the Wasserstein distance, expressed in Eq. (11), is computed.
where sup is the supremum (least upper bound), [34] derived the above objective function using the Kantorovich-Rubinstein duality; turning a minimisation problem with an infimum into a maximisation problem with a supremum. Formalising the WGAN training procedure, using in place of , and are optimised according to Eq. (12).
Note, must satisfy the Lipschitz constraint, because only Lipschitz continuous functions produce feasible/optimal solutions for both the dual/primal forms of the Wasserstein loss respectively. To satisfy the Lipschitz constraint, [34] proposed using gradient clipping when optimising . Alternatively, a gradient penalty, underlined in Eq. (13), may be added to the objective function [35].
where is the gradient penalty coefficient and̂is an interpolated randomly sampled vector: The gradient penalty ensures that the gradient norm of the objective function does not exceed 1, without requiring extensive hyperparameter tuning.

Data
As a case study, to evaluate the robustness of our augmentation methods, we forecast Belgian and Dutch DAM prices. To forecast these, we use lagged DAM prices [14,18], day-ahead grid load forecasts [19], day-ahead available generation forecasts [18] and meteorological features, namely actual and day-ahead forecasts of temperature, wind speed, solar irradiance, and precipitation. From [36,37] we gather our dataset, using data from 01/01/2016 to 31/12/2016 for model training and data augmentation (training), data from 01/01/2017 to 31/12/2017 for hyperparameter tuning (validation), and data from 01/01/2018 to 31/12/2018 for augmentation evaluation (test). DAM prices for each split are shown in Fig. 1. An example of weekly DAM prices is also given in Fig. 2. Summary statistics of DAM prices are presented in Table 1. Training, validation, and test means ( ) and standard deviations ( ) are presented for every hour (h) denominated DAM contract and the average (avg) of contracts.
Analysing Table 1, we observe that the dataset of DAM prices displays varying statistical characteristics across countries (Belgium/The Netherlands), contracts (hour), and time (training/validation/test). On average, the Mean and SD of DAM prices are higher in Belgium than The Netherlands. Similarly, the Mean and SD of DAM prices are frequently higher for peak hours (between 08h-20h) than for off-peak hours. Finally, the Mean and SD of DAM prices are observed to generally increase across the training, validation, and test sets, i.e. across time.
While the above are general observations with exceptions -the Belgian 14h contract for instance displays a decreasing Mean and SD across time -overall we strongly believe that modelling such statistically diverse data will enable us to thoroughly evaluate the robustness of our augmentation methods.

Data processing
To facilitate data augmentation using generator networks, we use Min-Max normalisation, which binds data in the range [0, 1]. Such data can be readily generated by applying a sigmoid activation function across final layer network outputs as in [38].

Feature selection
Many researchers have modelled the DAM as a multiple output regression problem. We choose to model the DAM as a single output regression problem, forecasting the prices of every DAM hourly contract independently. This allows us to: (1) tailor our modelling approach to the varying statistical characteristics of every contract, (2) capture any price drivers which may be unique to a specific contract, and (3) avoid using too many features in the forecasting of any single contract. Feature selection is used to shrink an initial feature pool of 2160 features to a reduced feature space of at most 35 features. Our initial feature pool, for any DAM contract, consists of the features mentioned in Section 3.1 spanning a seven-day-lagged period. Specifically, the total of 2160 features comprises of 1440 1 meteorological features, 384 2 generation and load features, and 336 3 price features. For example, the Dutch feature pool contains seven-day-lagged 00h-23h: Dutch prices, Belgian prices, Dutch generation and load forecasts, and Dutch meteorological features. Note that we use Belgian and Dutch prices together for both the Dutch and Belgian feature pools to increase the explanatory variability coming from neighbouring countries.
To understand why we constrain our modelling input size to at most 35 features refer back to our discussion of the generalisation bound in Section 1.1. The VC-bound 4 for a regression can be expressed according to Eq. (14) [39]: S. Demir et al.
is the sample size, out and in are the generalisation and empirical errors, 1 − is the probability that the bound holds, and ( ) is the VC-dimension of the family of regression models  = { ( , ℎ)} ∶ R → R indexed by ℎ ∈ . The VC-bound exposes how changes in ( ) and impact the expected generalisation error. Because the generalisation bound follows the same monotonicity as the VC-bound, we can use Eq. (14) to determine the needed to train a family of models  , or the  that a constrained can reasonably train. Numerous studies, such as [1], indicate that by applying a rule of 10, mathematically expressed as 10 × ( ) ≤ , a reasonable generalisation error -i.e. a level where the VC-bound is meaningful with a probability close to 1 -can be attained. The rule of 10 can either be applied to determine an appropriate for a given ( ) or the reverse. Given our of 365, the rule of 10 suggests that an  with a ( ) ≤ 36 is best suited to our case study. Linking the ( ) to our final modelling feature size, it can be shown that the ( ) of a -dimensional linear classifier and regression is equal to + 1 -i.e. its degrees of freedom. The [40]. Because the expressiveness and by extension ( ) of -dimensional real-valued neural networks is at least as large as the ( ) of a -dimensional linear regression we can determine a limit for the maximum number of feature inputs for our case study. Formally,  ⊆  implies that [ ( ) = + 1 ] ≤ ( ), and therefore ≤ ( ) ≤ ( ). From the above we discern that strictly no more than 35 feature inputs, , should be selected by our feature selection method.
Similarly to [41], we use random forests (RF) to identify and select features with the greatest explanatory power. RF, an ensemble machine learning model, fits numerous decision trees to random samples of a training dataset. Because each tree is trained to recursively split data by maximising an information gain, estimates of a feature's importance can be computed. This property, as well as RF's computational speed, and robustness to outliers and noise [42], make RF an effective feature selection method.
The RF feature selection method is trained on the training set and tuned by minimising the R 2 across the validation set. We set a feature importance cut-off that prevents the RF method from selecting more than 35 features. Fig. 3

Forecasting models
In order to establish reasonable benchmark forecast accuracies we model DAM prices using ARXs. Further, following [18], we model prices using a two-layer neural network (2NN) consisting of two intermediate FCLs. Finally, to evaluate the performance impacts of augmentation on convolutional neural networks, we model DAM prices using a joint three-layer network (2CNN_NN) consisting of two intermediate CONVs, two optional max-pooling or average-pooling layers, and a single intermediate FCL. Layers are stacked to identify nonlinear relationships, and prevent an explosion in the number of network parameters. Early stoppage, L2 regularisation, dropout, ReLU activation functions, Adam, learning rate scheduling, and learning rate decay are used to combat node saturation, speed-up model training, and reduce the likelihood of overfitting. Implementation details can be found in Appendix. To identify the hyperparameters of the above evaluation models, Bayesian optimisation, which often outperforms random grid search [43], is employed. A sample of selected hyperparameters is displayed in Tables A.1 and A.2.

Data augmentation
When augmenting multivariate time series, both , real explanatory variables, and , real response variables, can be simultaneously used to generatẽ, augmented explanatory variables, and̃, augmented response variables. and represent measurements indexed through time, forming a set of time series { ∈ | = ∪ , ∀ ∈ , ∈ }. An ability to generate a set of augmented time series {̃∈̃|̃= ∪̃, ∀̃∈̃,̃∈̃} simplifies augmentation by alleviating the need for any pseudo-labelling of̃. Examples of generated time series are presented in Fig. 4. In Sections 3.5.1-3.5.4, we describe how, given a set of normalised training inputs , normalised augmented outputsc an be generated using a multitude of augmentation methods.

Feature space augmentation
From [4] we choose to evaluate jittering, scaling and magnitudewarping. Briefly describing how these methods generate data, jittering adds varying amounts of noise ( ∼ (0, 2 )) to . Scaling generates data by multiplying by a random scalar ( ∼ (1, 2 )). Magnitude-warping, similarly to scaling, multiplies by a smoothlyvarying random curve with knots and a standard deviation . To select the optimal hyperparameters and , Bayesian optimisation is employed. Table A.3 presents examples of selected hyperparameters.

Autoencoder augmentation
We utilise AEs, introduced in Section 2.5, to augment multivariate time series. Similarly to [38], as a distance function, ( , .), to measure the reconstruction loss in Eq. (5) we choose the binary cross-entropy (BCE). We found that using the BCE, an asymmetric loss function, outperformed using a symmetric loss, such as the mean squared error. 5 The BCE a special instance of the cross-entropy ( ( ),̃(̃)) = − ∑ ( ) log̃(̃), is calculated according to Eq. (15).
where is the dimensionality of . Optimising the BCE pointwise is equivalent to optimising the forward : ( ( ) ∥̃(̃)) = ( ( ),̃(̃)) − ( ( )), because ( ( )), the entropy of , is known and constant. For our AEs' training, we use Leaky ReLU with a negative slope to combat neuron saturation, early stoppage and L2 regularisation with a regularisation parameter to limit model overfitting, and Adam to speed-up convergence. The search spaces of and for Bayesian optimisation can be found in Appendix.
AEs, once trained, can generate time series,̃, by applying perturbations to encoder outputs in the latent space according tõ= ( (̂( ))), where ∶ → R is a perturbation function. This is advantages because encoding model inputs from the feature space to increases the relative volume occupied by the real distribution [29]. To allow tailored data generation, we identify optimal encoding and decoding functions for every DAM hourly contract using Bayesian optimisation. As shown in Fig. 5, encoder outputs are scaled by a scaling matrix ∼ (1, ( ⊺ )), where ⊺ is equal to the column-wise standard deviation (SD) of encoder outputs, [ 1 , … ], multiplied by . Formally, we apply a scaling latent space perturbations according to: ( ) = . Bayesian optimisation is used to identify an optimal . See Appendix for the search space of . 5 To demonstrate why an asymmetric loss function, specifically the BCE, is better suited to modelling than a symmetric loss function such as the mean squared error (MSE), below we calculate the forecasting losses () and gradient magnitudes (|∇|) for two model predictions. Firstly, we consider the  and |∇| Notice also that BCE |∇| dominate those of the MSE. Such asymmetric weight updating is desirable when modelling a non-uniform probability distribution as it promotes mass covering, while strictly preventing outlier generation. The results are: fewer generated data points near 0 and 1, and more generated data points near the central tendency.

Variational autoencoder augmentation
Introducing our VAE architectures, all inference networks use a single latent variable layer with a normal prior ( ( ) ∼ (0, )) and variational parameters : and 2 . These parameters are approximated as: ( | ( ), 2 ( )), where ( ) and 2 ( ) are parametrised latent variational estimates of . To train our VAEs, we use a variant of Eq. (7). The weight of the forward regulariser in Eq. (7) is adjusted by a Bayesian optimised parameter [44]. Additionally, an L2 penalty, weighted by a Bayesian optimised parameter , is added. For a minibatch of time series ({ } =1 ∼ ), we train our VAE according to Eq. (16).
, , = 1 + log ( ) 2 − ( ) 2 − ( ) 2 , and / are the dimensionalities of feature/latent variable outputs respectively [30]. In Eq. (16), the first term on the right is equivalent to the BCE, while the second term is a closed-form expression of the forward . We use Bayesian optimisation to identify optimal inference and generation networks for each DAM contract. Table A.4 presents examples of optimal inference and generation networks. Similarly to AEs, we apply Leaky ReLU with a negative slope , early stoppage, and Adam. The search spaces of , and for Bayesian optimisation can be found in Appendix. Once trained, by minimising the above VAE loss, the prior ( ( ) ∼ (0, )) is sampled and passed through the generation network to yield VAE augmented data as shown in Fig. 6.
S. Demir et al.     Using elements of [46][47][48] where , referred to as the critic in WGANs, approximates the 1-Lipschitz function ∶ → R. The L2 penalty, ‖ ‖ 2 2 , is added to the generator loss to further improve training stability. During a non-exhaustive empirical evaluation we found that L2 regularisation facilitated WGAN-GP training and increased the diversity of generated data (See [45]). Algorithm 1 presents our proposed method for training WGAN-GP time series augmentors. The hyperparameters referenced in Algorithm 1 and further described in Appendix are selected using Bayesian optimisation for every DAM hourly contract. Similarly to [35], the negative critic loss, − ( ; ,̃), is used in our convergence criterion. Once trained, is employed to generate multivariate time series. Example and network architectures are presented in Fig. 7 and Table A.4. Across a randomly selected sample of DAM contracts, we empirically compared the training stability of WGAN-GPs with the training stability of vanilla GANs, GANs with soft real and fake labels, WGANs, and context encoders, which generate regions of a target and use a compound loss function comprising of a reconstruction loss plus an adversarial loss. Overall, WGAN-GPs were found to be the most stable and easy to train. They suffered significantly fewer instances of mode collapse and were less sensitive to small hyperparameter changes.

Augmented data exploration
To highlight the capabilities and limitations of the augmentation methods, Section 4.1.1 compares the internal linear structure of real training and generated data by visualising their principal component projections. Section 4.1.2 analyses the distributions of both real training and generated data by visualising the statistical differences between them. Note that no link between the distributions of generated data and augmentation performance should be assumed.

Principal components
Principal component analysis [49] orthogonally transforms a set of variables onto a linearly uncorrelated orthogonal basis set of principal components. Variable variance is maximised in the first principal components, enabling the informative visualisation of multivariate datasets, and the linear correlations and temporal relationships between time series features. To generate Fig. 8 we performed singular value decomposition, and projected both our real and generated data onto the first two principal components, PC1 and PC2, of the training series.
Analysing Fig. 8, positive % SD s as well as stochastic and relatively small shifts in both PC1 and PC2 highlight how feature space transformations increase the principal component variability of without critically distorting the temporal relationships in . Feature space augmentors effectively generatẽby adding noise, ∼ ( 0, ( ) ) , to .
adds variability to the projection of ; ⊺ [ 1 , 2 ] + ⊺ [ 1 , 2 ], where 1 and 2 are the principal and second principal eigenvectors of . Since is zero-centred and tuned to avoid distorting the temporal relationships in , a majority of generated̃s generally remain in the manifold of . Out of the three feature space augmentors examined, jittering is observed to generate the most varied̃s, with larger % SD s.
Turning to model-based augmentors, AEs, VAEs, and WGAN-GPs generatẽs more tightly clustered around their central measures of tendency and lower principal component variability. More tightly clus-tered̃s result because of early stoppage and how model-based augmentors optimise their respective objective functions. While empirically evaluating various model-based augmentors, we observed that AEs and VAEs initially learn to generate data that varies across PC1, before proceeding to learn to generate data that varies across other principal components. Early stoppage ceases model training before AEs and VAEs learn to generate data with greater overall principal component variability than , resulting in less varied but more meaningful̃s. Similarly, early stoppage ceases WGAN-GP training before begins overfitting the training data. Overall, Fig. 8 demonstrates that AEs, VAEs, and WGAN-GP can generatẽs on the manifold of . While PCs are non-zero, their magnitudes suggest that model-based augmentors do not generatẽs with critically distorted temporal relationships.

Generated distributions
Distribution differences between and̃for each feature , namely and̃, are visualised in Fig. 9 to demonstrate our augmentors' ability to successfully approximate the moments of the real distribution, and to emphasise their capacity to approximate both the lagged dependent and the lagged independent variables in .
Examining the generated distributions of feature space augmentors, because jittering, scaling and magnitude-warping do not model the underlying distribution of , perturbations with a potential to significantly alter s are generally avoided. Near zero Means and positive SDs in Fig. 9 2 , and VAE are observed to successfully approximate the means of the real distribution. A majority of̃s have a negative % SD and % Skewness. Explaining these results, Sections 2.6 and 3.5.2 underlined that both AEs and VAEs perform parameter optimisation by minimising the forward . As is explained in Section 2.4, minimising the forward spurs an autoencoder to perform 'mean seeking' approximations, cen-tring̃around the mean of . Because mean centring is prioritised, it generally occurs before early stoppage, and before the second and third standardised moments of are fully approximated. Finally, examining the generated distributions of WGAN-GP, WGAN-GP generates̃s with the highest |Mean| and |SD|, and both positive and negative % SDs. Minimising the Wasserstein loss encourages GANs to approximate the second standardised moments of the real distribution without inducing 'mean seeking' approximations of .

Evaluation setup
We aim to determine whether data augmentation can significantly boost the prediction accuracies of multivariate time series regression models. To achieve this aim, across the test set we conduct an empirical and statistical analysis of regression models trained with (Aug: [ ,̃], size = 2 ) and without (Bench: [ ], size = ) augmentation. The specifics of our empirical and statistical evaluation procedures are detailed below.
Describing our empirical evaluation procedure, to assess the overall impacts of data augmentation, in Section 4.3 we compute and evaluate the mean absolute errors (MAE), the root mean squared errors (RMSE), and the symmetric mean absolute percentage errors (sMAPE) of regression forecasts with and without augmentation. Additionally, similarly to [3], we compute Win / Tie / Loss scores to measure an augmentor's efficiency. Instances where benchmark RMSEs decrease/increase across the test set are classified as Wins/Losses. Instances where an augmentor fails to improve upon benchmark RMSEs across the validation set are classified as Ties. Beyond an overall evaluation, in Section 4.3.1 we analyse cumulative absolute error differences between Bench and Aug model forecasts to determine whether augmentation performance remains consistent after model refitting, and to evaluate how augmentation impacts forecast accuracies across the test set. To evaluate whether augmentation performance is impacted by model complexity, we examine the relationship between an augmentor's performance and forecasting model complexity in Section 4.3.2. To determine whether augmentation performance is invariant to changes in the target distribution, in Section 4.3.3 we evaluate the relationship between an augmentor's performance and the target's variance. Finally, to investigate how augmentation improves Bench model accuracies, in Section 4.3.4 we analyse forecast accuracies of targets inside and outside the interquartile range.
Describing our statistical evaluation procedure, to determine the statistical significance of any accuracy improvements, we perform onesided Wilcoxon signed-rank tests [50]. The Wilcoxon test is a nonparametric test used to compare distributions of paired samples. We use this test to compare Bench and Aug RMSEs: {RMSE ℎ } =1 and {RMSE } =1 respectively. The test assumes symmetry between positive and negative sample differences: = RMSE − RMSE ℎ , ∀ ∈ {1, … , }. To compute the Wilcoxon statistic, the rank sums of positive and negative differences are calculated, and their minimum is taken. Note, following the recommendations of [51], we use the Pratt ranking method to obtain conservative estimates of the Wilcoxon statistic. We perform a one-sided Wilcoxon test with the null hypothesis H 0 :  When the Wilcoxon statistic translates to a -value less than 0.05, we reject H 0 and accept H 1 . A p-value less than 0.05 indicates a statistically significant performance improvement at a 5% level. In addition to the aforementioned analyses, in Section 4.3.5 we assess the prediction accuracies of models trained using a combination of̃s, generated using multiple model-based augmentors. For example, performances of forecasting models trained using both AẼand VAẼ ([ ,̃,̃], size = 3 ) are evaluated. Motivated by [9], we also evaluate the performances of ensemble augmentation methods, which average AE, VAE, and WGAN-GP forecasts. Both simple averaging (AVG) and weighted averaging (Weighted) ensemble methods are assessed. The AVG method assigns the same weights to augmentors' forecasts, while the Weighted method assigns weights based on augmentors' previous day's absolute forecast errors. To calculate an AVG forecast for day we perform:̂A VG = 1∕3(̂A E +̂V AE +̂W GAN-GP ). A Weighted forecast for day is calculated according to Eq. (19).
|. Under the Weighted method, a greater weight is assigned the forecasts of historically successful augmentors.

Numerical results and discussion
To highlight overall performance changes, summary augmentation results are presented in Table 2 The results indicate that not all augmentation methods can significantly boost the regression accuracies of multivariate time series models. While AEs, VAEs, and WGAN-GPs, on average, reduce benchmark summary MAEs, RMSEs, and sMAPEs by more than 2%, yielding p-values < 0.05, jittering and magnitude-warping fail to significantly improve summary benchmark MAEs and sMAPEs, yielding p-values ≥ 0.05. Beyond the summary results, AEs, VAEs, and WGAN-GPs are observed to significantly boost benchmark performances in 6/6, 6/6 and 4/6 evaluation cases respectively. Moreover, they are observed to improve 58.33% to 87.4% of benchmark RMSEs; yielding % RMSEs between −4.36% and −0.64%. Meanwhile, jittering, scaling and magnitude-warping are observed to yield p-values < 0.05 in To explain the relative underperformance of jittering, scaling, and magnitude-warping, we reiterate that the success of feature space augmentation is highly data-dependent. Broadly, feature space augmentation either increase forecast accuracies by spurring the identification of long-term trends or decrease forecast accuracies by adding too much noise and scrambling any potential trends in time series. Analysing average ARX % RMSE improvements, scaling (−0.84%) is found to perform better than magnitude-warping (−0.35%) and jittering (0.07%). We postulate that scaling, in contrast to jittering, successfully improves linear model performances, because it better regulates the amount of noise added to input time series by transforming both trends and residual errors.
In the case of deep models, it must also be noted that feature space augmentations can increase overfitting by not adding enough noise; extending training times without facilitating the identification of nonlinear long-term trends. Analysing average ANN % RMSE improvements, jittering (−0.25%) is found to perform better than magnitudewarping (−0.09%) and scaling (−0.01%). We postulate that, with ANNs, scaling, in contrast to jittering, generally fails to generate sufficiently distinct time series, i.e. add enough noise, to meaningfully facilitate the identification of non-linear long-term trends.
Beyond feature space augmentation, in Table 2 we find overwhelming evidence that model-based augmentors consistently generate meaningful multivariate time series capable of significantly boosting the regression accuracies of both ARXs and ANNs. All model-based augmentors decrease Belgian, and Dutch ARX, 2NN, and 2CNN_NN MAEs, RM-SEs, as well as sMAPEs. Moreover, they yield more than 14 (58.33%) Wins across all evaluation cases. Analysing performance differences between Belgian and Dutch evaluation cases, AEs, VAEs, and WGAN-GPs are observed to reduce Belgian forecasting errors more than Dutch errors. Varying feature counts, used in the forecasting of Belgian and Dutch DAM prices, may explain these differences. Because Dutch prices are, on average, forecasted using fewer features, it is reasonable to postulate that less complex ANNs are used to forecast Dutch prices. Elaborating, per Section 1.1, decreased model complexity reduces the need for large training sets and, by extension, the expected forecast error reduction attainable from data augmentation. The impacts of model complexity on augmentation performances are analysed further in Section 4.3.2.
Finally, analysing model-based augmentation performances across ARXs and ANNs, on average, model-based augmentors are observed to reduce ARX MAEs, RMSEs, and sMAPEs by 2.67%, 2.36%, and 2.54%, and ANN MAEs, RMSEs, and sMAPEs by 2.63%, 2.33%, and 2.55% respectively. While similar ARX and ANN improvements are surprising, note that they mostly result from WGAN-GP's Belgian ARX performance. On average, AEs and VAEs achieve higher forecast error reductions with ANNs than ARXs.

Cumulative absolute error differences
To adjust for potential shifts in the target distribution, model refitting is implemented. Describing the refitting process, after forecasting 183 consecutive targets from the test set, historic test set data is combined with the training and validation data to form a combined training set. Using this set, model-based augmentors are refitted, and a further 183̃s are generated. After generating additional time series, the evaluation models are refitted as well. In order to analyse the evolution of forecast accuracies and to evaluate the impacts of model refitting, Fig. 10 visualises the evolution of cumulative absolute error differences (CAED) between benchmark and augmentation model forecasts.
Analysing the evolution CAEDs, Fig. 10 further exposes the performance gulf between feature space and model-based augmentors. While AE, VAE, and WGAN-GP yield CAED 183 s of −27.87, −33.19, and −30.35 respectively, jittering, scaling, and magnitude-warping only yield CAED 183 s of −1.99, −3.76, and −5.52. Moreover, while all modelbased augmentors yield a negative CAED 365 , only scaling, buoyed by ARX improvements, results in a negative CAED 365 . As the updating of model weights, on average, negatively impacts ANN scaling CAEDs, as well as both ARX and ANN jittering and magnitude-warping, we postulate that feature space augmentors must, in general, add a sub-optimal quantity of noise when generating additional̃s.
Further analysing model-based CAEDs, in Fig. 10 AE, VAE, and WGAN-GP are observed to gradually decrease CAEDs across the test set. Up until the 183 rd evaluation day, VAE outperforms AE and WGAN-GP. After refitting, however, WGAN-GP commences outperforming both AE and VAE augmentors, resulting in the lowest CAED 365 . Note that as a result of higher average target prices, the decrease in CAEDs marginally intensifies after, roughly, the 225th evaluation day.

The impact of model complexity
In Section 1.1 we introduced the bias-variance trade-off, and formalised how an expansion of the sample size decreases a linear regression's expected generalisation error. In support of the bias-variance trade-off, it can be shown that, provided overfitting is avoided, the magnitude of expected generalisation error improvements increases as model complexity increases. Below we set out to explore whether the above relationship holds for our augmentation methods. For this, ANN model complexity and augmentation performance (% RMSE) are visualised in Fig. 11. Because computing an ANN model's true complexity is impractical, we estimate model complexities using the number of forecast model parameters per input feature (Params/Features). Augmentors which display an average drop in RMSE for both a majority of complex and simple evaluation models, regardless of model architecture, are considered to be successful augmentors. Moreover, augmentors which additionally display a greater average decrease in RMSE with complex models are considered comparable to a real expansion of sample sizes. For the purposes of our analysis, complex evaluation models are defined to be models with a Params/Features ratio greater than or equal to our case study's median ANN Params/Features ratio of 503.
Analysing Fig. 11, scaling and magnitude-warping are observed to perform better with simple models than with complex models; yielding Win/Loss ratios of 12∕6 and 16∕12, compared to 7∕13 and 12∕15. Scaling increases both complex 2NN and 2CNN_NN benchmark RMSEs, while magnitude-warping is observed to increase both simple and complex 2CNN_NN RMSEs. Consequently, based on the above results, we consider scaling and magnitude-warping to be unsuccessful augmentors. Evaluating jittering results, jittering improves a majority of both simple and complex benchmark models; yielding Win/Loss ratios of 14∕11, and 14∕4 respectively. Jittering reduces complex model errors more than simple model errors; on average yielding higher % RMSE reductions. As jittering, however, fails to improve simple 2NN RMSEs, we do not considered to be comparable to the real expansion of sample sizes.
Similarly to jittering, all model-based augmentors reduce a majority of both simple and complex model RMSEs. Specifically, AE, VAE and WGAN-GP yield Win/Loss ratios of 30∕13, 33∕16 and 34∕15 across simple models, and ratios of 34∕8, 35∕8 and 32∕13 across complex models. Unlike jittering, however, model-based augmentors on average improve all simple and complex model 2NN and 2CNN_NN benchmark RM-SEs. Moreover all model-based augmentors, apart from the WGAN-GP 2NN, reduce complex model RMSEs more than simple model RMSEs. From the evidence displayed in Fig. 11, model-based augmentation is considered comparable in impact to a real expansion of sample sizes.

The impact of a standard deviation change
In Section 4.1.2 the differences between generated and real time series were analysed. We highlighted that feature space augmentors increase the standard deviations of time series inputs, while model-based augmentors generally decrease their standard deviations. To implicitly evaluate whether the standard deviation of generated time series affects augmentation performance, Fig. 12 visualises the impacts of target standard deviation changes on ARX and ANN % RMSEs. We tranche standard deviation changes into two groups: SD − (% SD( ) < 0), and SD + (% SD( ) > 0), before analysing augmentation performances separately across SD − and SD + .
In line with expectations, in Fig. 12 feature space augmentors are observed to improve benchmark regression accuracies more often when % SD( ) increases. In total, feature space augmentors yield a Win / Loss ratio of 110∕76 across SD + compared to a Win / Loss ratio of 31∕25 across SD − . Jittering, scaling, and magnitude-warping respectively yield Win / Loss ratios of 32∕22, 38∕20 and 40∕34 across SD + and Win / Loss ratios of 11∕6, 8∕7 and 12∕12 across SD − . Despite varying total counts across the two SD regions, a majority of feature space augmentors are observed to individually narrowly improve benchmark model performances across both SD + and SD − . AE, VAE and WGAN-GP overwhelmingly improve a majority of benchmark model performances across both SD + , with Win / Loss ratios of 76∕25, 81∕29 and 80∕30, and SD − , with Win / Loss ratios of 21∕8, 21∕8 and 20∕10 respectively. Unintuitively, given that model-based augmentors generally generatẽs with a lower standard deviation, model-based augmentors are observed to improve benchmark regression accuracies marginally more often when % SD( ) increases. Given these results, we find no evidence of an innate relationship between the standard deviations of̃s and augmentation performances under varying % SD( ) shifts.

Augmentation performance across the target distribution
Below absolute forecast error changes, between the benchmark and augmentation models, are analysed for targets: below the lower quartile (lower tail), < 1; inside the interquartile range (IQR), 1 ≤ ≤ 3; and above the upper quartile (upper tail), 3 < . For each target region, Fig. 13 visualises the mean Win over Loss (%MWOL) as well as the mean absolute error difference (MAED) to highlight how often and by how much, on average, augmentation improves benchmark forecast errors respectively. An augmentor is considered to be robust if on average it improves a majority of errors, yielding a MAED < 0 and a %MWOL > 50%, for all target regions. Augmentors which only reduce forecast errors in the IQR, or IQR and tail are considered comparable to outlier removal techniques or regularisation. Augmentors which robustly reduce lower tail, IQR, and upper tail forecast errors are considered comparable to the introduction of additional real training data. In order to avoid skewing our examination, the impact of refitting, discussed in Section 4.3.1, is filtered out. Only forecasts error changes up to the 183rd day are considered.
Analysing Fig. 13, on average, feature space augmentors are observed to reduce lower tail forecast errors more than IQR errors, and IQR forecasts errors more than upper tail errors. While jittering and magnitude-warping both reduce lower tail forecast errors, only scaling reduces IQR and upper tail forecast errors. To understand the potential reasons for this behaviour, note that the probability of feature space transformations generating̃in decreases as the input series moves further from the mode of . Because the distribution of DAM prices is positively skewed with a long right tail, the probability of feature space augmentors generating meaningful time series in the lower tail or IQR   Similarly to feature space augmentors, AE improves lower tail forecast errors more often than IQR errors, and IQR errors more often than upper tail errors. Unlike feature space augmentors, however, AE achieves robust forecast error reductions across every target region. This is significant because it distinguishes AE performances from is the total number of absolute forecast errors that improve (| | < | ℎ |) and is the total number absolute forecast errors that worsen.
jittering, scaling, and magnitude-warping, underscoring how/why latent space augmentors generally outperform feature space augmentors. Explaining AE's outperformance, we reiterate that applying transformations in the latent space better ensures that generated time series originate from . Encoding increases the relative volume occupied by , making transformations in the latent space more likely to generate meaningful̃s across than feature space transformations. Improving upon the performance of AE, in Fig. 13 VAE is observed to yield greater %MWOLs and MAED reductions than AE across every target region. Because of the positive skewness of DAM prices and VAE's optimisation of the forward , VAE continues to improve lower tail benchmark performances more than upper tail performances similarly to AE. We postulate that VAE improves upon the overall performance of AE by embracing the modelling and sampling of latent space distributions. Modelling the latent space distribution ensures that less residual noise is added when subsequently generating any time series, resulting in more meaningful̃s across .
Similarly to AE and VAE, WGAN-GP achieves robust error reductions across every target region. WGAN-GP, however, reduces IQR forecast errors more than AE and VAE. To understand a potential reason why, recall from Section 2.4 that optimising the reverse encourages 'mode seeking' approximations of . While WGAN-GPs optimise a more stable variant of the reverse , WGAN-GPs continue to spur mode/median seeking approximations of more than AEs or VAEs. We postulate that the above underscores why WGAN-GP yields the highest IQR %MWOLs and MAED reductions.

Combined and ensemble augmentation methods
Having identified multiple performance differences between modelbased augmentors, below we assess the prediction accuracies of models trained using a combination of̃s. Combined augmentors such as AE + WGAN-GP, which uses both AE and WGAN-GP̃s to train evaluation models, are evaluated. Additionally, ensemble augmentors, which average the forecasts from AEs, VAEs, and WGAN-GPs using either simple averaging (AVG) or weighted averaging (Weighted) methods, are evaluated. Results of both combined and ensemble augmentation methods are presented in Table 3. Further, a breakdown of forecast error changes is visualised in Fig. 14.
Analysing the summary results in Table 3, we observe that combined augmentors, utilising two varying sets of̃s, on average outperform their individual augmentation counterparts. VAE + WGAN-GP, for example, yields higher summary MAE and sMAPE reductions than either VAE or WGAN-GP augmentors. As a general rule, we observe that the greater the sum of individual forecast error changes the greater the combined augmentation forecast error change. VAE + WGAN-GP yields a higher summary sMAPE reduction than AE + WGAN-GP, while AE + WGAN-GP yields a higher sMAPE reduction than AE + VAE. Given that AEs, VAEs, and WGAN-GPs change benchmark summary sMAPEs by −2.13%, −2.67%, and −2.83%, the above ordering is correctly predicted by the sum of individual augmentor sMAPEs; | − 2.67 − 2.83| > | − 2.13 − 2.83| > | − 2.13 − 2.67|. Explaining the results, we postulate that combined augmentors, such as VAE + WGAN-GP, potentially outperform individual augmentors because of a logarithmic relationship between model performance and augmented data quantity. Note, however, that the failure AE + VAE + WGAN-GP to, on average, reduce forecast errors more than VAE + WGAN-GP suggests that an upper bound to this relationship must exist.
Examining the performances of ensemble augmentors, in all 6 evaluation cases AVG and Weighted augmentors produce statistically significant RMSE reductions. Evaluated using the number of Wins, AVG and Weighted outperform individual and combined augmentors across all 6 evaluation cases. Additionally, they generally yield higher forecast error improvements than combined augmentors. Most notably, Weighted yields greater summary % MAEs, % RMSEs, and % sMAPEs than any other augmentor.
Analysing breakdowns of forecast error changes, in Fig. 14 all combined augmentors are observed to robustly improve a majority of lower tail, IQR, and upper tail benchmark forecast errors. In comparison to individual augmentors, combined augmentors generally yield marginally smaller lower tail forecast error reductions. However, they produce greater upper tail forecast error improvements. Similarly to the results in Table 3, the performances of combined augmentors are impacted by the performances of their individual component̃s. Elaborating with an example, we find that combined augmentors utilising WGAN-GP̃s improve upper tail MAEDs and %MWOLs. Linking these results to the performances of individual augmentors, recall that in Fig. 13 we observed that WGAN-GP yields greater upper tail forecast error improvements than either AE or VAE.
Evaluating ensemble augmentation performances, both AVG and Weighted are observed to also robustly improve a majority of benchmark errors across all three target regions. Compared to individual and combined augmentors, ensemble augmentors yield noticeably higher %MWOLs. While ensemble augmentors generally fail to replicate the IQR and upper tail MAED boosts of combined augmentors, they nevertheless succeed in yielding comparable MAED boosts to AEs, VAEs, and WGAN-GPs. Overall, we conclude that ensemble augmentors offer a less risky path to attaining forecast error reductions than either individual or combined augmentors.

Conclusion
Our study demonstrates that multivariate time series augmentation methods can significantly boost the regression accuracies of both autoregressive models with exogenous inputs and artificial neural networks. While jittering, scaling, and magnitude-warping generally struggle to improve a majority of forecast errors, AEs, VAEs, and WGAN-GPs are found to significantly reduce a majority of forecast errors; on average reducing benchmark MAEs by 2.23%, 2.73% and 2.97% respectively. Taking every result into consideration, VAEs and WGAN-GPs are found to be our best and most stable individual multivariate time series augmentors, decreasing 70.83% and 69.44% of benchmark errors.
Beyond individual augmentors, we find evidence that utilising combinations of generated time series from multiple model-based augmentors can reduce forecast errors further. VAE+WGAN-GP for instance on average reduces benchmark MAEs by 3.44%: 26.01% more than VAE and 15.82% than WGAN-GP. Similarly, we find evidence that ensemble augmentors, which average the forecasts from AEs, VAEs, and WGAN-GPs, reduce the inherent risks in multivariate time series augmentation. AVG for instance improves 86.81% of benchmark errors: 22.56% more than VAE and 25.01% more than WGAN-GP.
We hope that our work spurs others to research data augmentation methods for multivariate time series regression. We recommend that researchers looking to extend our work consider evaluating variants of our best augmentors, such as adversarial VAEs and Info-GANs, or, given the successes of combined and ensemble augmentors, explore more sophisticated bootstrapping algorithms and novel averaging methods for ensemble augmentation. Our empirical evaluation of context encoders leads us to believe that GANs could also be used for scenario generation. A myriad of GANs exist, including TransGANs, that could be integrated into our adversarial augmentation algorithm to generate correlated time series. We encourage researchers to consider using context encoders, context-conditional GANs, or conditional WGAN-GPs.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix. Model architectures and hyperparameters
Bayesian optimisation is used to identify optimal architectures and hyperparameters of every model in our study. Models are trained on the training set and evaluated on the validation set. The hyperparameters are selected based on the best validation score, i.e. the lowest root mean squared error. Note that the augmented data is not used for optimisation. Samples of selected hyperparameters are presented in Tables A.1 ] . Note, in Table A.2 CONVs are represented using tuples specifying kernel height and width, and the number of filter outputs. FCLs are represented using integers specifying neuron count. An average pooling layer, using a one dimensional kernel of size 2 with a stride of 2, is represented using the acronym AvgPool. Explaining the above notation with an example: the NL 06h 2CNN_NN consists of two CONVs, with one dimensional kernels and channel outputs of size 4 and 4, and 8 and 32 respectively, connected to a single FCL with 64 neurons. This network is specified as [(1, 4, 8), (1,4,32), 64] in Table A.2. Early stoppage and learning rate scheduling and decay are employed while training ANNs. ANN training continues until either an early stoppage criterion, requiring a minimum training loss improvement of 1e-6 over 22 epochs, is satisfied, or the maximum number of training epochs, 200, is reached. The learning rate is reduced by a factor of 10 after approximately every 50 epochs. Additionally, it is reduced by a factor of 4∕3 after 7 epochs of the training loss plateauing. Table A.3 shows a sample of selected hyperparameters for feature space augmentors, introduced and described in Section 3.5.1. Bayesian optimisation is employed to select log( ) ∼ [ log(9 × 10 −3 ), log(10 −1 ) ] and ∈ 4, … , 8.
Finally, Table A.4 presents a sample of optimal architectures for our model-based augmentation methods: AE, VAE, and WGAN-GP,