Generating Multivariate Load States Using a Conditional Variational Autoencoder

For planning of power systems and for the calibration of operational tools, it is essential to analyse system performance in a large range of representative scenarios. When the available historical data is limited, generative models are a promising solution, but modelling high-dimensional dependencies is challenging. In this paper, a multivariate load state generating model on the basis of a conditional variational autoencoder (CVAE) neural network is proposed. Going beyond common CVAE implementations, the model includes stochastic variation of output samples under given latent vectors and co-optimizes the parameters for this output variability. It is shown that this improves statistical properties of the generated data. The quality of generated multivariate loads is evaluated using univariate and multivariate performance metrics. A generation adequacy case study on the European network is used to illustrate model's ability to generate realistic tail distributions. The experiments demonstrate that the proposed generator outperforms other data generating mechanisms.


I. INTRODUCTION
In order to plan power systems and calibrate operational tools, it is essential to analyse system performance through a large range of representative scenarios [1], [2].Historical data is a key source of such scenarios, but when the available data set is too small for the desired application or when it cannot be made available for privacy reasons, it becomes valuable to have a model that can generate relevant data in abundant quantities.The challenge is that generated scenarios should embody both univariate distributions and multivariate inter-dependencies of the historical data [3].
A common approach has been to fit parametric probabilistic models to historical scenarios, especially load states.In [4], Gaussian mixture models (GMM) have been proposed to augment load data in distribution networks.Another study has introduced hidden Markov models to generate house-hold electric loads [5].More recently, a load generator has been designed using time-varying queuing models [6].Due to the This work was supported by the Chinese Scholarship Council.
curse of dimensionality, it is especially challenging to use parametric methods for generation of high-dimensional states [7, chapter 3].Copula-based models are one class of generative models that does scale to higher dimensions, either using the Gaussian copula, or by 'stacking' copulas in a vine structure, possibly in combination with dimension-reduction schemes [3].
As vine-based copula models are highly asymmetric and therefore prone to bias, it is appealing to investigate 'native' high-dimensional models, such as neural networks.The variational autoencoder (VAE) [8] is an unsupervised machine learning model based on a deep neural network architecture.It has been successfully used in generating electricity load series, such as theft detection [9] and electric vehicle load profiles [10].However, the validation of generated data often remains limited to visual comparisons, which is not straightforward for snapshots of larger and more complex electricity systems.
Moreover, most VAE implementations do not make full use [11] of the flexibility permitted by the mathematical framework in [8].Output noise tuning and training [12], [13] has only recently been considered, with a focus on image and video data sets.In power system related data generation applications, the output noise parameter is usually treated as a hyperparameter (i.e. a preset value that controls the learning process) [14] and noise is not actually inserted into samples [9], [10], [15]- [17].
This paper bridges those identified gaps by investigating the impact of output noise and its parameterisation, and by analysing generated data using performance metrics.This is done for the VAE and the conditional VAE (CVAE), in the context of large-scale spatial load patterns of European countries.This lays the basis for wider applications of this method to synthetic load generation at lower aggregation levels, where consumption patterns are inherently more variable.
The main contributions of this paper are: • We show how a sample-dependent output noise parameter can be co-optimised in the training process and how this noise is used in the generative process.
• Through comprehensive experiments, we show the performance and practicality of VAE-and CVAE-based load generators in comparison with Gaussian copula and conditional generative adversarial network (cGAN) models.

II. DATA GENERATION MECHANISM
In this section, a representative multivariate load state generation mechanism is proposed, based on the conditional variational autoencoder (CVAE).

A. CVAE-based generative model
The CVAE is a neural network architecture that is trained to learn the salient features of historical data by mapping (encoding) historical system states onto a lower-dimensional latent space where the latent distribution is approximately normal -and transforming latent vectors back (decoding) into a high-dimensional state space [18].The decoder is used in conjunction with contextual information c to generate representative states (which can be omitted to obtain a regular VAE model).Consequently, the model is able to generate samples with a similar distribution to the historical data, by transforming normally distributed samples in the latent space back to the data space.We note that the latent (i.e.hidden) representation of a data point is used solely to facilitate reconstruction and synthesis.It does not need to be imbued with a particular meaning.
The structure of the CVAE algorithm is depicted in Fig. 1.The encoder maps the d-dimensional input data x to the code z in the lower-dimensional latent space through k hidden layers H e l , l = 1, . . ., k. Weight matrices W e l , bias vectors b e l and the context c are utilized in the encoding process as where a represents an element-wise nonlinear activation function.Vectors µ and σ parameterize an input-dependent normal distribution in the latent space.The output z is sampled accordingly, using , a vector that is sampled from a standard normal distribution, and the Hadamard product .
Mirroring the encoder network, the decoder maps the sampled latent space code z to the d-dimensional output data x using where W d l and b d l denote weight matrices and bias vectors for decoding, respectively.µ and σ parameterize a z-dependent normal distribution in the x space.

B. Training and generation process
In the training stage, the whole structure of the CVAE model is utilized as Fig. 1.Weight matrices W and bias vectors b are updated in an iterative way with the goal of minimizing the loss function [18] L = L D KL + L Re . ( The Kullback-Leibler loss is the sum over all training data points x i (assumed i.i.d.) of the Kullback-Leibler divergence between that point's posterior distribution q φ (z|x i ) and the prior distribution p(z) (chosen as the standard normal distribution).The posterior distribution q φ (z|x i ) is determined by the parameters φ of the encoder network and represents the mapping of the point x i into a normal distribution in the latent space using (1a) and (1b).As the Kullback-Leibler divergence between two normal distributions can be evaluated directly [14], the Kullback-Leibler loss is computed as where n denotes total number of observations used for training and (µ i , σ i ) are evaluated for data point x i and condition c using (1a).The reconstruction loss L Re stands for the negative loglikelihood of reconstructing the inputs x i via their latent space codes and the decoder that is parameterized by θ.The reconstruction loss is thus computed as where the final step involves a single-point approximation of the expectation and (µ i , σ i ) are obtained from the randomly generated latent code z(x i ) and the condition c using (2a).
During training, the full-sample sum in loss functions (4) and ( 5) are replaced by batch-sample averages.The constant nd 2 log 2π of L Re is omitted.After the training process, only the decoder part of the trained CVAE network is utilized to generate data.Latent space codes z are sampled from the standard normal distribution N (0, I) (see Fig. 1).Then, data space samples x are sampled from distribution N (µ (z, c), σ (z, c)), whose parameters are determined by z and c using (2a).We note that although the amount of available training data determines the information contained within the model, there is no limit to the amount of data that can be generated.
In this way, a complex data distribution in the x space is constructed as a continuous superposition of normal distributions that are parameterised by the normally distributed coordinate z.Using the procedure above, the encoder and decoder networks are trained to adapt any distribution to this normally distributed latent space.We note that other distributions besides the normal distribution can be used as the prior for the latent space coordinate z [19], [20] -selecting the best latent space representation for a particular class of problems remains an open research problem.

C. Network and output noise co-optimization strategy
It is common for CVAE implementations to generate data x not by sampling from N (µ (z, c), σ (z, c)) via (2b), but by directly using the mean value µ (z, c) (the maximum likelihood sample).Moreover, the standard deviation σ is not co-optimized in the training process of ( 3), but considered a hyperparameter that fixes σ i,j =s identically in all dimensions, so that ( 5) can be replaced by In contrast, we investigate the model in which the parameters σ of the output noise distribution are co-optimized as a function of z during training, as was recently also (independently) proposed in [13].In addition, we explicitly add output noise σ (z, c) to the generated data.To compare the different approaches, the quality of the generated data is evaluated under all four combinations (Table I): whether σ is co-optimized in the training stage (Auto σ ) or set to a fixed value (Fixed σ ); whether the noise σ (z, c) is added to the outputs (Noisy) or not (Noise free).

D. Loss function weight tuning strategy
The two loss terms have opposing effects.The Kullback-Leibler loss L D KL ensures a good fit with the prior distribution that samples are generated from, thus suppressing spurious generated points at the expense of 'smoothing' the output.The reconstruction loss L Re , on the other hand, promotes exact reconstruction of the training data.In this paper, in addition to the output noise, we also study the effect of a heuristic weighting factor β [21] for the Kullback-Leibler loss term L D KL on statistical properties of the generated data.
All aforementioned combined strategies are explicated in Table I.Their impacts to the quality of generations will be investigated in the following sections.Particularly, the settings of standard deviation σ and weight β influence the objective function in the training process and will ultimately affect the generated data.On the other hand, the use of output noise, σ (z, c), will directly impact the data generation stage.In this section, the performance of our proposed CVAEbased generative model is analysed using a European load data set.This is done with three data quality metrics that measure univariate distributions and multivariate dependencies.Impacts to the quality of generated data are investigated under the experimental settings in Table I, using both conditional and regular VAEs.

A. Data source and generation
Historical hourly load data for 32 European countries between 2013 and 2017 was obtained from the Open Power System Data platform [22] (package version 2019-06-05).Columns of AL, CS, CY, GB, TR and UA were dropped for incomplete records.The historical data was randomly split into training and test set in blocks of one week with the proportion of 4:1 (35,148 training and 8,569 test samples).
The training set was min-max normalized before being fed into the CVAE model and the inverse transformation was applied to generate samples.The context information c is the hour of day.Both total and hourly volumes of the generated data are the same with the training data set, in order to balance them for visual and statistical analysis.However, we emphasise that the purpose of constructing such a generative model is to have the ability to generate limitless non-repeating data, e.g. for reducing the risk of overfitting in downstream machine learning tasks.
The parameters of the generative models were tuned for optimal performance, for both the VAE and the CVAE.The network contained 2 hidden layers in the encoder with dimensions of 24 and 16, respectively; the bottleneck layer had 8 nodes (8-dimensional latent vector).The decoder also had 2 hidden layers with the same dimensions in reverse order.Comparisons against 4-neuron and 16-neuron bottleneck revealed that a smaller bottleneck results in excessive loss, whereas a larger bottleneck insufficiently forces the network to learn features.In the CVAE model, the hourly time-of-day was encoded cyclically using sine/cosine representation.
The ReLU activation function was used, except for the generation of µ and σ leading up to the bottleneck and output layers.The adaptive moment estimation (Adam) weight optimizer [23] was utilized with default settings to iteratively optimize the value of weight matrices W and bias vectors b.The batch size and learning rate related parameter α for training was 64 and 10 −4 respectively and 20, 000 training iterations were used.Training and data generation of the model was conducted in Python using tensorflow on the Google Colab environment using the GPU option.The code used for this paper is available for download [24], [25].

B. Data quality metrics
To test a generative model's ability to reproduce the features of historical data, especially in high dimensions, statistical tests are required.Three tests are put forward to examine different aspects of the generated data set, in comparison with the historical data.
1) Kolmogorov-Smirnov test for univariate marginal distributions: We used the two-sample Kolmogorov-Smirnov (K-S) test [26] to see whether the generated data was able to reproduce the marginal load distributions for each of the countries in the data set.For a given output dimension (load in a single country), historical and generated data are compared.Under the null hypothesis that historical and generated data are drawn from the same model, the p-values should follow a uniform distribution.In other words, when the historical data is compared against itself, the cumulative distribution of p-values should lie on the diagonal.Thus, for generative models, the closer the cumulative distribution of p-values lies to the diagonal, the higher the similarity between the two distributions.
Clearly, the models are unlikely to exactly reproduce historical distribution, thus large deviations from the ideal curve will show up for large-sample tests.Nevertheless, to analyse the degree of performance of various models, we use repeated tests on smaller sample sets that result in clear differentiation, as in [3].In this paper, 0.5% of the data set, i.e. 176 data points out of 35,148, were randomly drawn from training and generated data set, and then a p-value was calculated accordingly.This process was repeated 5,000 times for each country and a curve was constructed from all p-values.
2) Autoencoder-based point-wise test for multivariate dependencies: Autoencoder (AE) neural networks have been proven to be highly sensitive anomaly detectors [27].Unlike (C)VAE networks, AEs have no stochastic layers and only minimize the reconstruction loss r = i x i − xi 2 /d.An AE learns to compress and decompress the data based on properties of the training set.As a result, data points with dependencies that deviate substantially from that in the training set tend to have higher reconstruction errors.
A separate AE network was trained for this test, with hyperparameters equal to that of the CVAE model, except for the stochastic layers and objective function.Reconstruction errors of all data points (historical or generated) are plotted as cumulative distributions for easy comparison.As a test for overfitting of the autoencoder on the training data, the autoencoder test was performed on the training and test data.The two distributions visually overlapped, suggesting this is not a concern.
3) Energy test for multivariate dependencies of population: Another two-sample test, the energy test [28], was conducted to examine whether the multivariate dependencies of the  population were well acquired from the training set.The energy test, computed using the PyTorch library torch-twosample [29] uses a user-specified number of permutations (200 was used) to calculate a p-value.The same as for the K-S test, we used random subsets of 0.5% of the generated population and historical population.We repeated this process 1,000 times to draw a distribution of p-values and compare it with the uniform distribution (which would be expected if the data was drawn from the historical distribution).

1) Visual comparison of univariate distributions:
In this experiment, the CVAE with fixed σ and no output noise was used to generate 1,459 load demands, conditioned on the time 2:00.Results for the Netherlands are shown in Fig. 2a, for various values of σ .As the output noise assumed in training increases, the variability of the generated points decreases (because noise is not actually added).When σ = 0.1, the distribution of generations is the closest to that of historical data.This setting will be used for all further experiments with fixed σ .1 Fig. 2b further compares data generated using the 'Fixed σ ' and 'Auto σ , Noisy' schemes and the training data.Conditioning on 2:00, 10:00 and 21:00 was performed, and results are shown for the Netherlands.Both methods are able to qualitatively reproduce the features of the data.
2) Multivariate correlations: The top row of Fig. 3 shows the loads of all countries for 10 different snapshots at 2:00, relative to the mean load in those countries at 2:00.Compared to historical data (a) and the noisy generator (c), samples generated by the noise-free generator clearly show higher correlations between countries.This is confirmed by the correlation analysis between six countries in the bottom row of Fig. 3.By omitting output noise, the noise-free generator generated (too) highly correlated samples.
Sensitive experiments for the multivariate dependencies will be conducted in the following sections using the autoencoderbased point-wise test.The accurate representation of multivariate dependencies will be important for the analysis of supply shortfalls in Section IV.
3) Influence of noise generation: In this experiment, the influence of the four strategies listed in Table I were tested with β = 1.Results for the statistical tests described in Section III-B are shown in the first column of Fig. 4. The K-S test results and autoencoder results show that the inclusion of output noise is essential to improve marginal distributions (Fig. 4a) and increase output variability to the level of the historical data (Fig. 4e).In addition, the autoencoder and energy tests show that automatic tuning of the noise strength (Auto σ ) is essential to improve the multivariate dependencies of the generated samples.Together, this experiment shows that the 'Auto σ , Noisy') generator outperforms the other approaches listed in Table I.This was to be expected given the mathematical theory behind the CVAE (which includes noise), but is at odds with common implementations.

4) Comparison between conditional and regular VAE:
In the second column of Fig. 4, the performance of the CVAE and VAE models (with β = 1) is compared.The CVAE model slightly outperforms the VAE model in all categories.One possible explanation is that the CVAE model has access to the context c (time of day), which effectively increases the dimension of the latent space.Because of its better performance, we continue using the CVAE model in subsequent experiments, but the results suggest that a VAE model delivers comparable performance, and may be preferable when no natural conditioning variable is available.
5) β sensitivity test: The third column of Fig. 4 shows the impact of β (values 1, 3, 10) on the performance of the CVAE (Auto σ , Noisy) model.As β is increased, the performance on the K-S test (Fig. 4c) improves, indicating an improved ability to learn marginal distributions.On the other hand, performance on the autoencoder test (Fig. 4g) worsens, suggesting that points 'outside' of the training point cloud are generated for large β.Finally, the energy test (Fig. 4k) indicates that a moderate value of β can strike a balance between the opposing requirements: the curve for β = 3 is closest to the desired result.Nevertheless, depending on the application, it may be desirable to choose β larger or smaller.
6) Comparison of Generative Models: In the fourth column of Fig. 4, the quality of data sampled from different generative models was investigated.The values of β for CVAE and VAE models (both Auto σ , Noisy) were tuned for optimal performance on the energy test (see previous section).In addition, Gaussian copula [30] and cGAN [31] models were included for comparison.The basic cGAN model was modified to use Wasserstein losses [32].Both its generator and discriminator are deep neural networks; each has two hidden layers of 256 neurons, all activated with LeakyReLU (α = 0.2) except in the output layers, where linear and sigmoid activation functions are used for the generator and discriminator.Weights of the neurons are optimized with root mean square propagation (RMSprop) available from python package Keras.
The K-S test shows the outstanding ability of the Gaussian copula model to reproduce marginal load distributions (a design feature of copula models [3]).This model also shows competitive performance on the autoencoder and energy tests.However, it will become clear in Section IV that its tailperformance is worse than that of the (C)VAE models.The cGAN model shows the best performance on the autoencoder test, indicative of its ability to generate samples with realistic features.However, the model significantly underperforms on the K-S and energy tests, which suggests that the generated samples, though 'realistic', are unevenly distributed through the space of possible states.The optimised CVAE and VAE models are competitive on all three tests, with the CVAE model slightly outperforming the VAE model.

IV. MULTI-AREA ADEQUACY ASSESSMENT
Next, we investigated the performance of the load generation mechanisms by using it for a multi-area adequacy assessment study, based on the ENTSO-E Mid-term Adequacy Forecast 2020 (MAF2020) [33].Multi-area adequacy assessment measures the sufficiency of generating capacity compared with the load on each of the nodes in the power system under transmission constraints.This can be considered a stress test of the generative model, as the outcomes are sensitive to high-load events (tail distributions) and their dependencies between countries.
Monte Carlo simulations were used to estimate Loss Of Load Expectation (LOLE [h/year]) and Expected Energy Not Served (EENS [MWh/year]).LOLE is the expected number of hours per year during which the supply does not meet demand.EENS is the expected amount of energy demand per year that cannot be supplied.Parameters from the MAF2020 study were used to construct a model for generating capacity and net transfer capacities between countries.They were combined with generated and historical load data to define a probabilistic model for the Monte Carlo simulations.We emphasise that the model thus constructed is not meant to be an accurate representation of the European grid, but a stylised problem that serves as a comparative testing ground for the generative models.

A. Multi-area adequacy assessment structure
We consider the network as a directed graph (to allow for asymmetric flow limits) where nodes are zones, edges are connections between zones, and edge capacities are transfer capacities.Each sampled state w is represented by the available generating capacity g w i and demand d w i of each node i.Based on the flow constraints and dispatching policy, the consumed power p w i and load curtailment c w i for each node can be calculated, related by We determine c w i (and implicitly p w i ) by solving a quadratic problem with variables ci (curtailment) and fij (flows), which aims to minimize the total load curtailments and assumes that curtailments are balanced between areas [34], relative to the demand in that area: Here, L and N are the sets of connections (from i to j with i < j) and areas respectively.Constraints on power flow fij from node i to node j are given in ( 9); (10) limits curtailment and (11) enforces flow and generating power constraints.The objective function (8) has positive definite structure and the constraints are linear, so this optimization problem is strictly convex and has a unique solution.This optimization problem was solved using the python package quadprog [35].

B. Power system model
A European adequacy assessment model was developed, based on the target year 2025 data from the ENTSO-E MAF2020 [33].The net transfer capacities between countries are defined as the summation of transfer capacities between their constituent zones, as reported in the MAF2020.Since details of generators and unit capacities are not reported in the released dataset, we model the total generating capacity and the unit capacities in each country as follows.The assumed generating capacity of each country is a summation of conventional generating capacity in its zone(s) plus 5% of nameplate wind power capacity.Unit sizes are set per country as the closest value under 500MW that is a divisor of the generating capacity; a unit availability of 83% is used.Cyprus has no connection to other countries, so a unit capacity of 95MW is used to avoid excessive outages.

C. Multi-area adequacy assessment results
To compare the CVAE, VAE, Gaussian Copula, and cGAN generators, they were trained on historical load data from 2017 and 2018 for 35 countries, retrieved from the Open Power System Data Platform ( [22]; columns for CS, IS and UA were omitted).Each model was used to generate 100,000 random load samples.The 'Auto σ , Noisy' setting was utilized for the CVAE and VAE models, and β was set to 10 for improved reproduction of the marginal distributions.For each model, 1,000,000 Monte Carlo generation samples were drawn and combined with random load samples to estimate the LOLE in each country.Fig. 5 depicts the estimated LOLE values for all generative methods and historical data.The area of each sector of the disc represents the LOLE obtained using a particular load states generating model.Numerical results for three countries with low (Austria, AT), medium (The Netherlands, NL) and high (UK) risk levels are shown in Table IV-A.Standard errors for the least significant digits are shown in parentheses.Moreover, to investigate the beneficial effect of interconnection -and therefore the importance of accurate multivariate modelling -risks for these countries are also reported in the absence of interconnection ('island').
By design, the Gaussian copula reproduces the marginal distributions of the historical data.Therefore, the calculated risk for islanded systems is consistent with those for historical data.However, the results demonstrate that this model tends to overestimate risks for interconnected nodes (countries).The cGAN generative model tends to cause an overestimation of risks with both islanded and interconnected nodes, sometimes very significantly (e.g. the LOLE values for the UK and Ireland).In comparison, both the VAE and CVAE models generate data that results in risk estimates that are closer to those observed using historical data, although deviations exist from country to country.This suggests both models are able to substantially represent the multivariate tail distribution of the historical data.
The capability of generating load conditioned on hours is an additional advantage of CVAE in comparison with VAE in the adequacy assessment context.Load curtailments usually accrue during high load hours.So, time of day could be used as a control variable for an importance sampling Monte Carlo scheme that preferentially samples load states at high load hours and compensates for the resulting bias by sample reweighting.

V. CONCLUSIONS AND FUTURE WORK
In this paper, we have investigated the performance of CVAE-and VAE-based models to generate multivariate load states.Our inclusion of (1) sample noise in the generator and (2) co-optimised output noise parameters results in generated samples that show better marginal distributions and dependencies, when compared with common CVAE implementations (fixed noise parameter, noise omitted from the generator).A loss weighting factor β (hyperparameter) can be used to tune the performance of the model.Performance was tested using three statistical tests and in a Monte Carlo generation adequacy study.The (C)VAE based models significantly outperformed Gaussian copula and cGAN models on at least one of these tests and were competitive on all others.
With access to contextual information, the CVAE model slightly outperformed the VAE model.Moreover, such information can be used for targeted analysis, e.g. as part of a Monte Carlo importance sampling scheme that selects specific hours of the day.
In future work, we will further investigate the universality of our proposed load generation scheme by applying it to lower load aggregation levels, such as household-level data.At this level, the privacy-preserving nature of synthetic data becomes very beneficial and should be carefully tested in addition to the distributional aspects.

/Fig. 3 .
Fig. 3. (a), (b) and (c) display 10 typical ratios of 32 countries' historical and generated data to the historical mean values at 2:00.(d), (e) and (f) demonstrate the Pearson correlation coefficient matrices of 6 (out of 32) countries' historical and generated data at 2:00.The horizontal and vertical dimensions in the matrices are Spain (ES), Croatia (HR), Iceland (IS), Italy (IT), Luxembourg (LU) and the Netherlands (NL).

Fig. 4 .
Fig. 4. Results of statistical tests.Each column denotes a set of experiments (noise generation, training condition, value of β and model family).The three rows depict results for the three tests described in Section III-B.

Fig. 5 .
Fig. 5. Comparison of LOLE estimates using historical load data and generative load.The area of each sector of the disc represents the LOLE of the corresponding model (20h/y shown for scale in legend).
1.71.0TABLE II CALCULATED RISKS OF SELECTED COUNTRIES (WITH AND WITHOUT INTERCONNECTION) USING HISTORICAL DATA AND ALL GENERATIVE MODELS.