Energy data generation with Wasserstein Deep Convolutional Generative Adversarial Networks

Residential energy consumption data and related sociodemographic information are critical for energy demand management, including providing personalized services, ensuring energy supply, and designing demand response programs. However, it is often dif ﬁ cult to collect suf ﬁ cient data to build machine learning models, primarily due to cost, technical barriers, and privacy. Synthetic data generation becomes a feasible solution to address data availability issues, while most existing work generates data without considering the balance between usability and privacy. In this paper, we ﬁ rst propose a data generation model based on the Wasserstein Deep Convolutional Generative Adversarial Network (WDCGAN), which is capable of synthesizing ﬁ ne-grained energy consumption time series and corresponding sociodemographic information. The WDCGAN model can generate realistic data by balancing data usability and privacy level by setting a hyperparameter during training. Next, we take the classi-ﬁ cation of sociodemographic information as an application example and train four classical classi ﬁ cation models with the generated datasets, including CNN, LSTM, SVM, and LightGBM. We evaluate the proposed data generator using Irish data, and the results show that the proposed WDCGAN model can generate realistic load pro ﬁ les with satisfactory similarity in terms of data distribution, patterns, and performance. The classi ﬁ cation results validate the usability of the generated data for real-world machine learning applications with privacy guarantee, e.g., most of the differences in classi ﬁ cation accuracy and


Introduction
The next-generation smart grid faces challenges for sustainable energy management, which requires bi-directional information flow between customers and energy operators.Smart meters, a type of advanced metering infrastructure (AMI), enable communication between customers and energy operators, which typically record energy consumption at 30-min intervals [1].Smart meter data analysis help utilities better understand their customers, enabling them to offer personalized services, design demandresponse programs, and provision energy supply.Conversely, customers can better understand their energy consumption, improve energy efficiency, and change their consumption behaviors.Among others, smart meter data can be used to energy forecasting [2e4], anomalous consumption detection [5,6], price design [7], demand-side management [8,9] and customer segmentation [10].Building data-driven models depend on the availability of sufficient data for training.This is especially the case for building deep learning models, which requires a large amount of data for training.However, it is often challenging to obtain sufficient data due to cost, technical barriers and privacy reasons.The release of smart meter data is often strictly controlled and regulated, and data usability is often compromised due to over anonymization.For example, the public data sets [11e15] are anonymized based on the aggregation or removal of the features that can potentially identify individuals.As a result, fine-grained and feature-rich data availability is the main obstacle to building various data-driven models.
In non-intrusive load monitoring (NILM), collecting fine-grained data is cost-prohibitive as it requires installing additional devices to monitor appliance-level consumption for individual households.In addition, real-world energy consumption data can have quality issues, such as missing values and outliers, due to sensors or communication problems in data collection.These missing values and outliers must be corrected before being used to train models.A feasible solution for these problems is to generate realistic datasets.Lately, the Generative Adversarial Network (GAN) [16] received wide attention in image or natural language processing areas for the tasks of realistic image generation, text-to-image synthesis, image completion, and resolution enhancement, e.g., Refs. [17e19].A more recent work [20] also applied GAN to generate time-series data, as shown in Fig. 1.
The architecture of GAN consists of two networks, a generator and a discriminator.The generator, denoted by G, produces synthetic data, while the discriminator, denoted by D, estimates the probability that input data is real rather than synthetic.These two networks compete with each other in the training process and eventually find an equilibrium.During the training process, the random noise Z is sent into the generator as the input data, and the generator produces the synthetic data G(Z) as output.In each epoch, the discriminator receives the real sample X or generated sample G(Z) to judge each sample as real or fake, and gives an output D(X) or D(G(Z)), respectively.The overall loss of GAN consists of two parts: one from the generator and the other from the discriminator.The generator aims to minimize the following function, while the discriminator aims to maximize it: where E x stands for the expected value of all real samples, and E z stands for the expected value of all random data to the generator.This formula comes from the cross-entropy between the real data distribution and the generated data distribution.The objective of the discriminator is to maximize the average of the log probability of real samples and the log of the inverse probability of fake samples so that the discriminator can distinguish the difference between real and fake samples: Conversely, the objective of the generator is to minimize the log of the inverse probability predicted by the discriminator for fake data, which encourages the generator to produce samples with a low probability of being fake: G : minimize logð1 À DðGðzÞÞÞ Throughout the training epoch, the generator and the discriminator take turns involving updates to their model weights according to the descent of the loss function.When network G and network D reach Nash equilibrium, the training process ends.Then, the generator is ready to generate synthetic samples.
Although the structure of the original GAN model is relatively simple, the training of the GAN model is complex, which often suffers from instability, e.g., hard to converge or model collapses.Many attempts have been made to improve the original model.Among others, these include the deep convolutional GAN (DCGAN) [21] which introduces convolutional layers to enhance the feature extraction capability, conditional GAN (CGAN) [22], which adds additional condition vectors to the input of the discriminator and the generator to control the GAN output, the Wasserstein GAN (WGAN) [23], which replaces the original loss function with the Wasserstein distance [24].
Load profile generation is a complex process that requires considering autocorrelation, periodicity, and temporal dependencies of time series.This is very different from image data in terms of complexity and data characteristics.To generate the load profiles, we use an improved GAN model, called WDCGAN, derived from the above three variants: DCGAN [21], CGAN [22] and WGAN [23].This model employs the Wasserstein distance [24] in its loss function, instead of the common Kullback-Leibler (KL) divergence [25].The reason is that the Wasserstein distance, as an alternative objective, can properly measure the distance between two datasets no matter whether their distributions are disjoint.It can improve learning stability, eliminate problems such as mode collapse and gradient vanishing, and provide meaningful learning curves for debugging and hyperparameter searches.To better learn the temporal and periodic features of the electricity consumption data, we converted one week's electricity data into an image-like matrix form and inserted the convolutional layers into the GAN network by referring to the DCGAN model to better extract the temporal and periodic features of the consumption data.To control the type of data generated, we refer to CGAN and add conditional information to the input of native GAN.With the WDCGAN model, users can generate realistic fine-grained scalable load profiles with only small samples as seed (as training set).The generated data can then be used for different purposes, e.g., to train data-driven models, distribution, and sharing without worrying about data privacy.Users can balance the usability and privacy of the data by tuning a hyperparameter during training.As the generated load profiles will be used to train data models or for analysis purposes, the quality is crucial.To evaluate the WDCGAN model, we first study the statistic characteristics of the synthetic load profiles, including distribution, patterns and auto-relations; and compare with the real data.Then, we use the synthetic data for an application to identify the sociodemographic information of the household based on load profiles, with four classic classification models, including the Convolutional Neural Network (CNN) [26], Long Short-Term Memory (LSTM) [27], Support Vector Machine (SVM) [28] and LightGBM [29].The experimental results show that the accuracy of these classification models trained with the synthetic data is comparable to those trained with the real data, which confirms the effectiveness of the proposed model.In summary, the contributions of this paper are threefold: Proposes the WDCGAN model for load profile generation.The model addresses data privacy, cost, and technical barrier issues of data management in the energy sector.Implements an improved neural network structure in the proposed model, which is capable of improving training stability and combating model collapse issues.Showcases the machine learning task of identifying household sociodemographic information based on load profiles, evaluates the model comprehensively using real-world Irish CER data, and validates the effectiveness of the model.
The remainder of this paper is organized as follows: Section 2 reviews the related work.Section 3 introduces the Irish CER data, the sociodemographic identification problem of the household, and presents the details of the model, the identification task, and evaluation metrics.Section 4 conducts experiments to evaluate the model.Section 5 concludes the paper and presents future work.

Related work
This section will first review the load profile generation methods; then discuss the application of the generated data sets, which in this case is for the identification of household sociodemographic information.

Load profile generation
Load profile generation has received a lot of attention in the past.Table 1 lists some methods for generating load profiles, but not exhaustively.These methods can be classified into two broad categories: mathematical modeling methods and data-driven methods.Most of the earlier works are based on simulation using mathematical modeling.The main benefit is that the simulation methods do not require real samples to generate data.However, simulation methods require complex mathematical modeling knowledge and the generated load data are generally less accurate.The most widely used mathematical method is the Markov model.Therefore, to better improve performance, external sources become necessary to assist the models, such as appliances [32], physical characteristics of buildings [31], human activities [30e32], or weather conditions [33].In recent years, data-driven methods become increasingly popular, mainly due to better accuracy and the availability of some public data sets.Data-driven methods typically require some real samples as seed or training the model for data generation.A more classic data-driven method is regression models trained with sample data, e.g.Refs.[34,35].With emerging deep learning, neural network-based methods appear, such as LSTM-based [37], ANN-based [41] and .In this paper, we favor the GAN-based method optimized with the Wasserstein distance in its loss function, because of its excellent performance in simulating realistic data.Data-driven approaches can generate scalable data with a given relatively small-sized real-world data as the seed for training.In addition, there are some other methods used for load profile generation, e.g., the graph signal processing (GSP) based method, e.g., Ref. [43] and the simple arithmetic method, e.g., Ref. [36].But these methods only derive new load profiles based on the given samples, which are not suitable for generating scalable data sets like the predictionbased methods, such as regression or neural networks.
The methods can be divided into top-down and bottom-up approaches in terms of how load profiles should be generated.Top-down approaches typically generate fine-grained load profiles based on disaggregation, e.g., by dividing the total consumption of a substation into individual households [42] or dividing the total consumption into individual households [42] or appliances [43].Top-down approaches require historical data for training and supplementary data (e.g., weather and indoor activities) for the calibration of load profiles; otherwise, accuracy is poor [34].
Bottom-up approaches, on the other hand, aggregate detailed load profiles at higher levels, e.g., from appliances to an individual household [44,45].Although bottom-up approaches can generate realistic load profiles with diversity, they have strict requirements for detailed household information.Therefore, both the top-down and bottom-up approaches have limitations in accuracy or complexity.

Studies on load profiles and sociodemographic information
Sociodemographic information is a type of customer characteristics related to household factors, such as employment status, income, family size, family composition, and appliances.There are two main research directions between sociodemographic information and load profiles: one is to estimate load profiles based on sociodemographic information; and the other is to infer sociodemographic information based on load profiles.
The studies [46e48] found that sociodemographic information has a significant impact on their energy consumption.Huebner et al. [49] confirmed that an increasing number of occupants can lead to an increase in total electricity consumption.McLoughlin et al. [50] analyzed customer age as a factor in electricity consumption and mentioned that family members under the age of 36 consumed the most in the evening.Beckel et al. [51] investigated how household composition affects consumption patterns and the total amount.The studies [52e55] disaggregated residential electricity consumption based on sociodemographic information such as appliances and indoor activities.
The studies [51,56,57] suggested that sociodemographic characteristics can be identified from energy consumption data.The sociodemographic factors, including home occupancy, number of people, and number of appliances, have been successfully inferred.For example, several recent studies [58e64] revealed sociodemographic information from smart meter data, including the number of residents, income and occupancy status.
As a result, the study of energy consumption and sociodemographic data is crucial for demand-side energy management, which involves data privacy issues.Data generation can be a viable solution to preserve privacy, which this paper aims to achieve.

Methodology
This section will first describe data collection and its processing, then present an overview of the privacy-preserving framework for load profile generation, followed by the detailed description of the proposed WDCGAN model, and finally present the model application and evaluation metrics.

Data collection and processing
This study uses the Irish CER dataset [14], gathered as a part of the Smart Metering Project initiated in 2007.The purpose of the CER was to investigate the performance of smart meters and their impact on customer energy consumption.This dataset consists of individual electricity consumption data and survey data.The consumption data were recorded by smart meters at a 30-min interval from July 2009 to December 2010 (75 weeks).Fig. 2 shows the load profiles of three representative households over a week.The survey data contains the sociodemographic information collected from 4223 households by filling out the multiple-choice questionnaire.This sociodemographic information includes occupancy status (e.g., employment status, social class), consumption behavior (e.g., interest in reducing bill), household properties (e.g., floor area, number of bedrooms) and home appliances (e.g., number of washing machines).This paper selects ten representative sociodemographic characteristics from the 140 questions in the survey data as shown in Table 2.We have removed the rows with missing data or specific classes involving attributes that do not represent a significant proportion of customers.The customers' answers to these questions were used as class labels in our identification task and generation work.The corresponding question numbers in the survey data are provided in the second column of the table.The last two columns provide the classification rules for each question and the number of customers in each category.Since data generation and social demographic identification require a large amount of training data, we segment consumption time series by week to enrich the training dataset.

Framework overview
Fig. 3 shows the WDCGAN-based privacy-preserving framework for sociodemographic information identification.This framework can be divided into two parts: data generation and sociodemographic information identification.The generated data can be used to replace the real-world data for classification tasks in machine learning.The similarity between the real data and the generated data will be calculated in order to evaluate the performance of the data generator indirectly.This figure shows the nine critical blocks and their interaction in the framework, which are described in the following.
Block 1 is the input for training or testing, including electricity consumption data and the corresponding sociodemographic data.Block 2 is the generative model to be trained.The generative model used is WDCGAN and the training data are from block 1. Block 8 is applied to compare the predicted labels from block5 with the actual labels from block 7. Block 9 is the comparison result of the prediction accuracy of all classification models in real and synthetic data.
The nine blocks can be divided into two parts.The first part includes four blocks for data generation (block1 e block3) and classification training (block4).The second part includes the blocks from 5 to 9 for the evaluation.There are two types of models that must be trained in our framework.One is the generative model for generating data, while the other is the classification model for evaluating the model.The classification model will be trained using synthetic data from the generative model.In the evaluation process, the resulting classifier is used to predict the labels with the given real consumption data as input (block6), and the prediction labels, i.e., ŷ, will be compared against the real labels, i.e., y, (block8)  to evaluate the model performance (block9).In general, the higher the accuracy, the more useable the data, but with less privacy.Therefore, it is essential to balance the usability and privacy of the data.This can be achieved by adjusting a hyperparameter to train the WDCGAN model, and different data generators can be trained to generate data to meet different user requirements.

Proposed WDCGAN model
As discussed earlier, GANs can learn the distribution of a given dataset, and have been used successfully in graphical and natural language processing.However, the application of the original GANs to load profile generation for identifying sociodemographic information still faces some challenges.First, the original GANs cannot control the generated data after training because the generator input to the GANs is only noise, which means that the generated data lack label information.For the identification of sociodemographic information, if the generated data lacks label information, the mapping between the generated consumption data and consumer's labels cannot be established, so the generated data cannot be used by classification models.In addition, the greater freedom of the generative model may lead to a larger difference between the generated samples and the actual samples, resulting in lower usability of the data and poor stability during the learning process.Second, it is challenging to train GANs because the evolution of discriminator and generator networks must be balanced during the training process.If one of them is much stronger than the other, it will lead to two types of failures: model collapse and convergence failure.That is, if the discriminator is too strong, the generator can only create one type of output or a small set of outputs, resulting in model collapse.Conversely, if the discriminator is too weak, the generator will not be able to produce good quality outputs, resulting in convergence failure.Third, the learning process for generative adversarial networks is not sufficiently constrained.The discriminator only criticizes the probability of true and false for the input data.Therefore, it pays more attention to feature extraction than to the information itself from the data.
Since the original GAN cannot be stabilized to generate load profiles with label information, we design a novel network structure based on three GAN variants, including DCGAN [21], CGAN [22] and WGAN [23].They are incorporated into the same network structure to improve the original GAN.First, we incorporate label information into GAN, which makes it evolve from unsupervised to supervised learning.Label information is added to the input vector as an additional condition to control its output.In detail, the real or generated samples with their label information are combined into a uniform vector, feeding the discriminator of the WDCGAN.Then, the discriminator outputs the judgment on the input vector.Only when the generated sample and the actual sample are similar, as well as their labels, can the judgment be true.With the added conditional information, WDCGAN can control the output type by governing the label to the generator.Second, based on GCGANs, we incorporate convolutional layers into the structure of GCGANs.In image processing, the convolutional layer is the feature extractor.We found that the autocorrelation and periodicity characteristics of energy consumption time series are similar to the relationship between pixels in image processing.Therefore, we convert the weekly consumption data into a 7 Â 48 matrix like image.Each row of the matrix represents the daily consumption data and the column represents the readings with a granularity of 30 min.The convolutional layer in the WDCGAN will extract the hidden features of the load profiles.Third, based on WGANs, we apply the following techniques to stabilize and accelerate the training process.We use the Wasserstein distance in the model, instead of the Kullback-Leibler (KL) distance, to overcome the difficulty of the model converge.We construct the Wasserstein distance based on the following rules: removing the sigmoid from the last layer of the discriminator, removing the logarithmic operation of the generator and the loss function of the discriminator, limiting the update of the discriminator parameter to a small range, and using RMSProp or SGD instead of momentum-based optimization algorithms (momentum or Adam).In addition, to speed up the convergence of the training, we use the Relu activation function in the generator and the Leaky Relu activation function in the discriminator, respectively.
The details of the WDCGAN structure are further described as follows: Discriminator: The input of the discriminator is a combined vector of real or generated samples with their labels.At the same time, the output of the discriminator is the judgment of the authenticity and matching of the combined vector.The discriminator function is defined as below: where x represents the real data, z represents the random noise data, P data(x) is the distribution of the real sample x, P Z(z) is the distribution of the generated data, D(x|C) indicates the probability that the discriminator judges the real sample x is true under the condition of the given C, and D(G(z|C)) is the probability when the discriminator judges the generated sample z is true under the condition of the given C. Fig. 4 illustrates the structure of the discriminator and the processing flow.First, the label (i.e., the sociodemographic information) and the consumption data of one week with a granularity of 30 min are all converted into a 7 Â 48 matrix.The two matrices are connected as input for the discriminator.Then, the four convolutional layers of the discriminator will extract features from the input.The stride size and the kernel size of the convolution layer are defined by the experiments shown in Table 3. Last, the output of the discriminator is a value of 1 Â 1, which is a judgment of the input data.The discriminator loss is used to guide updates to the discriminator and generator parameters.Only when the generated sample combined with its label is sufficiently similar to the actual sample combined with its label will the discriminator output be true because it has been confused.
Generator: The input of the generator in WDCGAN is a vector composed of random noise and the label with the embedded format.The generator output is a matrix of 7 Â 48, corresponding to a week of consumption data with a granularity of 30 min.The structure of the generator and the data processing flow are shown in Fig. 5. First, the labels (i.e., sociodemographic) are transformed into 100 Â 1 by one-shot coding and then concatenated with 100 Â 1 of random noise data as the generator input.Then, four layers of the transposed convolution are used to perform bottomup sampling from the input data to the hidden features.The stride size and kernel size of the transposed convolution layers can be found in Table 3.After the four transposed convolution layers, the input vector of 200 Â 1 is transformed into a matrix of 7 Â 48.Finally, the 7 Â 48 matrix is sent to the discriminator as input data during model training or used to generate the required data after training.
With the above carefully designed WDCGAN, we can generate a large amount of synthetic data by controlling the input labels as a hyperparameter for the generator.The input labels and the generated load profiles form matching pairs, replacing the real sociodemographic data and their consumption data.This means that WDCGAN can generate the data used for various supervised  For data preprocessing, we proceed as follows: First, we obtain the labels from the survey data containing the customer sociodemographic information.Then, we partition energy consumption time series by week and the corresponding label of each customer.The results in 75 weeks as the CER data are at 30-min intervals, which have 336 values for one week.Third, we drop the weeks with the values of continuous null or 0. Fourth, we normalize the data within [0,1] to better train the WDCGAN.Finally, we convert the 336 data values of a week into a 7 Â 48 matrix.The example of the resulting time series and corresponding heatmap can be found in For the training of WDCGAN, the process is different from that of a general deep learning network, as it has two different networks, a discriminator and a generator, whose parameters must be updated in turn.Since we use WGANs [65], the slow convergence problem of deep learning networks can be overcome.The loss function of WDCGAN can be viewed as the Wasserstein distance with a gradient penalty, which is defined as follows: |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} Gradient Penalty (5) Equation ( 5) is the Wasserstein distance between the generator output z and the real samples x probability distributions P g and P r .This WDCGAN loss function must be minimized during learning.The gradient penalty is a penalty on the gradient norm on the samples where P z is the distribution obtained by uniform sampling along a straight line between the real and generated distributions P g and P r .In this paper, we use the following hyperparameters: the gradient penalty coefficient lambda ¼ 10, the number of critic iterations per generator n critic ¼ 5, and the learning rate a ¼ 0.0001.
For data generation, we use random data of 100 Â 1 and n labels to feed a well-trained generator and obtain a total of n of 7 Â 48 output data.Then, we denormalize the output data and obtain the generated load profiles.

Application of the proposed model
The sociodemographic information identification of customers is essential to build a classification model.This model can find the mapping between the load profiles (recorded by smart meters) shown in Fig. 2 and the corresponding sociodemographic characteristics shown in Table 2.The consumption time series are denoted by X ¼ [X 1 , X 2 , …X i …, X n ] as input vector, and sociodemographic characteristics are denoted by Y ¼ [Y 1 , Y 2 , …Y i …, Y n ] as labels.The n and i indicate the total number of customers and the i-th customer.Y j stands for the j-th characteristics of all sociodemographic characteristics, and y ij means the j-th characteristics of the i-th customer.Mathematically, a classification model, F j , is trained based on the consumption data X and the label Y j for the j-th characteristics: where w j and Y j are the parameters to be learned and the given label of the j-th characteristics, respectively.In most previous studies, supervised learning models (e.g., classification and regression models) are trained successfully with the assumption that both X and Y require a relatively large dataset, and can be easily obtained.However, as mentioned earlier, access to energy consumption data can be restricted due to privacy concerns.Recently, GANs have shown excellent performance in producing realistic data with the given training data.Thus, the generated data can be freely shared, accessed, or even used to augment or enrich similar datasets without concern for privacy.
Identifying sociodemographic information is a classification problem, i.e., finding the match from load profiles to sociodemographic information.This study often helps in energy demand-side management.For example, a better understanding of sociodemographic information allows one to provide personalized services and improve energy efficiency, which can benefit both customers and utilities.Therefore, this section will take the identification of sociodemographic information as an example application to evaluate the proposed model.We employ the following four classical classification methods in this application, including CNN, LSTM, SVM, and LightGBM, which are described below.We will train these classification models using real-world consumption data, as well as generated consumption data, and make a comparison to evaluate the proposed WDCGAN.
CNN: CNN is a feedforward deep neural network consisting of an input layer, an output layer and several hidden layers (its network structure can be found in Figure B.11 in Appendix B).The hidden layers are typically composed of a convolutional layer, a pooling layer, and a fully connected layer, which are used to process the features obtained from the previous layer.The convolutional layer is locally connected and has shared weight properties, making the CNN invariant to change and space.Compared to other feedforward neural networks, CNN has fewer parameters, so it is faster to train using backpropagation algorithms.CNN uses relatively less preprocessing and can achieve exceptional performance in these tasks.It has become the dominant model in computer vision.In this paper, we define a CNN as a classification model for identifying sociodemographic information.The input layer of the model receives an image in the format 7 Â 24, which is generated by WDCGAN.According to Ref. [64], the output layer of the model uses an SVM to predict the class based on the features of the hidden layers.In our model, the CNN has two convolutional layers, a maxpooling layer, and two fully connected layers between the input layer and the output layer.The kernel size of the two convolutional layers is 3 Â 2 and 3 Â 3, respectively, and the stride is 1.The kernel size of the max-pooling layer is 2 Â 2, and the stride is 2. The number of cells of the two fully connected layers is 320 and 32, respectively.
LSTM: LSTM has a particular recurrent neural network (RNN) that is robust to learning and predicting sequential data.The network can determine what to remember, forget, and generate output by adding gate memory units.LSTM can overcome the problem of RNNs, which are limited in maintaining long-term memory over time.LSTM has good performance in many applications with inherent sequences.LSTM network structure consists of an LSTM layer, a fully connected layer and a softmax layer (see Figure B.12 in Appendix B).The softmax layer is also a fully connected layer, used for classification by mapping the network output to (0,1).The LSTM layer has 24dimensional input features and 100 hidden units in this model.The output of the network is the predicted value of the input data class.
SVM: SVM is a supervised learning method for classification.SVM uses a kernel function to map the space of input samples to a high-dimensional feature space and then finds the optimal classification surface in the high-dimensional space to obtain a nonlinear relationship between the input and output data.It has a comprehensive theoretical basis and fundamental framework, which can address machine learning problems such as high-dimensionality and overlearning.This paper uses an SVM with a radial basis function (RBF) kernel.
LightGBM: LightGBM is, by default, a gradient-optimized decision tree algorithm used for classification.It applies a highly optimized histogram-based decision tree algorithm and offers significant advantages in efficiency and memory consumption.A LightGBM algorithm uses two novel techniques, called gradientbased one-sided sampling (GOSS) and exclusive feature bundling (EFB), which enable the algorithm to run faster with high accuracy.The algorithm can typically significantly speed up the training process, resulting in a more efficient model in many cases.The implementation of these classification models consists of two parts, model training and model evaluation.We use the generated load profiles to train the models, but use the real load profiles to test the trained models.To reduce the model parameters, we aggregate the generated data into hourly data, which changes the input format from 7 Â 48 to 7 Â 24.As a result, all real data used for the evaluation are transformed into a 7 Â 24 matrix.
The input to CNN is one week of consumption time series, 7 Â 24, and the output is predicted sociodemographic information.The loss function of the CNN is the cross-entropy loss between the predicted label and the generated label.The stochastic gradient descent method is used for parameter optimization during model training and the learning rate is set to a CNN ¼ 0.02.LSTM will use the same input as the CNN, and uses cross-entropy as the loss function.The LSTM model is trained with Adam for parameter optimization, and the learning rate is set to a LSTM ¼ 0.01.As for the other two classification models, SVM [66] and LightGBM [29], we will not present their structures and parameters here because they are well-known and widely used models.

Evaluation metrics
To evaluate the performance of the proposed model, we use the formula metrics and the effects of the actual application to evaluate the generated load profiles.The metrics include the autocorrelation of the generated data and the root mean square error (RSME) of the generated data compared to the real data.The actual application effect is the classification accuracy using the generated data.
Autocorrelation: Ideally, the generated data should have the same randomness, volatility, and periodicity as the real data.Thus, we evaluated the model by calculating the autocorrelation values of the generated load profiles.The peak value and the valley value represent the periodicity and randomness of the generated data in the autocorrelation curve.A generated time series is represented by fxg N t¼1 , then its autocorrelation value with a lag of h can be calculated as follows: Autocorr ðhÞ ¼ where m and s represent the mean and standard variation, respectively.In addition, the generated data should be diverse and reflect all possible customer electricity consumption scenarios.The scatterplot of each generated dataset's mean and standard variance can reflect its diversity.The more scattered the scatterplot, the more diverse the generated load profiles are.

RSME:
The generated data should have characteristics similar to those of the real data.To check the similarity between the generated data and the real data, we computed the root mean square error (RSME) between the real time series fX r g N t¼1 and the generated time series È X g É N t¼1 , defined as: The purpose of the data generated is to be used by various applications.Therefore, we use the accuracy of the generated data in different classification models as an evaluation criterion.Therefore, the generated data are used as the training set for the classification model, and the real data are used as the test set.The difference in classification accuracy in the two data sets may reflect the similarity between real and generated data.The confusion matrix calculates the accuracy of the classification model on the two datasets.For a binary classification task, for example, its confusion matrix is as follows: where TP (true positive) and TN (true negative) represent the number of positive and negative instances correctly classified by the model.FN (false negative) and FP (false positive) represent the number of incorrectly classified positive and negative instances, respectively.Accuracy is used to evaluate the prediction quality of the classification model, calculated as the proportion of all correctly predicted samples, i.e., TP and TN, to the total number of samples, i.e.,: An accuracy value close to 1 means a good classification result, while an accuracy value close to 0 means a bad classification result.However, the accuracy value of the classification model cannot directly reflect the quality of the generated data.We, therefore, use DACC to evaluate the quality of the generated data, which represents the distance of the accuracy of the same classification model between the real data and the generated data, defined as follows: where Accuracy(Generated) represents the accuracy value of the classification model on the generated data, while Accuracy(Real) represents the accuracy value of the same classification model on the real data.If the value of DACC is close to 0, it means the performance of using the generated data has no difference to that of using the real data.F-measure: It is also called F 1 -score or balanced F-score, which stands for the harmonic mean of precision and recall [67,68] as in Eq. 12.It is often used to evaluate classification performance on imbalanced labeled data.The value of F 1 -score ranges from 0 to 1, where a higher value reflects better classification performance.
where Pr and Re denote precision and recall, respectively, which are calculated as follows: Since the single F 1 -score value on the generated data cannot adequately reflect the quality of the generated data, we define DF 1 as the metric to evaluate the quality of the generated data.It represents the distances of the F 1 -scores of the same classification model in the real and synthetic datasets, reflecting the performance difference on the two datasets.DF 1 is calculated as: where F 1 (Generated) represents the F 1 value of the classification model on the generated data, while F 1 (Real) represents the F 1 value of the same classification model on the real data.If the value of DF 1 is close to 0, it means that the performance of using the generated imbalanced data does not differ from that of using the real imbalanced data.

Results and analysis
This section will first present the data preparation procedure and experimental settings, then the description of data preparation, and finally evaluate the model based on load profile statistics and sociodemographic information classification using the generated datasets.

Data preparation and experimental settings
The experiments use the CER data set described in Section 3.1, it records electricity consumption data and sociodemographic information from customers for more than 5000 Irish households from July 14, 2009, to December 31, 2010.After data cleaning, more than 300,000 CER records can be used for the experiment and S R represents these cleaned data.Each sample in S R contains one week's time series consumption values and the corresponding sociodemographic information.For the generative model, i.e., WDCGAN, 10,000 or 40,000 samples are randomly selected from S R as its training set.Here, we use G Qi {10000} and G Qi {40000} to denote the generators trained with 10,000 and 40,000 real samples, respectively, where Qi stands for ten Questions in Table 2. Take G Q310 {10000} and G Q310 {40000} as examples, they represent the generators in WDCGAN trained for Question #310 with 10,000 or 40,000 samples from S R , respectively.For classification models, 40,000 real samples are randomly selected from S R as their test set (or training set) for the four types of classification models, denoted by S RÀQi .At the same time, the trained WDCGAN will use its generator to produce 40,000 synthetic samples, denoted by S SÀQi f10000g or S SÀQi f40000g, as training set (or test set) for the four different types of classification models.generated.

Load profile evaluation based on statistics
As mentioned above, the generator in the trained WDCGAN can be used to produce synthetic data.We can control the conditional vector fed into the generator to produce various synthetic data, satisfying data privacy and data usability requirements.
S SÀQi f10000g and S SÀQi f40000g represent the 40,000 synthetic data generated by the data generators G Qi {10000} and G Qi {40000}, respectively.The generated datasets S SÀQi f10000g and S SÀQi f40000g for each question have an identical proportion of sociodemographic categories as the S RÀQi , achieved by controlling the number and proportions of categories in the conditional vectors.Take the question #310 as an example.It has two label categories, class-1 and class-2.Therefore, there are a total of 12,488 class-1 samples and 27,580 class-2 samples among the 40,000 selected samples.To reduce overfitting of the classification model, the original time series with a fine granularity of 30 min are aggregated into hourly resolution, so that each customer's weekly consumption data become a 168-dimensional vector.
Table 4 presents the RSME results between the real data and the data generated for each question.Column2 and Column3 indicate the RSME of S SÀQi f10000g and S SÀQi f40000g with S RÀQi , respectively, and the Column4 is the RSME difference between Column2 and Column3.generated.
For an intuitive comparison between the generated and real data, we plot the mean and standard deviation curves for question #310 in Fig. 6.As shown in Table 2, question #310 has two different categories with an answer, yes or no, and we use label #1 and label #2 to denote yes and no, respectively.It can be seen that the curves of the generated data and the actual load profiles for each class are close to each other and have the same periodicity.However, the mean and standard deviation of the generated data are more volatile than the real data.It is because WDCGAN tries to randomize the generated data, which can be confirmed by the scatterplot in Figs.7 and 8.The generated data and the real data have similar areas of scattering distribution, suggesting that the generator can successfully generate highly similar consumption data.However, it can also be seen from the two figures that the scattered areas of the real data surround the scattered areas of the generated data, which means that the generated data are less diverse than the real data.A comparison of the two figures shows that the overlap area in Fig. 7 is larger than the overlap area in Fig. 8, which indicates that more training data can help WDCGAN better understand and learn the diversity of actual data.Fig. 9 compares the autocorrelation between the generated data and the real data.Both autocorrelation curves have similar valleys and peaks at adjacent points, which implies that they have the same randomness over the period of a day (lag ¼ 24), confirming that our WDCGAN has learned the periodic characteristics from the real data.From DACC, we can find that the differences in accuracy between the real and generated datasets for the 10 classification tasks are mostly less than 8.0%, only the questions, #401, #450 and #453, with a greater difference.

Load profile evaluation based on applications
As the labels for the 10 classification tasks are imbalanced, F 1 is used for a better assessment of the imbalanced data.The results are shown in Table 6.The experimental settings of Column3 to Column6 are the same as those in Table 5.The last two columns are the differences between Column3, Column4 and Column5, Column6.We can observe that the differences except for the questions #450 and #453 are less than 8.0%.The small DACC and DF 1 values validate that the classification models perform closely in S SÀQi f10000g and S RÀQi .That is, the distributions of the two datasets are similar.Therefore, we can safely use the generated data for the classification tasks where data privacy is concerned.It is worth noting that the accuracy values are generally higher than F 1 scores, but DF 1 values are generally smaller than DACC.This means that the Accuracy metrics are biased due to the class imbalance.Therefore, F 1 is more preferable to evaluate data with class imbalance.

Remarks
In addition, we conducted experiments using real data as a training set, and evaluated the classification performance using the resulting model (see Tables C.11 and C.12 in Appendix C).The results also confirmed that the proposed model is capable of generating data with the desired diversity, as shown in Figs.7 and  8. Furthermore, we evaluated the model with a different training sample size to assess the impact (see Table C.13 in Appendix C).Comparing with Table 6 (results from 10,000 training samples), we can conclude that a larger training sample size can improve classification performance.This may be because the model can learn the distribution and diversity of larger training samples.
Comparing the accuracy in Table 5 and C.11, and the F 1 scores in 6, C.12, and C. 13, we can see that the Accuracy and F 1 scores of the generated data decrease, but with a slightly different rate of decrease.This can be explained by the following reasons: First, WDCGAN limits the update of the generator parameters to a small range to achieve stable model convergence, which makes the generated data less diverse than the real data.The less diversity means that the classifiers trained with the generated data have less generalization.As a result, when the model is tested with real data, the accuracy and F 1 scores can be much lower.However, if the generated data are used as a test set, the accuracy and F 1 scores of the classifiers trained by the real data only decrease slightly.Second, the number of categories in the training data varies, resulting in significant differences in the data generated by the model.The more categories labeled in the training set, the more information the model can obtain from the external conditions.The external conditions allow the model to better learn the unique characteristics of the different data categories and help the model generate more categories.Last, since each classification model has different feature extraction abilities and sensitivities, this leads to significant differences in accuracy and the F 1 score for different models on the same data.
In summary, through the above studies, we have seen that the proposed model, WDCGAN, can successfully generate realistic electricity consumption data, and the generated data can also be successfully applied to the classification tasks to identify sociodemographic information.The results validate the effectiveness of the proposed model.

Conclusions and future work
Data privacy and availability are becoming a growing concern in energy management.In this paper, we proposed an improved GAN model, WDCGAN, to generate realistic electric load profiles.The proposed model is derived from existing GAN variants, including DCGAN, CGAN, and WGAN.The WDCGAN improves the structure of the native GAN neural network to enhance its feature extraction capability, accelerate training convergence and stabilize the training process.The performance of the generative model is indirectly represented by the data it generates.To evaluate the proposed WDCGAN, we compared the generated data with the real data using the metrics including autocorrelation and RSME.In addition, we took the sociodemographic information identification task as an example to evaluate the performance of the generated data in real-world machine learning applications.The identification task was implemented based on the four classic classification models including CNN, LSTM, SVM, and lightGBM.We conducted the experiments comprehensively and the results showed that the proposed data generator can produce realistic data similar to the real data in terms of data distribution, patterns, and classification performance.The results confirm that real data can be replaced by generated data in various applications to address data privacy or availability problems.
For future work, we would like to further improve the WDCGAN model, e.g., by enhancing its ability to simulate the diversity of the generated data.Furthermore, it would be interesting to apply a distributed approach to speed up the training process and generate scalable datasets in a distributed manner.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.C. 13 shows the results of using 40,000 real or generated samples to train the model, G Qi {40000}, but test with the real data.
Figure B.10 in Appendix B.

Fig. 6 .
Fig.6.Comparison between real and generated data with respect to mean and standard deviation for the question #310.

Fig. 7 .
Fig. 7. Mean and standard variation of the real and generated load profiles for the question #310 with G Qi {10000}.

Fig. 8 .Fig. 9 .
Fig. 8. Mean and standard variation of the real and generated load profiles for question #310 with G Qi {40000}.

Table 1
List of state-of-the-art techniques for load profile generation.

Table 2
Sociodemographic information to be identified.

Table 3
The architecture and hyper-parameters of discriminator and generator.
Fig.5.The generator architecture in the proposed WDCGAN.

Table 5
presents the accuracy performance of CNN and SVM in

Table 4
RSME with different training sample sizes.

Table 5
Accuracy on the real data and the generated data (generated data as the training set).

Table 6
F1-score on the real data and the generated data (generated data as the training set).