Stochastic generation of residential load profiles with realistic variability based on wavelet-decomposed smart meter data

Residential smart meter data with high time resolution are integral to many data-driven applications, ranging from hosting capacity studies to R&D activities of private enterprises. However, privacy legislation restricts public availability of large-scale datasets. Furthermore, existing datasets may suffer from imbalances in terms of underrepresented classes. To address these concerns, this study presents a novel decomposition–recombination approach for generating synthetic load profiles that exhibit realistic variability and demand peaks. High-frequency load profiles are decomposed into a low-frequency base load and high-frequency variability at the daily level through a discrete wavelet transformation. Components from different households are subsequently rescaled, shifted and recombined in a stochastic load profile generator to obtain new daily load profiles with high-fidelity behavior. The performance of this generator is evaluated through benchmarking, resulting in a mean average error of 0.09 kW on an average value of less than 3 kW for the daily peaks, whilst preserving their seasonality. The introduced load profile generator is validated as an alternative to privacy-sensitive residential smart meter data in a hosting capacity case study. The analysis focuses on the voltage drop caused by residential electric vehicle charging, considering both real and synthetic data. The synthetic data demonstrated voltage drops with a mean average error less than 0.2 V for the 10th and 90th percentile when benchmarked with respect to the real voltage level distribution. The introduced decomposition–recombination method is shown to accurately capture the high-frequency variability and peak behavior, and is suitable for practical applications at the daily level.


Introduction
Low-voltage (LV) distribution systems were historically designed to distribute power from centralized power plants to end-users connected to the LV grid, often dimensioned to withstand a worst-case concurrent peak demand. However, multiple societal trends are challenging this traditional perspective on the LV grid. First, the increasing penetration of distributed energy resources (DERs) such as photovoltaic (PV) installations are already transforming many residential consumers into prosumers, increasingly injecting power into the grid. Second, an exponential growth of both electric vehicles (EVs) and heat pumps (HPs) is expected, as they play a key role in the energy transition. They provide low-carbon alternatives to traditional fossil fuel based appliances to fulfill mobility and heating demand. The wide-spread inclusion of these types of appliances will impact the intensity of residential electricity usage.
However, grid operators are aware that the large-scale integration of PV installations, EVs and HPs can lead to LV grids performing One of the prerequisites of many HC studies is the data availability of the pre-existing demands on the network [11]. Hosting capacity studies that use time series data were found to be accurate and realistic, although the main drawback is that few scenarios can be tested. However, time series analysis allows to take the intermittent nature of distributed generation into account, while also including all correlations as they occur in reality, such as solar power production and electric heating or cooling [12]. Analogously, HC studies that use a probabilistic approach for the maximum load or generation need to define distributions based on available data, not only for the generation, but also for the pre-existing demand. In [8] the introduction of EVs and HPs were considered in UK-based networks, and the HC was found to be strongly dependent on the initial demand, highlighting the need for accurate modeling of the a priori loads.
Consequently, the availability of residential smart meter data is of paramount importance for accurate HC studies on the LV grid. However, its availability is severely hindered due to privacy concerns [13]. In 2019, the EU amended its Electricity Directive to require smart meters to comply with the Union data protection and privacy rules [14], as residential smart meter data can be used to identify individuals. Therefore, metering data at higher resolution is classified as ''personal data'' in the EU, and needs to be in line with the EU's General Data Protection Regulation (GDPR).
This identification can take various forms. In [15], the introduced algorithm managed to identify over 90% of the presence and sleep cycles of household inhabitants, which can be used to infer holiday periods where the household is vacant. Similarly, [16] concludes that information about occupant behavior can be accurately estimated even with simple data-extraction algorithms. These techniques can lead to inferring religious beliefs of household occupants, for example if they get up much earlier during the month of Ramadan, which implies they are Muslim [17]. Furthermore, 30-minute resolution smart meter data already allowed researchers in [18,19] to accurately predict household characteristics such as the employment status of the occupants as well as their social class, the size of the dwelling, in addition to whether it is a single-person household or a family.
In addition to alleviating privacy-related concerns with respect to smart meter data, synthetic load models have been used in various applications, such as (i) balancing unbalanced datasets with respect to the minority class [20,21], (ii) evaluating the robustness of reinforcement learning algorithms [22], or (iii) augmenting their training datasets [23]. The research field of residential load modeling thus aims to provide input for other studies and bridge the gap caused by privacy issues or limited availability of residential smart meter data. This can range from creating synthetic profiles that can be shared with researchers and industrial partners, to automating the anonymization preprocessing of the measured data [24].
Residential load modeling tends to follows multiple steps. First, data is gathered and preprocessed. Consumers with similar properties are subsequently grouped together through a clustering algorithm, and finally synthetic profiles are generated. Our contribution to this research field is located in this final step, and is threefold: 1. An existing framework for extracting and superimposing highfrequency variability, previously only used for non-residential consumers and aggregated residential loads, is adapted so that it can be applied to residential consumers. 2. Using residential smart meter data provided by the DSO, this adapted framework is used to construct a data-driven residential load profile generator that is able to take the seasonal nature of residential consumption into account. 3. Numerical validation of the proposed methodology and benchmarking compared to the measurement data, as well as through a realistic EV hosting capacity case study on a LV network.
The rest of this work is structured as follows. Section 2 discusses previous work on load modeling and the introduction of realistic variability. Before discussing the methodology introduced in this work, the dataset used for its validation is introduced in Section 3. The considered method for extracting the variability of non-residential load profiles is subsequently discussed, and adapted for residential consumers in Section 4. The wavelet decomposition is first examined in-depth in Section 4.1, before it is applied to residential consumers in Section 4.2, and finally in Section 4.3 to introduce a load profile generator. The proposed methodology is validated in Section 5, before being applied to a hosting capacity case study and benchmarked compared to the original dataset in Section 6.

Related research
Residential load modeling approaches can broadly be divided in three categories: (i) bottom-up and (ii) top-down models, as well as (iii) hybrid models [24,25]. Bottom-up approaches aim to model the electricity demand of individual dwellings by modeling the usage of individual appliances, bearing in mind occupant behavior. Two such bottom-up load models are the CREST model for describing individual households [26] and the IDEAS and StROBe model for district-level simulations [27,28]. Both models utilize time-use surveys as a basis for household occupancy and appliance usage.
In contrast, top-down models use the total energy or electricity consumption estimates to assign them to the characteristics of the building stock [25]. Herein, historical consumption datasets supplemented with information on the residents of different dwellings are often used as data source if the corresponding metadata is available and can be used. Such metadata can include (i) the number of residents in the dwelling, (ii) their age, (iii) which major appliances are present, (iv) the size of the dwelling and its isolation level, (v) income, etc. It is therefore unsurprising that such metadata is not often available with the raw consumption data, which can directly be attributed to privacy concerns [24].
Load profiling for top-down models that use historical datasets in the absence of such metadata generally includes three stages [29]. Initially, a clustering method is used to group consumers with similar consumption behavior. This can be done on the chronological load profiles themselves, or new application-specific features can be constructed from the measured data. In the second step, typical or synthetic load profiles are generated, often through statistical measures such as the mean or median. Finally, consumers' characteristics are inferred from the typical load profiles using available metadata.
However, residential households exhibit a significant range of when their daily peak load occurs [30,31]. Hence, taking the mean or median value of the clustered profiles during the second stage of the aforementioned load profiling methodology necessarily implies an averaging and smoothing process that leads to an underestimation of the real peak demand values in the typical load profiles [29]. Consequently, traditional typical load profiles are not suitable to model the impact of capacity-related measures that depend on the demand peaks, such as capacity-based tariffs. Therefore, research attention has shifted toward (i) taking the non-simultaneity of residential peak demands into account during the clustering process, or (ii) including realistic variability into the smoothened typical load profiles.
The clustering process traditionally uses an Euclidean-based metric between two time series to define dissimilarity. When grouping consumers based on chronological data, non-coincident peaks are thus not grouped together.
Several studies use Dynamic Time Warping (DTW) as an alternative dissimilarity measure. DTW can recognize profiles with similar shapes by finding the optimal alignment between two sequences, and captures flexible similarities by aligning the coordinates inside them [32].
Therefore, residential load profiles with similar shapes such as profiles with evening peaks have successfully been clustered together, even though these peaks are not concurrent [33].
An alternative to clustering based on chronological measurements is grouping consumers based on similar properties, called features. These features can be constructed in the time or frequency domain, or be metadata such as the number of household inhabitants, or be constructed based on additional datasets, such as weather data over the same time period and the correlation with outdoor temperature.
Features in the time domain can be constructed by performing simple operations on the chronological data. Four key time periods throughout the day were defined in [34]. Clustering involved features such as the relative average power, the mean relative standard deviation, as well as features describing the seasonal behavior and the difference between weekdays and weekends for each period. Similarly, [35] clustered households based on the peak times and the percentage of overlap for peaks in each daily load profile, in order to determine household suitability for demand response programs.
Features based on time domain information can be supplemented by properties related to the frequency domain. Frequency-based features give additional information on the cyclic behavior of the consumption profile. The Fourier transform was used in [36] to extract the harmonic behavior on a daily and weekly basis, upon which the consumers were grouped based on a set of newly constructed frequency domain features. Similarly, [37] introduced features that included information on the phase of the harmonics that significantly contributed to the reconstruction of the original profile.
However, [38] argues that while the Fourier transform is suitable to describe average or aggregated load behavior, it suffers major limitations on load characterization of volatile data. The wavelet transform offers a suitable alternative for data with stochastic variability as it is able to simultaneously represent signals in both time and frequency domain.
In [39], the coefficients obtained through the discrete wavelet transformation (DWT) were used as features. Furthermore, the DWT can be used to decompose signals into high-frequency (HF) and low-frequency (LF) components, where the HF component contains the stochastic load behavior.
This transformation has been used in [40] to extract the variability from load profiles of non-residential consumers, which were subsequently superimposed on modeled loads to obtain more realistic variability. Consequently, this method aims to compensate the averaging that occurs during the generation of typical load profiles to obtain new profiles with representative variability. This framework was subsequently used in [41] to quantify the variability in aggregated building profiles based on 1-minute consumption data.
It is this framework for (i) extracting the HF variability through the discrete wavelet transformation and (ii) superimposing it on LF (modeled) load profiles that is adapted and expanded upon in this work.
The difference with the aforementioned works using the DWT to superimpose extracted variability is twofold. First, the researched target group of this work are individual residential consumers. In [40], the variability of schools was studied in-depth as a case study, while in [41] aggregated building profiles in the range of 100 kW were considered. Aggregated residential consumers at feeder-level were studied through the DWT in [42], but the aggregation level limited the stochastic behavior in the HF component. Therefore, the framework introduced in [40] has not yet been applied and validated in-depth for individual residential consumers.
Second, instead of superimposing the extracted variability profiles on a modeled load obtained through a clustering process, this work superimposes HF variability of a single household on the LF approximation of a different residential consumer, allowing for the construction of a load profile generator based on a historical dataset. The resulting dataset has identical properties for the daily consumption and peak demand by design, while also exhibiting realistic variability.

Measurement data
The methodology discussed in this work is validated on a dataset obtained under a non-disclosure agreement (NDA) from the distribution grid operator in Flanders, Fluvius cvba. This historical dataset comprises consumption data of 1 422 consumers on the LV distribution grid in a suburban area for the year 2013, measured at 15-min resolution during one year, leading to 35 040 time points per consumer.
The metering infrastructure was installed during a proof-of-concept study on digital meters in Flanders during the period 2010-2014. As households spanning different generations and compositions participated in this study, the dataset can be considered sufficiently diverse for consumers on the LV distribution grid in Flanders. Following preprocessing steps were undertaken to obtain the final dataset: (i) households with PV installations were excluded from the dataset as PV installations tend to be allocated post-hoc to individual households during HC studies, and (ii) only households with data for the full year 2013 were withheld. The grid operator performed the preprocessing step to interpolate for missing data where necessary.

Multi-resolution analysis through wavelets
The best known frequency-based signal decomposition is the Fourier transform. The orthonormal basis of this Fourier decomposition is comprised of periodic sine and cosine functions with infinite support. However, as residential daily load profiles are highly non-stationary due to the occurrence of HF stochastic peaks, the Fourier transform is inefficient to use. It would require a large number of harmonics to accurately describe the volatile peaks [38].
In contrast, wavelet-based decompositions are suitable to describe the considered load profiles. Wavelets are oscillatory functions with zero mean that have good time localization properties as they decay in a limited time window. Wavelets can be considered as a family of functions constructed by both translating and dilating a single wavelet ( ) ∈ 2 (R), called the mother wavelet. This is illustrated in Eq. (1), with as scaling parameter and the translation parameter.
Varying the scaling parameter allows wavelets to describe both high and low frequencies localized around the center = . Small values of result in very narrow windows corresponding to high frequencies, while large scales results in low frequency signals. It is therefore possible to perform a multi-resolution analysis (MRA) on signals defined on 2 (R) by using a basis of wavelets. The LF profile captures the base load, while the HF component captures peaks.
Operationally, this decomposition process is illustrated in Fig. 1. The considered daily load profile is passed through a high-pass and low-pass filter.
Hence, the original signal is decomposed in a LF approximation 1 and a HF detail 1 . Each decomposition step downsamples the signal by a factor of 2. A k-level decomposition of a signal ( ) therefore decomposes the signal as follows: The presence of a low-pass and high-pass filter in the MRA process can be explained by the mathematical framework as first introduced in [43]. Therein, a two-pronged approach is followed for the decomposition. Firstly, an MRA consists of a sequence of nested subspaces ∈Z be a multi-resolution analysis of 2 (R). It can be proven that a function ( ) ∈ 2 (R) exists such that ( ) satisfies the dilation This function ( ) is called the scaling function, also known as the father wavelet in the context of MRA. The approximation of the function ( ) at the resolution 2 , also denoted as 2 can now be computed by decomposing the signal on this orthonormal basis, which can be shown to be equivalent to the presence of a low-pass filter: However, if a signal is approximated at the resolution 2 +1 and 2 , there is a difference of information between these two scales, which restricts the ability to reconstruct the original signal. This difference is called the detail function. Consequently, it is necessary to also determine the detail function to unambiguously reconstruct the original signal. This gives rise to a description equivalent to the high-pass filter included in Fig. 1.
The approximations at resolution 2 and 2 +1 are equal to the orthogonal projections of the original signal on 2 and 2 +1 . The detail function at resolution 2 is then given by the orthogonal projection of the original signal on the orthogonal complement of 2 in 2 +1 . This orthogonal complement is denoted as 2 . It is possible to now construct an orthonormal basis for 2 consisting of dilated and translated wavelets as introduced in Eq. (1). Similar to the set defined in Eq. (3), a binary scaling and dyadic translation is chosen to obtain such an orthonormal basis: The detail 2 of the function ( ) at resolution 2 is similarly determined by the following set of inner products with the wavelet basis: In this work, the Haar wavelet as given in Eq. (7) is used as the mother wavelet ( ). Following the reasoning introduced in [38], the discontinuous nature of Haar's wavelet is suitable to describe the load profile of individual consumers, capturing the behavior of turning appliances on and off. Furthermore, the considered metering infrastructure logs averaged power demand with a resolution of 15 min, likewise leading to a discontinuous profile.
Additionally, the scaling function for the LF approximation is given in (8). Using this framework, Fig. 2 illustrates a three-level DWT MRA on a residential load profile. The original load profile is measured at a 15minute resolution. The MRA introduces subspaces with a 30-minute, 1-hour, and 2-hour resolution to approximate the signal 1 , 2 and 3 respectively. The HF details are subsequently captured in the signals 1 , 2 and 3 . The nested subspaces of the MRA which translate to the downsampling by a factor 2 for each decomposition step can be observed. Fig. 2 additionally illustrates the main difficulty of constructing synthetic load profiles for residential consumers. The LF approximations fail to adequately capture the peak demands at increasing levels of decompositions. The LF approximation 2 only captures 65% of the original daily peak demand.
This observation for a single day for a single consumer can be extended to the considered dataset of residential load profiles. Fig. 3 visualizes the distributions of the annual peak demand as well as the mean monthly peak demand, compared to the original load profiles. Similar results can be observed: a three-level approximation tends to underestimate the annual and mean monthly peak demand by a factor 2.
This significant contribution of the HF detail to the peak demand of residential consumers limits the possibility of using previously reported methodologies for superimposing extracted HF variabilities onto a LF approximation or modeled load without any modification. Previous works using DWT MRA for load modeling either studied nonresidential consumers [40,41], or aggregated load profiles at feederlevel [42].

Superimposing residential variability
In [40,41] the extracted HF detail of the daily profile was each time normalized with respect to the peak daily value of the LF approximation. This normalized profile was subsequently rescaled before superimposing it onto a modeled load profile.  While this was valid for non-residential consumers, this normalization scheme cannot be used for residential consumers where the HF component significantly contributes to the peak demand. Consequently, a different approach has to be considered for the rescaling of the HF component. As opposed to the aforementioned normalization scheme, this framework starts from a daily peak demand max , either measured or modeled. The research question then reduces to determining whether a combination of a LF approximation of consumer and a rescaled detail of consumer can consistently lead to a load profile with the predetermined peak value.
By construction, the daily consumption of the LF approximation of the original daily profile is equal to the daily consumption of the original daily load profile, as the sum of the oscillating detail functions is equal to 0. Hence, it is possible to scale and translate a detail function before superimposing it on an approximated load while, if desired, keeping the inherent correlation between the daily consumption and the daily peak demand intact. This will prove advantageous in Section 4.3.
Let , now denote the LF approximated load of consumer on day at the th 15-minute interval of the day while ∑ , is the sum of the HF detail values of consumer on the same day at that 15-minute interval. Keeping the LF approximation of consumer fixed allows the unambiguous determination of the scaling factor to rescale the HF detail of consumer .
The daily HF detail of consumer can now correctly be rescaled to yield the desired peak demand in the combined load profile. However, there is no guarantee that the maximum values of the approximation signal and the rescaled detail function take place at the same time of the day. Therefore, a circular shift is performed on the HF profile in order to align its maximum value with the peak values of the approximation function. Analytically this shift can be represented by the operator .
Given an obtained HF detail profile as a function of time measured at timesteps, let denote the load at timestep . For the considered dataset, the detail function is an array of 96 values: . A circular shift is now defined as follows, where the operator denotes a shift of steps. ] . The circularly shifted HF signal can now be superimposed on the LF approximation of a different consumer, once the maximum of both coincide in time. The choice for a shifted HF detail is logical from the point of view of load modeling, as it does not matter whether the variability from stochastic peaks occurs at e.g., 6 PM or at 9 PM, as long as it is realistic. This framework is illustrated for the construction of a single load profile in Fig. 4, where the stochastic peaks of household during the evening are rescaled and shifted to the morning to construct a profile with a daily maximum demand equal to that of household .
The choice of only considering HF profiles of the same day by including a circular shift is somewhat arbitrary. An alternative would be to create a taxonomy of daily variability profiles and making abstraction of the day it corresponds with, and it is no longer required to superimpose a detail function of the same day on a LF approximation. Similar considerations were made in [40] to model on the day-by-day or year-by-year approach. Given the interest in generating daily load profiles in this work, the circular shift approach was introduced to generate viable synthetic profiles for each day while only considering inputs from the same day.
Two additional aspects of this framework need to be discussed as they have a computational impact. First, the HF detail has negative values as an oscillating function. Consequently, it is possible that combinations of rescaled HF and LF functions will have negative load values. In the absence of local generation, this is a non-viable solution. Accordingly, only synthetic profiles that have non-negative loads will be considered. Should this occur, a different HF detail and/or LF approximation need to be selected. The impact of this constraint is discussed indepth in Section 5.1. Second, the LF approximations of the considered load profiles have a resolution of 2 h, and as such they exhibit 8 identical maximum values. Consequently, there are 8 possible solutions for the operator to align the maximum values of the LF and the HF arrays. Therefore, the obtained synthetic profile will not be a unique combination of the two arrays. If a single profile needs to be obtained, this selection can happen randomly out of the viable solutions.

Stochastic load profile generator
The framework introduced in Section 4.2 can now be used to construct a larger dataset of synthetic load profiles. As suggested in Section 2, one of the main shortfalls of traditional load modeling of residential consumers is related to the second step. This step entails the generation of typical load curves for different types of consumers by using statistical measures.
As a result of the proposed methodology, usage of the mean or median value and the subsequent smoothing process to obtain a single typical curve can be avoided. Instead, we aim to approximate the original dataset in a distributional sense. Thus, the intended output is a dataset that in this case approaches the real distribution of peak demands in the original dataset.
The block diagram to construct a single stochastic daily load profile of day is given in Fig. 5. This method can be repeated as many times as desired to obtain a dataset of daily load profiles of a predetermined size. Based on the measured load profiles, a 2-dimensional probability density function (PDF) is estimated between the daily load factor and the daily consumption. The sampled consumption is used to rescale a LF profile of consumer , while the load factor determines how the HF profile of consumer has to be reshaped and shifted.
In the first step of the block diagram, the distribution is defined that will be approximated and sampled. This distribution is estimated based on a dataset of load profiles. A bivariate distribution is selected, as this offers a twofold advantage. First, the inherent correlation between the consumption and the peak demand can be included. Second, sampling a 2D-distribution yields two parameters, allowing for the rescaling of both the HF and LF profile.
The first parameter is the daily consumption, as the LF approximation can be rescaled to this value. While it is not strictly necessary to rescale the approximation function, it does allow for the introduction of a continuous spectrum of values for the daily consumption, instead of limiting the constructed profiles to the discrete values in the historical dataset. Furthermore, by taking the approximation profile closest to the sampled value, the realism of the obtained profile is assured. For example, a profile with 4 kWh daily consumption will not be upscaled to an unrealistic 30 kWh profile.
Second, a metric needs to be chosen to rescale the HF component. Intuitively, it seems logical to use the daily peak demand of the original data as metric to rescale the HF detail. However, computational concerns have lead to the choice of the daily load factor as second metric. The daily load factor of consumer on day is denoted as , , and is given by the ratio of its mean daily load to its maximum load on day . Let , denote the measured load of consumer on day at the th 15-minute interval of the day. The daily load factor is then given by: The load factor for residential consumers was introduced in [44] as a way to describe the ''peakyness'' of their consumption behavior. A high corresponds to demand that is distributed evenly throughout the day, while a low is indicative for intervals of high demand compared to its base load. From Eq. (11), the daily peak demand can be calculated if both the daily load factor and the daily consumption are known. Fig. 6 illustratively displays the probability distributions of the daily load factor and the peak demands for the full dataset for a single winter day. Both parameters are non-normally distributed and display a long tail. A continuous PDF can be estimated through kernel density estimation (KDE) using a Gaussian kernel. However, the main drawback of this approach occurs for densities exhibiting long tails, as the KDE technique tends to oversmooth the longer tails [45].
Consequently, the load factor is preferable over the peak demand for this approach, as stochastic peak demands occur for lower L in the bulk of the distribution, whereas they are located in the long tail of the peak demand distributions and their density will be underestimated if this parameter were to be used. Furthermore, the load factor was used in [30] as a metric for the comparison between a bottom-up load model and a historical dataset. They concluded that the synthetic data underestimated the seasonal effect in both the load factor and peak demand behavior, and that ''limited treatment of seasonal variation in load modeling can lead to inaccurate predictions of its effects''.
Therefore, by limiting the construction of the daily load profiles to each separate day based on its 2D KDE, the seasonality of the historical dataset is preserved. As displayed in Fig. 7, the KDE shows significant differences depending on the season, despite showing similar values of maximum probability.  The left-hand side of Fig. 7 displays a summer day, while the righthand side shows a winter day. The KDE of the summer day exhibits a long tail toward high load factors at lower consumption that is absent during winter days, which indicates the presence of vacant households where families are on holiday.
For the remainder of this work, the introduced methodology is applied to the full dataset of consumers unless otherwise stated. However, an identical approach can be used if only a subset of the complete dataset is considered. One possible application is using a preprocessed dataset as input, with labels corresponding to clusters obtained through unsupervised learning on features or metadata. By limiting the 2D KDE construction and sampling to the and daily consumption to elements contained in the cluster, and constricting the sampling of both the LF approximation and HF details from elements within that cluster, more targeted output can be obtained.

Results
The output of the stochastic load profile generator are as much individual daily load profiles as desired for a given day . Eight randomly generated profiles are shown in Fig. 8. The LF approximations are once again shaded in gray, while the superposition of the LF and HF profiles is displayed in blue.
The majority of the generated profiles behave as expected, featuring the stochastic behavior that is often associated with residential consumers, with morning, noon or evening peaks clearly visible in the HF component of the consumption profiles. By construction, the HF component is able to capture a large fraction of the daily peak demand for the highly stochastic profiles.
However, the introduced methodology exhibits drawbacks which can lead to unrealistic profiles for two specific cases, more precisely cases (g) and (h) included in Fig. 8. First, subplot (g) displays a vacant household with a low base load. However, the inclusion of the HF details can subsequently lead to an unrealistic cyclic behavior on top of the small base load for these households.
Second, profile (h) displays a profile where the LF approximation already captures more than 95% of the daily peak demand. Therefore, the addition of a HF detail has very little impact on the overall shape of the resulting profile, and the synthetic profile displays a stable consumption for every block of two hours. This is a result of the choice of wavelet and decomposition level. A lower decomposition level or a more variable wavelet for the DWT-MRA would lead to a less constant consumption profile for this case.
In order to investigate whether the implemented methodology behaves as expected at the level of the population, 200 profiles were generated using a subset of households with annual consumption between 2.750 and 3.750 kWh. This consumption range is centered around the mode of the distribution of the yearly consumption and is representative for typical Flemish households without electric heating [46].
The distributions of the daily peak demands throughout the year are visualized in Fig. 9 for the original dataset, as well as for the synthetic dataset. Both the mean value of each day, as well as the 25-75 percentiles for the original and the synthetic dataset are displayed.
In order to evaluate the performance of the synthetic dataset, the time series of the mean daily peak demand is compared with that of the original dataset. Table 1 displays three metrics. The mean absolute error (MAE) is 0.09 kW, while the mean percentage error (MPE) is −2.9%. The proposed methodology yields the expected results, with only a slight deviation from the original dataset.
However, as visible in Fig. 9 and the negative MPE, the synthetic model leads to a slight underestimation of the mean daily peak demand. This underestimation can be attributed to original profiles with high daily peak demand. These profiles are harder to reconstruct, which is discussed in the following subsection. Table 1 Performance evaluation of the synthetic data in Fig. 9.

Viable combinations of households
In this subsection, a major concern of the representativity of the proposed load profile generator is investigated. The generator hinges on combining LF and HF profiles of different households. However, as discussed in Section 4.3, not all combinations under the proposed methodology yield valid synthetic profiles. As the HF profile has negative values, the rescaling of this component can lead to negative values in the synthetic profile, which is a non-viable solution.
In order to quantify the number of viable combinations, the original daily load profiles in the historical dataset are considered. For each household , the daily load profile on day is decomposed in a LF approximation and HF detail function for a given decomposition level . The detail functions of each other household are rescaled and shifted according to Section 4.2, and subsequently superimposed on the approximation of household to obtain synthetic profiles with the same daily consumption and peak demand. The percentage of other households that lead to synthetic profiles without negative values was determined for each household and day .
The results of this analysis are visualized on Fig. 10. The mean of the distribution for each day is displayed together with the 25-75 percentile band. On average, 25%-30% of the households in the dataset can be used to reconstruct a synthetic load profile according to the aforementioned boundary conditions if the DWT decomposition level is 3, while this is nearly 45% for decomposition level 1. A seasonal effect is visible, which can be attributed to more households being vacant during the holiday period and thus less stochastic.
While Fig. 10 is informative on the general behavior of the number of valid combinations, the variance on the displayed distributions is significant, as seen from the large interquartile range. Consequently, it is necessary to investigate which profiles have a lower number of valid combinations and whether this has an impact on the proposed methodology. This is done in Fig. 11. Therein, the percentage of profiles that lead to a valid synthetic profile is displayed as a function of the percentile of the daily peak demand of each individual consumer. The maximum daily peak demand of each consumer corresponds to 100, while the minimum daily peak demand corresponds to 0.
As the daily peak demand percentile increases, fewer profiles tend to lead to valid combinations. This is consistent with the assumption that higher peak demands are the result of more stochastic behavior, and thus need to be captured by a larger HF detail. The daily load profile of each household that exhibits the maximum peak demand can only be reconstructed by 15% of the other profiles for a decomposition level 1, while this drops to 4% for a decomposition level 3.
Figs. 10 and 11 highlight the trade-off between anonymization and retaining sufficient profiles to be able to form enough valid combinations. The higher the decomposition level of the MRA-DWT, the more anonymized the coarse LF approximations will be. However, the price for this increased anonymization is a diminishing amount of HF details of other households that can be combined with this approximation to yield valid synthetic profiles.
However, not only the decomposition level has an impact on this anonymization process. Note that the displayed results up to now were obtained based on the Haar wavelet. In [40], the choice of wavelet in the DWT-MRA decomposition was investigated for 34 different wavelets. Therein it was found that the Haar wavelet, there denoted as the Daubechies 1 (db1) wavelet, tends to lead to a decomposition with the highest variability in the HF profile.
To illustrate this trade-off, Fig. 12 recapitulates the findings from Fig. 3, but this time for three different types of wavelets: the Haar, sym5 and db5 wavelets. According to [40], the sym5 wavelet contains less variability in the HF detail than the Haar wavelet, while the db5 contains even less than the sym5 wavelet.  This means that the LF approximation contains more variability for the db5 than the sym5 wavelet. The consequence of this is shown in Fig. 12. Wavelets with less variability contained in the HF detail, tend to capture the peak demands better in their LF approximation.
The limiting factor for the representativity of the output of the stochastic profile generator is the number of valid combinations for the largest percentiles of the daily peak demands. In the absence of obtaining a larger dataset, the methodology initially points toward reducing the decomposition level in the MRA-DWT. However, using a more variable wavelet than the traditional Haar wavelet can also lead to more valid combinations.

EV hosting capacity case study
In order to investigate the usability of the generated stochastic load profiles in hosting capacity studies, their performance in an illustrative real-life case study is benchmarked. Here, the impact of residential EV charging on the LV distribution network is considered. The considered network is a representative network for Flanders of a semi-urban area with a balanced mix between detached and semi-detached housing units. The considered grid is shown in Fig. 13. A summary of its specifications is given in Table 2.
Each household is assigned a load profile and an EV charging profile. Similar to Section 5, only households with annual consumption between 2.750 kWh and 3.750 kWh are considered for the original dataset. These profiles are subsequently used to construct a dataset of similar size with synthetic load profiles.
The EV charging is assumed to occur at 3.7 kW. Furthermore, the EV charging profiles are generated from the density distributions of realworld arrival and departure times, obtained from ElaadNL [47]. The dynamic power charging curves are derived from [48]. These EV charging profiles are subsequently superimposed on the load profile assigned to that household. Furthermore, it is assumed that the load distribution of the house-units is symmetrically connected to the distribution cable (i.e., housing units 1 is connected to L 1 -N, 2 to L 2 -N, 3 to L 3 -N, 4 again to L 1 -N etc.).
Simulations are performed within an OpenDSS-Python environment, therefore the distribution network is modeled in OpenDSS [49] while the actual power flow analysis is performed in Python through the OpenDSS COM interface, as presented in [50,51].  The method adopted for the modeling of the cables is described in [52]. Results are obtained through a steady-state power flow analysis performed every timestep. To compare the performance of the measured and synthetic dataset, the voltage level is calculated for every 15 min, for each individual household over the course of one year.  Table 3 by the MAPE and MAE.
As seen in Fig. 14 by visual inspection, as well as in Table 3, the synthetic dataset yields nearly identical results to the original data, proving its suitability for EV-related hosting capacity studies. However, the proposed methodology has several limitations that currently restrict its application area. The next sections elaborates on the current limitations and future research.

Limitations
The introduced methodology stochastically generates daily load profiles. While this method constructs profiles that lead to valid results for EV-related HC studies such as presented in Section 6, these cannot yet be used as input for other time-sensitive analyses.
This constraint needs to be attributed to the output of the methodology, which is daily load profiles. However, annual consumption profiles are often needed for e.g. PV HC studies. Annual consumption profiles have a certain autocorrelation.
Simplistically randomly combining the stochastic daily load profiles to obtain annual load profiles would lead to inaccurate results for this autocorrelation. The combination of stochastic daily profiles with realistic autocorrelation will be the subject on a future research paper. This is illustrated in Fig. 15 where the autocorrelation function (ACF) of the time series of the daily consumptions is plotted for a household in the original dataset as well as a synthetic profile. The synthetic profile does not contain any significant lags in the ACF, in contrast with the measured profile.

Conclusion
This article presents a data-driven stochastic load profile generator for residential consumers. Using the discrete wavelet transformation and a multi-resolution analysis, privacy-sensitive load profiles can be decomposed into high-frequency details and a low-frequency approximation. The high-frequency component of the load profile corresponding with one household can be rescaled and shifted, and subsequently superimposed on the approximated profile of a different household.
This yields a synthetic load profile with a given daily peak demand and daily consumption. By sampling the two-dimensional distribution of (i) the daily consumption and (ii) the daily load factor of the original dataset, their relation is preserved in the resulting stochastically generated profiles.
The validation of the proposed stochastic load profile generator was twofold. First, the generated profiles were benchmarked with respect to the original dataset for the daily peak demand behavior. The seasonal behavior in the original data is preserved by limiting the generator to the daily level. Furthermore, the distribution of the synthetic daily peak demands showed a MAE of 0.09 kW, corresponding to a MAPE of 4.1%. Second, the generated stochastic load profiles were applied to a reallife hosting capacity study for residential charging of electric vehicles on a LV feeder. The synthetic data demonstrated voltage drops with a mean average error less than 0.2 V for the 10th and 90th percentile when benchmarked with respect to the real voltage level distribution. These results indicate the suitability of the proposed methodology for practical applications at the daily level.
Two restrictions of the introduced methodology are discussed. The first limitation concerns the reconstruction of daily load profiles with a high peak demand. These profiles tend to have a significant contribution of the high-frequency component to their peaks, which leads to fewer other households which can be used for constructing a synthetic profile with a similar peak demand. This highlights the trade-off between anonymization of the privacy-sensitive data and the computational process involved in the load profile generation. Second, daily load profiles as output cannot be used for all hosting capacity studies. The autocorrelation of a random sequence of synthetic daily profiles is inaccurate, as measured profiles contain a high degree of autocorrelation between their daily consumptions. This second limitation will be addressed in a follow-up study.

Declaration of competing interest
The authors declare no conflict of interest.