Data-driven modelling of energy demand response behaviour based on a large-scale residential trial

• An in-depth analysis of the response behaviour of a large number of households. • Analysis of the temporal structure of response behaviour. • Data-driven modelling of DR response based on household and appliance data. • Study of the effects of important households’ characteristics on response be-

a b s t r a c t Recent years have seen an increasing interest in Demand Response (DR), as a means to satisfy the growing flexibility needs of modern power grids. This increased flexibility is required due to the growing proportion of intermittent renewable energy generation into the energy mix, and increasing complexity in demand profiles from the electrification of transport networks. Currently, less than 2% of the global potential for demand-side flexibility is currently utilised, but a more widespread adoption of residential consumers as flexibility resources can lead to substantially higher utilisation of the demand-side flexibility potential. In order to achieve this target, acquiring a better understanding of how residential DR participants respond in DR events is essential -and recent advances in novel machine learning and statistical AI provide promising tools to address this challenge. This study provides an in-depth analysis of how residential customers have responded in incentive-based DR, utilising household-related data from a large-scale, real-world trial: the Smart Grid, Smart City (SGSC) project. Using a number of different machine learning approaches, we model the relationship between a household's response and household-related features. Moreover, we examine the potential effects of households' features on the residential response behaviour, and highlight a number of key insights which raise questions about the reported level of consumers' engagement in DR schemes, and the motivation for different customers' response level. Finally, we explore the temporal structure of the response -and although we found no supporting evidence of DR responders learning over time for the available data from this trial, the proposed methodologies could be used for longer-term longitudinal DR studies. Our study concludes with a broader discussion of our findings and potential paths for future research in this emerging area.

Introduction
The increasing proportion of renewable energy resources and the growing adoption of new variable load types (e.g. Electric Vehicles) in the energy mix poses new challenges to electricity grids [1] . The effective operation of power systems demands them to function within a range of specific acceptable values for frequency and voltage. This requires the electricity demand and supply to be balanced at all times in power systems. Traditionally system operators have relied solely on solutions such as expensive network reinforcements or keeping conventional power plants open, and spinning reserve to provide reserve capacity.
An alternative approach for addressing these challenges is electricity demand response (DR) [2] . DR is regarded as a potential tool that can provide the necessary flexibility for mitigating the intermittency of renewable energy generation and the lower predictability of future loads [1] . DR solutions allow grid operators to maintain the power grid generation and load balance at a low cost, while avoiding or delaying the need for costly reinforcements of the power networks, or investing in additional back-up generation.
According to a 2020 IEA report [3] , less than 2% of the global potential for demand-side flexibility is currently being utilised. The predominant sources of demand-side flexibility are industrial thermal loads and processes, thermal comfort in buildings (both residential and nonresidential), charging of electric vehicles and on-site generation and energy storage [3] . This demand-side flexibility can be offered by Virtual Power Plants (VPPs), demand response providers and prosumers [2] . Demand response has been identified as a key component for providing this needed flexibility across the research literature. Many studies have explored the effectiveness and performance of demand response schemes, and they have provided evidence to support that demand response schemes (e.g. dynamic pricing schemes) can be used to provide the necessary flexibility [4][5][6][7] .
The demand response base is relatively low and the worldwide inventory of flexible assets in the residential, commercial and industrial sectors needs to grow multiple times higher than it is today [3] . For the widespread adoption of demand response programmes there is the need for participation in DR from distributed residential consumers, which have a huge demand response potential. However, the diversified, heterogeneous and distributed nature of the residential DR assets makes it more challenging for service providers (i.e. demand-side aggregators) to participate in the electricity markets with demand-side aggregated resources [8] .
Accurately estimating the response behaviour of DR participants and understanding the drivers behind their response is essential for addressing these challenges. An accurate forecast of response behaviour in DR schemes can decrease the uncertainty related to the available flexibility, and is directly related to the reliability of the services provided to the electricity grids. An enhanced understanding of customers' response behaviour can aid researchers, retailers and system operators to design more accurate and precise pricing or incentive mechanisms for DR programmes. The design of a DR mechanism not only affects the profitability of the participating entities, but also the success of the DR scheme. Energy big data analysis and other data-driven approaches can be a powerful tool for studying the consumers' response behaviour [9] .
In this work, we base our analysis on datasets from the Smart Grid Smart City (SGSC) trial project data [10] . The SGSC project is one of the world's largest residential DR trials, running for almost 2 years. During the duration of the trial project there were numerous product bundles offered to the participating consumers [11] . The product bundles were a combination of pricing/incentive schemes (e.g. dynamic peak pricing, dynamic peak rebate, seasonal time-of-use, etc.) and usage feedback technologies. For a more in-depth description of the SGSC trial project the reader can refer to the project's reports and the work of Motlagh et al. [12] .
The focus of this work, within the larger SGSC trial, is on the incentive-based dynamic peak rebate (DPR) scheme provided by the network trial partner. In this programme, participating customers were encouraged to reduce their electricity consumption by receiving rebate incentives. These customers were notified of the DPR events via SMS notifications 24 hours in advance of the event to request their participation [11] . All the participants in this DR scheme were residential/domestic customers.

Contributions and study outline
Our work provides new and complimentary insights, which can be used to augment prior research and brings a more complete picture of demand response behaviour, and the drivers behind it. Most studies reported in existing literature use household load data (energy consumption data at different points in time) to predict response behaviour (reduction of consumption as a response to a signal) of households, and do not consider households' characteristics. In this study, we use household-related features such as the type of heating system, internet connection, and air-conditioning units to assess their response behaviour. The main contributions of this work can be summarised as: • In-depth analysis of the response behaviour of a large number of households based on household and appliance data, from a large real-world trial. • Prediction of the households' response behaviour based only on household-related data, with accuracy in line with expectationsgiven results from previous studies on prediction of response behaviour focused on load data. • Analysis of the temporal structure of the response behaviour and the responders' ability to learn over time. • A detailed and interpretable study of the effects of the various households characteristics and appliances used, and their influence on energy reductions achieved.
Moreover, our research tries to promote more transparent and reproducible research by using public demand-response data and making available online the code needed to reproduce the exact results as a public GitHub repository 1 . Through this study we hope to improve the modelling and the understanding of DR response behaviour, for better planning and decision-making in the demand response domain.
The remainder of this paper is structured as follows. We start by presenting an overview of the data-driven methods utilised in this paper and a literature review of the data-driven approaches used for the study of demand response behaviour in Section 2 . That way we provide the context for our approach and we position our work to the existing literature. We then perform modelling and analysis of the response behaviour, by first exploring the available response-related datasets. The first step is focused on "cleaning " the datasets and identifying potential characteristics -as well as issues with the data -, utilising existing stateof-the-art exploratory data analysis and statistical tests in Section 3 . We also explored the possibility of DPR responders learning over time, and searched for temporal patterns of the response. Subsequently we applied machine/statistical learning techniques to model the relationships between the households' response and household-related features in Section 4 . Having modelled the relationship between the households' features and how their response to DR events, we then try to pinpoint the important features of this relationship and we also study and discuss the effect of these features on the households' response under the DR scheme. Section 5 concludes this work with a discussion and potential extensions of this study.

Data-driven approaches for demand response behaviour
In principle, the problem of modelling the response of consumers under DR schemes is a complex and dynamic problem based on both external and internal (to each consumer) factors. Data-driven methods offer a means for identifying patterns, modelling, and acquiring insights from the available datasets [9] .

Overview of key data-driven techniques
The multitude of data-driven methods can be grouped according to their application. There are exploratory data analysis tools where, through the use of descriptive statistics and data visualisation, we try to study the structure of the data, variables' distribution, and the interdependencies within the data sets [13] . Data exploration is used as an initial step acting as a guide for the subsequent treatments to the data.
Data preparation is the step of pre-processing the raw data for denoising, identifying discrepancies in the data and potential systematic errors. In the context of demand response, among others, that would mean pinpointing "faulty " smart meter readings, baseline estimations, and potential presence of extreme compensation values. There is a need to pre-process data to identify missing entries from smart meters readings, customers' surveys and in general from collected data, to avoid low quality data becoming inputs to models resulting to erroneous estimates and insights. Resolving the challenge of missing data can be achieved by employing complete-case analysis or imputation techniques (ranging from simple mean imputation to multiple imputation with Markov Chain Monte Carlo [14] ). In missing data analysis it is paramount to understand the underlying missing data mechanism [15] , e.g. whether the data are missing, or not, at random. De-noising the data entails pinpointing and filtering out errors in the data and/or outliers. Data preparation is an important process of the data analysis pipeline; real-world data is impure and noisy, and errors and biases of raw data can propagate to subsequent steps of analysis and modelling, with a great effect on the results of data-driven approaches [16] .
Furthermore, data mining can be used for clustering consumers behaviour to provide suggestions of consumers for DR schemes, selection of participants in DR events, and design of DR programmes. Frequently used categorisation techniques include both supervised classification and unsupervised clustering algorithms. Methods such as k-means and self-organising maps have been widely used in the context of demand response and for the study of response behaviour. In demand response, where a significant proportion of data is time-series, it is essential to use tools from time series analysis to discover potential underlying temporal structure. These tools include autocorrelation, identifying trends and seasonality patterns. For example, autocorrelation (or serial correlation) is a measure of linear association between lagged values of a sequence (e.g. time series). i.e. a lag autocorrelation is the correlation between values that are k time periods apart.
Another important application is also the modelling of data and estimation of variables of interest. In DR, that would be prediction of energy demand to provide information to support energy producers in the accurate planning for energy production, estimation of load curtailment, reliability of DR participants, etc.
Widely used techniques for practical data-driven modelling of data are ensemble methods and artificial neural networks [17] . The most popular ensemble methods are gradient boosting and random trees. Gradient boosting [18] is an ensemble of weak predictive models (typically decision trees) and it is an additive, forward stage-wise boosting model. This means that the algorithm sequentially adds new decision trees to the model without altering the parameters and coefficients of the decision trees already added [18] . Moreover, this type of models allow for the optimization of arbitrary differentiable loss functions [19] .
Random forest is also an ensemble method based on decision trees, and it operates by constructing a multiple of independent decision trees that are trained independently on a random subset of data [20] . To grow each tree, the inputs (or combinations of them) at each node are selected in a random fashion [20] . Contrary to the gradient boosting algorithm (where the weak models are built sequentially), random forests are an averaging ensemble method; meaning that they build weak decision trees independently and then they take the predictions' average of these weak estimators.
Artificial neural networks are computational models inspired by, albeit not identical to, biological nervous systems. The two basic architectures of artificial neural networks are the feed-forward and the recurrent architecture [21] . A dense neural network is a feed-forward network where all its layers are fully connected. More specifically, each unit in a dense (fully-connected) layer is connected to all the units in the subsequent layer [22] . Dense neural networks (and feed-forward neural networks in general) can be thought as universal function approximators which are constructed to achieve statistical generalisation [22] .
For domains like demand response, which have not been extensively studied and evaluated, there is the need to understand how the models attained their results and not focus only on their predictive performance [23] . For that goal, there is a need for interpreting data-driven models. Two popular model-agnostic interpretation methods are the (SHapley Additive exPlanations) SHAP framework [24] and the permutation feature importance technique [20,25] . We have employed these two approaches because prior research has shown that, in cases where the input variables are categorical (like in this study), the measures of variable importance and Gini importance can be biased towards categorical features with more categories [26] . These two frameworks approach the problem of feature importance from different angles.
In principle, the permutation feature importance algorithm assigns importance to the various input variables by taking into account the deterioration in model performance, whereas SHAP is based on the features' contribution to the model predictions. In more detail, the modelagnostic permutation feature importance algorithm calculates the importance of a feature by measuring the increase/decrease in the model's error metrics when the values of this feature are randomly shuffled [25] . This permutation breaks the relationship between the feature and the target variable, thus a model with higher error metrics (positive importance) is indicative of the model's dependence on the feature. On the other hand, negative importance (lower error metrics after permutation) means that the feature is not important for the model [20] .
The SHAP approach is a framework for interpreting the models' predictions, by assigning to each feature an importance value for a particular prediction [24] . The SHAP framework explains the prediction of a data instance by computing the contribution of each feature [24] . The contribution of a feature is calculated by computing Shapley values from coalitional game theory [27] . Under this paradigm, the players in a coalition are the feature values of data instance and the Shapley values denote how the pay-out (in this case the prediction) is distributed among the features in a fair manner.

Related work on data-driven methods for DR response behaviour
The subject of the analysis and modelling of customers' response behaviour has been broadly studied across the DR literature (see Antonopoulos et al. [28] for a full review). There are numerous papers where data-driven techniques have been used to model and predict customers' response to DR signals. Zhou et al. [29] estimate the reductions in electricity consumption during demand response time-windows, by incorporating latent variables (i.e. not observable variables) in statistical forecasting models -i.e. Ordinary least squares, k Nearest Neighbours, Support vector, and decision tree regression models. The latent features are constructed from the consumer's consumption data by applying sequentially a Conditional Gaussian Mixture model and a Hidden Markov model. Liu et al. [8] train a Long Short-term Memory (LSTM) neural network on simulated consumers' response data. The response data were generated using a response function which is quadratic to the customers' received incentives.
Paterakis et al. [30] approach the problem of the prediction of customers' response under dynamic DR price signals, by using a hybrid load forecasting model. The time-series load data are decomposed to various frequency signals, by applying wavelet transform, which are then fed as inputs to dense neural networks. The outputs from the neural networks are aggregated to create the final response forecast. Holtschneider and Erlich [31] estimate the response behaviour to DR price signals by applying a dense neural network. To tackle the issue of sparse training data for predicting response behaviour -there are only a few DR events per year -, there is work where they have employed non-parametric approaches i.e. an ensemble of k-Nearest Neighbours regression models [32] , or transfer learning-based approaches [33] . In the work of Cai et al. [33] they attempt to predict the customers' response behaviour in incentive-based DR by using load data from DR participants with similar load consumption behaviour.
The input data to the aforementioned data-driven algorithms have been found to be numerous.They are features engineered from the consumers' load data i.e. maximum and minimum daily load, mean consumption [8,32] , as well as the actual data of consumers' electricity consumption [29,30,33] . Quite often time-related features are included in the model i.e. week, month, hour of the DR event, seasonal scores [8,32] , as well as environmental factors i.e. ambient air temperature [29,33] and humidity [33] . The target variable for DR response prediction has mainly been the electricity consumption of consumers [8,29,30] . Cai et al. [33] forecast the curtailed load (difference of the actual electricity consumption from the estimated baseline consumption), and Kang and Lee [32] the proportion of realised load reduction to the requested load reduction. The training data of the learning algorithms has primarily been simulated or is semi-synthetic data [8,29,30] , but there are also cases where real-world DR data have been used [32,33] .
Moreover, clustering the main DR participants attributes is a typical approach for analysing the various response behaviours. The literature has mainly based their analysis on the customers' load data for the analysis of the response behaviour. Motlagh et al. [12] have focused on knowledge extraction from load profile data by comparing customers' consumption behaviours under DR with the ones of the control group. They use Principal Component analysis to extract the principal components from half-hourly consumption data and then they cluster them, based on these principals components, by using self-organising maps. Other clustering algorithms like k-means [34] , expectation maximisation [34] , and finite mixture-based clustering [35] , have been employed for understanding the DR participants behavioural use.
Other widely applied AI approaches used to study the behaviour of demand response are mechanism design and cooperative game theory -which are often modelling, not data-driven. These methods have been employed for the design of demand response contracts, such as in [36][37][38] , demand response incentive mechanisms [39,40] , demand-side and consumer coalitions [41][42][43] , or coordinating specific devices, such as electric vehicles [44,45] .
It is noted that, while the previous papers have provided valuable insights and advancements they tend to focus on the analysis of response using primarily load data. By contrast, the purpose of this work is to study the modelling and analysis of response based on household and device-related features. With specific emphasis on studying the effects of those features on the DR response behaviour.

Individual datasets exploration and analysis
In this study we use data from the Smart Grid, Smart City trial project. The SGSC project is one of the largest commercial scale smart grid technology trial projects globally [12] . It includes a wide spectrum of trial data, ranging from household and device-related data to EV utilisation. In our research the SGSC trial datasets utilised are the following: The household dataset includes data related to the participating households including the type of product they were involved with, energy and gas usage tier, type of house, type of devices in the household among others. The Home area plug readings dataset has the energy usage readings of various household devices, and can offer information, in a more granular level, of what type of electricity devices the customers utilise. The peak events dataset contains temporal details regarding the DPR and DPP events, whereas the peak event response set has the actual consumption, baseline estimation and rebate amount data. Before applying machine learning techniques for modelling DR response behaviour, we use exploratory data analysis and data exploration tools to pre-process the various datasets.

Household & HANPR data analysis
Initially in the household dataset we drop some features irrelevant to our analysis. That includes information on non-tariff products(e.g. lifestyle audit, feedback technology), redundant features (e.g. internal references), and in general non-informative features for how participants have responded in DR events. Although in the SGSC trial customers were offered a few tariff products, in this work the analysis is focused on the incentive-based DPR product. Therefore, we filtered out all the households which did not participate in the DPR scheme. After the filtering stage the resulting dataset did not have any missing entries.
The household dataset includes both numerical and categorical features (qualitative variables) with the majority of the features being categorical variables. When there is a mixture of both types it can be tricky to visualise the level of association between all these features. In this work, the approach we have selected is splitting the dataset in two subsets (based on the type of features) and calculating the Spearman correlation for the numerical variables and the Cramér's V 2 metric [46] . Figs. 1 and 2 visualises the level of association for numerical and categorical features respectively.
Especially in our study, where one of the main purposes of the regression models is to investigate the relationship between the input variables and the households' response for DR, multicollinearity among the input variables can lead to misleading analysis of the effects of the input variables on the DR response, as well as erroneous interpretations of the fitted models [24,47,48] .
From Fig. 1 we can see that in general the numeric features are not highly correlated with each other. The only case where there is a somewhat significant correlation is among the features related to household's power generation meters (number of gross and net solar generation meters, number of general supply meters, number of miscellaneous generation meters); which is to be expected. In Fig. 2 we can identify that there are not really strong associations among the categorical features, with the exception to the following cases below: • Between the assumed dwelling type and the actual dwelling of participating household. • Between the air-condition type and the response to whether the participating household has air-condition. • Among the gas-related features of the household.
In relation to the HANPR data, where there is more granular information about each household's devices, we explored whether we could engineer features related to the participants devices. That was done for the potential use of these features in the prediction of each household's response to the DPR events. Unfortunately this dataset only had relevant device data for a small proportion of the total participating households set, therefore, we have not utilised these features further in our analysis.

Peak events & response data analysis
Among the products that were offered in the trial project there were two peak event products: dynamic peak pricing (DPP) and dynamic peak rebate (DPR). In this study, the focus of modelling and analysis is the incentive-based DPR programme. In the DPR scheme the participating households were offered a rebate amount (constant at 4 . 5∕ ℎ ) to decrease their electricity consumption compared to their baseline estimation.
Across the running time frame of the SGSC trial there were 18 DPR events, as it is illustrated in Fig. 9 of the Appendix A. The duration of the DPR events ranges from two to four hours, and they have only happened during the weekdays (mainly Thursday and Friday), primarily during the summer and winter months, as well as mainly in the late afternoon hours, as shown in Fig. 3 . Since this is a dataset of domestic consumers from Australia, it is natural to see the peak events occur in the summer in early afternoon, when there is maximal use of air conditioning units.
Moving on the DPR response data, for the rebate programme we have complete data for only 13 out of the total of 18 DPR events (found in the peak event dataset). This means that we have the actual consumption and the baseline estimation for all participants, as well as the rebate amount, for 13 peak events. At this point, we should note to the reader that when we use the term response to a DR event we are referring to the percentage difference between the actual electricity consumption of the household and its estimated baseline consumption, for the duration of the specific DR event -as described in Eq. (1) .
Where , is the response of household for the DR event . Respectively, , is the actual consumption, and , is the baseline consumption (estimated by the service provider) of household for the duration of the DR event . This metric was selected with the reasoning that it has the advantage of being comparable across different customers, contracts, and DR schemes. But, at the same time there are limitations; i.e. errors of it constituent parts (errors in baseline estimation) can propagate to the response variable. Due to the fact that baseline estimation is one of the basic issues for demand response schemes [49,50] , we investigate any potentially faulty estimations of baseline and we filter out these data points.
First, given that a typical household consumes in average more than 0.45 kWh per 2 h, we exclude from our analysis all the cases where the baseline estimation is lower than 0.01 kWh. We can infer that households consuming less than this electricity consumption indicate probable faulty values in the baseline estimation, or that the households are not really used by their owners and therefore there is no reason for including them in our response analysis.
In Table 1 we provide the descriptive statistics of the various customers' actual and baseline consumption across the DPR events, as well as of their received rebates and their response for the DPR events. The descriptive statistics include the mean value, the standard deviation (std), minimum and maximum values, and the quartiles (25%, 50%, 75%) We can see that the average response for the DPR scheme is positive (34.25%), meaning that on average a participating household will not decrease their consumption during a peak (DPR) event but actually increase it by around a third. We can also see that the standard deviation (std) of how customers have responded in the DPR events is exceptionally high ( ≈ 800%). This could be partly attributed to outliers and potential issues with baseline estimation. Looking at the 3rd quartile (75%) value of response we can see that it is positive. That means that quite often ( > 25% of the cases) households have not actually decreased their consumption during a DPR event. The extremely high maximum values and standard deviation of responses indicate an issue with outliers in the data.
We filtered out the DPR events which show irregular distribution patterns, and to address the issue found with the outliers in the remaining DPR events we employed unsupervised techniques for outlier/anomaly detection. The outlier detection is implemented on each DPR event, and not on each household. Mainly because the baseline estimation can change through time, and also because the number of DPR events per households is really small to do any meaningful outlier detection.
The algorithms that we examined for this task are the Local Outlier Factor (LOF) algorithm [51] , and the Isolation Forest [52] . The applied LOF algorithm is a density-based approach which employs the k-nearest neighbours to estimate the local density of each sample, and tries to identify samples which have significantly lower density to its surrounding datapoints. The isolation forest algorithm takes a different approach to outlier detection, with explicitly isolating the anomalies instead of profiling normal samples [52] . It uses an ensemble of random splitting trees, and considers outliers those datapoints which have short average path lengths on these trees.
In this study we applied an approach similar to the algorithm proposed in Cheng et al. [53] where the two algorithms are combined to try and overcome their respective limitations. First we applied the Isolation forest and extracted an initial set of outliers, and on that initial set the LOF algorithm was applied to perform local outlier detection and get the final set of outliers. The distribution of the various households' response , per DPR event after removing the outliers/anomalies is illustrated on Fig. 4 . From Fig. 4 it is apparent that even after removing the outliers there is a high variance in how households have responded to DR signals. This result is in accordance with previous work in the literature [4] , where it is presented that the average reduction DR trials is ranging between 10% and 50%. The issue  of variance in DR has been discussed also in the work of Aïd et al. [36] . A subsequent step, following the removal of the outliers from the dataset is to study the distribution of the households' responses to the various DPR events. We filter out the customers who have responded to less than five (out of total 10) DPR events. The reasoning behind this filtering of consumers is that there should be a minimum number of events where they have responded, in order for this analysis to provide any meaningful insights. Fig. 5 illustrates that the distribution of the consumers' responses seems to be similar to a log-normal distribution. The distribution of the responses in this case is not symmetric and appears to be positively skewed. This tendency is the same across all seasons and it is apparent that it persists in the aggregated case too. When the data follow a skewed distribution, the median is often the best measure of central tendency. That is the rationale behind selecting the median response per household (across the DPR events in which they have participated), rather than the mean as a measure of central tendency.

Temporal structure analysis
A key hypothesis to explore would be whether participants are learning to respond better over time. This hypothesis has implications on the selection of consumers for response in DR events. In the case that consumers learn over time to respond better over a sequence of DR events, then a more dynamic view on the customers' potential for DR should be considered. For example, an aggregator could select these consumers more often with the intention of a better response in the longer-term. One way to check this hypothesis is with the use of autocorrelation.
We calculated the autocorrelation function (ACF) for every household's DPR responses and the distributions of 1-lag and 2-lag autocorrelation are presented in Fig. 6 . At first glance there appeared to be quite a few households that have relatively high autocorrelation values. But we need to confirm that the autocorrelations in the population from which the sample is taken are statistically different from zero, rather than these observed autocorrelations being the result of randomness from the sampling process. This can be achieved by employing the Ljung Box Q test [54] , which is a portmanteau 3 statistical test. In order to reject the null hypothesis of the Ljung-Box Q test (that the data are independently distributed) the p-values must have lower value than the alpha significance level.
By checking the p-values associated with the Ljung Box Q-statistics, at the 5% significance level, we can see that only 29 households had a statistically significant 1-lag autocorrelation and 21 households had a statistically significant 2-lag autocorrelation -out of a total of 953 households. With these results, there is little evidence that there is a temporal structure in the participants' response to the peak events and that there is a learning process over time, at least for this dataset where the majority of the customers participated only in a limited set of DR events. This points to an important insight for future DR trials, that could  Furthermore, by examining the response of households in Figs. 5 and 7 it can be inferred that the response does not differentiate much among the seasons. The same can be said for distinct days -DPR events only happened in three days and all of them were during the work week -, and for distinct hours and months. Their distributions can be seen in the Figs. 14-16 of the Appendix A.

Modelling of demand response behaviour
In this section we explore the potential relationship between the household-related features described in Section 3.1 and the median response per household for DPR DR events, and extract the important features for that purpose. As stated above (see Fig. 5 ) the median response per household have been selected as a target variable due to the skewed distribution of the responses, and the nature of the inputs variables (which is at the household level). The distribution of the target variable is presented in Fig. 7 . For that purpose we train different types of predictive models on the household data set, and the models used in this work are the following: Initially, we split the features' data in training, validation and test set and then we pre-process the data. Our dataset is predominantly composed of categorical features (see Section 3.1 ). As a result, the preprocessing of the dataset mainly involves converting the categorical features to numerical, using an embedding system (e.g. one-hot encoding or ordered target encoding). In general, the numerical features were not pre-processed because they are of similar scale and distribution. Moreover, in our case there were not any missing data, so there was no imputation step involved. In the subsequent paragraphs we describe more in detail the models used, and the actual encoding system used for each model. In the linear model the estimation of its parameters is done using the Ordinary Least Squares (OLS) method, which is a widely used and common approach. This kind of linear model offers good interpretability of the potential relationship between the predictor variables and the target variable [56] , and serves as a benchmark for the comparison of the other learning algorithms. In this work we utilised the statsmodels implementation [57] of an OLS linear regression, with one-hot encoding of the categorical variables.
In this study, the gradient boosting on decision trees algorithm is based on the CatBoost implementation [58] . This implementation includes a permutation-driven alternative to the classic gradient boosting algorithm, and has a built-in embedding approach for categorical features based on target statistics [59] . The number of trees used in the final model are selected, based on the iteration that outputs the optimal evaluation metric in the validation set. For the training of the Random Forest model we utilised the scikit-learn [60] implementation with its default parameters. The pre-processing of the dataset was done in a similar fashion to the gradient boosting case.
Here, we constructed a two-hidden layer dense neural network using the Keras API which runs on top of the TensorFlow platform [61] . The network's hyperparameters -in our case the number of units for each layer and the learning rate of the Adam optimiser [62] -were tuned based on the Keras Tuner library [63] , with the use of the random search method [64] . We also experimented with "deeper " architectures (higher number of layers) but there was no gain in the predictive performance of the model, and therefore were not selected as the final model. According to the survey of Hancock and Khoshgoftaar [65] the most common encoding for categorical variables in neural networks is One-Hot encoding. In our case we have used the CatBoost encoder (as in the earlier models) for consistency reasons and to make the results more comparable.
The various machine learning models were trained and validated on 10 different splits of the dataset (in training, validation and test sets). That was done in order to evaluate their average performance and give more informed values of their test error metrics. The selected metrics are root mean squared error (RMSE), mean average error (MAE), mean average percentage error (MAPE) and 2 , and were evaluated on the test set of the data. The results are presented in Table 2 . For a more detailed description of the models' implementation and the exact values of the optimal hyperparameters, the reader can always refer to the Jupyter notebooks/codebase in the GitHub repository as found in Section 5 .
Based on the combination of the models performance on the different error metrics, the Gradient boosting algorithm seems to produce the best results in the test set, followed by the linear model. In general, the results across all these four models are quite similar (both for linear and non-linear estimators), and comparable to the existing literature. For example, the MAPE metric of models for individual customers' response is in the range of 41-45% [32,33] . In all cases the 2 is quite low, and therefore we can infer that this household features are not very informative predictors of the median response. But, even though the input variables explain one a small percentage of the variance in the model, that does not necessarily mean that they need to be discarded as useless. In this work, the purpose is to explore the relationship among the features and their importance, and not high prediction accuracy. When the independent variables are statistically significant, we can still draw important conclusions about the relationships between the variables.
Examining the results' summary of the linear model, we can see that the in-sample 2 ≈10%, and primarily that the p -value of the F-statistic is quite low. In the multiple regression setting this fact means that there is at least one statistically significant feature, and therefore there is a relationship between the household features (at least of a subset of them) and the household's median response. Moreover, examining the p-values of the t-test for individuals features we can see that, at the 95% significance level, there are the following statistically significant features: • The level of a household's dryer usage • If a household has internet access • If the household is using gas oven • If the household is using gas for hot water • If the household has Air-condition • The number of a household's controlled loads • The number of refrigerators in a household • Whether the household stayed on the trial for the duration The above are supporting evidence that there is a statistically significant relationship between the input variables and a household's median response, albeit not a strong one. Therefore we extend further our analysis by exploring important features and discussing the contributions of these important features on the customers' response, in an incentivebased DR scheme.

Identification of important features and discussion of their effects on response behaviour
Given that there is a relationship between the customer households' features and the response of a household, an interesting subsequent step is to examine which features are important for the response behaviour of the DR consumers. Identifying which features are important cannot only potentially lead to more accurate prediction of response behaviour, but also provide insights to the data and the models utilised.
In our study we have applied two popular model-agnostic interpretation methods to examine features' importance . The SHAP framework and the permutation feature importance method. These two methods have been applied to different models in order to get more decorrelated results and a less biased set of important features. The permutation importance algorithm has been applied for the linear and random forest regression models. The results for the test set appear in the Fig. 12 of the Appendix A. In the x-axis of the importance graph we have the importance of the feature (difference between the model's score metric with and without each feature's permutation), and in the y-axis we have the features ranked by their importance magnitude. The SHAP method has been applied for the gradient boosting and dense neural network models. The importance of each feature in this case is the mean of the absolute Shapley values across the test data. The sorted features by decreasing magnitude of importance are illustrated in Fig. 11 of the Appendix A.
Based on these four different models and two feature selection frameworks we have four distinct rankings of the features' importance. To overcome this diversity among them and create a global set of ranked important features we have utilised an ensemble feature ranking based on a Markov Chain rank aggregation method [66] .
Understanding the drivers behind response behaviour in DR schemes and their effect on response can have wide implications on the use of demand response for the provision of demand-side flexibility services. Among others, a deeper comprehension of residential response behaviour can assist service providers with targeting households for DR, better selection strategies for responding in individual DR events, as well as with the design of more successful, fair, with high participant engagement DR schemes.
In this work, we approach the study of the effect of households' characteristics on the residential response behaviour in two ways. We examine the distribution of the residential responses across the various important households' characteristics, and we investigate the impact of these features on the Gradient boosting model's output, employing the versatile SHAP interpretable framework. The local explanations of the SHAP method for the samples of the test set result to global model insights.
The Gradient Boosting regression model has been selected because it is the model with the best prediction accuracy (across the models used in this study). The intuition behind using two different methods is that this way we cross-check the results between them, and therefore can acquire insights, on the residential response behaviour in DR events, which are potentially less biased.
In more detail, based on the results of the ensemble ranking algorithm we select the features with the highest rank, and for each feature we explore how the various levels of the categorical input variables influence the households' response. How the various numerical features affect consumers response in DPR events can be seen in Fig. 8 . E.g. households with high number of refrigerators tend to not follow the DR request, whereas households with a low number of refrigerators implies that the household will decrease more their consumption during DR events. Fig. 13 presents the empirical cumulative distribution function for the different levels of the most important categorical variables. The rationale for using this type of plots is that by comparing the CDFs for different categories we can estimate whether the categories influence differently the response behaviour of the DR participants. Categories of the same variable with "steeper " CDFs indicate that households belonging in this category tend to have higher response, and decrease more their actual electricity consumption (compared to their baseline) than households belonging to the other categories -and vice versa.
The graphs that illustrate the contributions of each feature to the Gradient boosting regression's model are Fig. 8 and Fig. 10 . Each point in these graphs is the Shapley value for a specific feature and an instance of the model's output. Fig. 8 combines feature importance with feature effects. The features are sorted by their importance magnitude and the colour represents the value of the feature from low to high. The grey colour denotes a categorical variable, where there is no notion of high and low values. Although this plot illustrates information about the relationship between the values of households' characteristics and the impact on the model's prediction, for numerical variables, it is not really informative for categorical variables. In this case to see the exact form of the relationship we have to look at Fig. 10 , where we plot the Shapley value of that feature with its categories for all the examples in the test dataset. On the y-axis are the SHAP value for that feature, which represents how much knowing that feature's value changes the output of the model for that sample's prediction. Also, because of the way we have defined the households' response (percentage difference between actual and baseline consumption), negative SHAP values indicate decrease in the electricity consumption and thus better response. Moreover, the points in this plot are coloured based on the level of the household's electricity usage level. If an interaction effect is present between the plotted feature and the electricity usage level it will appear as a distinct colouring pattern.
After careful examination of the aforementioned graphs we can provide the following insights on the effects of households' characteristics on the DR response behaviour. As we can see from both approaches, households that have access to the internet tend to respond better in DR events. We can infer that this feature does not have a direct effect on the DR response, but most likely it is associated with characteristics of the residents. One interesting insight that we can infer from Figs. 10 and 13 (and seems counter-intuitive) is the fact that the reported effort level taken by the households, to respond to a DR signal, does not seem to have an effect on the households' actual response. Households that own air-conditioning systems respond better than the ones which do not. The households that do not have a gas-fired oven tend to respond better than the ones that do.
Another interesting insight is that the households with medium electricity usage level tend to respond better than the ones with low or high usage levels. Existing studies on the literature have indicated that highuse households tend to respond significantly more in load reduction than low-use households [7] . This could potentially be attributed to the design of this specific DR peak scheme, where the rebate amount per energy unit decreased is constant. Moreover, the customers in warmer climate zones tend to have a better response behaviour than the ones living in areas with mild climate conditions. The residential customers with medium and high level of clothes dryer usage respond better than residential properties which do not use the dryer frequently (or at all).
We can see that consumers with medium levels of gas usage tend to be more responsive in DR peak events than the ones with low levels. It seems that households with high levels of both gas and electricity usage tend to perform worse in response behaviour. The dwelling type does not seem to have a big effect on the response behaviour of customers. We could cautiously infer that the consumers residing in units (apartments) with medium electricity usage level tend to respond better.

Conclusion and further work
For the wider realisation of the demand-side flexibility potential, the widespread utilisation of residential DR resources is essential. However, residential response behaviour poses certain challenges. Accurate modelling of response behaviour of DR participants and a deeper comprehension of the factors driving response behaviour can assist in alleviating these challenges with better targeting and selection strategies of consumers for DR and the design of better-tailored schemes.
In this work, the authors approach the study of modelling and analysis of response behaviour under DR by utilising not primary load data (which is mainly used in such studies), but datasets related to household characteristics, when responding in an incentive-based DR programme. To this end, the study makes use of data from one of the largest opensource DR trial projects involving residential (household) consumers. We explored the datasets and pre-processed them accordingly (i.e. check for multi-collinearity, handling of missing values, outlier/anomaly detection and filtering). We explored the possibility of participating households learning to respond better over time, but we did not find supporting evidence for this hypothesis (at least for this specific DR scheme and available dataset).
We have showed that there is a statistically significant relationship between a household's response and a number of households' characteristics, and we have applied various machine learning techniques in search of better prediction accuracy. The prediction accuracy of these models found by our study is in line with expectations, given previous work on prediction of response behaviour which only focused on load data. Future work could combine both types of data this (householdrelated features and load data), potentially yielding better prediction accuracy for residential response behaviour.
We further explored and found the important features for modelling the households' response behaviour using two model-agnostic methods, across the different machine learning models. Next, we examined the effects of these important features on the households' response behaviour by using statistical distribution plots and interpretable AI methods based on coalitional game theory. We found that households with internet access, air-conditioning systems, power-intensive appliances (e.g. clothes dryer), and with lower gas usage tend to respond better than average. Moreover, there are also some insights which raise questions about the reported level of consumers' engagement in DR schemes, and the individual rationale of customers' response to DR signals.
Finally, taking a longer-term view, we argue that data-driven studies such as ours can provide key insights into better design of future residential DR trials, such that the data they collect can fill gaps in the knowledge of researchers and practitioners. For example, in this study, we could not detect a temporal trend showing an improvement in the response of households, as they participate in more DR events over time. This is likely due to the limited number of DR events the dataset from the SGSC trial (after filtering, we found only 10 events had reliable, high-quality, usable data). Another question that cannot be answered based on this data is what is the elasticity of the expected percentage demand reduction (response), w.r.t. the size of the financial reward of-fered by the DR aggregator or network operator. In practice we expect this question to be important in designing, budgeting and optimising and budgeting residential DR schemes in the future.
Overall, with the roll-out of ubiquitous smart meters and largescale availability of consumer-level response data, we argue data-driven methods will have an increasingly key role to play in the design of successful and fair DR schemes, with high participant engagement.

Data availability
The Smart-Grid Smart-City Customer Trial Data used in this study are available at the Australian Government data website: https://data.gov.au/dataset/ds-dga-4e21dea3-9b87-4610-94c7-15a8a77907ef/details .

Code availability
The reproducible code for this analysis can be found at: https:// github.com/antongiannis/dr-behaviour-modelling-residential.git .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.