Improving energy self-sufficiency of a renovated residential neighborhood with heat pumps by analyzing smart meter data

In the energy renovation process, usually, buildings are upgraded to become energy-neutral annually with installed photovoltaic systems and heat pumps. However, the energy self-sufficiency of these buildings is surprisingly low. Therefore, the rapid deployment of heat pump based heating systems creates a shift of natural-gas consumption from the previously consumed building side (boilers) towards the electricity production side (power-plants). Fortunately, the development of information and communication technology enables access to consumption/generation data of building-related energy systems. Thus, there is an opportunity to strategically use this data and improve energy self-sufficiency and accommodate heat pump based heating systems. In this study, the improvement of self-sufficiency is discussed using a renovated neighborhood. The presented method incorporates a smart-grid application with a data-driven clustering, prediction, and an energy management strategy. First, clustering of similar demand-profiled dwellings with the k-means algorithm, and demand-prediction using the randomforest technique was performed. Afterwards, electric energy storage was introduced and multiobjective optimization reducing annualized costs and carbon emissions have been performed. For the carbon-dioxide optimal case, when aimed at the entire neighborhood, an annual self-sufficiency increment of more than 25% can be achieved, while four months out of the twelve being 100% energy selfsufficient. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Ever since the Paris agreement to pursue efforts to limit the temperature increase to 1.5 C above pre-industrial levels was accepted [1], governments have started stimulating energy efficiency and sustainability measures by means of subsidies and regulations. One of the most important focus fields of these regulations is the buildings since they account for 40% of the global energy demand [2] and 30% of global CO 2 emissions [2]. This has led to a growth in utilizing private renewable energy sources [3], primarily photovoltaic (PV) panels, and has changed the centralized energy system towards a more decentralized infrastructure [4].
Along with this transformation, most of the European countries seek out measures to eliminate the natural-gas utilization of the buildings' heating sector. All-electric heating systems [5], for instance, heat pumps (HPs) [6] gain a great deal of attention as a substitute. Nevertheless, with the current composition of the electricity production (central coal and natural-gas power plants) [7], a shift of fossil energy consumption from the demand side (boilers) to the central production side (power-plants) [5] can be observed when operating all-electric heating systems [8]. Some can argue, this problem can be eliminated with local renewable energy sources. Yet, most of the renewable energy sources, especially wind and solar energy, do not have a high energy density and are irregularly available [9]. Mainly for PV systems of residential buildings, the solar intensity is at its peak during daytime where electricity demand is generally low. Therefore, methods are needed to improve energy self-sufficiency, especially for neighborhoods with residential buildings.

Motivation
Even though employing new energy systems or smart control strategies is a possibility to improve energy self-sufficiency with new buildings, applying the same to the existing buildings remains a huge challenge. In the existing buildings' energy renovation process, the first step is to efficiently insulate them so that the heating demand can be reduced by a considerable amount. Then, the rest of the demand can be provided by a suitable energy system. In this process most of the time, buildings are upgraded to become energy neutral annually, yet energy self-sufficiency has been identified to be surprisingly low [10]. Therefore, most of the produced energy from PV panels is fed-back to the grid without using locally. At the moment, in some countries, energy taxes are only applied to the net electricity consumption of prosumers. However, this net metering system will gradually disappear. Meaning, the electricity that is directly used from the national grid will be costly and feed-in electricity may not bring any advantage to the building owners. This study is motivated by the fact that the level of energy self-sufficiency is not adequate as elements associated with the heating demand electrify especially with HPs.
On a positive note, in parallel with the installation of PV and HP systems, the replacement of traditional analog meters with smart meters is also being done which has the possibility to generate and accumulate enormous amounts of data [11]. Smart meter data has enabled a closer look into the energy demand patterns of individuals and neighborhoods [12]. However, this data is often considered privacy sensitive. Therefore, not many studies have been able to use such detailed energy consumption data of large samples of households. Making use of smart-meter data with available control strategies to improve energy self-sufficiency of existing renovated buildings, and accommodate HP based heating systems is the main objective of this work.

Literature survey
As of many other sectors, the built environment has gained momentum over the past decade in taking advantage of machine learning algorithms for enduring smart-grid applications. Out of these applications, the highest focus is drawn to multi-step (Ex: day-ahead) [13] or single-step (Ex: Next hour) [14] energy demand prediction. These studies pointed out it is challenging to determine one specific best machine learning model for demand prediction. It is stated that the choice of the model is determined by the nature of the data [15]. Therefore, it is essential to analyze the available data and application, to determine which model suits best in the given situation [16]. Nonetheless, most of these studies agree that when the resolution of the data points is high (hourly, 30-min, 15-min, etc.) the prediction accuracy reduces with commonly used machine learning algorithms, therefore advanced deep-learning methods or unsupervised clustering methods (depending on the application) are useful in these situations.
To fully grasp the information available from smart meters, evaluating data at the collected high resolution is important. When the granularity of the data set is high, rather than demand prediction, researchers have focused on identifying representative typical demand patterns. Clustering algorithms have served in this area under discussion. Shin et al. [17] used a data set of 22 houses with a sampling interval of 1/15 s for 29e122 days with the target of finding typical daily patterns. In their study, the k-means method was used as the clustering algorithm. In another study, Yilmaz et al. [18] compared two different approaches using the k-means method to identify the best possible representative electricity demand profiles. In this study, 15-min data of 690 households for a complete year has been exploited. When the demand prediction is a necessity, Shchetinin [19] has proposed to cluster the demand patterns first and then perform the prediction in order to improve the prediction accuracy.

Contribution
This research shows one of the many possible applications that analyzing smart meter data makes possible. Using demand prediction and smart control strategies, this study evaluates how to efficiently micro-manage the smart grid, expose energy-saving possibilities, and improve energy self-sufficiency of renovated neighborhoods.

Organization of the paper
First, in section 2, an approach is suggested to perceive information from 15-min granularity smart meter data with a clustering method. Then, this knowledge is used innovatively for demand prediction. In section 3, a currently used control algorithm is utilized to improve the energy self-sufficiency of a renovated neighborhood along with a battery storage system. Section 4 provides the concluding remarks of the study.

Methodology
The case study dwellings are located in the center of the Netherlands, have no natural gas connection, and are provided with a PV-system, HP, and high levels of insulation. These renovated dwellings are meant to be 'net-zero-energy' when considering a full year (i.e. yearly consumption ¼ yearly production). The living space of the dwellings is about 85e100 m 2 with an installed HP capacity of 1.2 kW and a PV installation capacity of 5e8 kW p . Two years (2016 and 2017) of 15-min interval meter data from these 70 renovated dwellings were used in this study.
These households use a constant tariff structure for their electricity consumption. The collected meter data from the dwellings are namely; smart meter (SM) consumed, smart meter produced, and heat pump consumed. Smart meter produced indicates electricity produced by the PV panels. The total grid consumption is defined as in equation (1). The total grid consumption can carry a value of plus or minus depending on the PV production as illustrated in Fig. 1.
The cleaned smart meter data of the 70 houses (see Annex.A) was used to follow the steps described in the next sections, namely; dynamic clustering, demand prediction, and self-sufficiency improvement of the neighborhood as demonstrated in Fig. 2. The underlying theoretical principles of the used methods and models will be clarified consequently.

Step 1: Dynamic clustering
The goal of this clustering step is to exploit and identify similar electricity demand patterned dwellings as illustrated in Fig. 3. By doing a dynamic clustering process (change every day) it is possible to recognize how similar the dwellings behave each day. As the clustering technique, the k-means method [20] is employed since it is one of the most widely used clustering algorithms with proven performances (see Annex.B).
In order to identify the suitable number of clusters 'k' to segment the data, two validity indices have been checked, namely; Davies-Bouldin validity Index (DBI) [21] and Silhouette Coefficient (SC) [21]. Davies-Bouldin index evaluates inter-cluster differences. As shown by equation (2), the minimum score that can be obtained for DBI is zero. Smaller the values obtained for DBI, the better the clustering [11]. In the equation, the average intra-cluster distance is given by diam(C i ) and the distance between two cluster centroids is given by d(C i ,C j ).
The Silhouette Coefficient is a normalized value that can be used to determine the degree of separation between the clusters [18]. The coefficient is bounded by the values À1 and 1. A higher value indicates better clustering, while negative values represent misclustering [11]. In equation (3), C x shows the average intra-cluster distance and C 0 x shows the average minimum distance to another cluster. In this study, these two indices were calculated for each day of the analyzed period. Then, the average value per month was compared and a suitable number for k is selected.
Fourteen days prior to the evaluation-day (¼ Day: xþ1) electricity consumption data is used for the clustering process (see Fig. 3). Thus, Day: x to Day: x-14. So, the 70 houses are grouped into k ¼ M number of clusters according to their past 14-day demand   behavior. Then, on Day: xþ1 (which is the evaluation day) the energy demand of dwellings are aggregated virtually according to the clustering obtained (illustrated in Fig. 4). At the end of each day (24h00), the clustering process will be repeated. So, every new day a different group and a different number of houses are available in each cluster according to the demand behavior of their past 14 days. The 14-day mark was selected with a holistic approach. The 14-day mark provided a balanced behavior for the desired process. This step allows understanding in each day how many and which dwellings group together. The dwelling numbers in each cluster are known.

Step 2: Demand prediction
Usually, the demand prediction step is needed to reduce realtime monitoring expenses, initial costs of hardware components, and long-term maintenance costs. For the demand prediction, the Random Forest regression algorithm (see Annex.C) is used due to its performance superiority. With aggregated level predictions it is not possible to identify information about individual buildings, but the prediction accuracy is high. With individual building prediction, no information is lost but prediction accuracy is low. Clustering is used so that the information about individual buildings is not completely lost and prediction accuracy can be improved. Fig. 5 illustrates the demand prediction procedure.
In this study, after the clustering is performed at the end of each day, demand prediction is performed for each cluster for the coming day-ahead (Day: xþ1). Since the number of dwellings inside the clusters is changing each day, model training for each cluster should also be performed at the end of each day (retrained). For the model training and prediction, 14-day prior data is utilized (1344 data points per dwelling). In demand prediction also 14-day mark is chosen after performing several trial and error exercises with smaller and larger data sample sizes. It was identified for this case; larger data samples do not significantly influence the prediction accuracy. Since the prediction variable Total Grid includes plus and minus values, normalization of the data set was performed before prediction. Normalization brings all the values into the range [0,1]. Outdoor temperature, irradiance, and previous day demand of the dwellings are chosen as the input features for the random forest model (see Fig. 5). To understand the added value of clustered predictions, a similar process was conducted to predict the dayahead individual demand of the houses. Then, the results of individual predictions are compared with clustered predictions.
The performance quality of the trained models was evaluated using the coefficient of determination (R 2 ) [22]. R 2 is a dimensionless indicator that shows how much variability in the dependent variable is accounted for by the independent variables. This value varies between 0 and 1. The closer the value to 1, the better the predictions. For this study, the pseudo-R 2 value above 0.8 is considered sufficient for the trained models. R 2 is a better evaluation metric for trained models than the predictions [23]. The coefficient of the variance of root mean square error (CVRMSE) [22] was used to determine the prediction accuracy. 'ASHRAE guideline for measurement of energy demand' [24] states for hourly demand predictions, CVRMSE value below 0.3 (¼30%) is sufficiently close to physical reality, and it is a good indicator of predictions. This same standard is considered in this study for 15-min demand predictions. In equations (4) and (5), b x i and x i symbolize the predicted and observed values of data point i, while x shows the mean of all the observed data points. N represents the total number of data points.
Step 3: Self-sufficiency improvement of the neighborhood After knowing the demand predictions of the neighborhood, the next step is to improve the self-sufficiency. This is done by using  appropriate energy sharing measures between dwellings. If the dwellings operate under a separate micro-grid, at a certain time interval 'T', it is possible to calculate the direct-shared energy potential which is the allocation of excess produced electricity from PV panels directly between the dwellings. However, by comparing the demand patterns of individual houses, the exploitation and identification of direct energy sharing potential is cumbersome. In this case, clustered demand patterns significantly help in identifying such parameters. The notion of direct sharing is illustrated in Fig. 6.
To improve the self-sufficiency (equation (6)), next, battery storage is introduced, so that, the excess energy produced by PV panels can be stored and used at a later time period as illustrated in Fig. 7. The battery storage systems can be installed at a neighborhood scale or local level. Due to the development of technology and the gradual price reduction [25], it is foreseeable that the vast utilization of battery systems at the neighborhood [26] level will soon become a preference.
Self À sufficiency ¼ Usable energy from renewable sources Actual energy demand (6) 2.3.1. Model formulation for the operation of neighborhood battery storage system To calculate the energy sharing potential through a battery storage system, a Matlab-based optimization problem is implemented using linear programming [27]. Modeling toolbox YALMIP is used as the interface to the solver Mosek. The model formulation is based on the 'energy hub' concept [28] and is defined in 2.3.1, sections A-C. In the formulations, a scalar variable is represented by an italic letter while a bold letter stands for a vector. Scalar and vector annotation mainly differentiate the design value of devices (one value) and optimal control of devices (per sample time). An independent decision variable is presented by a superscript d.
2.3.1.1. Battery storage. The storage system modeled in this work is battery storage and a simplified linear model is used for the optimization problem as described by the references [28]. Equations (7)e(10) describes the equality and inequality constraints. In equation (7), E t is the storage level at each time interval t in kWh, Bat Charge/Discharge is the charging and discharging energies in kWh and h Charge/Discharge is the charging and discharging efficiencies.
Decay is the rate at which stored energy declines.
Both charging and discharging efficiencies are chosen to be 90% and the decay is set to 0.1% per hour. These values were chosen based on the work of [28]. C Bat is an integer and X Bat is a binary variable that acquires a value of '1' when the device is included in the optimization. d Charge=Discharge set the limit on the charging and discharging rates based on a C-rate [29] of 0.5C, and depth of discharge (DOD) of 80%.
2.3.1.2. Electricity grid and energy balance. The model includes an external grid for electricity. The equality constraint for the energy balance is presented by equation (11). Electricity consumed from the external electricity grid is represented by Grid Energy while electricity fed to the grid is represented by Grid In . Note: Total Grid is described in equation (1) 2.3.1.3. Objective function. The objective of this problem is to schedule the battery storage optimally so that the electricity fedback to the grid (Grid In ) can be minimized. Multi-objective optimization is performed with the weighted sum method [30]. Equation (12) shows the objective function (U) which minimizes both system costs and carbon emissions associated with grid electricity consumption. For utilizing PV-produced electricity no additional carbon emission is considered. Equations (14)e (18) demonstrates the notations shown by the objective function (12).
The parameter a in equation (12) is used as the weighting factor for multi-objective optimization. When a set to zero, the objective represents the minimization of CO 2 emissions only. In this case, the battery storage is utilized at its maximum. On the other hand, when a equals one objective represents cost optimization.
C Investment ¼ Cost Battery :Cap Battery :ACF Battery (14) The total costs (C Total ) comprises of the investment cost of the battery storage (C Investment ), fixed operation and maintenance costs (C OMF ), and variable costs (C V ). ACF in equations (14) and (15) is the annualized cost factor calculated based on the discount rate (r) and an estimated lifetime of the battery storage system. Other relevant parameters associated with the objective function are presented in Table 1.
In order to obtain the Pareto optimal solutions, a in equation (12) is increased in 0.05 intervals starting from 0 (CO 2 optimal) to 1 (cost-optimal). For the investment cost of the battery storage and carbon factor for grid electricity, two scenarios were exploited; case-1 with 584 V/kWh [28] and 0.5 kgCO 2 -eq/kWh while case-2 with 260 V/kWh [28] and 0.3 kgCO 2 -eq/kWh. Case-1 shows a conventional circumstance and case-2 shows a futuristic sustainable situation with reduced prices for battery storage and reduced CO 2 emissions for grid electricity (Netherlands mix).

Results and discussion
Following the steps formulated in Section 2, the simulations were conducted, and the results are discussed accordingly.

Step 1: Clustering
In this step, importance is given to the number of clusters that should be selected. The two validity indices, DBI and SC, have been calculated for each clustering day and averaged per month as shown in Fig. 8. For DBI, it is possible to observe, when k ¼ 2, 3 the distribution of the index over the months is dispersed while the rest of the 'k' values provide closer behavior over the months. From the rest of the 'k' values, the lowest point occurs in July for k ¼ 4 clusters. The lower the DBI, the better the cluster performance. Even though higher numbers for 'k' show acceptable performances, it is also detected that the distribution of the number of houses among the clusters is not satisfactory.
Similarly, for SC also the clusters show scattered behavior over the months when k ¼ 2, 3 and show closer performances for the other k values. Out of the other k values, many months show the best performance with the highest SC value occurring when k ¼ 4. Considering all these aspects described and the distribution of the number of houses among the clusters, k ¼ 4 is chosen as the finest value for this study. Fig. 9 shows the distribution of five houses among the clusters for the month of January in the years 2016 and 2017. The illustration gives a clear indication of the diversity of the demand profiles of the dwellings over different days.

Step 2: Demand prediction
This section discusses the day-ahead demand prediction for individual dwellings and clustered dwellings. For the trained models, pseudo-R 2 value of all houses remained at the desired margin (>0.8). Fig. 10 demonstrates the R 2 variation of daily trained models of a randomly selected house. Fig. 11 demonstrates the CVRMSE variation for the day-ahead normalized individual demand predictions of all the 70 houses. The worst prediction day, best prediction day and the mean of predictions are disclosed in the figure. From the evaluated total number of 358 days in each of the years 2016 and 2017, 54 and 56 houses were able to maintain the CVRMSE margin ( 0.3) for over 300 days respectively. Worst performing house fulfilled the CVRMSE requirement 161 days for the year 2016 and 148 days for 2017. Fig. 12 illustrates the daily R 2 variation of the clustered trained models. When the number of houses in a cluster increases, the trained models perform better than the days where the number of houses is less in number. Fig. 13 shows the CVRMSE error calculation for the clustered normalized predictions. For all clusters, in both the years 2016 and 2017, over 345 days managed to maintain the required CVRMSE margin. This shows a prediction improvement of the dwellings in the neighborhood while adequately preserving the insights.
When accumulated and compared the actual peak energy consumption (max: Total Grid ) and maximum energy fed back to the grid (min: Total Grid ) of the whole neighborhood with individual predictions and clustered predictions, clustered demand showed closer values than the individual predictions (see Table 2).
Other than that, when it is requested to predict each dwellings' energy demand, the energy management system experiences a complex time-consuming problem. In comparison, clustering the demand patterns before prediction significantly reduced the required computational time.

Self-sufficiency improvement
After the predictions are known, next, a battery storage system    is introduced at the neighborhood level. If the battery storage is introduced at a larger level, one can argue the purpose of clusterlevel demand predictions within the neighborhood. If the demand prediction is performed for the whole neighborhood, identification of characteristics of individual buildings is not a possibility. However, with clustering, the characteristics can still be identified for similarly performing buildings. Therefore, clustered sub-level predictions are more useful than a single aggregated prediction for the whole neighborhood. Before introducing battery storage, as mentioned in Section 2.3, if the neighborhood operates under a separate micro-grid, the direct energy sharing possibility between the dwellings can be calculated as 20 MWh for 2016 and 28 MWh for the year 2017. Table 3 shows a summary of the results for direct sharing and the self-sufficiency increment of the entire neighborhood. Grasping this type of information is not a possibility if the demand was analyzed only at the entire neighborhood level. Fig. 14 illustrates the demand profiles of the clusters for a random Day in the year 2016. Here, it is possible to comprehend the direct energy sharing possibility. Minus value in Fig. 14 indicates the dwellings are overproducing and electricity is fed back to the grid. During the overproducing period, dwellings in Cluster-2 seem  to be consuming energy. So, this period demonstrates how energy is directly shared between the houses. This section discusses the overall sharing potential after introducing the battery storage system. Fig. 15 illustrates the Paretofrontiers obtained for the conventional situation and the futuristic sustainable case for the years 2016 and 2017. In the figure, a comparison has been made with accumulated clustered predictions and accumulated actual individual consumptions of the buildings.
In each case, a ¼ 0.5 is marked where the CO 2 emissions and costs are given similar priority. The value of a in the sub-graphs in Fig. 15 varies from one to zero from left to right. In case-1, where the battery storage prices are high, it is visible that when a ¼ 1 which is the cost-optimal case, no battery storage is included. However, in case-2 battery storage is acknowledged even with the cost-optimal case.
After obtaining the Pareto solutions, the aim is to select a suitable battery capacity to be installed in the analyzed neighborhood. Instead of choosing one capacity value, the battery capacities when a ¼ 0.25, 0.5, and 0.75 have been compared and the corresponding self-sufficiency increment has been recorded. Table 4 demonstrates the overall self-sufficiency increment of the entire neighborhood and the surplus energy fed-back to the grid after introducing the battery storage system for the different a values.
The behavior of the storage system for two random days is illustrated in Fig. 16 and Fig. 17. Demand in the figures (Total Grid ) illustrates the accumulated profile of the entire neighborhood. All symbols presented in the figures resemble the symbols defined in section 2. For the low PV electricity production day (Fig. 16), all the excess energy could be stored in the battery system. In contrast to Fig. 16, in the high PV production day (Fig. 17), it is possible to observe that the battery could handle only a portion of the   excessively produced electricity. Therefore, most of the produced electricity is yet fed back to the grid (Symbol: Grid In ). Both these figures correspond to the year 2016 sustainable scenario with an 'a' value of 0.5 in the objective function. In Fig. 17 it can also be observed, at the beginning of the day the battery system is discharging. This is because of the energy stored by the battery storage during the previous day. Fig. 18 illustrates the consumed energy from the grid and the fed back energy to the grid, for the CO 2 optimal case (a ¼ 0) for the entire neighborhood (all 70 houses). In the CO 2 optimal case, the maximum battery storage capacity is utilized. Nevertheless, it is visible from the figure, a significant amount of produced energy is still fed back to the grid without having the opportunity to be consumed inside the neighborhood. From the analysis, it was identified about four months out of the twelve are 100% energy sufficient for both the years 2016 and 2017.

Conclusion
This paper discussed the improvement of energy self-sufficiency of a renovated residential neighborhood that utilizes HP and PV systems. The presented method incorporates a futuristic smart-grid application with an efficient unsupervised clustering method and a neighborhood-level energy management system (NEMS).
It was possible to observe that clustering similar demand patterns significantly reduced the data burden for prediction and computational time required while increasing the prediction accuracy. Although the algorithms; k-means and random forest have been proved to have better performances in this paper, they may not be suitable for other scenarios. A sensitivity analysis is needed on some important variables such as electricity tariff. Other than that, this study assumes perfect foresight of PV generation and this cannot resemble the actual operation. So, it is recommended as future work to compare different clustering and prediction algorithms including variable tariff structures and PV prediction, as further advancement of the proposed method.
To accommodate all-electric heating systems, it is important to  16. Scheduling of the battery storage system for a random day with low PV production. increase the self-sufficiency of buildings. This study showed with the involvement of data-driven prediction and control algorithms this target can be achieved during the spring/summer months. However, a point highlighted in the study is the amount of inconsumable excess energy even after introducing the battery storage systems. To further improve self-sufficiency, the proposed method can also be used for other types of applications such as introducing electric vehicle charging points at the neighborhood level. As many elements across the grid electrify, improvement of self-sufficiency is important to meet the total electricity demand through clean energy sources. For the practical realization of such proposed NEMS, the traditional business processes, regulations, and energy markets need to be redesigned. Such concepts will be valuable in a smart grid either in the low-voltage AC or DC context and will benefit the service providers to attain considerable efficiency gains. With the presented knowledge in this paper, one can develop new ideas based on this verified method. For example, to show the quality of prediction and the ultimate cost of the system. In conclusion, this paper showed one of the many possibilities for the clever utilization of renewable electricity for a better alignment of supply and demand to confront a future-proof smart energy system at the neighborhood level.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment
The research work is funded by Netherlands Organisation for Scientific Research (NWO) Perspective program TTW Project B (14180) e "Interactive energy management systems and lifecycle performance design for energy infrastructures of local communities" (https://ses-be.tue.nl/). We wish to express our gratitude to all the organizations which have been a part of this program. Special thank is given to BAM Bouw en Techniek for helping with the data collection process.

A. Cleaning of data
Prior to the development of the clustering and prediction methods, the collected data needed to be preprocessed [22]. The preprocessing is performed to modify the erroneous outliers and fill the missing data points. In this case, offline meters and the unavailability of machine to machine connections are found to be the reasons for these outliers and missing points. The provided data contained some unrealistic outliers, valued a hundred times larger than the mean. The data is inspected visually to see where the threshold for a so-called outlier should be. According to this visual inspection, a threshold for outliers is determined for each data set. Every value above this is replaced by a realistic replacement which is its preceding value.
For the sake of quality and reliability, missing values of the data set also should be filled. These houses have missing values in the 'consumption' variable, ranging from 5.6% to 19.7% (Total data points: 68,832 per house). These missing values are firstly filled by the previous time-stamp values. This first step does not fill all the missing data points. Therefore, the next step is to fill the missing points with the identical time-stamp consumption values of the previous day. These two steps reduced the number of missing data points drastically. The final step, which is the employment of moving average was able to mostly fill the rest of the absent values while smoothening the data variation. The whole process is illustrated in Fig. 19

B. K-means clustering
K-means clustering is a convenient approach to divide a dataset into k distinct groups [21]. The main principle for the k-means clustering approach is to minimize the 'within-cluster variation' [21]. Taking into account two important properties; (i) each observation (in this case: Total Grid ), belongs to one of the k clusters, (ii) no observation belongs to more than one cluster, thus, nonoverlapping clusters. The problem that has to be solved to create good clusters according to the k-means method is represented by equation (19) [20], where k is the number of clusters, and (C K ) is a measure of the amount by which the observations within the clusters differ from each other, named; within-cluster variation.
The within-cluster variation can be expressed in a number of ways, however, by far the most used method involves the squared Euclidean distance. The enthusiastic reader is referred to Ref. [20] to understand more about Euclidean distance. In this method, the 'k' value should be chosen manually. This is a limitation [34] of this method.

C. Random-forest algorithm
A recent study of the authors which was published elsewhere [22] discussed the accuracy of different machine learning algorithms and concluded random forest has better performances compared to the other machine learning techniques. Random forest is an ensemble learning method [20], consisting of a collection of regression trees. Since it consists of multiple trees, the prediction value is an average of all the constructed trees. The goal of this is to de-correlate the individual trees and make the prediction more reliable. However random forest doesn't predict well beyond the range in the training data [35]. This may overfit data sets that are particularly noisy and prevent making accurate predictions.