A Win-Win Algorithm for Learning the Flexibility of Aggregated Residential Appliances

In the Demand Side Management (DSM) context, residential customers have the potential for reducing costs and relieving the grid with non-thermostatic appliances. These appliances might be optimally scheduled by a central entity, taking into account user preferences. However, the user might not be able to communicate its preferences “a-priori”, leaving to the central entity the task of understanding preferences that should be learnt without causing discomfort to the user. With this premise, this study aims at exploring a DSM program that learns the acceptance of realistic simulated users to shift in time of home appliances, such as washing machines and dishwashers, analysing the benefits that arise from their inclusion. To this end, the proposed Acceptance Learning Algorithm 2.0 (ALA 2.0) minimises costs in scenarios with different energy sources and with a certain level of acceptance to shift in time, optimally scheduling the appliances according to the boundaries found by the proposed algorithm. ALA 2.0 is able to understand preferences also when modelling a behaviour of the user which is influenced by external factors not directly observable and when users make very few requests, interacting with the user in a simple way. Experimental results highlight that it is possible to understand the acceptance to the shift in time of the simulated users without any prior knowledge and without causing too much discomfort, achieving a win-win situation. As an example, more than 90% of requests were accepted in December, which is chosen as a representative month.

The associate editor coordinating the review of this manuscript and approving it for publication was Alon Kuperman .  . request i,j there is/not a request for appliance i from user j (1/0).

DECISION VARIABLES
P_to t amount of power given to the grid at time t. P_from t amount of power from the grid at time t. E_bat t amount of energy in the battery at time t. PC_bat t charging power of the battery at time t. PD_bat t discharging power of the battery at time t. PD_on t /PC_on t binary variable that indicates if the battery is discharging or charging at time t. x f ij ∈ {0, 1} binary variable that selects the day load profile of the appliance i of the customer j.

I. INTRODUCTION
Residential users are part of the energy transition the world is facing nowadays. Whilst the single consumer has a meaningless impact on the energy system, the whole residential demand could help significantly to exploit more renewable generation, relieve the grid and diminish costs. To this end, different changes in household practices might occur, from energy conservation to temporal shift and automated appliance scheduling.
Under time-varying rates, these changes should result from the user's actions in response to the price signal. However, the user might not optimally schedule appliances or it might respond poorly: higher responsiveness has been noticed in customers who claim to know how to act to modify the electricity consumption [1].
Consequently, the task of finding the best time-slot for turning on appliances might be left to a central entity in automated demand response programs. However, when dealing with appliances such as washing machines and dishwashers, particular attention should be put into the design of the program in order not to affect the comfort of the user, leaving a certain degree of control and the possibility to override exogenous decisions to the user [2], [3].
In the attempt of understanding the effective tools for motivating user participation in automated demand response programs, [4] surveyed the intention to adopt a demand response program where the energy provider could control its customers' dishwasher to use the excess of solar energy. In the proposed program, it was guaranteed that the appliance would have automatically been started within 6 hours and that the automatic control could have been prevented twice per month.
With this background in mind, this paper proposes a DSM program similar to the one surveyed in [4] but where a realistic user can decide whether to accept or refuse the proposed shift of appliances such as washing machine and dishwasher -hereinafter referred to as ''shiftable appliances'' -on the basis of its preferences, which are centrally learnt by the proposed algorithm. Consequently, the proposed work tries to address three major challenges: i) finding the optimal schedule of users' appliances when preferences are not known a-priori but centrally learnt in an interactive way by proposing shifts in the use of an appliance that matches the user's preference. This minimises the users' discomfort and maximises the users' acceptance; ii) modelling of realistic user, i.e. not completely rational, including random factors that influence the acceptance of the shift proposed by the aggregator. The new model of the user makes it more difficult finding the solution; iii) testing a user-friendly DSM program to understand the possible consequences of this DSM program in the real world.
To simulate and test the DSM program, the following actors were modelled: i) the Aggregator, responsible for the DSM program ii) the Users representing the residential customers who enrolled their appliances in the DSM program and iii) the Market, which informs the Aggregator on the dayahead prices.
In order to centrally learn the acceptance of users, we used a Mixed Integer Linear Programming (MILP) formulation which aims at minimising costs from different energy sources coupled with an improved version of the algorithm presented in [5], i.e. the Acceptance Learning Algorithm 2.0 (ALA 2.0).   1 shows an oversimplified and intuitive example of a user making a request to use the washing machine. In a nutshell, the shift proposed by the MILP formulation is evaluated by the user and its acceptance or refusal is used to increase the knowledge on its preferences. Depending on the answer, the appliance is turned on or off, accordingly.
From a psychological point of view, this mechanism might be perceived differently from the user since i) the user is in control of its appliances and ii) from the user perspective, it is not communicating personal preferences, it is just evaluating the proposal of a central entity.
Then, we tested different scenarios increasing progressively the flexibility of the users, with and without the presence of Photovoltaic (PV) systems and Energy Storage Systems (ESS). Results obtained from the different scenarios are compared, demonstrating that the algorithm is always able to learn the acceptance to the shift in time of the simulated users.
A. RELATED WORK Literature works propose various solutions and approaches for taking into account the user's preference, i.e. the maximum temporal shift for appliances. The discriminant factor to select literature solutions described in the following was the way in which such studies considered and modelled the user.
In [6] and [7], the boundaries allowed for shifting the appliances are known, immutable and equal for all users. Melhem et al. [6] compare a MILP model with the proposed math-heuristic optimization algorithm under different scenarios. Mathematical models for the grid, renewable energy sources, ESS and electric vehicles are considered. In [7], costs are minimised following the known constraints on the customer's preference. Both PV systems and ESS are considered in the MILP formulation.
A mixed integer programming formulation where preferences are communicated by the customer once and updated only if limits change is proposed in [8]. The authors considered 250 users that manifest the level of preference in different periods where the appliance may be turned on.
Liu et al. [9] analyse a residential demand response program where the most representative appliances are discussed. The control strategy used for the shiftable non-thermostatic loads, e.g. the dryer, is ''price naming'' where the appliance is turned on when the locational marginal price drops below the desired price threshold. Annual costs of different scenarios -with and without load control, in the presence or absence of a solar farm with ESS -are compared. Instead, Manganelli et al. [10] propose a case study of an existing residential and commercial building with a microgrid and control systems, preserving users' habits and comfort. The shiftable loads considered are washing machines and dishwashers. The building energy management system indicates possible slots and their costs, then the user selects a slot.
To schedule home appliances minimizing both dissatisfaction of user and costs, [11] proposes a multi-objective DR optimization model solved through the Constrained Many-Objective Non-Dominated Sorted Genetic Algorithm. Maximum start-end times are defined by the user. RES and ESS are also considered. A multi-objective DR optimization model is also used in [12], which is solved thanks to the genetic algorithm. In the objective function, it uses a weighting factor representing the proportion of power consumption cost, based on day-ahead electricity price, and discomfort cost. The user can strike a balance between discomfort and cost through the weighting factor. Thermal loads, flexible deferrable loads (e.g. electric vehicles) and non-flexible deferrable loads are considered. Two scenarios are compared: i) the first scenario equally weighs power consumption and discomfort costs; ii) the second considers consumption cost only.
For a single household, [13] uses Q-Learning without the assumption of knowing the dis-utility function of user's dissatisfaction. The authors propose a fully automated energy management system that receives a request from the user for using an appliance. The request time does not necessarily coincide with the target time, which represents when the user prefers the request to be satisfied. Then, the energy management system schedules when to satisfy the request. The user can decide to cancel requests not completed yet. Then, the user evaluates some completed/cancelled requests. The evaluation corresponds to the user's dissatisfaction.
In [14], a quality of experience-driven approach is used. Starting from a survey of 427 subjects, it found no correlation between appliances usage habits and users'data. Therefore, through the k-means algorithm, it obtains different profiles on the basis of the preference collected through a questionnaire. The users had to indicate the degree of annoyance from 1 to 5 -the minimum and the maximum level, respectively -for certain delay up to 3 hours. Instead, new customers answer for a short amount of time to annoyance rating questions because of task shifting. Then, one of the obtained profiles is assigned to the new customer. Two algorithms are illustrated to find the optimal time-slot for the appliances. Hakimi et al. [15] proposes a new method for certain types of controllable loads (i.e. dishwasher, washing machine and heating/cooling system) where the maximum shift was chosen considering the consumers' welfare extracted from a survey. The use of appliances is shifted to the time at which the difference between load and RES power is maximum, always taking into account the consumers' welfare. Table 1 summarises the proposed literature review highlighting the main features of each solution and comparing them with our.

B. CONTRIBUTION
The column ''user preference'' resumes how user preferences are considered. Two main categories emerge: known, i.e. decided a-priori or set on the basis of the preferences of the majority, and learnt. According to [14], ''most of the literature solutions consider the customer comfort as a set of hard constraints on appliance usage, a-priori set without profiling among different kinds of customers, which are likely to have different subjective needs. Moreover, emphasis is often put on the cost or energy optimisation, but no metrics for a-posteriori evaluation of the perceived quality is given''. As in [14], we wanted to focus on the end-user. However, we opted for a real-time evaluation of the proposed shift. Authors in [12] offer the possibility to the user to find VOLUME 9, 2021 a balance between discomfort and cost. However, the consequences of a certain value are not so immediate. Indeed, ''different people are at different stages of awareness of effective actions to take during a demand response activity. Learning to be energy adaptable and energy efficient is a journey where steps to improve results, increase ease and adjust comfort levels are taken one step at a time'' [16]. Therefore, we chose to ask an individual ''yes'' or ''no'' question which should be easily understood by users and which can guide the user in making optimal choices.
Depending on how the user is modelled, the method to solve the problem might change (refer to column ''Method'' in Table 1). Similarly to [13], we aim at learning the acceptance of the user to the shift of appliances. Nevertheless, we opt for a centralised optimisation with a view to a future collective energy community. Therefore, the main contribution of this paper w.r.t. literature solutions consists of trying to learn preferences in a centralised way using dynamic constraints set by the proposed ALA 2.0 in the MILP formulation for the considered 3000 users. Instead, methods such as Q-Learning suffer from the ''curse-of-dimensionality'' which limits the problem to few users [13].
With respect to our previous work [5], which presents Acceptance Learning Algorithm 1.0 (ALA 1.0), the main contribution lies in learning the user preferences without causing too much discomfort to all the users, i.e. we diminished the number of times in which the central entity proposes a time-slot for turning on the appliance which does not respect user preferences. Moreover, we modelled the users' behaviour more realistically. Sometimes, even if the proposed shift meets the user preferences, the request from the aggregator is not accepted by the user to model external factors that might influence the routine of the user. These refusals will be hereinafter referred to as ''random refusals''. Thus, the users are no longer modelled based on how they ''should'' behave, rather their behaviour is more descriptive, including also random factors not directly observable. Furthermore, we increased the number of users to 3000 to obtain more realistic simulations, comparing new scenarios with different sets of energy sources. Consequently, the performance of ALA 2.0 with the new more realistic users has been compared with ALA 1.0 presented in [5], demonstrating that ALA 2.0 learns the time-slots preferred by users without causing too much discomfort to them, even in the case of users that do not use the appliances very often or randomly reject the shifts.

II. THE FRAMEWORK
In this section, we present the proposed framework by introducing briefly the tools used, listing the models and describing the agents. For our framework, we chose Mosaik [17], a flexible and modular smart-grid co-simulation framework, to coordinate simulators and Aiomas [18] to create the agent environment, i.e. the actors.
The simulators implement the models of the electricity consumption of each family, the PV systems generation and ESS. The electricity consumption -generated using [19] -is related to the usage of the appliance of each household. Thus, each family is always characterised by its own load profile. Instead, the PV systems from [20] and a generic model of an ESS (see the formulation in Section II-B) may or may not be included.
The agents represent the intelligence that controls the models. Three types of actors interact in the proposed scenarios: i) the User which decides whether to answer positively or negatively to the requests of the Aggregator according to its perceived discomfort, ii) the Aggregator which manages the DSM contracts, optimally shifting the appliances and learning the user acceptance with ALA 2.0, iii) the Market which communicates the day-ahead market price. The main interactions are shown in Fig. 2.
The simulation evolves in time steps of 15 minutes, however, two time-slots are more important than the others as described below.
i) Every (simulated) day at 21:00, the Market informs the Aggregator on the day-ahead market prices (see block ''1'' in Fig. 2). In the same manner, each User communicates the aggregated daily load schedule as well as each shiftable appliance schedule to the Aggregator (see block ''2'' in Fig. 2). The Aggregator knows also the 24-hour-ahead forecast of the PV production.
Using the obtained data it performs two optimisations -OPT 1 and OPT 2. OPT 1 minimises cost without shifting any appliance: it is used as a baseline to understand potential savings. Instead, OPT 2 might anticipate or postpone the shiftable appliances of a certain amount of time according to the knowledge of the Aggregator on user acceptance. OPT 2 is used to propose to the user the optimal action (see block ''3'' in Fig. 2).
ii) Every (simulated) day at 23:00, the user communicates its answer to the proposed shift (see block ''4'' in Fig. 2). This step gives more information on user preferences. Then, the optimisation, that decides the strategy for the day after, is performed using the obtained information. Therefore, costs are minimised by shifting only the appliances that have been allowed to be shifted.
The rest of this section will describe in-depth the engine for the User and the Aggregator. The Market simply provides the day-ahead prices. In future works, we plan to enhance it by implementing advanced market dynamics and policies.

A. THE USER AGENT
The User represents a realistic simplification of the real household. To model its decisions w.r.t. the delay of the shiftable appliances, first we distinguished between i) aggregated electricity consumption, containing the aggregated load profile of all not-shiftable appliances, and ii) disaggregated load profiles of the shiftable appliances, i.e. the washing machine and the dishwasher, using the simulator proposed in [19].
Then, we modelled the discomfort created to the user when shifting certain appliances from the desired starting time. If this delay causes dissatisfaction greater than the one tolerated, the switching-on of the appliance is not anticipated or postponed. In this case, there is no economic punishment for the user. On the contrary, if the User accepts the request, it receives an economic reward (direct or computed as the savings derived from the shift). Thus, to model the behaviour and the level of acceptance of the User, two assumptions have been made.
Assumption 1: At the beginning of the simulation, each User has an opinion on the DSM program described by a coefficient in the range [0,1]. It is 0 when the user does not like the DSM program at all; 1 when the user strongly appreciates it. Accordingly, 0.5 stands for a User with a neutral opinion. The opinion of the User changes based on the shift proposed by the Aggregator. Indeed, the opinion highlights how well the DSM program is performing: if the proposed shifts are in line with the preferences of the user, the opinion increases; while if the User refuses the request of the Aggregator, the opinion decreases. This is formulated by Equation 1 as follows: where q is an arbitrary quantity equal in both cases, i.e. 0.01. Therefore, we removed the assumption, modelled in [5], through which the user acceptance is partially influenced by its experience to obtain a clear lower bound. In the previous model of the user in [5], the answer to requests was deterministic, except for shifts close to the maximum acceptance, i.e. those influenceable. Instead, with the proposed model of the user, the behaviour is more descriptive, including also random factors not directly observable, i.e. also the request for small shifts from the aggregator might be refused due to external factors that might influence the routine of the user, i.e. the ''random refusal''.
In case the proposed time-slot is exactly the one chosen by the User, the answer is implicitly affirmative and the opinion does not undergo variations. Therefore, the opinion of the User is a way to visualise the performances in time of the algorithm from the point of view of users' preferences. Thus at each time step, it will be visible if the proposal of the aggregator matches the user's preference since an increase in the opinion of the amount q means that the user accepts the request, while a decrease stands for a refusal.
Assumption 2: It has been supposed that a User acts according to its level of comfort and routine. W.r.t. our previous work [5], the User is not utterly rational, i.e. if the discomfort is below a certain threshold the answer may be negative. Indeed, a degree of randomness in the answer depending on external factors, i.e. deviations from the usual behaviour due to commitments, has been included. Thus, the User may decline usually accepted shifts. Therefore, the rational behaviour of the User has been modelled using a dis-utility function like [21], [22]. It indicates the dissatisfaction created by the delay from the desired time-slot: major shifts correspond to greater dissatisfaction. We formulated it as the square difference between desired and proposed starting time The threshold w.r.t. the dissatisfaction function indicates the maximum value that usually corresponds to an affirmative answer (i.e. excluding random answers). Over the threshold, the User does not accept the proposal made by the Aggregator. An example is proposed in Fig. 3.
The different simulations will consider three degrees of acceptance. The minimum level of this threshold corresponds to a 1 hour delay, according to the survey in [2]. The maximum flexibility tested reaches arbitrarily 5 hours. Additional acceptance to the shift would lead to even further savings.

B. THE AGGREGATOR AGENT
The Aggregator manages the DSM contracts, decreasing costs and learning the acceptance of the Users. The process followed by the Aggregator to discover users'preference while minimising costs is described in the following and through the Algorithm 1.

1) PRE-ACTION
This step aims at discovering the shift in time allowed by the user to be given as input to the optimisation. For the selection of the pre-action, i.e. the shift allowed, a decreasing -greedy algorithm is used. Thus, ALA 2.0 chooses as input the vector in between the shift (see the next step Action) that gives the major reward (Exploit, see lines 11-13 in Algorithm 1) for a fraction of the requests -similarly to ALA 1.0 [5]. Otherwise, a ''guided'' exploration (Explore, see lines 14-16 in Algorithm 1) is performed in order to avoid too much discomfort to the user, i.e. it starts from exploring smaller shifts, increasing the time-slots available each time the user answers positively or decreasing the time-slots allowed if it refuses the proposal of the Aggregator. In this version of the algorithm, the value of arbitrarily decreases each month down to the value of 0.1. It does not reach zero to keep collecting information and capture eventual variations in the behaviour of the user. With real users, should be appropriately calibrated.

2) ACTION
The optimisation, that may include PV systems and ESS in accordance with the designed scenario, is performed. This optimisation schedules the start of the shiftable appliances in the best allowed time-slot. This time-slot is the ''action'' that will be proposed to and evaluated by the user (see line 18 in Algorithm 1). The objective of the optimisation (Equation 3) is to minimise costs considering day-ahead prices (c_from t ) and, if installed, the cost related to PV systems (c_pv t ) and ESS (c_dis_bat t ). In the latter case, i.e., the presence of PV systems, the surplus may be sold to the grid at a price c_to t .
In this simplified scenario, we consider an individual virtual battery with a capacity equal to the sum of all capacities replacing the ESS. The following constraints are not valid if ESS is not present (Equations 4-11).
♦ ESS Constraints: PD_bat t PD_on · D max , ∀t E_bat t capacity, ∀t E_bat t minCharge, ∀t E_bat t=1 = E_bat_init E_bat t = E_bat t−1 + δ * eff * PC_bat t − PD_bat t * δ/eff , ∀t > 0 (9) PC_on t + PD_on t 1, ∀t Specific charge and discharge rates are associated with the battery since these bounds cannot be exceeded (Equations 4-5, respectively). The maximum capacity and a minimum charge characterise the battery (Equations 6-7), while the energy stored at the beginning of a new day is equal to the energy stored at the end of the previous day (Equation 8). Furthermore, the battery follows the simplified model (Equation 9). Charge and discharge cannot happen simultaneously (Equation 10). In addition, we decided to use the battery for self-consumption (Equation 11).
♦ Balance Constraint: Power balance is preserved thanks to Equation 12. In case PV systems are not present, there is no possibility to sell the surplus of energy to the grid. L_shift ij is a cycle matrix containing the feasible allocation of the consumption vector of the shiftable appliances, while x ij f is a binary variable that selects the day load profile of the appliance i of the customer j. All the parameters that have been presented in bold in Equation 3 and Equation 12 together with the ESS constraints are optional, thus they are removed if the scenario does not include the corresponding model. ♦ User request: Low and Up limits are associated with each appliance and with each user and depends on the pre-action. When the user j decides to use appliance i, request ij is set to 1. Thus, the appliance must be turned on that day. On the contrary, the sum of the binary variables is zero and the domestic appliance is not turned on.

3) USER EVALUATION
After the OPT is solved, the optimal shift is communicated to the User, which accepts or declines the proposal of the Aggregator congruently to Assumption 2, i.e. the threshold plus a random behaviour.

4) UPDATE
When the User refuses the request, the action is penalised with a small negative reward. Otherwise, the action is rewarded proportionally to the amount of delay from the desired start, i.e. greater time shifts are rewarded more. If the Explore pre-action had been chosen, the bookmark is updated based on the user answer (see lines 22-25 and 29-33 in Algorithm 1). Recent answers weight more w.r.t. previous one, in case the User changes its opinion. Therefore, information on the selected action is updated following Equation 14.
where α is the constant step-size parameter, Q n is the estimate for the n th reward and R n represents the n th reward [23]. Thus, the difference w.r.t. ALA 1.0 (presented in our previous work [5]) is in the Explore preaction. For the Explore preaction of ALA 1.0 a random number r in between 1 and 24 is generated. It corresponds to the amount of hours considered for the shift. Consequently, a vector that includes all the timeslots in between r hours in advance and r hours after the time-slot requested by the user is created. Depending on a random number, the time shift proposals might be not in line with user preferences for a long period, causing discomfort to the user. Instead, with ALA 2.0, at the beginning few actions, corresponding to smaller shifts, are allowed to increase the probability of asking shifts liked by the user. This allows to include also users that use appliances not very often. Then, if the user answers positively to any amount of shift during an Explore preaction, the number of possible actions increases. At the same time, the amount of exploration decreases with time, decreasing the mechanism that increases the number of actions. Therefore, guiding the exploration, ALA 2.0 decreases the discomfort of the user. if new month and > 0.1: then 8: decrease 9: end if 10: p ← random(0, 1) 11: if p < (1 − ) then 12: explore ← 0 13: preaction ← best_action 14: else 15: explore ← 1 16: preaction ← action indicated by bookmark 17: end if 18: action ← shift decided by the MILP problem 19: prosumer_evaluation ← yes/no 20: if prosumer_evaluation = yes then 21: R ← proportional to the delay 22: if explore then 23: if bookmark = 24 then of the chosen action 36: best_action ← action with highest value 37: end for

III. RESULTS
In this section, first we present the experimental results for a single User to demonstrate how ALA 2.0 works; then, we discuss its performances when 3000 Users are running simulating a realistic district in a city. In both cases, we compare ALA 2.0 performances with the previous ALA 1.0 [5].

A. SINGLE USER ANALYSIS
In the following, we present the results in simulating a single User, which corresponds to a single family with only one shiftable appliance, i.e. the washing machine, 91 washing in the simulated year. The random refusals, arbitrarily set in this case, corresponds to the 5 th , 25 th , 45 th , 65 th , 85 th request.
We propose an example were the family has only 1 hour of flexibility with ALA 1.0 and ALA 2.0, see Fig. 4. The amount of exploration is the same in both algorithms. However, since ALA 2.0 has few possible preactions among which the OPT 2 can choose, in this case, it finds faster certain proposals liked by the user (orange line in Fig. 4). Instead with ALA 1.0, the user is more annoyed (blue line in Fig. 4). The number of refuses, excluding the random ones, are 8 with ALA 2.0 and 20 with ALA 1.0. This graph allows verifying how the opinions evolve in time. For simulations with more than one user, it visually shows if all the users are accepting the proposed shifts since the overall result in percentage might hide really good results for some users and really poor for others. ALA 1.0 performance strongly depends on the random selection of the preaction, and the action chosen, consequently. That is why, depending on the random selection, ALA 1.0 might perform very well for some users and not that much for others. Instead, ALA 2.0 should ensure the satisfaction of almost all the customers (from the point of view of requests in line with what is liked), thanks to the guided exploration. The 3000 users, with a different number of appliances, requests and random refusals, allow to observe the general trend.
Being a trial and error algorithm, performances have also been compared with the optimal solution, i.e. the solution obtained knowing preferences a-priori. Thus, we run three optimisations, with i) the same family, ii) the same ''random'' refusals and iii) 1-3-5 hours of flexibility, where the allowed shifts were known, i.e. the User communicates the boundaries allowed for the shift representing its acceptance. Then, we compare the optimal shifts suggested by the Aggregator in the just mentioned scenario with the shifts suggested with ALA 2.0, i.e. where the acceptance must be learnt. The number of identical requests are 79, 60, 73 for 1 hour, 3 hours, 5 hours of flexibility, respectively.
Last but not least, we want to understand what would happen if the user changes its behaviour, e.g. the recent COVID-19 pandemic and the consequent lockdown caused a change of users' habits. Two cases are illustrated: from 1-hour acceptance to 3-hours (Scenario 1) and vice  versa (Scenario 2). Starting from the equal to 0.8 in January, it decreases to 0.6 in February, then, to 0.4 in March, to 0.2 in April and from May it is set to 0.1. The reward for each action is arbitrarily set to the hours of shift multiplied by a factor 0.1 and the penalty to -0.00001. The family changes the acceptance level in May.
In the first case, at the beginning of May the estimated values of the actions, different from zeros, are shown in Table 2. Thus, the best action, i.e., the one with the highest Q value, corresponds to the user flexibility (±1 hour). At the end of the year, the best action is ±2 hours. The algorithm needs to continue exploring to increase the value of the ±3 hours action, but it does not bother the user. In the second case (see Table 3), in May the best action is correctly 3 hours, while at the end of December it is 1 hour (see 1 st row, column ''±3h'' and 2 nd row, column ''±1h'', respectively in Table 3).
B. THE 3000 USERS ANALYSIS ALA 2.0 has been tested for 3000 Users in presence of two different sets of energy sources: one composed by the market only (Scenario 1) and the other made up of the market, the PV systems and the ESS (Scenario 2). The simulations, which last for one year with a 15 minute time step, have been repeated with the Users offering a flexibility of 1 hour, 3 hours and 5 hours, generating six case studies.
The costs (see Equation 3) or the levelised cost of energywhich includes the costs of the installation and maintenance -are i) c_from t : the cost from [24] for 2013 plus taxes, system and network charges, ii) c_pv t : 0.13 e/kWh from [25] (1 kW PV system per family), iii) c_to t : 0.1 e/kWh and iv) c_dis_bat t : 0.12 e/kWh (total capacity 3720 kWh).
The and the values of rewards are configured according to Section III-A. Each User owns a washing machine and/or a dishwasher. At the beginning of each simulation, the opinion of the User is arbitrarily taken from a normal distribution, with mean µ=0.5 and standard deviation σ =1/3, truncated to the range [0.2,0.8]. In these simulations, the uptake of the program is not taken into account, thus it has been supposed that Users who signed in have not an initial extreme opinion, neither positive nor negative. They will first try the DSM program and then they will evaluate it. Each time the proposed shift is below the threshold (see Section II-A), each user has the 5% of probability to answer negatively.  Fig. 5 reports the savings obtained with ALA 2.0. The baseline, i.e., the optimisation without load shifting, is depicted with the pink line. If the level of flexibility is in between ±1 hour, results are poor. Only around 0.22% of savings w.r.t the baseline case are obtained. It should be also pointed out that, in this first approximation, capacity constraints were not included, which with a good probability will further decrease the possibility to shift the loads. With a level of acceptance of ±3 hours, around 0.49% can be saved. When the flexibility is ±5 hours, results improve further. Indeed, savings reaches around 0.68%. The green line in Fig. 5 represents the best case where all the users answer positively to all the requests of the Aggregator, which is hard to achieve in the real world. In this case, savings reach 1.46%.
We compared the results with those obtained with ALA 1.0. From the Aggregator prospective the differences are almost negligible, i.e. savings with ALA 1.0 and ALA 2.0 are almost the same and trends are almost overlapped. Thus, to increase the readability of the plot, we did not report ALA 1.0 trends in Figure 5. Instead, from the Users point of view, the difference is remarkable as shown in Fig. 7, which reports the variations of the opinions of the 3000 Users in one year for both ALA 1.0 and ALA 2.0. The opinion of those who make more requests increases very fast. In the end, the opinion of the majority becomes completely favourable. No one has an opinion more contrary than the starting one, with one only exception in Fig. 7b. This represents a customer that makes very few requests. If there is a combination of few requests, random refusals and wrong guesses from ALA, this might happen. Nevertheless, considering all the other results, this represents a single case. As shown in Fig. 7, at the beginning of the simulation, ALA 1.0 causes much more discomfort to all users. Then, it discovers the flexibility for the majority of users, while for some others, especially in the case with 1 hour flexibility (see Fig. 7d), it bothers the users too much.
To understand the advantages of ALA 2.0 w.r.t. ALA 1.0, the results for December are shown in Fig. 8. The total acceptance rate is computed as the total number of ''yes'' from all users divided by the total number of requests in percentage. We chose December as a significant month, since it is a month where the exploration does not change further, i.e. is set to 0.1. In December, with users with one hour of flexibility, the total acceptance rate is 90.36% with ALA 2.0 and 86.20% with ALA 1.0. However, if we consider the acceptance rate per user, with ALA 1.0, 73 users have between 0% and 10% of acceptance rate, while this decreases to 2 users only with ALA 2.0 (see Fig. 8a, 8d). These users turnon their appliances very few times and might also ''randomly refuse'' the requests from the aggregator. As already pointed out, from the aggregator viewpoint, there are not remarkable differences among both ALAs since users who make few requests have not a great impact on costs. Instead, from the users' viewpoint, differences are significant as all users are treated equally and it is offered to each individual user the opportunity to participate. Thus, ALA 2.0 is fairer to users. As the acceptance rate of the user increases, the differences in performances between ALA 1.0 and ALA 2.0 decreases (Fig. 8b, 8e and Fig. 8c, 8f). For the sake of completeness also results for the whole year are shown in Fig. 9. In this case, the acceptance rate is lower because it also includes the first months of our simulated year, where ALA 2.0 is performing a strong exploration phase, thus it is learning.  2) SCENARIO 2 Fig. 6 reports the costs of ALA 2.0 in the scenario with PV systems and ESS. The cost of the baseline (pink line) is definitely lower than the one in Scenario 1. Savings w.r.t. the pink line of this scenario for ±1 hour, ±3 hours, ±5 hours are 0.88%, 1.93%, 2.53%, respectively. Thus, the 3 hours shift scenario has higher savings than the ones of the unlikely VOLUME 9, 2021    advantage of the increased flexibility. Also in this scenario, from the Aggregator Agent viewpoint, differences in terms of savings in using ALA 1.0 or ALA 2.0 are negligible.
Thus, Fig. 6 reports only the trends of costs when ALA 2.0 is applied. From the User Agents' perspective, with both versions of ALA results improve w.r.t. the previous scenario, VOLUME 9, 2021 i.e. cases in Figures 10a to 10c

IV. CONCLUSION
In this paper, we presented a novel framework to simulate and evaluate the impact of DSM on users. The main purpose consists on learning the acceptance of such simulated users in shifting in time the turning-on of their household appliances, without any prior knowledge and without causing too much discomfort. In order to achieve this goal, we proposed ALA 2.0, an improved version of ALA 1.0 presented in [5], which is able to learn the acceptance of the users from the positive or negative evaluations of the households, leaving a certain degree of control to the users. Consequently, costs were minimised and appliances were optimally scheduled considering the boundaries found thanks to ALA 2.0.
Since it was not possible to test both versions of ALA with real users, they had been modelled starting from what described in surveys in literature, e.g. [14], modelling their acceptance level and the associated discomfort. Moreover, we also included the possibility for the user to make ''random refusals'' because external factors may influence user habits.
Under these conditions, we demonstrated the potentiality of ALA 2.0 for a single simulated user and for 3000 users in different scenarios comparing results with ALA 1.0, analysing both the user discomfort and savings for the aggregator considering one or more energy sources and different level of acceptance. As an example, in the ''market only'' scenario with 1 hour of acceptance the ''Acceptance rate'' in December is 90.36% with ALA 2.0 and 86.20% with ALA 1.0. However, the performances of ALA 2.0 on the user side are definitely improved w.r.t. ALA 1.0 and almost all the simulated users have a positive opinion on the DSM program at the end, i.e. several temporal shift proposals better match user preferences. Learning every single preference has been considered the major advantage for users. Indeed, they are treated equally and it is offered to each individual user the opportunity to participate, even if they rarely use these appliances.
The amount of savings was strictly related to the chosen parameters and should not be interpreted in terms of absolute values, but as a way to understand when we have major benefits using this type of appliances. As future work, we plan to extend our solution to in-depth evaluate the economical impact of DSM programs. As demonstrated, a win-win situation was always obtained, but the presence of PV systems and ESS enabled to increase considerably savings, exploiting users' flexibility.
In future works, thermal loads will be included in the analysis as well.
[24] GME. Gestore Mercati Energetici. Accessed: Jun. 5 software architectures with particular emphasis on infrastructure for ambient intelligence, software solutions for simulating and optimizing energy systems, and software solutions for energy data visualization to increase user awareness. In the fields above, he has authored over 70 scientific publications from 2011 to 2018.
ENRICO MACII (Fellow, IEEE) received the Laurea degree in electrical engineering from the Politecnico di Torino, the Laurea degree in computer science from the Università di Torino, Torino, Italy, and the Ph.D. degree in computer engineering from the Politecnico di Torino, Torino. He is currently a Full Professor of computer engineering with the Politecnico di Torino. His research interests include the design of electronic digital circuits and systems, with a particular emphasis on low-power consumption aspects. In the last decade, he had extended his research activities to areas such as bioinformatics, energy efficiency in buildings, districts, and cities, sustainable urban mobility, and clean and intelligent manufacturing.
LORENZO BOTTACCIOLI (Member, IEEE) received the Ph.D. degree (cum laude) in computer engineering from the Politecnico di Torino, Italy, in 2018. He is currently an Assistant Professor of computer science with the Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino. His main research interests include smart energy, smart city, and smart communities, with a focus on software solutions for planning, analyzing, and optimizing smart energy systems, and for spatial representation of energy information. VOLUME 9, 2021