Reinforcement Learning-Based Energy Management of Smart Home with Rooftop Solar Photovoltaic System, Energy Storage System, and Home Appliances

This paper presents a data-driven approach that leverages reinforcement learning to manage the optimal energy consumption of a smart home with a rooftop solar photovoltaic system, energy storage system, and smart home appliances. Compared to existing model-based optimization methods for home energy management systems, the novelty of the proposed approach is as follows: (1) a model-free Q-learning method is applied to energy consumption scheduling for an individual controllable home appliance (air conditioner or washing machine), as well as the energy storage system charging and discharging, and (2) the prediction of the indoor temperature using an artificial neural network assists the proposed Q-learning algorithm in learning the relationship between the indoor temperature and energy consumption of the air conditioner accurately. The proposed Q-learning home energy management algorithm, integrated with the artificial neural network model, reduces the consumer electricity bill within the preferred comfort level (such as the indoor temperature) and the appliance operation characteristics. The simulations illustrate a single home with a solar photovoltaic system, an air conditioner, a washing machine, and an energy storage system with the time-of-use pricing. The results show that the relative electricity bill reduction of the proposed algorithm over the existing optimization approach is 14%.


Introduction
With the advent of the Internet of Things (IoT) technology, smart sensors, and advanced communication and control methods in electric energy systems, increasing amounts of electric energy-related data are being produced and utilized for the reliable and efficient operation of electric energy system. Machine learning (ML) is a core technology for handling such big data effectively, and various ML-based applications are currently under development for the prediction of solar photovoltaic (PV) generation, load forecasting, energy control and cost optimization, peak load management, and the design of dynamic energy pricing using various ML models, such as the artificial neural network (ANN), support vector machine, and deep learning [1]. This study attempts to provide a novel ML-based framework with which to conduct optimal energy management of residential homes.
Owing to the increasing home energy consumption [2] along with emerging smart grid technologies in the residential sector, such as distributed energy resources (DERs) (for example, rooftop PV systems and residential energy storage systems (ESSs)), advanced metering infrastructure with smart meters, and demand response programs, home energy management is becoming increasingly

•
We present an RL-based HEMS model that manages the optimal energy consumption of a smart home with a rooftop PV system, ESS, and smart home appliances. In the HEMS model, the Q-learning method is applied to the energy consumption scheduling of different home appliances (air conditioner, washing machine, and ESS), whereby the agent of each appliance determines the optimal policy independently to reduce its own electric cost within the consumer comfort level and the appliance operation characteristics. Furthermore, we propose an ANN model to learn the relationship between the indoor temperature and energy consumption of the air conditioner more accurately, which is integrated into the Q-learning module to achieve improved performance of the air conditioner agent.

•
The simulation results confirm that the proposed RL method with the ANN can successfully reduce both the consumer electricity bill and dissatisfaction cost (for example, the indoor temperature and operating time interval of the washing machine within the consumer comfort settings). Moreover, we compare the performance of the proposed RL-based HEMS algorithm to that of the conventional mixed-integer linear programming (MILP)-based HEMS algorithm, and verify that the proposed approach can achieve greater energy savings than the conventional approach under various penalty parameter settings in the reward function of the appliance agent.
The remainder of this paper is organized as follows. Section 2 provides a literature review for our proposed method. Section 3 defines the various types of smart home appliances and introduces the conventional optimization formulation for home energy management. Section 4 presents the formulation of the proposed RL-based HEMS algorithm using the Q-learning and ANN methods. The simulation results for the proposed HEMS algorithm are provided in Section 5 along with discussion for the future applicability of the proposed algorithm in Section 6, and conclusions are provided in Section 7.

Related Research
Over the past decade, various studies have been conducted on the HEMS optimization formulation in different types of optimization models and performance assessment [4][5][6][7][8][9][10][11][12][13][14][15][16]. These approaches include the scheduling of different types of home appliances along with electric vehicles using linear programming (LP) [4,5], load scheduling considering the consumer comfort level using mixed integer nonlinear programming (MINLP) [6], convex programming based on relaxed MINLP using an L 1 regularization technique [7], load scheduling for a single consumer or multiple consumers using MILP [8][9][10], LP-based joint optimization of energy supplies and electric loads through three-stage scheduling (prediction, supply control, and demand control) [11], the natural aggregation algorithm (NAA)-based HEMS method consisting of forecasting, day-ahead scheduling, and actual operation [12], robust optimization for the scheduling of home appliances to resolve the uncertainty of consumer behavior [13] and the outdoor temperature and consumer comfort levels [14], and distributed HEMS architectures consisting of a local and global HEMS [15,16]. More recently, using real-time pricing, a HEMS optimization method that considers the operational dependency of various types of home appliances and consumer lifestyle requirements was proposed in [17]. Previous work on HEMS algorithms, including different types of optimization models, is summarized effectively in [18]. In addition, a broader literature review on the energy and comfort management of the residential, commercial and industrial buildings was conducted in [19].
In recent years, compared to the aforementioned model-based HEMS optimization approaches, data-driven approaches using ML methods have gained popularity owing to their more efficient residential energy management because the existing model-based approach is limited to deterministic decision-making under an uncertain environment and approximated energy system models, thereby leading to undesired energy consumption scheduling. In [20], an operation method for smart thermostats was presented, in which the consumer preference could be learned using a Bayesian inference method. Moreover, based on the learned consumer preference, the optimal temperature setting schedule for smart thermostats could be determined in a stochastic expected value model. A novel pooling-based deep recurrent neural network (RNN) method for household load forecasting was proposed to improve the accuracy significantly in [21]. Compared to the traditional deep RNN technique, a key method in [21] was learning spatial information among consumers and allowing for additional learning layers prior to the occurrence of overfitting. The numerical examples demonstrated that the proposed approach outperformed existing ML methods, such as the auto-regressive integrated moving average, support vector regression, and traditional deep RNN. In [22], a hybrid HEMS algorithm that integrated the ML methods into a traditional HEMS optimization problem was developed, in which the energy consumption of the heating, ventilation, and air conditioning (HVAC) was scheduled based on neural network-and regression-based learning methods. Furthermore, for a reliable wind energy management, a hybrid wind speed multi-step forecasting model was developed using an ANN method combined with the wavelet packet and complete ensemble empirical mode decomposition techniques in [23]. In [24], the Elman neural network that is optimized by the multi-objective salp swarm algorithm was used to enhance both the forecasting accuracy and stability of air quality early-warning system that improves air quality and human health. A hybrid electricity price forecasting method was presented in multi-step ahead framework, which consists of fast ensemble empirical mode decomposition, variational mode decomposition, and back propagation neural network in [25]. In [26], the ANN model was used to develop a tool that investigates the relationship among heating energy use, indoor temperatures, and the heating energy demand in the residential buildings with different occupant behaviors.
More recently, reinforcement learning (RL), also known as the model-free control approach, has received attention as a promising ML method for electric energy management. A pioneering work in RL-based energy management is Google DeepMind, which was developed using the RL method and has been proved to decrease the electricity bill by cooling the data center by approximately 40%. Deep RL (DRL) (that is, the combination of RL and ANN) was applied to the control of HVAC in a building to reduce the energy cost while maintaining a comfortable consumer level in terms of the indoor temperature [27], as well as both the indoor temperature and air quality [28]. Several papers have reported on building energy management with DERs using Q-learning, in which the ESS was controlled to achieve energy savings in a single [29] building and a community with multiple buildings [30]. In [31], multi-agent RL was presented to manage the home energy consumption. Each agent corresponded to various home appliances types with non-shiftable, shiftable, and controllable loads, and the energy consumption of each appliance was optimized through the Q-learning process, along with the real-time price prediction using the ANN. Recently, a novel Q-learning method using action dependent dual heuristic programming was proposed to solve the infinite-time domain linear quadratic tracker without requiring the information of the system matrices in [32]. In [33], Q-learning-based multi-agent framework was developed where all agents communicate with each other and synchronize with the leader agent, consequently achieving the optimal consensus solution for all agents in real time.
Although extensive research has been conducted on residential energy management using the RL method, to the best of the authors' knowledge, no study has proposed an energy consumption scheduling algorithm yet considering the operation of various home appliances, including the ESS, and the consumer comfort level simultaneously. Previous studies have been limited to the energy consumption scheduling problem for controlling only the HVAC [27,28] or only the ESS [29,30]. Similarly to our work, the study of [31] developed a Q-learning-based HEMS algorithm that scheduled the energy consumption of different home appliances with shiftable and controllable loads. However, no control for the ESS charging and discharging was considered.

Preliminary
In this study, we consider the situation in which automatic energy management for a single household is carried out by the HEMS, which schedules and controls the following types of household appliances under the time-of-use (TOU) tariff:

•
Controllable appliance (A c ): A controllable appliance is an appliance of which the operation is scheduled and controlled by the HEMS. The operation characteristics categorize controllable appliances into reducible appliances (A c r ) and shiftable appliances (A c s ). An example of a reducible appliance is an air conditioner, known as a thermostatically controllable load, in which the energy consumption can be curtailed to reduce the electricity bill. However, under the TOU pricing scheme, the energy consumption of a shiftable appliance can be shifted from one time slot to another to minimize the total electricity cost. A shiftable appliance has two load types: (1) a non-interruptible load (A c,N I s ), and (2) an interruptible load (A c,I s ). The operation of shiftable appliances with non-interruptible loads must not be stopped by the HEMS control during the appliance task period. For example, a washing machine must perform a washing cycle prior to drying. A shiftable appliance with an interruptible load may be interrupted at any time. For example, the HEMS must terminate the discharging process and initiate the charging process of the ESS instantly when the PV power generation is greater than the load demand.

•
Uncontrollable appliance (A uc ): An uncontrollable appliance, such as a TV, PC, or lighting, cannot be scheduled and operated by the HEMS. Therefore, A uc maintains the fixed energy consumption scheduling.

Conventional HEMS Optimization Formulation
A general HEMS algorithm that determines the optimal operating schedule of household appliances and DERs is formulated as an MILP optimization problem, consisting of the objective function and constraints as follows:

Objective Function
The objective function (1) for the HEMS optimization problem consists of two parts, each of which includes different decision variables (E net t , T in t ): In Equation (1), J 1 (E net t ) is the total electricity cost, calculated under the TOU price π t and the net energy consumption E net t at time t. Furthermore, E net t is written in terms of the energy consumption for the controllable/uncontrollable appliances and the predicted PV generation output. J 2 T in t is the total penalty amount involving the consumer discomfort cost. Discomfort implies a deviation of the preferred consumer temperature T set from the indoor temperature T in t . is a penalty term for the consumer discomfort cost. A larger leads to a smaller J 2 T in t , thereby providing the consumer with decreasing discomfort, while resulting in less energy savings. The value of can be determined by the HEMS operator to satisfy the consumer preferred comfort level at the expense of the consumer electricity bill. The following subsections demonstrate the equality and inequality constraints for the HEMS optimization problem.

Net Power Consumption
Equation (2) is the constraint on the net energy consumption; that is, the difference between the total consumption of all appliances ∑ a∈A E a,t and the predicted PV generation output E PV t . In Equation (3), the total consumption of all appliances in Equation (2) is decomposed into four different types of reducible appliances (a ∈ A c r ), shiftable appliances with a non-interruptible load (a ∈ A c,N I s ), shiftable appliances with an interruptible load (a ∈ A c,I s ), and uncontrollable appliances (a ∈ A uc ):

Operating Characteristics for Controllable Appliances
For a reducible appliance a ∈ A c r , Equation (4) is the constraint for the temperature dynamics of the reducible appliance (for example, air conditioner) at time t (T in t ), which is expressed in terms of T in t−1 at time t − 1, the predicted outdoor temperature at time t − 1 ( T out t−1 ), the energy consumption of the reducible appliances (E a,t ), and the environmental parameters (α, β) specifying the indoor thermal condition. Equation (5) presents the range of consumer preferred indoor temperatures with T min and T max . The energy consumption capacity for the reducible appliance is limited with E min a and E max a in (6): Equations (7)-(9) ensure the desired operation of shiftable appliances with a non-interruptible load a ∈ A c,N I s (for example, a washing machine) with the binary decision variable b c,N I a,t : (i) for the stopping period where ω pref s and ω pref f are the consumer preferred starting and finishing time in Equation (7), (ii) for the operation period of L a hours during a day in Equation (8), and (iii) for a consecutive operation period of L a hours in Equation (9). The energy consumption capacity for the shiftable appliances with a non-interruptible load is described with E max a in Equation (10) Equation (11) illustrates the operational dynamics of the state of energy (SOE) for the ESS (a ∈ A c,I s ) at the current time t in terms of the SOE at the previous time t − 1, the charging and discharging efficiency, η ch a and η dch a , and the charging and discharging energy, E ch a,t and E dch a,t , respectively. Equation (12) provides the SOE capacity constraint with SOE min a and SOE max a for the ESS. Equations (13) and (14) present the constraints on the charging (E ch a,t ) and discharging (E dch a,t ) energies of the ESS, respectively, where b c,I a,t represents the binary decision variable that determines the ESS on/off status: Finally, the MINLP-based HEMS optimization problem above can be converted into an MILP optimization problem by means of the linearization of the nonlinear objective function J 2 T in t as follows:

Home Energy Management via Q-Learning
RL is one of the main ML techniques for optimal decision-making in a non-deterministic environment. As illustrated in Figure 2, while an agent interacts with an environment, the agent learns the type of action depending on the state of the environment, and sends the learned action to the environment. The environment then returns a reward along with the new state of the environment to the agent. This learning process continues until the agent maximizes the total cumulative rewards received from the environment. A policy is defined as the manner in which the agent acts from a specific state, and the primary goal of the agent is to determine the optimal policy that maximizes the reward. In this study, we assume that the environment is described by a Markov decision process, in which the agent state transition relies only on the present state, along with the action selected in the present state, without considering all past states and actions. Q-learning is one of the representative RL techniques for determining the optimal policy ν * of a decision-making problem. The general process of Q-learning calculates a Q-value Q(s t , a t ) of a pair of state s t and action a t at a discrete time t and updates the Q-value towards the maximum total rewards using the following Bellman equation: In Equation (18), based on the optimal policy ν * , the optimal Q-value Q * ν * (s t , a t ) is obtained by the summation of the present reward r(s t , a t ) and maximum discounted future reward γ max Q(s t+1 , a t+1 ) where γ ∈ [0, 1] represents a discounting factor that explains the relative importance of the present and future rewards. As the discounting factor γ decreases, the agent becomes short-sighted because it focuses increasingly on the present reward. However, a larger γ enables the agent to focus increasingly on the future reward and thus become far-sighted. The value of γ can be tuned by the system operator using Q-learning to balance the present and future rewards.
Whenever the Q-value Q(s t , a t ) is updated with a specific pair of state and action at time t, Q(s t , a t ) is saved in the state-action table, namely the Q-value table. The agent selects its action using the Q-value table at every time t, and the element (Q-value) in the Q-value table associated with the selected pair of state and action is updated using the following Bellman equation: In Equation (19), θ ∈ [0, 1] represents the learning rate that determines the extent to which the new Q-value overrides the old one. With θ = 0, the agent learns nothing and uses only the past Q-value without exploration in the Q-learning process. However, with θ = 1, the agent updates its Q-value using only the present reward and maximum discounted future reward without exploitation. Similar to the selection of γ, a trade-off between exploration and exploitation can be determined by the system operator through setting the value of θ in [0, 1]. Finally, by updating Q(s t , a t ) in an iterative manner using Equation (19), the Q-value will become increasingly larger, and the agent will obtain the optimal policy ν * with the largest Q-value, as follows: In this study, the aforementioned Q-learning method is applied to an individual appliance (for example, air conditioner, washing machine, or ESS) to calculate the optimal operation schedule of appliances in a smart home with a PV system and an ESS, which consequently results in the reduction of the consumer electricity bill within the consumer preferred appliance scheduling and comfort level. A detailed illustration of the state, action, and reward for the proposed Q-learning approach is provided in the following three subsections.

State Space
We consider the situation in which the proposed Q-learning algorithm is executed for 24 h with a 1 h scheduling resolution. For ∀t = 1, . . . , 24, the state spaces of the washing machine (WM), air conditioner (AC), and ESS are expressed as follows, respectively: where the states E WM t , E AC t , and SOE ESS t are the energy consumption of the WM and AC, and SOE of the ESS, respectively, at time t.

Action Space
The optimal action for each appliance depends on the environment of the agent, including the present state, as defined in Section 4.1.1. The action spaces of the WM, AC, and ESS are illustrated as follows: In Equation (22), the WM agent performs the binary action {On, Off}. With the 'On' action, the WM agent turns on the WM, which consumes a constant energy (E WM,max ), whereas the WM agent turns off the WM with the 'Off' action. The action for the AC agent is discretized into 10 levels of AC energy consumption in Equation (23) where ∆E AC represents an energy consumption unit of the AC. Similar to the action for the AC agent, the discrete set of actions for the ESS agent is defined with an energy unit of ESS ∆E ESS in Equation (24). These discretized actions are categorized into discharging and charging actions, corresponding to {−4∆E ESS , −3∆E ESS , −2∆E ESS , −1∆E ESS } and {1∆E ESS , 2∆E ESS , 3∆E ESS , 4∆E ESS }, respectively. The proposed algorithm calculates an hourly energy consumption schedule for the appliances for the next 24 h. Given the state and action sets above, the Q-value tables for the WM, AC, and ESS agents are illustrated using the |T | × |A WM |, |T | × |A AC |, and |T | × |A ESS | matrices, with |T | = 24, |A WM | = 2, |A AC | = 10, and |A ESS | = 9, respectively. In this case, |A| is the cardinality of the set A (that is, the number of elements in A).

Reward
The reward function for each appliance agent is formulated as the sum of the negative electric cost and negative dissatisfaction cost associated with the consumer preferred comfort and appliance operation characteristics. The comprehensive reward r Total for the HEMS is defined as In Equation (25), the three reward functions r WM t , r AC t , and r ESS t aim to evaluate the HEMS performance in terms of: (i) the electric cost and consumer undesired operation of the WM, (ii) the electric cost and consumer thermal discomfort of the AC, and (iii) the electric cost and energy underutilization owing to overcharging and undercharging of the ESS.
Firstly, the reward function for the WM agent is expressed as otherwise, where ω pref s and ω pref f are the consumer preferred starting and finishing times of the WM, respectively, while δ and δ are the penalties for early and late operation, respectively, compared to the consumer preferred operation interval. A dissatisfaction cost is added to the reward function with a negative value if the WM agent schedules the WM energy consumption before ω pref s or after ω pref f ; otherwise, the reward function includes only a negative electric cost.
The reward function for the AC agent is defined as where κ is the penalty for the consumer thermal discomfort. The dissatisfaction cost is defined as the deviation of the consumer preferred temperature T in t from T min and T max , and it is considered as the reward with a negative sign only if T in t deviates from the range of [T min , T max ]. Finally, the reward function for the ESS agent consists of a negative electric cost and negative energy underutilization cost, as follows: where τ and τ are the penalties for the ESS overcharging and undercharging, respectively. In this case, energy underutilization of the ESS occurs if the SOE becomes lower than SOE min (undercharging) or greater than SOE max (overcharging), and it is considered as a reward term, along with the electric cost during the ESS underutilization stage.

Prediction of Indoor Temperature via ANN
In this study, we consider the situation in which the HEMS schedules the AC energy consumption based on the indoor and outdoor temperature with the consumer preferred thermal conditions. Traditionally, the HEMS calculates the current indoor temperature using an approximated equation (that is, the equivalent thermal parameters (ETP) model Equation (4) in Section 3.2.3 in terms of the previous indoor and current outdoor temperature, AC energy consumption, and indoor thermal characteristics). In this subsection, in contrast to the aforementioned model-based approach for the indoor temperature prediction, we propose an ANN-based method for predicting the indoor temperature associated with the AC energy consumption.
In the proposed ANN model, the AC agent learns the extent to which the AC energy consumption affects the current indoor temperature, which implies the estimation of the function f that illustrates the relationship between the indoor temperature and AC energy consumption, as follows: where f is the approximated function that explains the relationship between the input data from the ETP model in Section 3.2.3), such as the previous indoor temperature (T in t−1 ), consumer's preferred indoor thermal conditions (T min , T max ), weather forecasting ( T out t ), and AC energy consumption (E AC t ) and the output for the predicted current indoor temperature.
As illustrated in Figure 3, the proposed ANN model consists of one input data layer with five neurons, three hidden layers with seventeen neurons, and one output layer with one neuron. Each layer calculates the weighted sum of the input vector and a constant bias b i with a weight W i , and the weighted sum is transferred to the following layer by means of the transfer function. In this study, Rectified Linear Unit (ReLu) function is used as a transfer function [34]. Moreover, the Adam optimization algorithm [35] is employed to train the proposed ANN model, and the learning rate of the optimization algorithm is set to 0.005.

Input layer
Hidden layer Output layer  The temperature prediction function approximated by the proposed ANN is fed into the Q-learning module for the AC agent, as illustrated in Section 4.1. This approximated model enables the AC agent to calculate the dissatisfaction cost more precisely and determine the optimal energy consumption schedule more efficiently during the Q-learning process.
Finally, the HEMS with the PV system, ESS, and home appliances learn the energy management policies that optimize the electricity bill and consumer comfort level using Algorithm 1. The HEMS receives the hour-ahead indoor temperature, consumer preferred indoor temperature range, predicted outdoor temperature, and AC energy consumption (E AC t ), and uses the ANN to predict the current indoor temperature. Afterwards, the proposed Q-learning is initiated to schedule the optimal energy consumption of the appliances and ESS charging/discharging. Figure 4 illustrates the proposed Q-learning-and ANN-based framework for optimal control of the home appliances and ESS.
Algorithm 1: Q-learning-based energy management of smart home with PV system, ESS, and home appliances. 1 Initialize each appliance's energy demand, dissatisfaction parameters, and Q-learning parameters 2 %%Learning with ANN for temperature prediction of AC agent 3 Indoor temperature at time period t − 1 → T in Select a t from present s t using -greedy policy 15 Take action a t ; observe r(s t , a t ) and s t+1

Simulation Setup
We considered the situation of a household with two major home appliances (AC and WM), and an ESS that can be controlled by the HEMS under the TOU tariff, as illustrated in Figure 5a.
The simulations were carried out for 24 h with a 1 h scheduling resolution. It was assumed that the predicted PV generation energy E PV t in Figure 5b and outdoor temperature T out t−1 in Figure 5c could be obtained accurately. The maximum energy consumptions of the AC, WM, and aggregated uncontrollable appliances were 3000, 500, and 1700 Wh, respectively. The consumer comfortable temperature range was assumed to be (23 • C, 25 • C). The consumer preferred temperature T set is 24 • C, and the penalties (MILP-based HEMS) and κ (RL-based HEMS) for the consumer thermal discomfort were both 100. The parameters α and β, which represent the AC thermal characteristics, were set to 0.8 and −0.02, respectively. The allowable operating period for the WM was (6:00 a.m., 10:00 p.m.), and the consecutive operation time was 2 h. The maximum charging and discharging capacities, E ch,max and E dch,max , for the ESS were both 4000 Wh, while the initial, minimum, and maximum SOE values were 2400, 800, and 4000 Wh, respectively. In the action space, the energy consumption units, ∆E AC and ∆E ESS , for the air conditioner and ESS are 40 Wh and 150 Wh, respectively. For the reward function, the penalties for the dissatisfaction cost of the WM and ESS were δ = 50, δ = 50, τ = 50, and τ = 50, respectively. The parameter of the -greedy policy for exploration and exploitation was set to 0.1. The learning rate θ and discounting factor γ in the Bellman equation were set to 0.1 and 0.9, respectively. The proposed algorithm was implemented on a computer (AMD Ryzen 7 2700X 8-core CPU (China) clocking at 3.70 Hz and 32 GB of RAM), using the optimization toolbox in MATLAB R2018a (MILP optimization) (MathWorks, Natick, MA, USA) and Python (Q-learning and ANN).

Performance of the Proposed RL-Based HEMS Algorithm
In this subsection, we present the simulation of the algorithm for the proposed RL-based HEMS, and verify the energy consumption schedule of the controllable appliances and ESS charging/discharging schedule. Figure 6a illustrates the energy consumption schedule calculated by the WM agent. It can be observed from Figure 6a that, given the consumer preferred operation period (6:00 a.m., 10:00 p.m.) with two consecutive operation hours (L = 2), the optimal operation schedule for the washing machine was selected as (7:00 a.m., 8:00 a.m.). This scheduling policy is considered as optimal because a washing machine operates at the lowest TOU price, which in turn reduces the electricity bill, while satisfying the consumer preference. Figure 6c,d illustrate the charging/discharging and SOE schedules for the ESS, respectively. Similarly to the result in Figure 6a, it can be observed from Figure 6c that, in general, the charging (positive energy consumption) of the ESS occurred at low TOU prices, whereas the discharging (negative energy consumption) of the ESS occurred at high TOU prices, thereby leading to consumer energy savings. Furthermore, it can be observed from Figure 6d that, as the price increased, the SOE decreased, and vice versa. This is because a higher price results in the ESS discharging and hence the SOE decreases. For example, the ESS discharged at 4:00 p.m. with the highest price in Figure 6c, which led to a decreasing SOE from 4 to 5:00 p.m. in Figure 6d. However, the ESS charged at 5:00 p.m. owing to the decreasing price, and, consequently, the SOE increased from 5 to 6:00 p.m. Figure 6b illustrates the AC energy consumption schedule. Compared to the results of the WM and ESS agents, we observe from this figure that a high (or low) price did not always enable the AC agent to decrease (or increase) the AC energy consumption. This is owing to the fact that the AC agent considers the consumer thermal comfort as well as the electricity bill saving in the reward function.  In the Q-learning process for the AC agent, a higher penalty κ for the consumer thermal discomfort led to a lower energy saving and vice versa. Regarding the trade-off between the energy saving and consumer comfort in terms of κ, HEMS operators may adaptively adjust the penalty κ to situations in which the consumer aims to save more on the electricity costs or maintain a more comfortable environment. A detailed assessment of the impact of the penalty for the AC agent on the proposed algorithm is presented in the following subsection.

Impact of Different Parameters in Reward Function on the Proposed Algorithm
In this subsection, we investigate the effects of the different penalties κ and preferred operating time intervals [ω pref s , ω pref f ] in the reward function for the AC and WM on the performance of the AC and WM agents. Figure 7a-c illustrate the impact of varying κ values (κ = 10, 50, 100) on the indoor temperature T in t at any time period t given an outdoor temperature T out t . As indicated in Figure 7a, when κ was set to 10, the indoor temperature exhibited a significant deviation from the consumer preferred temperature range (23 • C, 25 • C) in most time periods. It can be observed from Figure 7b,c that an increasing κ caused the indoor temperature to deviate less from the consumer preferred temperature range. This observation derives from the fact that the AC agent with an increasing κ aims to update the Q-value toward maximizing the satisfaction cost of the consumer indoor thermal condition at the expense of the consumer electricity bill. The trade-off between the energy saving and consumer comfort in terms of κ is verified by the comparison of Figure 7a-c and Figure 7d-f. As expected, it is observed from Figure 7d-f that, as the value of κ increased, the AC energy consumption also increased to maintain the consumer comfort.     Figure 8a-c that the optimal operating schedule of the WM was selected at the time periods with the lowest TOU price within the preferred operating time interval. This observation confirms that the WM agent could always determine the optimal policy to minimize the electricity bill successfully, provided that the consumer preferred operating time interval changed.

Impact of ANN on AC Agent Performance
In this subsection, we study the effect of the indoor temperature prediction using the ANN model proposed in Section 4.2 on the performance of the proposed RL-based algorithm. Figure 9a,b compare the AC energy consumption and indoor temperature obtained by the Q-learning process between the ETP model and ANN models. This comparison verifies that the ANN model (Figure 9b) required less energy consumption than the ETP model (Figure 9a). This is because the ANN assisted the AC agent in learning the relationship between the indoor temperature and AC energy consumption more accurately, and, consequently, the AC agent determined the optimal policy to achieve greater energy savings. Furthermore, the proposed ANN approach is beneficial for HEMS operators in the following manner. During the RL-based HEMS execution, the HEMS operators use only various data types for the AC energy consumption, consumer preference, and weather forecasting without explicitly relying on the model-based ETP equation with fixed environmental parameters (α and β in Equation (4)) for the indoor thermal conditions. Therefore, the HEMS operators do not need to tune these parameters, even though the household environment varies.

Performance Comparison between MILP-and RL-Based HEMS
In this subsection, we compare the performance of the proposed RL-based HEMS algorithm with that of the MILP-based HEMS algorithm. Figure 10a,b illustrate the AC energy consumption using the MILP model and RL model, respectively. It can be observed from these figures that the operation periods for the WM were scheduled for ( with an energy saving of $80 with the RL approach, where the electric costs for the MILP and RL models were $140 and $60, respectively. The comparison results of the ESS charging and discharging schedule between between the MILP and RL models are presented in Figure 10c,d. Compared to the charing and discharging schedule for the MILP model illustrated in Figure 10c, it is observed from Figure 10d that the ESS charged and discharged a significant amount of energy, consequently achieving an energy saving of $94.09 where the electric costs for the MILP model and RL models were −$193.91 and −$288, respectively. Moreover, it can be observed from Figure 10e,f that the RL approach reduced the AC energy consumption more than the MILP approach, consequently leading to a reduction in the electric cost of $228.28, where the electric costs for the MILP and RL models were $462.62 and $234.34, respectively.
where X bill,MILP is the total electricity bill in the RL model using the MILP and X bill,RL p is the total electricity bill using the RL where p represents a parameter including the preferred operating time interval of the WM, ESS capacity, and penalty for the consumers preferred indoor thermal conditions associated with the AC operation. It can be observed from these figures that the RL method could achieve greater energy savings than the MILP model under the situation with varying parameters. The results from Figure 11a indicate that the longer preferred operating time interval of the WM enabled the WM agent to select the operating intervals efficiently, and the consumer to obtain greater energy savings. As expected, as the ESS capacity increased, the ESS agent conducted additional energy charging and discharging to reduce the electricity bill, which is verified by Figure 11b. It is also observed from Figure 11c that the smaller penalty for the consumer preferred indoor thermal condition led to a greater energy savings, where the AC agent minimized the WM electricity cost at the expense of the consumer indoor thermal comfort.
Finally, Figure 12 compares the total energy consumption every hour between the MILP method and the proposed RL method integrated with the ANN. It can be verified from this figure that the energy consumption using the proposed approach was significantly reduced in the following three time periods: (3:00 a.m., 5:00 a.m.), (10:00 a.m., 12:00 p.m.), and (10:00 p.m., 12:00 a.m.). In this simulation study, the relative electricity bill reduction of the proposed RL method compared to the MILP method using Equation (30) was calculated as 14%. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23   In this study, the proposed HEMS algorithm is executed under the TOU pricing tariff. However, in electric power system operations, there is another pricing tariff such as real-time pricing (RTP). RTP, namely locational marginal pricing, is the core variable to conduct the congestion management in the wholesale and retail electricity markets [36,37]. Recently, a two-stage home energy management algorithm has been developed under distribution locational marginal pricing [38]. In real-time electricity markets, the results in [31] show that the value of RTP can be accurately forecasted by ANN with various types of input data. Therefore, the ANN-based RTP forecasting module can be integrated into the proposed Q-learning framework illustrated in Section 4 to manage the optimal energy consumption of a smart home under the real-time electricity market environment.

Electric Vehicle (EV) Integration
Recent studies have investigated the joint optimization of electric vehicle (EV) and home energy consumption scheduling [4]. However, these studies are limited to model-based SOE constraints without conducting the travel pattern analysis of EV. To resolve this limitation, a key part in our proposed approach would be to analyze the travel pattern of EV using ANN with historical travel data such as arrival and departure times, the number of travels per day, and the travel distance. Then, the modeling of the SOE dynamics of EV could be performed by the EV agent, which learns the charging or discharging action depending on the SOE state of EV similar to the ESS agent process illustrated in Section 4.

Constraint of the Lifetime for ESS
A lifetime of the residential ESS is an important constraint for the HEMS problem, and it is expressed as the SOE range in terms of the number of the limited charging and discharging cycles of ESS [39]. A key part of this task would be to identify the proper limit of charging and discharging cycles of ESS. To this end, one possible direction in the proposed framework is to add the limit of charging and discharging cycles to dissatisfaction cost for the ESS agent. This enables the ESS agent to determine the policy that maintains the number of charging and discharging cycles within its acceptable range.

Conclusions
We have proposed a machine learning-based smart home energy management algorithm using reinforcement learning and an artificial neural network. The proposed algorithm can minimize the electricity bill through the energy consumption scheduling of two controllable home appliances (an air conditioner and a washing machine) and the charging and discharging of the energy storage system, while maintaining the consumer comfort level and appliance operation characteristics. In the proposed Q-learning framework, the agents for a washing machine, an air conditioner, and an energy storage system independently learn their actions through the interaction of an environment until they maximize the total cumulative rewards received from the environment. The washing machine agent schedules the energy consumption of the washing machine within the consumer preferred operation period. The energy storage system agent calculates the charging and discharging energy while preventing the overcharging and undercharging of the energy storage system. In the indoor temperature prediction model constructed by an artificial neural network, the air conditioner agent performs the scheduling for the energy consumption of the air conditioner while satisfying the consumer preferred indoor temperature. The performance of the proposed algorithm was validated in the simulation study, and the results confirm the economical advantages of the proposed approach compared to the existing optimization approach using mixed-integer linear programming.
In future work, we plan to develop a multi-agent reinforcement learning algorithm that schedules the energy consumption of multiple smart homes with distributed energy resources and smart home appliances. A key challenge lies in how to design the efficient communication scheme between multiple smart homes for achieving the energy savings and maintaining the consumer comfort level. In addition, the practical implementation of the developed algorithm should be tested in large-scale realistic electric power networks. Last but not least, we plan to integrate advanced neural network models such as recurrent neural networks and long short-term memory in the proposed framework to improve the prediction accuracy of the indoor temperature.

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature
The main notations are summarized below. Other undefined symbols are explained in the text: Binary charging and discharging state of ESS a at time slot t: "1" for charging, "0" for discharging b c,N I a,t Binary consumption state of non-interruptible shiftable appliance a at time slot t: "1" for consumption, "0" otherwise s t State at time slot t a t Action at time slot t r WM