Model-Free Reinforcement Learning Economic Dispatch Algorithms for Price-Based Residential Demand Response Management System

It is projected that plug-in electric vehicles (PEVs) would steadily increase as household appliances. However, PEVs' high power consumption, stochastic usage patterns, and storage capacity will surely result in a rise in the elasticity of demand response and pose significant difficulties for price-based residential demand response management (PRDRM). This artcle aims to optimize a two-tier globally shared nonconvex PRDRM problem with local constraints and PEVs, known as social welfare: maximizing retailer profits and minimizing the combined residential costs. This is done by balancing residential electricity use with retail electricity prices in an unknown market environment. The proposed online/offline model-free reinforcement learning-based economic dispatch (MFRL-ED) methods can adaptively decide on the ideal retail price sequence by integrating the daily residential-retailer behavior model with the agent-environment interaction method, providing a basic MFRL-ED solution for PRDRM without a system identification step and an accurate load-retail model. Experiments show that MFRL-ED methods provide an effective class of PRDRM solutions.

among the key technologies in the deployment of energy demand response [1]. The main goal of RDRM, which is a key component of HEMS, is to leverage changes in the energy use of loads to react to time-varying tariffs or "reward and penalty" incentives to achieve cost savings or other advantages [2]. However, developing effective RDRM strategies for households is extremely challenging due to stochasticity and elasticity of residential power consumption. Specifically, the timing and frequency of appliance turn-on and turn-off are uncertain and hard to be predicted due to residents' lifestyle routines. The complexity of RDRM is increased when the appliances are further classified as dispatchable and non-dispatchable on account of the transferability of their energy consumption. These make it difficult for RDRM to efficiently plan the timing of power demand in response to dynamic tariffs. In addition, for an efficient load operation, accurate appliance models and parameters need to be determined in time to model the power characteristics and operating dynamics of these appliances. However, expertise is not always available to the average household.
To solve the above-mentioned difficulties regarding RDRM, scholars have proposed a series of economic dispatch (ED) approaches. The earlier RDRM works mainly focus on minimizing the household's electricity cost. For example, [3] and [4] combine mixed-integer linear programming models with the demand response of appliances to reduce daily household energy consumption, but elasticity of appliance usage and dynamic electricity prices are not considered. Then, [5] proposes a robust optimization method to minimize the worst-case daily bill payment by considering the uncertainty of consumer behavior. To ensure the probabilistic satisfaction of appliance operating constraints, an opportunity constrained optimization model is developed in [6]. A Lyapunov optimization algorithm is applied to loads with heating, ventilation and air conditioning (HVAC) in [7]. Currently, there are two branches of RDRM: price-based RDRM (PRDRM) encourages loads to adjust their energy consumption according to time-based pricing mechanisms with common strategies such as real-time pricing [8] and time-of-use pricing (TOU) [9], while incentive-based RDRM (IB-RDRM) [10] provides incentives/penalties for loads to contribute/fail to reduce demand during peak periods [11]. PRDRM is more in line with residential electricity consumption habits and has been widely adopted in many countries [12], thus this article concentrates on PRDRM. From the perspective of benefits, the current ED research on PRDRM is divided into three parts. For the individual benefits, there is a preference for reducing electricity costs or other benefits to customers by choosing an appropriate pricing mechanism. For example, the work in [13] investigates the distributed generation scheduling problem considering the uncertainty of renewable energy sources and the different personality types of consumers. For the interest of the energy company, maximizing the company's benefits or minimizing the cost of generation is the pursuit [14]. And to meet the reasonable demands of social development, it becomes a trend to integrate the relative interests of the both, so maximizing social benefits becomes a new research hotspot. A distributed dual decomposition-based (DDB) approach [15] and its fast version [16] are proposed to maximize social welfare. From the perspective of more diverse energy options and management, the optimization and control of HVAC systems are considered in [17], [18]. However, there remain unsolved problems in the aforementioned efforts. 1) Requiring system identification steps, i.e., explicitly optimizing the model, predictor, and solver. Developing a model-based demand response strategy requires constructing a model and identifying parameters, and performance may degrade due to the inaccuracy of models. 2) The existing PRDRM works rely heavily on deterministic pricing models (e.g., TOU, real-time pricing) that do not reflect the uncertainty and flexibility of dynamic electricity markets. 3) The short-sightedness of the grid leads to a focus on the immediate response of loads to the current pricing strategy and an inability to predict the impact of all subsequent responses. Therefore, it is of significance to develop an approach based on the unknown residential environment model to solve the PRDRM problem in smart grids.
Deep reinforcement learning (DRL) has been widely used in the industry in recent years. It can overcome the abovementioned problems by exploiting the end-to-end learning capability of neural networks (NN) and has achieved remarkable success in many complex decision-making applications such as distributed economic dispatch (DED) in smart grids [19]. As one of the energy scheduling problems, such model-free reinforcement learning (MFRL) algorithms have inspired researchers to investigate DRL-based PRDRM [20], [21]. In [22], a group smart home energy management scheme to minimize the energy cost and thermal discomfort of users is developed. Ref. [23] proposes a deep Q-network (DQN)-based demand response scheduling method for indoor air temperature control and thermal comfort management. The authors in [24] and [25] develop a DQN-based approach to optimize the charging scheduling of electric vehicles (EVs) in smart homes to minimize charging costs. In [26], an online building energy optimization method for scheduling timescales and time-shifted loads using the DQN method is proposed. As can be observed, the Q−framework is used by the majority of the aforementioned DRL-ED efforts to address demand response issues. However, due to the overestimation and the curse of dimensionality brought on by the structural defects of the Q−learning framework, they lack further modeling extensions and cross-sectional comparisons of convergence and performance of DRL.
We utilize MFRL-ED to plan the best retail price (RPs) in an ambiguous electricity market setting, drawing inspiration from the use of RL in energy scheduling. The contributions are summarized as follows.
1) Despite the highly stochastic usage patterns of PEVs, we effectively solve the optimization issue of globally shared nonconvex PRDRM with local constraints and nonsmooth terms. Additionally, in contrast to previous researches [24], [25] that just take into account charging, the load model with charging and discharging characteristics and complicated restrictions is well integrated in our PRDRM problem, and a detailed workflow of the DRL-ED-based charging and discharging mechanism is proposed.
2) The Q−learning framework is used for the majority of the existing researches of MFRL-ED algorithms [19], [20], [23], [24], [25], their overestimation is likely to lead to local solutions that are unable to resolve the dimensional disaster problem in continuous RP intervals.
In order to address the aforementioned issues and provide the underlying MFRL-ED framework for the PRDRM problem, we propose the offline double DQN-based ED (DDQN-ED) algorithm and the online Actor-Critic-based ED (AC-ED) algorithm, respectively, and compare the superiority and inferiority among them. Simulation experiments verify their effectiveness and scenario applicability. 3) This study unifies both interests, in contrast to the RDRM problem in [13], [14], which exclusively focuses on business or personal interests. Taking into account a large number of residences sharing a single electricity retailer, we build a two-level optimization model and focus on only the social welfare goal through relative social value weighting, i.e., seeking a balance between the business interests of energy retailers and the individual benefits of users, which is consistent with the ideal pursuit of a harmonious economic community. The rest of this article is organized as follows. Section II formulates the PRDRM problem. Section III presents a series of MFRL algorithms and the coupling model with PRDRM. Section IV demonstrates the effectiveness and advancement of the designed algorithms through simulation experiments.

II. FORMULATION OF PRDRM
The fundamental framework of RDRM is depicted in Fig. 1, which features residences, electricity retailers (ERs), energy markets (EMs), and independent system operators (ISO). While EMs supply ERs with wholesale power, ERs are in charge of supplying retail electricity to homes in specific regions. ISO oversees the market's commercial functioning. We explore the PRDRM issue between retailers and residents by coordinating RPs with residential daily electricity consumption using a variety of MFRL techniques, supposing that there is only one-way energy transfer between ERs and EMs.

A. Modeling of Residential Appliances
According to the transferable and abridged characteristics of the energy consumed by the loads, residential appliances can generally be divided into dispatchable (e.g., washing machines, dishwashers, etc.) and non-dispatchable ones (e.g., electric lights, refrigerators, etc.) [27]. As a class of dispatchable appliances with charging and discharging characteristics, PEVs are specifically considered in the PRDRM problem. Fig. 2 depicts a model of power transmission and communication between the residences and the electricity retailer. The red solid lines indicate electrical wiring infrastructure, the retailer delivers electricity to a region of residences. PEV charging is managed by a smart but simple device installed in the user's home. The blue dotted line indicates the underlying communication and information system, a two-way information flow exists between the retailer and the residences. The ER receives the actual energy consumption of residents in the previous time slot and the expected energy demand in the current time slot, and then dynamically adjusts the RP strategy with business interests in mind, while customers actively change their energy consumption in line with the change in RP in the current time slot. Therefore, it can be assumed that the set of dispatchable appliances is N d = {1, . . ., D} (excluding PEVs), the set of non-dispatchable appliances is N n = {1, . . ., N}, the set of PEVs is N p = {1, . . ., P }, and the set of all appliances can be expressed as N = N d ∪ N n ∪ N p = {1, . . ., N}.
Remark 1: Considering a region with the same electricity retailer, the star topology is naturally applied. However, it is more realistic that residences in different areas are free to choose multiple retailers at the same time, so PRDRM based on a network structure with multiple retailers and multiple residential areas needs to be developed using the distributed MFRL-ED technique, which is our later effort. And the set consisting of retailers should satisfy the minimum point coverage. [29], the actual energy consumption of the dispatchable appliance d ∈ N d is formulated as

1) Dispatchable Appliances: Motivated by
is the actual electricity consumption, ρ d,t ($/kW h) is the RP of electricity for the retailer's decision, and θ t ($/kW h) is the wholesale price (WP) bought by the ER from EMs. Considering the profit of ER, there is ρ d,t ≥ θ t . δ t < 0 is the price elasticity coefficient, which shows the interrelationship between energy demand and RP.
In this day-based electricity trading model, the user sends signals (R d,t , E d,t ) to the ER at time slot t, and the retailer gets the corresponding profit estimate from the feedback signals and makes an adjustment decision about the RP ρ d,t+1 . The essence of (1) shows that R d,t meets the maximum consumption of the residence, and that customers are willing to consume more electricity when WP θ t is close to RP ρ d,t , but they cannot accept the high retail price and thus consume less, this is in accordance with human intuitive thinking. Therefore, the demand error can represent the happiness index of residents' electricity consumption. We denote this happiness characteristic by a quadratic function as follows: where h d1 ($/kW h 2 ) and h d2 ($/kW h) are happiness coefficients related to the appliance, respectively. The electric happiness error function C d,t describes a quadratic error between the actual energy consumption E d,t and expected energy demand R d,t . Therefore, the closer E d,t is to R d,t (i.e., the smaller the error C d,t ) under a reasonable RP at time slot t, the higher the happiness of the residents. However, when residents are limited in their incentive to consume energy by high RP, their happiness becomes lower. Then, the available range of the electrical happiness error function is constrained to where DE min d and DE max d are the minimum and maximum demand error bounds, respectively.
2) Non-Dispatchable Appliances: Non-dispatchable appliances satisfy the identity relation between energy demand R n,t and actual consumption E n,t in all time slots, i.e., 3) PEVs: PEVs have all the characteristics of dispatchable appliances, and the actual energy consumption E p,t , p ∈ N p can be formulated as Note that E p,t < 0 means discharging, while E p,t > 0 denotes charging. Then the electrical happiness error function C p,t of PEVs can be expressed as where the parameters are interpreted similarly to (1). C p,t indicates that the ERs cannot fully satisfy the PEV owners' willingness to charge. In addition, vehicles with on-board batteries require the corresponding limit of rated power for safety at every time slot, which satisfy that where E r p denotes the rated power of the PEV p. Considering the charging/discharging characteristics of PEVs and the battery capacity, PEV p is subject to the following limitation: represent the minimum and maximum electric quantity, respectively. E 0 p denotes the initial energy level. e p indicates the charging or discharging efficiency, which is associated with We also consider the effect of frequent charging/discharging on the battery life and therefore quantify the cost of battery degradation [30]: where υ ($/kW h) is the degradation coefficient. Due to ISO regulation, the RPs of electricity are subject to with n ∈ N n , p ∈ N p , d ∈ N d , where ρ min and ρ max are the bounds of RP. Remark 2: Significant progress has been made in research on the monitoring and measurement of household load classifications [31]. For non-invasive methods, the load classification metering is based on the installation of meters at the entrance of the power line, so that the total characteristics of the appliances in the circuit can be collected and analyzed, and then the classes of appliances can be identified through complex processing methods such as signal processing, pattern recognition, and artificial neural networks, and then the classification metering of each type of load can be implemented [32], [33]. With the increasing maturity of smart grid technology and users' awareness of energy conservation, the implementation of the classification monitoring and metering of household appliances with different load types is promising to be popularized in the future.

B. Social Welfare
PRDRM is a two-level optimization problem in this article, where a minimized residential integrated cost to obtain the optimal actual energy consumption is expected from the user's point of view. From the perspective of the ER, maximizing the company's profit is the core business objective. These two wishes are considered together to show the relative social value of commercial profit and integrated cost of customers by maximizing social welfare [34].
The optimization problem of minimizing the comprehensive electricity cost of residents can be expressed as where EC t ($) indicates the integrated cost at time slot t, and the energy consumption vector E t contains all dispatchable, nondispatchable appliances and PEVs. For the benefit of the retailer, its profit maximization problem can be formulated as where EP t ($) represents the retailer profit at time slot t. P is the vector of electricity RPs consisting of three types of appliances Based on (1), (4) and (5), the actual energy consumption E t can be determined by P, the two-tier optimization problem (11) and (12) can represent social welfare by parameter trade-offs, i.e., the PRDRM nonconvex optimization problem can be written as where ω is the relative social value weight that balances commercial profits against residential energy consumption. When setting the weight ω to 0, we takes the residents' interests into account completely. And when ω is set to 1, only the interests of ER are considered. Remark 3: The peak rebound problem (peak may shift to another time of the day [35]) may occur when the penetration of dwellings benefiting from PRDRM is high for a given social value weight ω. The reason behind load accumulation is the intuitive human behavior of using more appliances at the lowest RP [36], [37]. This causes ER to buy more power from EM to manage the load demand, which may not only result in financial loss to ER but also increase power imbalance, power loss in the network and grid instability leading to voltage violations and overloading of transformers and distribution lines, etc. Therefore, considering the properties of the weight ω, we can bias the weights ω towards ER in time slots with low RP to ensure that the profits obtained by ER are stable, while the intuitive behavior of residents will autonomously reduce the use of more appliances in that time slot.
Problem (13) is a nonconvex optimization problem with globally shared objective function F t (P), local constraints and nonsmooth terms. However, the associated optimization algorithms, such as forward-backward splitting method [38], require not only regularity assumptions on the objective function, but also an accurate load-retail model. Considering the complex grid environment, this article proposes two types of MFRL-ED algorithms to optimize the objective (13).

III. MFRL-ED ALGORITHMS FOR PRDRM
In this section, we introduce a class of MFRL-ED algorithms to address problem (13). Instead of concerning the complexity of objective functions and constraints with the construction of appliance models and accurate parameter predictions, the proposed online/offline RL algorithms exploit the transmission data from the grid to make RP-decisions based on policy exploration. The basic RL element consists of a five-tuple S, A, R, T t , γ , corresponding to the retailer-resident electricity trading model as follows: is called a transition. Additionally, there ought to be a greatest lower bound for the difference Δρ i,t in actual transactions, which is another reason for discretizing the RP interval. We advise the readers to refer to [28] for the remaining concepts.

A. Q-Table-Based ED
The Q−Table-Based ED algorithm for the PRDRM is given in [29]. In general, the Q− with a i,t ∈ A, s i,t ∈ S, i ∈ N, where k denotes the episode index, and lr indicates the learning rate. When the Q−table converges, the optimal RP sequence can be obtained using the following target policy: where π * s i,t is also called greedy strategy. This Q−learning ED algorithm requires the creation of corresponding Q−table for each appliance in advance, and the growth in the number of appliances, discrete RPs and time slots all impose a significant burden on the storage and computation. Hence, Q− Table-Based ED algorithm has to take into account the appropriate discrete action space and residential area scale when solving the RDRM problem.

B. DQN-ED
We substitute the Q−value function approximation for the Q−table in order to address the dimensionality issue brought on by too many appliances and continuous RP intervals, which is usually constructed using deep neural networks (DNNs), i.e., where α i,t denotes the weight of the ith Q−network, We simply need to concentrate on the convergence of the weights α i,t , much as the convergence of q−value in the Q−table. Ref. [39] uses the simplified DNN with three-layers to represent the linear approximation of the q−function. The q−function is an evaluation function for the pair (R i,t , E i,t−1 ) and ρ i,t based on the prediction of future social welfare F t (P). Then the target q−value can be indicated as The quadratic loss function of the Q−network can be expressed as Thus the weight α i,t can be updated iteratively using the gradient descent method. Based on (17)-(18), we present the online DQN-ED algorithm for PRDRM shown in Algorithm 1. The historical observation {ρ i,t , F t (P), R i,t , E i,t−1 } after exploring is directly exploited by the Q−network, in which data correlation easily leads to obtaining local solutions. In addition, the max operation, although it can quickly bring the q−values closer to the possible optimization objectives, can easily be overdone and lead to overestimation problems.
Here, DDQN-ED is employed to achieve the elimination of the overestimation problem by decoupling the two steps of RP selection and calculation of target q−value. The social Algorithm 1: Online DQN-ED for PRDRM. Input: Learning rates lr, discount factor γ, maximum episodes index K, total time slots T , convergence threshold ξ.

3:
for all t = 1 : T do 4: Choose RPs {ρ i,1 , . . ., ρ i,T } by using ε−greedy strategy as the behavioral policy; 5: Calculate the actual electricity consumption E i,t by T i,t ; 6: Identify the estimated q−valueQ k Calculate social welfare F t (P) by (13); 8: if t < T then 9: Calculate the current q−value Q k s i,t a i,t π s i,t by (17); 10: else 11: Identify the current q−value Q k s i,t a i,t π s i,t = F t (P); 12: end if 13: end for 14: Update the weight α k i,t by the gradient descent: welfare expectation of DDQN is demonstrated to be unbiased estimation [40]. In the following, we will briefly construct the coupling of DDQN-ED with PRDRM.
DDQN-ED adopts two identical Q−network structures for each appliance: the current Q−network α k i,t and the target Q−networkα k i,t , which are responsible for the RP-decision and the RP estimation based on q−function, respectively. Based on (17), the target q−value can be expressed as whereQt s i,t ,a i,t+1 ,α k i,t denotes the q−value of the target Q−network, andQc s i,t+1 ,a t+1 ,α k i,t represents the estimated q−value of the current Q−network. Then the loss function of DDQN can be represented as Using (19) and (20), the offline DDQN-ED algorithm with experience replay for PRDRM is presented in Algorithm 2. It can be seen that the ER interacts with the residences for storing the transitions in the experience replay buffer D with maximum buffer F based on the current Q−network α k i,t . Then M transitions are sampled and removed from D to train the current Q−network, while the target Q−network weightα k i,t is updated at the fixed update period C. This delayed update reduces the parameter dependency between the target Q−network and the current Q−network. Therefore, DDQN-ED achieves the elimination of the overestimation problem by decoupling the two steps of the selection of the current action and the calculation of the target q−value. The detailed structures in Algorithm 2, such as the replay buffer and network models, are shown in Fig. 3. The experience replay buffer is D = {T f i }, f = 1, . . ., F where the f th transition T f i can be denoted as Noting that, when the capacity of replay buffer D is set to 1, Algorithm 2 changes to an online DDQN-ED. Moreover, we should make sure that each appliance has the same maximum buffer capacity F and update period C in order to maintain the algorithm iterating synchronously.
Remark 5: Since the target q−value is calculated through the target Q−network, the RP that maximizes the target q−value is originally selected according to the parameters of the target Q−network Q k s i,t a i,t π s i,t in DQN, while the q−value calculated after selecting the optimal RP with the current Q−network Qc s i,t+1 ,a t+1 ,α k i,t of DDQN must be less than or equal to the original q−value. This approach reduces the overestimation to a certain extent and makes the q−value closer to the real value.
Remark 6: The iterative formulation (14) is derived from the Bellman equation and the greedy strategy. For the sake of clarity, we assume that the q−value is optimal, then we have the ideal replay buffer D, maximum buffer capacity F , update period C, time slot t = 0, episodes index k = 0. 2: for all k = 1 : K or |Q k for all t = 1 : T do 5: Identify the estimated q−valueQc k Update the current Q−network's weight: optimal q−function Q * s t ,a t = max π E(r t |s t , a t , π), where E(·) denotes the expectation. However, the optimal q−function should satisfy the Bellman equation: t+1 a t+1 |s t , a t ). Hence, the overestimation problem can be attributed simply to the inequality: E(max (Q s 1 ,a 1 ,   Q s 2 ,a 2 , . . ., Q s T ,a T )) ≥ max(E (Q s 1 ,a 1 , Q s 2 ,a 2 , . . ., Q s T ,a T )). The result shows that fitting the q−function using the gradient descent yields larger estimated expectation. In summary, the overestimation problem is unavoidable for RL algorithms based on the Q−learning framework due to the adoption of greedy strategy to maximize cumulative rewards in unknown environments for maintaining efficient exploratory.

C. AC-ED
DQN-ED algorithms, which are a type of value-based learning algorithm, maximize social welfare expectations by selecting the best RPs based on a deterministic strategy. It is crucial to use stochastic policies for the PRDRM optimization problem to handle the continuous RP space and obtain more precise energy ED. Additionally, when compared to offline DDQN-ED algorithm, ER using online learning algorithms can maximize social welfare by adaptively and in real-time adjusting prices for real-time dispatch consideration. Therefore, this article proposes a policy-value-based learning algorithm for PRDRM, namely the Actor-Critic method, which approximates the policy distribution by adding a policy network to the Q−network which can be described by the parameter β i,t , where Pr(·) indicates the probability distribution. We formulate the policy network as the Actor network and the Critic network corresponds to the Q−network. In general, the input of the Actor network is a state vector and the output is an estimated action. The structure of the critic NN remains the same as that of the Q−network.
Here, the outputs of the Actor-Critic network are represented as and with i ∈ N, t ∈ T, where β i,t and α i,t are the Actor NN's weight and Critic NN's weight, respectively, φ a and φ c are the activation functions, the estimated action η i,t is the output of the Actor network, and the action-value function J i,t is the output of the Critic network at the time slot t. The temporal difference (TD) error for each appliance is given by The discount factor γ = 0 implies that a short-sighted learning algorithm focuses on immediate rewards and ignores the future state value. On the other extreme, γ = 1 indicates that the learning algorithm gives fair weight to the rewards of all time slots. Then the quadratic loss function for the Critic NN is defined as For the Actor network, we utilize TD error as the evaluation function. Then based on back propagation, the update formulas for Actor-Critic are represented as and where la and lc denote the learning rate of Actor and Critic, respectively. The details of which are illustrated in Fig. 4. Because we consider a discrete set of RPs, the Actor network adopts the softmax function to approximate the optimal policy. For details of the AC-ED algorithm, see Algorithm 3. Remark 7: Algorithm 3 can be divided into three steps: 1) Calculate the pair (s i,t , a i,t ) of the current time slot and the next time slot, respectively; 2) Identify the evaluation function E i,t from the calculated action-value functions J i,t and J i,t+1 ; 3) Update the AC-ED parameters α i,t and β i,t using E i,t .

D. Charging-Discharging Mechanism in DRL-ED
The procedure of the battery charging and discharging mechanism is introduced here. The charging and discharging characteristics of the battery characterize the increase or decrease of the cumulative variable E p,t presenting the current battery capacity. We use the actual power consumption of each iteration, positive or negative, to represent charging or discharging, while the change in power consumption satisfies the corresponding rated power and capacity constraints (7)-(8). The highly random usage pattern of PEVs means that the SOC of the battery cannot be known in advance of each iteration, which greatly increases the challenge of studying PRDRMs with PEVs. Therefore the literature [24], [25] considers only charging.
In the designed DRL-ED programs, PEVs' RP ρ k−1 p,t−1 at the ER end, the current power demand R p,t and a stochastic strategy (e.g., -greedy strategy) jointly decide and drive the PEVs' charging and discharging. The process is shown in Algorithm 4. It should be noted that we ignore the effect of battery degradation on battery capacity during charging and discharging.
Remark 8: It should be noted that PEVs' batteries are only used to supply electricity to households to ensure maximum Algorithm 3: Online AC-ED for PRDRM. Input: Learning rates lc and la, discount factor γ, maximum episodes index K, total time slots T , convergence threshold ξ. Output: Convergent RP sequence {ρ * i,1 , ρ * i,2 , . . ., ρ * i,T }, i ∈ N. 1: Initialize: The Actor network's weight β 0 i,t = rand(·), the Critic network's weight α 0 i,t = rand(·), time slot t = 0, episodes index k = 0. 2: for all k = 1 : Calculate the actual energy consumption E i,t by T i,t ; 5: Calculate the RPs {ρ i,1 , . . ., ρ i,T } by (22); 6: Obtain the action-value function J i,t by (23); 7: Identify the social welfare F t (P) by (13); 8: if t < T then 9: Calculate the action-value function J i,t+1 by (23); 10: Calculate the TD error E i,t by (24) social welfare (13). Residences can access not only the electricity provided by the ER, but also the PEV batteries to supplement the household electricity. When the households consume too much electricity, they may not necessarily get the maximum social welfare (affected by the user's electrical happiness error function (6)). Instead, supplementing electricity through PEV batteries may acquire higher social welfare.

IV. SIMULATIONS AND NUMERICAL ANALYSES
In this section, the effectiveness of the proposed MFRL algorithms is verified by several experiments. The algorithms are implemented by MATLAB R2014a on a desktop PC with i5-12400F CPU@2.50 GHz, 16 GB of RAM, and a 64-bit Windows 11 operating system.

A. Experimental Setup
We consider the energy demand response management problem for 6 dispatchable appliances {d1, d2, d3, d4, d5, d6}, 4 PEVs {p1, p2, p3, p4} and 5 non-dispatchable appliances {n1, n2, n3, n4, n5} in a whole day (24 time slots). The energy demand distribution for PEVs, non-dispatchable and dispatchable appliances (see Fig. 5 data from Commonwealth Edison Company [42]. It can be seen that the energy demand of non-dispatchable appliances is significantly higher than that of dispatchable appliances and PEVs, and that the peak demand occurs in 12 : 00 − 16 : 00 and 18 : 00 − 24 : 00. In addition, PEVs have no demand for electricity at certain times of the day due to their stochastic usage patterns that increase energy demand elasticity. In order to ensure the proper functioning of the electricity market economy and to protect the reasonable demands of residents for normal electricity consumption, we must coordinate the retail pricing strategies of electricity retailers with the electricity consumption strategies of customers in an effort to maximize social welfare. The time-varying parameters and appliance-related parameters are listed in Tables I and Table II, respectively. Note that the discrete RPs have a gap of 0.1, and according to the price parameters in Table II, the RP interval is [2.4, 6.7]. Therefore the number of discrete actions is 44 (i.e., M = 44).    Table III shows the specific daily RPs planning obtained by Algorithm 1.  I  TIME-VARYING PARAMETERS   TABLE II  PARAMETERS OF APPLIANCES   TABLE III OPTIMAL RPS PLANNING FOR ALL APPLIANCES It can be seen that the RPs strictly satisfy the price constraints. The overall trend of the RPs planning fluctuates over the 24 time slots, which is affected by social welfare and WPs. When the RP is too high, it is detrimental to the social welfare relative to the users. For example, d1 has been reducing energy consumption (from 8.9 (kW h) to 5.3 (kW h)) due to the increasing retail price (from 3 ($) to 6.6 ($)) in time slots 10−13, and the retail price decreases rapidly (from 6.6 ($) to 4.2 ($)) under the influence of factors that favor the customer's interest such as the electrical happiness error function and the PEV cost, and subsequently the residence will tend to consume more electricity in the next time slot. Out of commercial interest to the retailer, this also affects relative social welfare. For example, when the RP is too small at time slot 9, and the retailer increases the RP significantly at time slot 10 for {d2, p4, n1}, while PEVs in time slots 2−6 and 20−24 actively discharge in response to excessive electricity prices. Combining Table III and Fig. 5, it can be observed that the RPs peak does not occur during demand peak hours (12:00-16:00 and 18:00-24:00, and the average retail price (77.24($)) at demand peak is smaller than the average price (79.62($)) at other time slots. This is because the goal of maximizing social welfare ensures that prices are set to benefit both retailers and customers, creating a virtuous circle between retail prices and actual electricity consumption to maintain a relatively balanced social welfare.   convergence of the algorithm. In order to maximize social welfare, the retailer constantly changes its electricity pricing strategies to gradually make q−values converge to their maximization, considering the stochastic electricity consumption patterns and storage capacity limitations of the PEVs.
2) DQN-ED Results: We combine deep learning with Q−learning to construct Q−networks instead of the evaluation role of Q−tables, and propose two algorithms: online DQN-ED and offline DDQN-ED, where the capacity of the experience buffer D is set to F = 20, and M = 15 transitions are sampled and removed from D at each iteration. Fig. 9 represents the weights of hidden/output layers and the convergence of (Q k ) over 24 time slots for p3, which demonstrates the effectiveness of the DQN-ED algorithms. Fig. 10 shows the evolution of the q−values of DQN and DDQN over 24 time slots, respectively. The DQN-ED algorithms coordinate the energy consumption of each appliance, and based on the RPs with instant feedback, make decisions on the expected actual energy consumption through the greedy strategy that eventually converges to the optimal.
Due to the complexity of the RDRM problem (e.g., the nonconvexity of the optimization problem, the setting of parameters unique to the algorithm), the scheduling policies obtained by  different learning algorithms are often distinct. Fig. 11 shows the actual energy consumption and retail price planning during 24 time slots for the four PEVs under both DDQN-ED and DQN-ED algorithms. As can be seen, although the two algorithms obtain different RPs and actual consumptions, the relative actual consumptions of the two algorithms determine the relative retail prices. In particular, when the two algorithms perform discharging (E p,t < 0) and charging (E p,t > 0) at a certain time slot, respectively, the RP of discharging tends to be smaller than the RP of charging. In addition, although the RPs are not identical under different algorithms, the variation trend of RP is similar at 24 time slots. We believe is due to the fact that both use the same behavioral policy (ε−greedy strategy) and target policy (greedy strategy).
3) AC-ED Results: For the discrete action space, the Actor policy network utilizes the softmax function to select the optimal action with the maximum probability principle for each appliance as follows: , j ∈ M, i ∈ N   distribution of 44 actions through the softmax function. All action probabilities are randomly initialized before training, and the evaluation function E i,t is used to maximize the objective function to select the optimal action during training, and the proportion of that action probability in the action set is continuously increased while the probability of other actions is reduced, which will eventually converge to 1. Fig. 13 represents the evolution of action values (23) for d3 and p2, respectively, where the different colors represent different time slots. The Critic networks finally converge due to the stability of the action values, demonstrating that the algorithm's efficacy is assured. 4) Comparisons: Fig. 14 shows the average level of social welfare for the proposed four algorithms after 10 trials. The following features can be observed. Since the Q−learning method is a class of value-based (q−value) algorithms, which tend to fall into overestimation by maximizing the q−value, the curves of the algorithms are characterized by large amplitude, high frequency, and fast convergence. The disadvantage is that it may not converge to optimality against nonconvex optimization objectives, so more potential solutions have to be explored and learned through −greedy strategies. In contrast, the online AC-ED method is capable of both learning policy function (23) and evaluating current value function (22), as well as continuously optimizing policy network by TD error (24), so it has better stability and practicality. However, the high data  correlation caused by the online learning method directly affects the learning ability of the Actor-Critic network (some of the action values are updated slowly as can be seen in Fig. 13), resulting in the difficulty in obtaining the optimal social welfare quickly in Fig. 14. To decouple data correlation and fully explore policy distribution, we expect to study the offline learning based Actor-Critic framework and its variants applied to the PRDRM problem. Table IV shows the comparison between total social welfare ( T t=1 N i=1 ρ i,t )/(N × T ) of the algorithms for all time slots. The AC-ED algorithm gets the highest social welfare, 13.04% higher than that obtained by DQN-ED, but its average daily RP is indeed 34.47% lower than that of DQN-ED. This shows that high social welfare combines corporate profits and user benefits, which characterizes the social well-being in electricity usage. In addition, as shown in the experiment, the disadvantage of Q−table-based ED method for discrete action space is only the calculation and storage of q−values, while the experimental results of social welfare do not show significant drawback compared with the DQN-ED algorithms.

V. CONCLUSION
In this article, we studied the PRDRM problem based on MFRL, which takes fully into account the charged and discharged PEV model that is gradually becoming popular among users. Firstly, three types of appliances (including dispatchable appliances, non-dispatchable appliances and PEVs) were mathematically modeled, and then, PRDRM was coupled with MFRL to reconcile the actual power consumption of appliances on the environment side with the retail price of electricity on the agent side through a series of MFRL algorithms (including Q-table-based framework, DQN framework and Actor-Critic framework). A systematic solution with long-view decision capability was provided for real-time demand response in smart grids. Finally, simulation experiments validated the effectiveness of the proposed approach. Future work will apply the MFRL-based energy dispatching scheme to multi-carrier energy supply (i.e., gas and electricity) to achieve higher efficiency and lower operating costs.