Cost-Optimized Microgrid Coalitions Using Bayesian Reinforcement Learning

Microgrids are empowered by the advances in renewable energy generation, which enable the microgrids to generate the required energy for supplying their loads and trade the surplus energy to other microgrids or the macrogrid. Microgrids need to optimize the scheduling of their demands and energy levels while trading their surplus with others to minimize the overall cost. This can be affected by various factors such as variations in demand, energy generation, and competition among microgrids due to their dynamic nature. Thus, reaching optimal scheduling is challenging due to the uncertainty caused by the generation/consumption of renewable energy and the complexity of interconnected microgrids and their interplay. Previous works mainly rely on modeling-based approaches and the availability of precise information on microgrid dynamics. This paper addresses the energy trading problem among microgrids by minimizing the cost while uncertainty exists in microgrid generation and demand. To this end, a Bayesian coalitional reinforcement learning-based model is introduced to minimize the energy trading cost among microgrids by forming stable coalitions. The results show that the proposed model can minimize the cost up to 23% with respect to the coalitional game theory model.


Introduction
The overall demand for energy consumption has drastically increased over recent years, and it is also expected to reach up to 1000 Exajoule by the end of 2050 [1]. Governments are trying to enhance their energy generation capabilities by considering green and smart models to satisfy the massive energy demand. Therefore, energy generation and consumption models require a fundamental transformation in order to employ these capabilities in traditional power systems. By using smart grids and the advances in information and communications technologies (ICTs), a strong foundation is generated for transforming unidirectional power and information flow into a distributed bidirectional power and information system known as a transactive energy framework [2,3]. Transactive energy can be categorized into (i) transactive network management for organizing the energy supply chain, (ii) transactive control for controlling and managing the energy generation/consumption rate and (iii) peer-to-peer (p2p) energy market for allowing customers to trading energy among themselves [4].
One of the promising characteristics of microgrids is the possibility of p2p energy trading with each other or the utility grid. Energy trading can be carried out by transferring surplus energy from a microgrid to a close-by microgrid, which has been a well-known research topic in field of the smart grid since the 2010s [5]. Generally, we can describe the energy trading problem as a group of interconnected microgrids exchanging their surplus energy to serve the loads in other microgrids. The microgrids are also connected to the macrogrid, and energy trading can be conducted between microgrids and the macrogrid or among themselves. Some microgrids may have surplus energy at different time intervals and prefer to sell their energy, while others suffer from a lack of supply and wish to buy it. This system can be modeled as a game theory problem and tackled with game-theoretical or learning approaches [6][7][8].
Although energy trading has been explored to some degree, energy trading under uncertainty has been less explored. This paper investigates the energy trading problem among microgrids where each microgrid has different levels of energy surplus or demand in each epoch. Additionally, the dynamic nature of the energy levels causes uncertainty in our system. We employ a Bayesian reinforcement-based coalition formation scheme for energy trading among microgrids to deal with this uncertainty. This algorithm was first introduced in [9], and an application of this model was also developed for device-to-device communications in wireless networks [10]. In this work, we develop the Bayesian reinforcement learning model, which enhances the conventional Bayesian coalition formation by learning from past observations and experiences. We then employ this approach in the energy trading problem among microgrids under uncertainty. We compared the proposed method with two Bayesian reinforcement learning-based models [10], Q-learning [11], Bayesian coalition formation [11], and conventional coalition formation game theory [6]. The results show up to 23% improvements in cost minimization compared to the coalitional game theory-based method.
The rest of this work is organized as follows. Recent works are summarized in Section 2. In Section 3, the system model is demonstrated. In Section 4, the Bayesian coalition formation game (BCG) scheme is illustrated. In Section 5, the Bayesian coalitional reinforcement learning (BCRL) based scheme is proposed. In Section 6, the numerical results are evaluated, and finally, the conclusions are presented at the end.

Related Work
Game-theoretic methods have been widely employed for energy trading in microgrids. In [12], a game-theoretic approach is proposed for distributed energy trading between microgrids. In this study, a set of interconnected microgrids aim to exchange energy with each other and also with the macrogrid. In [13], a priority-based energy trading game is proposed in which buyers are prioritized according to the past contributions of the buyers and their current demands. In [14], a Stackelberg game is designed with a central power station as the single leader and multiple followers who want to sell their extra energy to this central station. In [15], the authors develop a model based on the repeated game, which lets microgrids choose a strategy with a probability for trading energy in the market in a way that their average revenue is maximized.
Coalitional game theory is a subset of game theory in which players cooperate to maximize a shared payoff and then distribute the received payoff among the players.
In several studies, the energy trading problem among microgrids has been modeled as a coalitional game theory problem. In this approach, microgrids can cooperate by forming coalitions for a specific period in which some microgrids with surplus energy supply the others that require energy. Table 1 summarizes the research attempts that investigate the energy trading problem using coalitional game theory.
In [6], for the first time, the energy trading problem in the microgrid community is investigated using coalitional game theory to minimize the power loss. The idea of coordinated operation of cooperative microgrids is studied in [16]. Although [6], only focuses on energy loss, the authors expanded the objective functions to maximize microgrids' expected profits and usage while minimizing power loss and consumers' discomfort. In [17], the authors proposed a nucleolus-based approach to fairly distribute the payoff among microgrids for transactive energy management in microgrid communities locally. In [18], the authors proposed coalitional-based energy trading where, in each coalition, an auction-based matching is employed to calculate the utility of the coalition, and then the coalition formation technique is used to partition the microgrids into coalitions. In [19], the authors designed the energy trading scheme in two stages. First, microgrids form coalitions, and then a matching game is used to schedule energy exchange in each coalition.
Machine learning algorithms have proven useful in a wide range of applications such as computer vision, sentiment analysis, self-organized systems, and robotics. However, it is not straightforward to use the same algorithms in AI-enabled smart grids. Existing machine learning techniques need to be tailored to meet the smart grid and microgrids' needs. In [20], the authors propose two learning automata-based methods for optimal power management in smart grids. In [21], the authors propose a dynamic demand response and distributed generation management method for a residential microgrid community. In [22], a fully distributed learning approach is proposed for optimal reactive power dispatch. In this method, a multi-agent Q-learning algorithm is employed that minimizes the active power loss and satisfies the bus voltage range and reactive power generation constraints. In [23], the temporal difference reinforcement learning approach is used to achieve the optimal control policy for residential energy storage. The problem of dynamic pricing in smart grids with reinforcement learning methods is addressed in [24]. The authors propose reinforcement-based dynamic pricing and energy consumption scheduling to help energy providers and consumers learn their best strategies.
Energy trading is also among the problems that can be tackled with machine learning approaches, specifically using reinforcement learning models such as Q-learning, Bayesian reinforcement learning, and deep reinforcement learning. In [8], a hot-booting Q-learningbased approach is implemented to achieve the Nash equilibrium of the dynamic repeated energy trading game. In [25], the authors improve [8] by designing a deep Q-networkbased approach. In our prior work, [11] we proposed a Bayesian coalitional algorithm that helps agents make a system of beliefs about the types of other agents. In contrast, in [26], agents can learn from their experience by using a Bayesian reinforcement learning technique; however, the proposed model suffers from the lack of a belief system. In this study, we propose a comprehensive Bayesian reinforcement learning framework for the problem of coalition formation in microgrid communities, which helps agents make a system of beliefs about the types of other agents and learn from their past experiences simultaneously.

System Model
In this work, we consider a network of M interconnected microgrids while each microgrid is also connected to the main utility grid known as the macrogrid as shown in Figure 1. The amount of generated energy by microgrid m ∈ M and its demand are presented by g m and d m , respectively. Therefore, we can find the total surplus or shortage energy of each microgrid as q m = g m − d m , which represents the energy that each microgrid is required to export to or import from the network. As a result, microgrids initiate an energy trading process among each other and with the macrogrid to satisfy their export/import requirements. Due to the dynamicity of the system, each microgrid can be either a seller or buyer of energy during each epoch. The process of energy trading, either among microgrids or with the macrogrid, imposes a variety of costs. In the proposed system, we assume that each energy transaction is associated with two sets of costs. The first set corresponds to the costs resulting from the power loss in the line, transformer loss from high to medium or low voltages, maintenance cost, etc. We call this set of costs as operational costs. Additionally, there are other hidden costs associated with energy trading among microgrids-for example, a case where two far away microgrids want to trade energy among themselves. This energy trading is not always feasible through the direct power line since it is not possible to have direct lines between all microgrids in practice. Therefore, we assume that direct lines are only deployed between close-by microgrids. When there are no direct links, microgrids have to make their energy transactions through a series of intermediate microgrid links or transfer their energy through the macrogrid (the seller microgrid sends its surplus energy to the macrogrid, and the buyer microgrid receives the transferred energy from the macrogrid). One of the interesting capabilities of microgrids is trading energy only among themselves without relying on a central macrogrid while in islanding mode. Islanding will be at risk by any reliance on energy trading with the macrogrid, resulting in unexpected costs. We address all of these third parties involved in energy trading with the unpredicted costs (virtual costs) as the second set of costs. Therefore, the total cost of power transaction E mn from m-th microgrid to n-th microgrid is given by: where d mn represents the length of power lines that need to be used for transferring energy between the m-th and n-th microgrids and δ shows the scaling factor. E mn is the power that is being traded plus the loss that happens during trading. PL(E mn ) denotes the power loss in trading energy between m-th and n-th microgrids and scale. w represents a weighting coefficient associated with the virtual cost and can be calculated as: The virtual cost is a function of distance and energy that is weighted by the parameter w. This parameter is fixed to a lower value w s for energy trading among microgrids which are closer than threshold d tr , and a higher value w l for the rest [27]. We assume that distant microgrids that are further than the threshold have no direct link in between. Consequently, the virtual cost increases compared to close-by microgrids. w 0 is the weight factor for energy transactions with the macrogrid.
Power loss is defined as below, in which R mn represents the resistance of line per km in energy trading between microgrids m and n [6].
where U m and ρ denote voltage and the fraction of power loss in the transformer at the interconnection point between the microgrids and the macrogrid (macro station), respectively. E mn is the trading power required to deliver the total surplus/demand q j of microgrid n to microgrid m and can be obtained as follows: The main objective of this system is to minimize the total cost. Therefore: Energy trading among nearby microgrids decreases the total cost with respect to trade energy among distant microgrids and macrogrid. Therefore, forming groups of close-by microgrids (also known as coalition formation) to trade energy in their groups is a promising approach that reduces the overall cost. In addition, there is no transformer loss in the operating range of energy trading among microgrids (low or medium voltage) [28].
We can formulate a coalition with a pair (C, v C ). C expresses the coalition in which coalition members cooperate to gain a higher coalitional value v C [29]. In this paper, we consider the total cost of energy trading among coalition members plus the cost of trading extra energy with the macrogrid as the coalition value. The objective is to minimize the cost. Therefore, the coalition value as the negative form of cost can be formulated as follows: where |C| shows the number of members of coalition C. Index 0 expresses any transactions with the main grid. When the coalition is formed, energy transfer among the coalition members needs to be scheduled to minimize the total cost in the coalition. Therefore, the coalition payoff in our considered system is defined as the maximum achievable coalition value v max (C) which is given by:

Bayesian Coalition Formation Game
In this section, we propose a Bayesian coalition formation game (BCFG) that tackles the uncertainty in the power level of the microgrids.

Game Formulation
The Bayesian coalition formation game can be characterized by a set of agents (M), a set of agent types (T m ∈ T ), a set of agent beliefs (B m ), a set of coalition actions ( A C k ), a set of outcomes known as states ( S), and the reward functions (u m ).
To employ BCFG in our problem, we can describe BCFG as a cost minimization model in a microgrid community where a set of M rational microgrid agents is involved in the coalition formation game. A coalition C k represents a set of microgrids that allows them to trade energy among themselves. The m-th microgrid's type T m stands for the microgrid's power level. Each microgrid is only aware of its type (T m ) but not the types of other microgrids. The m-th microgrid's beliefs about the types of other players are denoted by B m (T −m ) that consists of a joint distribution over T −m which is the probability assigned to other agents about their types. We assume that any coalition of microgrids has a restricted set of coalition actions A C k . A collective coalition action α C k is an action that is approved by all coalition members of C k about a new member to join their coalition. The coalition action is only observable for the coalition members and hidden from agents in other coalitions.
We consider the coalition tag of each microgrid as its state in each iteration of the game. Therefore, agents' state vectors can be defined as s = (s 1 , ..., s M ). Any state vector s corresponds to a joint reward R C k ( s, T C k ) which is calculated as: We use the proportional fair division method to distribute the coalition reward among coalition members, allocating each member a share of the coalition reward proportionate to their cost. Therefore, r m (s, T C k ) is defined as [29]: where ζ m is equal to

Stability Notation
Like all cooperative games, in the coalitional game theory, players with a common interest or members of a specific coalition maximize their joint reward, known as the coalition value. We compute the value of coalition C k with the members of type T C k as follows: where Pr{s|C k , α C k , T C k } represents the probability of transitioning to state s in coalition C k with members of type T C k when taking action α C k . Q(C k , α C k |T C k ) shows the long-term action value. It can be seen that V(C k |T C k ) is a function of the actual "type" of coalition members while the "type" of a microgrid is not known by the other microgrids inside the coalition. Therefore, to estimate the coalition value C k , coalition members need to rely on their beliefs about the "type" of other players. We call this estimation the expected value of coalition C k and coalition member m can compute its expected coalition value according to its beliefs B m as follows: where Q(C k , α C k , B m ) demonstrates the expected value of coalition C k when action α C k is taken while the system's belief is equal to B m . Since all the microgrids have their specific systems of beliefs, it is common for microgrids to end up with different estimations of Q(C k , α C k , B m ) and consequently V(C k , B m ). Therefore, none of the microgrids can reach the accurate estimations about the coalitional reward R C k and their share of reward r m .
To this end, players need a system to estimate their achievable rewards for cooperating in coalitional activity. We define demand D m as the share of the coalitional value that microgrid m believes in receiving in the coalition. Having the coalition structure C k with the demand vector D C k = (D 1 , D 2 , ..., D M ) , microgrid m's belief about the expected reward of microgrid j by taking action α C k can be estimated by: microgrid m expects a long-term reward by taking action α C k and it expects the demand vector as D C k which is defined by Q m (C k , α C k , D C k ).
Considering the above-mentioned definitions, the concept of a strong Bayesian core (SBC) can be defined as follows [9]. Definition 1. We assume that a tuple of a specific coalitional structure and a specific demand vector (C k , D C k ) are in the SBC of a Bayesian coalition formation game if:

•
No player believes there exists a better tuple than (C k , D C k ). This definition can be formulated as follows: and where m ∈ M. Equation (13) demonstrates the preference of microgrid m for itself and (14) shows the preference of microgrid j believed by microgrid m.

Coalition Formation
In this section, we define the Bayesian coalition formation process that we present in this paper. We assume that negotiations among the microgrids to merge and split from coalitions happen over an infinite number of iterations. At every iteration, there is a pairing the of coalitional structure and demand vector named the coalitional agreement (CS, D), which all players agree on. All the microgrids have the chance to modify this coalition agreement concerning their utility (a rational player changes the agreement to improve its utility). We call the microgrid m who attempt to change the coalitional agreement a proposer since it proposes to change the agreement in either one of the following ways: • A proposer can stay in its current coalition C k and propose a new demand D m from the coalition. • A proposer can decide to split from its current coalition and propose merging to other coalition C k with new demand D m .
The microgrids have the following finite set of actions (or negotiations options): (1) if a microgrid is a proposer the action is to make proposal π k m = (C k , {D i } i∈C k • D m ) which means joining (or staying in) coalition C k with the new demand D m . (2) If a microgrid is a responder to a proposal, it has the following action options: (i) either accept (κ m k = 1) or (ii) reject (κ m k = 1), in response to the presented proposal. We can summarize the proposition procedure as follows. At the beginning of every iteration, a proposer m is chosen randomly from all the microgrids with an equal probability of 1/M. Then, the proposer presents the proposal π k m to join or stay in the coalition C k with demand D m . After that, members of the coalition C k independently accept or reject the offered proposal without having any information regarding the action of other members. All the individual responder actions need to be unified in a single coalitional action to respond to the proposal. To this end, we introduce function f , which maps all the responding actions into the coalitional action α C k . We define this coalitional action as follows: This means that coalition members accept a proposal if all coalition members approve it; otherwise, the proposal will be rejected, and the existing coalitional agreement will be in effect.
We assume that all the players are rational, which means that the proposer submits a proposal that maximizes its expected reward. Meanwhile, because the other players are rational as well, they only accept a proposal that does not degrade their expected reward. Therefore, a rational proposal is to offer the maximum possible demand D m max that does not degrade the expected reward of other players according to the beliefs of the proposer about other players. This particular proposal is achievable for the proposed microgrid if: where α C k = f (π m k ) and Q m j is the expected reward of microgrid n believed by microgrid m. If proposer microgrid m finds π m k to be feasible, it expects all the responders to accept the proposal according to its system of beliefs about others. It should be noted that this feasibility is just an expectation, and the proposer is not sure that the proposal will be accepted or rejected since it does not know what is best for the responder. Considering (16), the proposer can estimate D m max as follows: The requested demand by the proposer is restricted to the interval [0, D m max (C k )]. To simplify the search for a proper demand, we define a unit ∆, making the proposer to propose a demand as integral multiplies of ∆. Therefore we can define the possible demand vector as [0, ∆, 2∆, ..., D m max (C k )/∆ ∆, D m max (C k )].

Bayesian Reinforcement Learning Coalition Formation
Types of players in a coalition dynamically change since microgrids' generation and demand vary in time. As a consequence, the coalition values change, which imposes uncertainty on the system. Combining Bayesian learning (RL) with the Coalition formation game gives the players the chance to learn about other players and eliminate uncertainties about them through interactions in the Bayesian Coalition formation process. In this section, first, we explain the conventional Bayesian reinforcement learning (RL) framework for a single agent, then we present the cooperative multi-user Bayesian learning framework suitable for the coalition formation process in microgrids, which is called as Bayesian Learning-based Coalition formation.

Conventional Bayesian RL
In the following, we briefly explain single-agent Bayesian reinforcement learning. We first need to define the Markov decision process (MDP) as an essential part of the reinforcement learning [30]. An MDP consist of four elements (S, A, Pr, r), where S is a vector of all possible states s. A is a set of all possible actions. Pr is a vector of all transition probabilities, and Pr{s |s, a} shows the chance of transition from state s to state s while taking action a. r(s, a) expresses the reward that the agent receives by taking action a in the state s. The RL problem can be defined as the problem of finding the optimal mapping strategy from actions to states σ : S → A for the MDP with the known or unknown transition probabilities. In the Bayesian RL algorithm, first, a prior distribution is assigned to the initial beliefs of the agent about the values of the unknowns in the system. This belief will be updated continuously as the agent observes the unknown parameters. Considering the partially observable nature of MDP in Bayesian reinforcement learning, in this work, we employ the partially observable MDP (POMDP) technique in our model [30].
A POMDP consist of the following elements (S p , A p , O p , Pr p , z p , r p ), in which S p denotes the set of states consisting of S p = S × {T s,s a }, where T s,s a shows the unknown transition dynamics, A p = A represents set of actions, and O p = S shows the observation space similar to the state space in the general MDP. Pr p (s , T |s, T, a), z p (s , T , a, o) and r(s, T, a) represents state the transition probabilities, observation space and reward function, respectively.
In POMDP, the strategy is to map from beliefs to actions as σ : B → A. We can calculate the value of a specific policy σ as the expected sum of discounted reward over infinite time in the future given by: where γ, s t and B t expresses the discount factor, state, and belief at time t. We are interested in finding the optimal policy σ * . The optimal policy has the highest value for all the belief states, i.e., V σ * (B) > V σ (B) and the corresponding value function of the optimal policy satisfies the Bellman equation as follows: where ψ is a normalizing constant.

Bayesian Reinforcement Learning Coalition Formation
In the following, we extend the previously discussed conventional single-agent BRL to the case of multiple agents in a coalition formation game. Our goal is to find the optimal coalition formation in the Bayesian coalition formation game that can be modeled as a POMDP.
Let us assume that the initial belief of the microgrid m is denoted by B m = B m (T C k ) where T C k shows the types of players in the coalition C k , similar to the unknown in the conventional BRL. Each microgrid m in the coalition C k with the coalition action α C k can compute a long-term expected action value according to its beliefs B m at each time slot t as follows: and (22) where u m (t) = u m (s , T C k ) expresses the reward that microgrid m receives at time t in the coalition C k with the members of type T C k in the current state s . The probability of transition from the current state s to next state s m by taking coalitional action α C k by members of type T C k is denoted by Pr{s m |C k , α C k , T C k } = Pr{s m |s , C k , α C k , T C k }. B m s m (T C k ) expresses the updated belief after transition to the next state s m about the types of other coalition members, T C k , which can be estimated using the Bayesian theorem as follows (same as single-agent belief update): Consequently, we can find the optimal value-function V m with a modified Bellman equation as follows: Unlike the original form of the Bellman equation, in our problem, microgrid m cannot find the optimal V m by maximizing Q m t as the coalitional process does not have full control of the coalition formation process. Therefore, microgrid m should estimate the probability Pr{C k , α C k , D C k |B m } instead tp find a specific coalition agreement (C k , D C k ) that all coalition members will accept. Therefore, by considering (21) and (22) and the belief update (23), each microgrid can learn the long-term value of any agreement (CS, D) to find the optimal decision with respect to its beliefs about the types of other microgrids.

Computational Approximations
As has been mentioned in the previous part, it is not straightforward to estimate (24), since, on the one hand, we need to approximate the transition Pr{s m |C k , α C k , T C k } and the acceptance Pr{C k , α C k , D C k |B m } probabilities. On the other hand, by considering the size of type and state space, it is not possible to directly compute (21), (22), and (24). Therefore, a realistic simplification is needed to approximate (24). To this end, we employ the Bayesian exploration bonus to estimate the transition probability Pr{s m |C k , α C k , T C k } [31]. In this method, we deploy counter parameters to determine how many times each transition occurs at each iteration t. The exploration bonus is used in order to put more weight on the paths that are not visited enough. Let us define the total number of transitions as: here, µ m (s m , C k , α C k , T C k ) is a counter that shows how many time transition to s m is accrued. Then, we can calculate Pr{s m |C k , α C k , T C k } as: Therefore, we can estimate the action value in (21) as follows: (27) where BEB is given by: ξ is a tuning parameter to adjust the chance of exploring less-visited transitions in transition probability. ξ = 0 means we skip the effect of BEB in our calculations. To estimate the acceptance probability Pr{C k , α C k , D C k |B m }, we need λ m 0 (C k , D C k ), which defines the times that agreement (C k , D C k ) has been proposed, and λ m (C k , D C k ) shows how many times this agreement has been accepted. Therefore, we can estimate the Pr{C k , α C k , D C k |B m } as follows:P It should be noted that 0.5 is the initial value that is set for the acceptance probability. Additionally, we assume that 0 < ζ < 1.
The BRLC algorithm as applied to our microgrid coalition formation problem is given in Algorithm 1. The algorithm is divided into an initialization step and the main loop. Each microgrid's initial power level, location, and coalition are assigned randomly in the initialization step. Then the initial demand of each microgrid is derived concerning their direct power loss in the case that they only perform energy transactions with the macrogrid. After all initial steps, the (C k , D m , T m ) tuple will be transferred to all microgrids.
The main loop consist of two phases: the learning phase (lines 5-7) and the coalition formation phase. In the learning phase, the action values of all coalitions, the current reward of each microgrid, transition probabilities, and the action-value function of each microgrid will be updated. Then, in the coalition formation phase (lines 7-15), we assume that each time the power level of one random microgrid is changing and that specific microgrid is given a chance to propose. The proposer microgrid makes a proposal π i k = (C k , D C k ) in which it decides about the coalition to join (or stay in the same coalition) and proposes a new demand in a way that maximize its own belief about Q i t . The proposal will then be transmitted to the member of the target coalition. Suppose all the members in the coalition find that their action-value function will be higher considering the new proposal. In this case, the proposal will be accepted, and the proposer microgrid will join/stay in the targeted coalition with the new demand. Otherwise, the proposal will be rejected, and the proposer microgrid stays in its previous coalition with the previous demand. After forming the new coalition structure, each coalition uses a greedy algorithm, introduced in [11], to exchange energy among the members of the coalition. If the coalition has a surplus or shortage of energy, then the coalition will transfer the surplus to or import this shortage from the macrogrid.

Performance Evaluation
In this section, at first, we briefly introduce our benchmark models and then examine the performance of the proposed model with respect to our benchmarks.

Maximum a Posterior Estimation (MAPE)
In this model, the estimation of the action-value function is simplified to the most probable belief type, believed by agent m based on its current belief vector B m as follows: The main advantage of MAPE with respect to BCRL is its lower complexity due to ignoring the expected coalition value microgrids. To this end, in MAPE, microgrids reduce their action-value function as follows: Since this method is a relaxed estimation of the proposed BCRL, we call it BCRLMAPE in the rest of the paper.  28 Update agreement (C k , D C k ) → (C k , D C k • D i ) 29 set the state s i → C k 30 set the type T i 31 broadcast T i to all microgrid j, j ∈ C k 6.1.2. Fully Myopic Estimation (FME) Similar to BCRLMAPE, the FME model has a lower complexity since in this model, only the instantaneous action-value function is considered, and the experience history is discarded. The action-value function is given by: The FME model, same as MAPE, is the reduced version of the BCRL; therefore, in this paper, we call this model BCRLFME.

Q-Learning Based Method
We compare our work with the Q-learning-based algorithm developed in [11]. Qlearning aims to reach a sub-optimal policy by choosing actions that maximize the expected current and future rewards. We assume that microgrids are agents and an agent's action is to refuse or accept the proposition of another microgrid to join their coalition. The state is the vector of coalition memberships, and the reward function is the same as (9). Thegreedy method is employed to consider action exploration.

Bayesian Coalitional Game Theory (BCG)
We implemented a Bayesian coalitional game theory-based approach for coalition formation in [11]. In this scheme, each microgrid makes a belief system about the types of other agents; however, agents do not learn from past experiences.

Coalitional Game Theory (CG)
A game theory-based coalition formation approach has been proposed in [6]. In this scheme, by employing a random merge and split technique, the system reaches a stable coalition formation which is not necessarily optimal or sub-optimal.
We refer the readers to [6,11] for more information on BCG and CG benchmarks. Note that the proposed method, BCRLMAPE, BCRLFME, and Q-learning benchmarks use the -greedy policy to increase the chance of exploration. The -greedy policy helps with the trade-off between exploration and exploitation. Agents attempt to improve their long-term benefits through exploration, while exploitation can be achieved by performing greedy actions. Algorithm 1 is also used for BCRLMAPE, BCRLFME, and Q-learning benchmarks.

Numerical Results and Discussions
In this section, for numerical evaluation, we consider a network of 4 to 10 microgrids within an area of 20 km by 20 km where microgrids and macrogrid interconnections are located randomly. We divided the full day into 240 time slots, where the load and generation patterns are randomly generated, and this procedure was periodically repeated every day with slight variations as in [6].
We compare the proposed BCRL with BCRLMAPE, BCRLFME, Q-learning, BCG, and CG benchmarks. The results are averaged over ten runs. The simulation parameters are presented in Table 2. In Figure 2, we present the average cost per user versus the number of microgrids ranging from 4 to 10. As expected, increasing the number of microgrids will reduce the cost since microgrids have more chance to make local coalitions in a dense network, resulting in less power transmission with the macrogrid and by microgrids, resulting in lower costs. Moreover, since BCRL is designed to overcome the uncertainty, it demonstrates more cost results than the other algorithms. The proposed algorithm shows 4% to 16% improvement compared to BCG and the sub-optimal BCRLMAPE, respectively. In Figure 3, to evaluate the effect of increasing power levels, we demonstrate the average cost per user versus the power levels. It should be noted that, in this figure, BCG and CG models cannot be compared with other models, since the power levels are not considered in these models. As is shown, when the power levels increase, the average cost decreases as expected. As we increase the power level, the quantization error will be reduced, and as a result, all the approaches perform better. As we can see in Figure 3, at different power levels, BCRL reduces the average cost per microgrid to 5% and 15% compared to the BCRLMAPE and BCG methods, respectively. In Figure 4, we present the average power loss per user versus the number of microgrids. As the number of microgrids increases, the distance between microgrids will be reduced, reducing the power loss in the system. Moreover, since BCRL is designed to overcome the uncertainty, it demonstrates better power loss results than benchmark approaches, with an up to 50% improvement with respect to conventional CG. While Qlearning benefits from past experience to make the best decision, BCG relies on the beliefs about the types of other players. The CG method only performs based on the random join and split iterations in coalitions to reach a stable coalition formation, which is not necessarily optimal.  Figure 5 shows the average power loss per user versus the number of power levels. As expected, by increasing the number of power levels, the power loss will be reduced due to the lower quantization error. As can be seen, the BCRL method is less prone to quantization errors due to its comprehensive estimation model for the expected action value. The BCRL method gained up to 20% on average compared to the BCG method. In Figure 6, the average amount of energy transferred to the macrogrid versus the number of microgrids is presented. As we can see, BCRL requires a lower amount of energy exportation to or importation from the macrogrid compared to the benchmark techniques. Additionally, due to the lower power loss between nearby microgrids, the probability of joining nearby microgrids to the same coalition will increase by increasing the number of microgrids, which can reduce the power exported to the macrogrid, as well.  Figure 7 shows the impact of increasing the cost of transferring energy with the macrogrid versus the average energy transferred with macrogrid. Here, the range of weighting parameter w 0 varied between 0.02 to 0.22. As we can see, when w 0 increases, the average energy transfer with macrogrid decreases, giving a chance to the coalition of microgrids to operate in islanding mode. We can see that the proposed BCRL model always performs better in making independent coalitions that rely less on macrogrid. BCRL decreases the exported power to the macrogrid up to 10% in comparison with the CG technique.  Figure 8 shows the convergence of the BCRL technique in terms of the average power loss per user. As is shown, the proposed model will be converged after 12,000 iterations. In Figure 9, we demonstrated the average number of iterations that are needed for the convergence of accumulative average power loss as the number of power levels increases in the BCRL scheme.

Conclusions
In this paper, we propose a Bayesian coalitional reinforcement learning-based approach for learning the optimal policy to minimize the cost for distributed energy trading among microgrids. In this work, each microgrid is modeled as an agent that can compete and cooperate with other agents. We model this problem as a Markov game, which aims to maximize the reward for each agent to overcome the uncertainties that are caused by microgrids based on their generation and demand. With the proposed scheme, microgrids reach stable coalitions where the energy export to the macrogrid or distant microgrids are reduced in the system. We introduced an algorithm that helps each agent systematically propose joining a new coalition and give the coalition members the chance to accept or reject the proposal according to their expected long-term rewards. To evaluate the performance of the proposed model, we compared our results with five benchmark schemes and showed that our scheme reduced the cost and power loss more than the others, reaching to 23% reduction in cost and a 28% reduction in power loss. Funding: This research was funded by the NSERC Discovery program.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: Belief at time t