Policy Evaluation in Decentralized POMDPs with Belief Sharing

Most works on multi-agent reinforcement learning focus on scenarios where the state of the environment is fully observable. In this work, we consider a cooperative policy evaluation task in which agents are not assumed to observe the environment state directly. Instead, agents can only have access to noisy observations and to belief vectors. It is well-known that finding global posterior distributions under multi-agent settings is generally NP-hard. As a remedy, we propose a fully decentralized belief forming strategy that relies on individual updates and on localized interactions over a communication network. In addition to the exchange of the beliefs, agents exploit the communication network by exchanging value function parameter estimates as well. We analytically show that the proposed strategy allows information to diffuse over the network, which in turn allows the agents' parameters to have a bounded difference with a centralized baseline. A multi-sensor target tracking application is considered in the simulations.


I. INTRODUCTION
Multi-agent reinforcement learning (MARL) [1], [2] is a useful paradigm for determining optimal policies in sequential decision making tasks involving a group of agents. MARL has been applied successfully in several contexts, including sensor networks [3], [4], team robotics [5], and video games [6], [7]. MARL owes this success in part to recent developments in better function approximators such as deep neural networks [8].
Many works on MARL focus on the case where agents can directly observe the global state of the environment. However, in many scenarios, agents can only receive partial information about the state. The decentralized partially observable Markov decision process (Dec-POMDP) framework [9] is applicable to these types of situations. However, a large body of MARL work assumes that Dec-POMDPs observe data that are deterministic and known functions of the underlying state, which is not the case in general. Consider, for example, robots that receive noisy observations from their sensors. The underlying observation model is stochastic in this case.
Under stochastic observation models, one common strategy is to keep track of the posterior distribution (belief) over the set of states, which is known to be a sufficient statistic of the history of the system [10], [11]. For single agents, this posterior distribution can be obtained at each iteration with the optimal Bayesian filtering recursion [12]. Unfortunately, for multi-agent systems, forming this global posterior belief requires aggregation of all data from across all agents in general. The agents can form it in a distributed manner only when they have access to the private information from other agents in the network. And even when agents have access to this level of global knowledge, the computational complexity of forming the global posterior distribution is known to be NP-hard [13] in addition to its large memory requirements. Moreover, obtaining beliefs necessitates significant knowledge about the underlying model of the environment, which is generally not available in practice.
Therefore, instead of forming beliefs, most MARL algorithms [14], [15], [16] resort to a model-free and end-to-end approach where agents try to simultaneously learn a policy and an embedding of the history that can replace the beliefs (e.g., recurrent neural networks (RNNs)). Nevertheless, recent empirical works suggest that this model-free approach can be sub-optimal when the underlying signals of the environment are too weak to train a model such as RNN [17], [18]. Moreover, RNNs (or alternative machine learning models) are usually treated as black boxes. In other words, these algorithms lack model interpretability, which is critical for trustworthy systems (see [19]). Furthermore, even though end-to-end approaches have shown remarkable performance empirically, they are still based on heuristics and lack theoretical guarantees on their performance. Compared to modular approaches, they are inefficient in terms of adaptability and generalization to similar tasks.
As an alternative, there is a recent interest towards improving belief-based MARL approaches [20], [21], [22]. These works have focused on emulating conventional beliefs with generative models, or with models learned from action/observation trajectories (in a supervised fashion). In this paper, we also examine belief-based strategies for MARL. In particular, we are interested in the multi-agent policy evaluation problem. Our work complements [20], [21], [22] in the sense that we assume that agents are already capable of forming local beliefs with sufficient knowledge (i.e., with learned local likelihood and transition models) or with generative models. Our focus is on the challenge of approximating the global Bayesian posterior in a distributed manner. Contributions: r We consider a setting where agents only get partial observations from the underlying state of nature, as opposed to prior work on MARL over networks [23], [24], [25], [26], [27], [28], [29], [30], [31] that assume agents have full state information. Moreover, as opposed to the literature on decentralized stochastic control [32], [33], [34], [35], in our setting, agents need to learn their value functions from data. More specifically, in our Dec-POMDP framework, agents only know their local observations, actions, and rewards but they are allowed to communicate with their immediate neighbors over a graph. In the proposed strategy (Algorithm 2), agents exchange both their belief and value function estimates.
r We show in Theorem 1 that by exchanging beliefs, agents keep a bounded disagreement with the global posterior distribution, which requires fusing all observations and actions. Also, exchanging value function parameters enables agents to cluster around the network centroid for sufficiently small learning rates (Theorem 2). Furthermore, we prove that the network centroid attains a bounded difference with a strategy that requires centralized training (Theorem 3).
r By means of simulations, we illustrate that agents attain a small mean-square distance from the network centroid. Moreover, the squared Bellman error (SBE) averaged over the network is shown to be comparable to the SBE of the centralized strategy. Paper Organization: In Section II, we present additional related work. In Section III, for ease of exposition and introducing notation, we describe the problem in single-agent setting. In Section IV, we propose algorithms for multi-agent policy evaluation. Section V includes the theoretical results, and Section VI includes numerical simulations.

II. OTHER RELATED WORK
Our proposed strategy is based on temporal-difference (TD) learning [36], [37], and makes use of function approximation. TD-learning for POMDPs are considered in [38], [39], and function approximations are incorporated in [40], [41], albeit in single-agent setting. The main contribution of the present work is to the networked multi-agent setting.
A plethora of work studies decentralized policy evaluation over networks [23], [24], [25], [26], [27], [28], [29], [30], [31]. Distributed versions of the TD-learning with linear function approximations are considered in [29], [30], [31]. However, these works assume that either the global state, or a deterministic function of it, is available to all agents. They overlook the stochastic nature of observations that takes place in many real-world applications. Also in deterministic setting, the works [42], [43] examine distributed linear quadratic control task when agents can observe local states only. In particular, [43] proposes a cooperative strategy for tracking the global state that exploits networked communication between agents. However, in this strategy, global state estimation at each iteration is independent of the previous estimations. It ignores the correlation between consecutive states. Furthermore, communication between the agents is utilized only for global state estimation, and not utilized for local Q-function estimate sharing. In contrast, in the present work, (i) observations are stochastic, (ii) agents take advantage of the transition model of the state, and (iii) they exchange value function parameters with their neighbors as well.
Our work is also related to the field of decentralized stochastic control [32] and dynamic team theory [33]. This field studies problems in which different decision-makers have access to different sets of information while working towards a common team goal. Typically, these problems are defined by an information structure that specifies which agents have access to which pieces of information (e.g., observations or actions) [44], [45]. Some approaches to solving these problems rely on the common information that arises from partial history sharing to all other agents [35], [46], [47]. In our networked setting, agents exchange value function parameters or beliefs at each iteration, without explicitly exchanging raw data, with their immediate neighbors only. Nonetheless, repeated application of this procedure causes information to mix and diffuse throughout the whole network. Moreover, most existing works in the decentralized stochastic control literature assume full model knowledge of the system, whereas we consider the case of learning from data since the reward model is not known a priori. Also, sharing value function parameters and beliefs instead of raw data makes our algorithm advantageous in terms of privacy and scalability. A similar approach is considered in [34], where the author proposes a belief-sharing pattern for decentralized control, rather than explicit information sharing as in prior work. However, they use a belief propagation algorithm over acyclic graphs, while we use a diffusion-based belief-sharing algorithm over cyclic networks. In addition, [34] considers the planning problem only whereas in this work we consider the policy evaluation problem, which requires learning from data.
For constructing local beliefs that approximate the global Bayesian posterior, we extend the diffusion HMM strategy (DHS) [48], [49]. This algorithm requires only one round of communication per state change, as opposed to other strategies [50], [51] that require multiple rounds of communication until network consensus at each iteration. Also, in contrast to other distributed Bayesian filtering algorithms [52], it does not combine likelihoods of data from different time instants. Instead, likelihoods are combined with time-adjusted beliefs. These properties make DHS communication efficient and successful in tracking highly dynamic state transitions. Note that [48], [49] deal with state estimation task only, and there are no rewards or actions in their setting. Therefore, we make proper modifications to the algorithm in the sequel.
In addition to these, the analysis in the current work is related to literature on the distributed optimization over networks [53], [54], [55], [56]. In particular, we adopt the two-step approach from [57], [58], [59]. In the first step, these works establish that agents cluster around the network centroid, and then, they show that this centroid converges to a neighborhood of the optimal solution, under constant learning rates. However, their focus is on optimization and supervised learning rather than reinforcement learning, which creates non-trivial distinctions in the analysis.
Notation: Random variables are denoted in bold. For K vectors w 1 , w 2 , . . . , w K ∈ R M of dimension M × 1 each, and for arbitrary matrices {A, B}, the notation col{w k } K k=1 and diag{A, B} stand for The p -norm for a vector w is represented by w p , while the p -induced norm for a matrix A is represented by A p .
To simplify the notation, we use w and A to denote the D KL (μ 1 ||μ 2 ). We use the notation "proportional to", i.e., ∝, whenever the LHS of the expression is the normalized version of the RHS. For example, for s ∈ S and function f : . (2)

III. PRELIMINARIES
In this work, we are interested in multi-agent policy evaluation under partially observable stochastic environments. For clarity of the exposition and to motivate the notation, we briefly review the procedure of single-agent policy evaluation under both fully and partially observable states.

A. FULLY-OBSERVABLE CASE
For modeling a learning agent under fully observable and dynamic environments, the traditional setting is a finite Markov Decision Process (MDP). An MDP is defined by the quintuple (S, A, T, r, γ ), where S is a set of states with cardinality |S| = S, A is a set of actions, T is a transition model where T(s |a, s) denotes the probability of transitioning from s ∈ S to s ∈ S when the agent executes action a ∈ A, r(s, a, s ) denotes the reward the agent receives when it executes action a and the environment transitions from state s to s , and γ ∈ [0, 1) is a discount factor that determines the importance given to immediate rewards (γ → 0) or the total reward (γ → 1). The goal of policy evaluation is to learn the value function V π (s) of a target policy π (a|s), where the value function is the expected return if the agent starts from state s and follows policy π , namely, where s i is the state at time i and a i is the action chosen by the agent according to the policy, a i ∼ π (a|s i ). In many applications, the state space is too large (or infinite), which makes it impractical to keep track of the value function for all states. Therefore, function approximations are used to reduce the dimension of the problem. For instance, linear approximations, which are the focus of the theoretical analysis of this work, correspond to using a parameter w feature mapping for representing state s. A standard stochastic approximation algorithm to learn w • from data is TD-learning [19], [36] such as the TD(0) strategy [61] and variations thereof. If we denote the value function estimate at w ∈ R M by V (s, w) φ(s) T w, then, under this strategy, the agent first computes the TD-error δ i at time i by using the observed transition tuple (s i , r i , s i+1 ): where r i r(s i , a i , s i+1 ) is the instantaneous reward at time i. Subsequently, the agent uses this error to update the current parameter estimate w i to where α > 0 is the learning rate, and for the linear function approximation case. This algorithm can be viewed as a "stochastic gradient algorithm" where the effective stochastic gradient is g i −δ i φ(s i ). In this work, we consider an 2 -regularized version of the algorithm, which changes the update step (5) to where ρ > 0 is a constant hyper-parameter. As opposed to supervised learning, regularization is rather under-explored in reinforcement learning, with notable exceptions in [62], [63]. However, recent work [64], [65] suggests that regularization can increase generalization and sample-efficiency in function approximation with over-parameterized models.

B. PARTIALLY-OBSERVABLE CASE
In many applications, the agent does not directly observe the state s i . For instance, a robot may receive noisy and partially informative observations from its sensors about the environment. The observation ξ i that the agent receives at time i is generally assumed to be distributed according to some likelihood function linking it to the unobservable state, say, ξ i ∼ L(ξ |s i ), which is conditioned on s i . In these scenarios, the agent will need to estimate the latent state first from the observations. To do so, the agent will need to learn a probability vector μ i ∈ M(S) over the set of states S, which is called the belief vector [10], [19]. Here, M(S) denotes the Sdimensional probability simplex, and the entry μ i (s) ∈ [0, 1] of the belief vector quantifies the confidence the agent has about state s being the true state at time i. The value of μ i (s) corresponds to the posterior probability of s conditioned on the action-observation history (a.k.a. trajectory): which means: This posterior satisfies the following temporal recursion [10], [12], [19]: where η i (s) is the time-adjusted prior defined by Here, F a i−1 is the collection of past observations and actions, i.e., where it is important to notice that F i = {ξ i } ∪ F a i−1 . If beliefs are used as substitutes for hidden states, then partiallyobservable MDPs (POMDPs) can be treated as continuous MDPs, since beliefs are continuous even if the number of states is finite. In this way, the policy evaluation problem would correspond to evaluating V π (μ) where the value function is now defined as the expected return when the agent starts from the belief state μ and follows the policy π (a|μ), namely [10], [19]: Observe that, in contrast to the fully-observable case, the agent now chooses action a i according to the policy a i ∼ π (a|μ i ), which is conditioned on the belief vector. Algorithm (4)-(7) can be adjusted for POMDPs by using the belief vectors (μ i , η i+1 ) instead of the states (s i , s i+1 ). Thus, we let and where the approximations V (μ, w) are computed by using the feature vectors φ(μ), now dependent on μ, to evaluate V (μ, w) φ(μ) T w. Note that from now on φ : M(S) → R M is a different feature mapping that represents μ instead of s, and agents' goal is to learn w Observe from (10)-(11) that in order for the agent to update the belief vectors (μ i , η i+1 ), it needs to know the transition model T and the likelihood functions L(ξ i |s) for each state. However, the agent does not need to know the underlying reward model r. It can use instantaneous reward samples r i to run the algorithm. In this sense, the algorithm is a mixture of model-based and model-free reinforcement learning. Motivation for this approach is at least two-fold. First, in some applications, learning the transition and observation models from data is inherently easier than learning the reward function. This is because the reward function can depend on some latent characteristics of the environment or some human expert, which may be challenging to estimate. One example where this scenario can arise is autonomous cars [66]. In this case, the observations from environmental sensors and cameras are processed with a learned likelihood model such as a convolutional neural network. The transition dynamics of the car depends on various parameters such as speed, acceleration, position, and incline, and can be modeled based on physics laws and mapping of the surroundings. However, learning a reward function for this application is notoriously difficult, as it is challenging to cover all possible situations [67]. Second, the agent can still run (14)-(15) even if beliefs are not formed through (10)- (11), but estimated by some other approach, as in [20], [21], [22].

IV. MULTI-AGENT POLICY EVALUATION
We now consider a set K of K cooperative agents that aim to evaluate the average value function under a joint policy π = {π k } K k=1 that consists of individual policies π k . The framework we consider is a decentralized POMDP (Dec-POMDP) [9], which is defined by the sextuple (S, A k , O k , T, r k , γ ). Here, the set of states S and the transition model T are common to all agents, where the notation T(s|s , a) now specifies the probability that the environment transitions from s to s when the agents execute the joint action a = {a k } K k=1 . The individual action a k of each agent k takes values from the set A k , and r k (s, a, s ) is the local reward k gets when the agents execute the collection of actions a and the environment transitions from s to s . Note that this setting covers general teamwork scenarios where the local reward of an individual agent can be dependent on all actions, and not only on its own actions. Specifically, it covers the scenarios that all agents observe the same reward, i.e., r k (s, a, s ) = r(s, a, s ), ∀k ∈ K. Remember that agents receive instantaneous rewards as they progress through the POMDP, and they are not required to know the joint action a from all agents. Moreover, O k is a set of private observations. At each time instant i, agent k receives observation ξ k,i ∈ O k emitted by state s i , and assumed to be distributed according to the local marginal likelihood L k (ξ k |s i ).
Similar to the single-agent case, Dec-POMDPs can be treated as multi-agent belief MDPs by replacing the hidden states with joint centralized beliefs defined by [9, Chap. 2] Here, F i denotes the history of all observations and past actions from across all agents until time i, where in the definition is now the aggregate of the observations from across the network, and a i−1 is a tuple aggregating actions from all agents at time i − 1. Moreover, under spatial independence, the joint likelihood L(ξ i |s) appearing in (16) is given by In a manner similar to the single-agent case, the belief η i (s) is the time-adjusted prior conditioned on F a i−1 (12): The goal of policy evaluation is to learn the team value function, which is the expected average reward of all agents starting from some belief state μ, i.e., where r k,i denotes the instantaneous local reward agent k gets at time i. There is one major inconvenience with this approach. In order to compute the joint belief (16), it is necessary to fuse all observations and actions from across the agents in a central location. This is possible in settings where there exists a fusion center. However, many applications rely solely on localized processing. In the following, we discuss and compare two strategies for multi-agent reinforcement learning under partial observations: (i) a centralized strategy, (ii) and a fully decentralized strategy.

A. CENTRALIZED STRATEGY
In the fully centralized strategy, the state estimation and policy evaluation phases are centralized and, hence, the setting is equivalent to a single-agent POMDP, already discussed in Section III-B, using the joint likelihood L(ξ i |s) and the average reward r i K −1 K k=1 r k,i . The fusion center computes the joint belief (16), and agents take actions based on this joint belief, i.e., a k,i ∼ π k (a k |μ i ). The fusion center then computes the centralized TD-error: and updates the estimate to This construction is listed under Algorithm 1.
6: for each agent k ∈ K do 7: Take action a k,i ∼ π k (a k |μ i ) 8: Get reward r k,i = r k (s i , a i , s i+1 ) 9: end for 10: then, evolve 11: average the rewards r i = 1 K K k=1 r k,i 12: update the model: 13: 14: end while

B. DECENTRALIZED STRATEGY
The centralized strategy is disadvantageous in the sense that (i) failure of the fusion center results in failure of the system; (ii) there can be communication bottlenecks at the fusion center; (iii) and agents can be spatially distributed to begin with. Therefore, in this section, we propose a fully decentralized strategy for policy evaluation where agents communicate with their immediate neighbors only.

1) DECENTRALIZED NETWORK MODEL
We refer to Fig. 1 and assume that the graph is strongly connected [53], which means that paths exist connecting any pair of agents ( , k) in both directions, and in addition, there exists at least one agent in the graph that does not discard its own information (i.e., c kk > 0 for at least one agent k). Under this assumption, the combination matrix C = [c k ], where entry c k ≥ 0 scales the information agent k receives from agent , becomes primitive. If two agents are not connected by an edge then c k = 0. We assume C is symmetric and doublystochastic, meaning that or in matrix notation:

2) LOCAL BELIEF FORMATION
In the fully decentralized strategy, the agents cannot form the joint belief (16) since they do not have access to the observations and actions of all other agents. They, however, can construct local beliefs. To do so, we will extend the diffusion HMM strategy (DHS) from [48] and [49], which is originally designed for hidden Markov models, to the current POMDP setting.
In DHS, the global belief vectors {μ i , η i } are replaced by local belief vectors {μ k,i , η k,i }, and the latter are updated by using local observations and by relying solely on interactions with the immediate neighbors. The original DHS algorithm is designed for actionless partially observable Markov chains, and each agent can use the same global transition model. However, in POMDPs, transition of the global state depends on the joint action, and the agents cannot perform a centralized time-adjustment step as in (23) since they do not know the actions of all agents in the network.
Therefore, one strategy is to use a transition model that is obtained by marginalizing over actions that are unknown to agent k. More specifically, let a N k ∈ A N k denote a tuple of actions taken by the set of neighbors of agent k (which we are denoting by N k ). These actions can be assumed to be known by agent k if, for instance, agents share their actions with their neighbors. Let a c N k ∈ A c N k denote the remaining actions by all other agents in the network, so that a = a N k ∪ a c N k . Then, each agent can use the following local transition model approximation: a), to time-adjust its local belief: Here, a N k ,i−1 is the tuple of actions taken by the neighbors of agent k at time instant i − 1. Moreover, in (28), the notation π (a N k , a c N k |s ) represents the joint action probability: where the notation π (a|s) is now a shorthand for π (a|μ) when i.e., when the belief attains value 1 for state s and is 0 otherwise. Note that this construction leads to a richer scenario compared to [48], [49], with transition models that are different across the agents. The rest of the algorithm is the same as the DHS strategy. Following (29), and based on the personal observation ξ k,i , each agent k forms an intermediate belief using a β-scaled Bayesian update of the form: where β > 0. Finally, agents in the neighborhood of k share their intermediate beliefs, which allows agent k to update its belief using the weighted geometric average expression: This procedure of repeated updating and exchanging of beliefs allows information to diffuse over the network.

3) DIFFUSION POLICY EVALUATION
In the fully decentralized strategy, the local belief formation strategy is used during both training and execution phases. Namely, the target value function in (19) represents the average return agents get when they execute the policy π with their local beliefs formed via the DHS strategy. Moreover, since the policy evaluation is also decentralized, during the training phase, they again need to use DHS to approximate the global belief state μ on top of the function approximation. More specifically, using its local belief vectors, each agent k computes a local TD error: where r k,i = r k (s i , a i , s i+1 ) is also a function of the local beliefs since each agent k now executes the action a k,i ∼ π k (a k |μ k,i ). Subsequently, each agent k forms an intermediate parameter estimate denoted by After receiving the intermediate estimates from its neighbors, agent k updates w k,i to The local adaptation step (35) followed by the combination step (36) are reminiscent of diffusion strategies for distributed learning [19], [53]. Observe that there are actually two combination steps involved in diffusion policy evaluation: the belief combination (33) with geometric averaging (GA), and the parameter combination (36) with arithmetic averaging (AA). These choices of fusion rules are supported by recent results in the literature [68], [69] that promote the use of GA for probability density functions and AA for point estimates. The listing of the proposed diffusion policy evaluation strategy for POMDPs appears in Algorithm 2. Algorithm 2 has the following listed advantages: r Decentralized information structure: The algorithm is designed to be fully decentralized, with each agent only having access to its own private data, such as observations and rewards, without the need to share this information with other agents. Importantly, agents do not require knowledge of the joint distribution of observations or the network topology. They only know their own marginal likelihood function, and their actions are only known by (or transmitted to) their immediate neighbors.
If agents happen to know their own marginal transition models, they do not need to know the policies of other agents or the global transition model. However, if the application requires them to approximate it themselves, they require knowledge of the other policies and the global transition model.
r Privacy: The algorithm is also advantageous in terms of privacy since (i) communicating beliefs allows information diffusion without explicitly sharing raw observational data, and (ii) exchanging value parameters allows agents learn the cumulative reward across network without explicitly sharing local rewards. 1: set initial priors η k,0 (s) > 0, ∀s ∈ S and ∀k ∈ K 2: choose β > 0 3: initialize w k,0 for ∀k ∈ K 4: while i ≥ 0 do 5: each agent k observes ξ k,i 6: for each agent k ∈ K and s ∈ S 7: end for 8: for each agent k ∈ K do 9: Take action a k,i ∼ π k (a k |μ k,i ) 10: Get reward r k,i = r k (s i , a i , s i+1 ) 11: end for 12: for each agent k ∈ K evolve 13: Compute T π k (s|s , a N k ,i ) using (28), and 14: end for 15: for each agent k ∈ K update 16: end for 17: for each agent k ∈ K combine . This is due to the need to average over non-neighbors' actions in (28), whose size grows with the network size in general. Compared to alternative approaches such as relaying raw data, incremental approaches [70], or Bayesian belief forming [71], our algorithm is much lighter in terms of complexity. Relaying raw data, for example, would result in an exponential increase of memory and communication overload at each hop, making it highly impractical. The incremental approach of relaying over a cyclic path (which is NP-hard to find [72]) that visits each agent once would reduce the overload. However, it is not robust against failures and not scalable, making it impractical for a decentralized setting. The Bayesian belief forming strategy requires knowledge of the network topology and other agents' functions, and known to be NP-hard, even in the much simpler case of fixed state and no action setting [13].

V. MAIN RESULTS
In this section, we analyze the performance of the decentralized strategy in Algorithm 2. In particular, we first show in Section V-B that the value function parameters {w k,i } of the agents cluster around the network centroid. Then, in Section V-C, we show that this network centroid has a bounded difference from the parameter of a baseline strategy (which will be presented in Algorithm 3). Our analysis relies on bounding the disagreement between the joint centralized belief μ i and the local estimate μ k,i , which is presented next.

A. BELIEF DISAGREEMENT
In a manner similar to [49], we introduce the following risk functions in order to assess the disagreement between the local beliefs formed via (37)-(39) with the joint centralized beliefs formed via (22)- (23): and The risks in (43) and (44) measure the disagreement after and before the joint observation ξ i , respectively. Remember that [49] considers a naive state estimation setting rather than a POMDP. Specifically, in their setting, the transition model does not depend on actions, and it is assumed that every agent knows the global transition model accurately. In comparison, in the current work, each agent uses a local approximation for the global transition model based on (28). Therefore, we need to make some non-trivial adjustments to the belief disagreement analysis. We begin with adjusting the assumptions from [49] to our model.
which ensures that likelihoods for each state pair (s, s ) share the same support, and in addition to this, over its support for each state s ∈ S and agent k ∈ K.
r Transition model: The Markov chain induced by any joint action a ∈ A is irreducible and aperiodic. Since the number of states is finite, this assumption implies that the transition model T(s|s , a) is ergodic [73,Chap. 2]. Like [49], we focus on the important class of geometrically ergodic models, which additionally satisfy the relation κ (T a ) ≤ κ (T) for some constant κ (T) < 1.
Here, κ (T a ) is the Dobrushin coefficient [12,Chap. 2] defined by: where T a ss T(s|s , a) is a generic entry of the S × S transition matrix T a . Due to space limitations, we refer the reader to [12,Chap. 2] for a comprehensive discussion on the Dobrushin coefficient κ (T a ). In short, κ (T a ) quantifies how fast the transition model forgets its initial conditions. Namely, as κ (T a ) → 0, past conditions are forgotten faster. Instances of geometrically ergodic transition models include transition matrices with all positive elements, or that satisfy the minorization condition in [12,Theorem 2.7.4]. In addition to this condition from [48], [49], we have an additional assumption on the transition model to regulate the disagreement stemming from the local transition model estimates: Assumption 1 (Transition model disagreement): For each agent k, consider the n-hop neighbors set N k n and its complement N c k n . In other words, N k n is the set of agents that have at most n-hop distance to the agent k.
We define the transition model approximation that uses n-hop neighbors' actions as follows: Then, we assume that which ensures that transition model approximations induced from n-hop and (n + 1)-hop neighbors' actions share the same support. Moreover, we assume that over the shared support, for n ≥ 1. This assumption basically makes sure that the increase in the error of the transition model approximation of agents due to lack of information about actions is bounded at each geodesic distance increase to that agent. 132 VOLUME 2, 2023

2) DIFFERENCE WITH CENTRALIZED STRATEGY
The following result provides upper bounds on the disagreement measures in (43)- (44).

Theorem 1 (Bounds on belief disagreement):
For each agent k, the belief disagreement risks (43) and (44) get bounded with a linear rate of κ (T). Namely, as i → ∞, and where d min is the minimum degree over the graph, i.e., minimum number of neighbors any agent over the network possesses, and λ max{|1 − K β |, λ 2 } where λ 2 < 1 is the mixing rate (second largest modulus eigenvalue) of C.
Proof: See Appendix A.
In Theorem 1, the first terms in both bounds are equivalent to the bounds obtained in [49]. However, the terms proportional to (K − d min )τ are new, and they arise from the fact that agents do not observe the joint actions and hence only have a local estimate of the transition model. Nevertheless, the bounds get smaller with increasing network connectivity, i.e., as λ 2 → 0 and d min → K, which shows the benefit of cooperation. In particular, if β = K and the network is fully connected (λ 2 = 0, d min = K), then the bounds are equal to 0. In other words, local beliefs match the centralized belief in this situation. It is important to note that the linear term (K − d min ) represents a worst-case bound that holds true for any strongly connected network topology. For instance, in a scenario where each agent has N > 1 neighbors, it is straightforward to modify the proof and show that these linear terms will instead be logarithmic, i.e., proportional to log K/ log N.
We use Theorem 1 in the performance analysis of the diffusion policy evaluation. To that regard, we first present the following consequence of Theorem 1, which provides a bound in terms of disagreement norms.
Corollary 1 (Bounds on disagreement norms): Theorem 1 implies that, as i → ∞, and where we introduce the constants and

B. NETWORK DISAGREEMENT
In this section, we study the variation of agent parameters from the network centroid. To that end, let us incorporate the linear approximation V (μ, w) = φ(μ) T w into the TD-error expression (40) to obtain the following relation: Since ∇ w V (μ, w) = φ(μ) for the linear case, it follows that where and To proceed, we introduce the following regularity assumption on the feature vector.

Assumption 2 (Feature vector):
The feature mapping φ(μ) is bounded and Lipschitz continuous in the domain of the S-dimensional probability simplex. Namely, for any vectors

Lemma 1 (Belief feature difference):
For each agent k ∈ K, the belief feature matrix H k,i in (59) has bounded expected difference in relation to the centralized belief feature matrix H i , defined below, i.e., where Proof: See Appendix C.
We also assume that all rewards are non-negative and uniformly bounded, i.e., 0 ≤ r k,i ≤ R max for each agent k ∈ K, and all time instants i. Now, we proceed to study the network disagreement. To that end, we define the network centroid as which is an average of the parameters of all agents. The following result shows that the agents cluster around this network centroid after sufficient iterations.

Theorem 2 [Network agreement]:
The average distance to the network centroid is bounded for ρ > γ B φ L φ / √ 2 after sufficient number of iterations. In particular, if ρ ≥ 0.75γ B φ L φ , then where > 0 is a constant defined by Proof: See Appendix D. Theorem 2 states that the parameter estimates by the agents cluster around the network centroid within mean 2 -distance on the order of O(αλ 2 ) in the limit as i → ∞. This result confirms that agents can get arbitrarily close to each other by setting the learning rate α sufficiently small. Besides, dense networks have in general small λ 2 , which results in a small disagreement within the network.

C. PERFORMANCE OF DIFFUSION POLICY EVALUATION
We can therefore use the network centroid as a proxy for all agents to show that the disagreement between the fully decentralized strategy of Algorithm 2 and a baseline strategy that requires a central processor during training is bounded. We start by describing this baseline strategy and explain why it is a more suitable baseline compared to using the fully centralized strategy Algorithm 1.
In some applications, even though agents are supposed to work in a decentralized fashion once implemented in the field, they can nevertheless rely on central processing during the training phase in order to learn the best policy. In the literature, this paradigm is referred to as centralized training for decentralized execution [16], [74]. For our problem, the crucial point is that during training the centralized processor can form beliefs based on all observations, but it should keep in mind that agents will execute their actions based on local beliefs once implemented. Therefore, in the baseline strategy, actions and rewards are based on local beliefs as in (37)- (39), whereas parameter updates are based on the centralized posterior as in (22)- (23). Algorithm 3 lists this baseline procedure. Notice that the algorithm consists of both local belief construction (see (67), (68), and (70)) and centralized belief construction (see (69) and (71)). The former is used for action execution a k,i ∼ π k (a k |μ k,i ), while the latter is used for value function parameter updates in (72)- (73).
In the fully centralized strategy of Algorithm 1, the actions by the agents and the subsequent rewards are based on the centralized belief. Therefore, the target value function that Algorithm 1 aims to learn corresponds to the average cumulative reward obtained under centralized execution. In comparison, the target value functions that Algorithms 2 and 3 try to learn are the same and they correspond to the average cumulative reward under decentralized execution. While trying to learn the same parameter w • , the baseline strategy can utilize centralized processing, but the diffusion strategy is fully decentralized. Nonetheless, the following result illustrates that the expected disagreement between the baseline strategy and the fully decentralized strategy remains bounded.

Theorem 3 (Disagreement with the baseline solution):
The expected distance between the baseline strategy and the network centroid is bounded after sufficient iterations for ρ >

Algorithm 3: Centralized Evaluation for Decentralized
Execution. 1: set initial priors η k,0 (s) > 0, η 0 (s) > 0, for ∀s ∈ S and ∀k ∈ K 2: choose β > 0 3: initialize w 0 4: while i ≥ 0 do 5: each agent k observes ξ k,i 6: for each agent k ∈ K and s ∈ S adapt and combine 7: end for 8: to form centralized belief with joint observation 9: for each agent k ∈ K do 10: Take action a k,i ∼ π k (a k |μ k,i ) 11: Get reward r k,i = r k (s i , a i , s i+1 ) 12: end for 13: average the rewards r i = 1 K K k=1 r k,i 14: for each agent k ∈ K evolve 15: Compute T π k (s|s , a N k ,i ) using (28), and 16: end for 17: evolve the centralized belief 18: update value function parameter (73) 19: end while Proof: See Appendix E. Theorem 3 implies that the disagreement between the network centroid, around which agents cluster, and the baseline strategy is on the order of B TV . This means that if the local beliefs are similar to the centralized belief, agents get closer to the baseline parameter. In this regard, from the definition (55) of B TV , it can be observed that B TV gets smaller with increasing network connectivity (i.e., decreasing λ 2 ), as β → K. In fact, it is equal to zero for fully-connected networks with the choice of β = K and c k = 1/K. Therefore, by changing β and c k , the fully decentralized strategy can match the value function estimates of a centralized training strategy that can gather all observations and actions in a fusion center. In the next section, by means of numerical simulations, we further compare the value function estimate accuracies of all Algorithms 1, 2 and 3 by using squared Bellman error (SBE).

VI. SIMULATION RESULTS
For numerical simulations, we consider a multi-agent target localization application. The implementation is available online 1 . We use a set of K = 8 agents and a moving target in a 10 × 10 two-dimensional grid world environment. The locations of the agents are fixed and their coordinates are randomly assigned at the beginning of the simulation. The goal of the agents is to cooperatively evaluate a given policy for hitting the target. Agents cannot observe the location (i.e., state) of the target accurately, but instead receive noisy observations based on how far they are from the real location of the target. The target is moving according to some pre-defined transition model that takes the actions (i.e., hits) of agents into account. Specifically, the target is trying to evade the hits of agents.
A possible scenario for this setting is a network of sensors and an intruder (e.g., a spy drone) -see Fig. 2. The sensors try to localize the intruder based on noisy measurements and belief exchanges. Moreover, in order to disrupt the communication between the intruder and its owner, each sensor sends a narrow sector jamming beam towards its target location estimate. However, the intruder is capable of detecting energy abnormalities and determines its next location by favoring distant locations from the jamming signals. We now describe the setting in more detail.
Combination matrix: The entries of the combination matrix are set such that they are inversely proportional to the 1 -distance between the agents. That is to say, the further the agents are from each other, the smaller the value of the weight that is assigned to the edge connecting them. Weights smaller than some threshold are set to 0, which implies that agents that are too far from each other do not need to communicate. The resulting communication topology graph is illustrated in Fig. 3(a).
Transition model: The target is moving between cells in a grid (i.e., states) randomly. The probability of a cell being the next location of the target depends on the current location of the target and the location of the agents' hits. Namely, each state in the grid is assigned a score based on its 1 -distance to the current location of the target and to the average location of the agents' hits -see Table 1. For example, observe from Table 1 that the cells that are in the proximity of the target's VOLUME 2, 2023  current location and also far away from the agents' strikes are given the highest score. These scores are normalized to yield a probabilistic transition kernel.
Likelihood function: Agents cannot observe where the target is. They can only receive noisy observations. Each agent gets a more accurate observation of the target's position if the target is in close proximity to the agent. Otherwise, the larger the distance between the agent and the target, the higher the noise level. Depending on how close the target is to the agent, and in order to construct the likelihood function, we first assign scores to each cell in the grid that reflect how probable it is to find the target in that cell -see Table 2.
Following that, the scores are normalized in order to yield a distribution function. For instance, if the target lies at an 1 -distance that is less than 3 grid squares from the location of the agent, the actual position of the target gets a likelihood score of 400, cells within an 1 distance of 2 grid squares from the agent get a likelihood score of 200, and cells within an 1 distance of 4 grid squares from the agents get a likelihood score of 30.
Reward function: The reward function in the environment is such that an agent receives a reward of 1 if the agent is able to hit the position of the target. The agent also receives a reward of 0.2 if the 1 -distance between the predicted location and the actual location of the target is less than 3 grid units. Otherwise, it gets 0 reward. Agents do not know the reward model, and use the instantaneous rewards instead.
Policy: We fix the policy that the agents evaluate as the maximum a-posteriori policy. Namely, agents detect (hit) a location if it corresponds to the maximum entry in their belief vector.
We use the belief vectors as the features directly, i.e., φ is an identity transformation. We set α = 0.1, ρ = 0.0001, and β = K = 8, and average over 3 different realizations for all cases. In Fig. 3(b), the average mean-square distance to the network centroid, i.e., is plotted over time for the fully decentralized strategy. Confirming Theorem 2, it can be seen that agreement error rapidly decreases and converges to a small value. In Fig. 3(c), we plot the evolution of the average squared Bellman error (SBE) in the log domain, where the SBE expression is given by: and similarly for the centralized cases. It measures the network average of instantaneous TD-errors. It can be seen that all approaches converge, and in particular, diffusion strategy (Algorithm 2) yields a comparable performance with CD (Algorithm 3). This observation is in line with Theorem 3, which states that the disagreement between the fully decentralized strategy and the baseline centralized training for decentralized execution strategy is bounded. Notice also that CC (Algorithm 1) results in a higher SBE compared to the diffusion and CD, despite being a fully centralized strategy. This is because, CC evaluates a different policy, namely, the centralized execution policy. Therefore, as argued in Section V-C, the SBE of CC is not a suitable baseline for the diffusion strategy.

VII. CONCLUDING REMARKS
In this paper, we proposed a policy evaluation algorithm for Dec-POMDPs over networks. We carried out a rigorous analysis that established: (i) the beliefs formed with local information and interactions have a bounded disagreement with the global posterior distribution, (ii) agents' value function parameters cluster around the network centroid, and (iii) the decentralized training can match the performance of the centralized training with appropriate parameters and increasing network connectivity. There are two limitations of the current work that can be addressed in future work. First, we assume that agents know the local likelihood and transition models accurately. One possible question is if agents have approximation errors for the models, how would these affect the analytical results. Second, an implication of Theorem 3 is that there is necessity for regularization (ρ > 0). We leave the question of whether one can get bounds that does not require this, possibly with more assumptions on the model, to future work.

A. PROOF OF THEOREM 1
We can rewrite the risk function as where (a) follows from definition (9), (b) follows from the definition of conditional expectation with respect to s i given F i . Merging the diffusion adaptation step (37) and the combination step (38) together yields the following form: which, combined with the update (22) for the centralized solution, results in: Here, we have introduced the marginal distribution of new observation given the past observations and actions: First, observe that the expectation of the log-likelihood ratio terms in (80) satisfies: where in (a) we used the spatial independency of the observations. Second, the expectation of the time-adjusted terms in (80) can be rewritten as: where in (a) we define the agent-specific distribution: and (b) follows from the fact that the arguments are deterministic given the current state and the history of actions and observations. The first term of (83) can be written as a KL-divergence because of the following: This expected KL-divergence can be bounded by using the strong-data processing inequality [75]: The second term of (83) arises due to transition model disagreement with the centralized belief. To bound it, we first introduce the LogSumExp function f with vector arguments ν ∈ R S : Its gradient is given by (88) Observe that if we define the vectors and then, we can rewrite the second expression of (83) as follows: Applying mean value theorem to this difference yields for some ν ,i between ν ,i and ν ,i . The term in (92) is bounded as follows: where (a) follows from the Jensen's inequality, (b) follows from the Hölder's inequality, and (c) follows from the fact that Furthermore, due to Assumption 1 and to the fact that the number of maximum hops outside N k is (K − |N k |), we have If we combine (86), (91), and (95), the expectation of the timeadjusted terms in (80) can be bounded as: Next, we bound the expectation of the remaining normalization terms in (80), which follows similar steps to what was done in [49]: where (a) follows from the arithmetic-geometric mean inequality, (b) follows from: where we use the definition: which is a density (or mass function if observations are discrete) since: Notice that the expression in (97) can be rewritten as if we use the LogSumExp function f from (87) and use the definitions: and (103) Following the steps in (92) and (93), this difference can be bounded as: Moreover, by assumptions on the graph topology (27) and on the likelihood functions (46), this expression can be further bounded as [49]: Subsequently, if we insert the bounds (82), (96), and (105) to (80), we arrive at the bound on the risk function: Expanding this recursion over time yields: which implies that if κ (T) < 1, the risk function is bounded as i → ∞: By (96), this also implies that

B. PROOF OF COROLLARY 1
In view of the Bretagnolle-Huber inequality [76], it holds that (110) If we take the expectation of both sides, we get: where (a) and (b) follow from Jensen's inequality. Together with Theorem 1, this implies that where we use the definition (55). Furthermore, on account of the fact that 2 norm is no greater than 1 norm in R S , it is also true that With similar arguments, it can be shown that where we use the definition (56).

C PROOF OF LEMMA 1
Inserting the definitions (59) and (63), the expected difference can be expanded as where the last step follows from the triangle inequality. Here, the first term can be bounded as where (a) follows from Assumption 2. Taking expectations and using (53) and (116), it follows that Similarly, the second term in (115) can be bounded as where (a) follows from Assumption 2. Using (53) and (54) we get: Combining (117) and (119) in addition to the fact that B TV ≤ B TV (since κ (T) < 1) yields:

D PROOF OF THEOREM 2
For compactness of notation, it is useful to introduce the following quantities, which collect variables from across all agents: Then, the equations (40)- (42) can be written as: Moreover, we can define the following K-times extended centroid vector: If we decompose H i into its centralized component H i and the respective disagreement matrix i H i − H i , we obtain: where the last step follows from the fact that Furthermore, taking the norms of both sides in (128) leads to Since the combination matrix C is a primitive stochastic matrix, it follows from the Perron-Frobenius theorem [53], [77] that its maximum eigenvalue is 1, and all other eigenvalues are strictly smaller than 1 in absolute value. Moreover, C is assumed to be symmetric, therefore its eigenvalue decomposition has the following form: where U is the orthogonal matrix of eigenvectors {u k }, and is the diagonal matrix of eigenvalues. Additionally, the powers of C converge (because it is primitive) to the scaled all-ones matrix (because it is doubly-stochastic): Therefore, the difference of these matrices becomes: which implies: where λ 2 is the second largest modulus eigenvalue of C. Moreover, the Kronecker product with the identity matrix does not change the spectral norm, hence: Moreover, we know from Lemma 1 that Additionally, in Appendix F, we establish (135)-(138) which hold for any realization (with probability one). From (161), note that: In addition, we show in Lemma 2 that and in expression (162) that Inserting these results into (130) yields the following norm recursion: Let us define the constant λ 2 λ 2 (1 − 0.08αγ B φ L φ ). Iterating (139) over time, we arrive at where (a) holds whenever: where c is an arbitrary constant.

E PROOF OF THEOREM 3
We begin by rewriting the baseline strategy recursion (72)- (73) in the form: where H i is defined in (63), and We introduce the K-times extended versions of the vectors: Then, the baseline recursion (142) transforms into It follows from the extended network centroid definition (127) and (145) that where we used the facts that and Next, if we define the following average agent disagreement relative to the baseline term it holds that Subsequently, taking the norm of both sides in (146) and applying the triangle inequality, we get First, observe that Moreover, from Assumption 2 and Corollary 1, it holds that and accordingly, By using the same bounds (135)-(138) from Appendix D for the other terms (which are established in Lemma 1, Lemma 2, (161), and (162)), we arrive at the recursion: where Iterating over time, we get: where (a) holds whenever

F AUXILIARY RESULTS
In the following lemma, we prove that the value function parameters are bounded in norm. Lemma 2 (Bounded parameters): For each agent k ∈ K, the iterate w k,i is bounded in norm if ρ > γ B φ L φ / √ 2, with probability one. In particular, if ρ ≥ 0.75γ B φ L φ , then Proof: Taking the norms of both sides of (126) yields: where (a) follows from the fact that the singular values of doubly-stochastic matrices are equal to one. Note that where (a) follows from the equality of spectral norm and maximum eigenvalue for symmetric matrices, (b) follows from Assumption 2, and (c) follows from the fact that the mean-square distance cannot exceed 2 over the probability simplex. The upper bound in (161) is smaller than 1 whenever ρ > γ B φ L φ / √ 2. Moreover, As a result, if ρ ≥ 0.75γ B φ L φ , we get: Iterating this recursion starting from i = 0 results in where the last step holds whenever