State Aggregation for Distributed Value Iteration in Dynamic Programming

We propose a distributed algorithm to solve a dynamic programming problem with multiple agents, where each agent has only partial knowledge of the state transition probabilities and costs. We provide consensus proofs for the presented algorithm and derive error bounds of the obtained value function with respect to what is considered as the “true solution” obtained from conventional value iteration. To minimize communication overhead between agents, state costs are aggregated and shared between agents only when the updated costs are expected to influence the solution of other agents significantly. We demonstrate the efficacy of the proposed distributed aggregation method to a large-scale urban traffic routing problem. Individual agents compute the fastest route to a common access point and share local congestion information with other agents allowing for fully distributed routing with minimal communication between agents.


I. INTRODUCTION
Value iteration is a well-established method for solving dynamic programming problems, yet it exhibits scalability issues for applications with a large state space. To this end, state aggregation can be used to reduce the number of states that need to be considered. The aggregation approach has a long history in scientific computing with applications ranging from the improvement of Galerkin methods [1], to solving large-scale optimization models [2], and dynamic programming [3]. In [4] it is shown how the aggregation approach can be used in conjunction with value iteration, however, for problems with large decision spaces and potentially cost and transition probabilities that evolve over time, conventional aggregation will still be hampered by scalability issues. To this end, we propose a multi-agent value iteration algorithm that utilizes aggregation to minimize communication overhead and allows for solving dynamic programming problems where each agent has only partial knowledge of the state transition costs and probability.
Value iteration has been expanded to a multi-agent framework for the class of problems that contain a joint decision space [5], yet in this approach no restrictions are imposed as far as the agents' knowledge of the underlying transition probabilities and costs are concerned. Such restrictions have been considered for small-scale problems as in [6], however, require a centralized critic for estimating the expected team benefit in a non-cooperative setting.
The paper is accepted in IEEE Control Systems Letters as N. Vertovec and K. Margellos, "State Aggregation for Distributed Value Iteration in Dynamic Programming,", the copyright of the published version is transferred to IEEE, and the published manuscript can be found in https://ieeexplore.ieee.org/document/10149480. The authors are with the Department of Engineering Science, University of Oxford. Email:{nikolaus.vertovec, kostas.margellos}@eng.ox.ac.uk The proposed framework differs from many multi-agent reinforcement learning approaches that utilize sharing a weighted average of agents' local estimated value. Rather than considering a set of global states shared by all the agents in conjunction with a local reward observed by individual agents as in [7], [8], we consider a Markov Decision Process (MDP) partitioned among agents with the objective of determining in a distributed manner the global value function without disclosing information on the transition probabilities and costs among agents. This paper performs the following main contributions: (i) We propose a distributed value iteration algorithm for multiagent dynamic programming, preventing transition probabilities and state costs from constituting global information among agents; (ii) We provide a rigorous mathematical analysis on the convergence and optimality properties of the proposed algorithm, merging tools from multi-agent consensus and dynamic programming principles; and, (iii) We demonstrate our scheme on a non-trivial traffic routing problem.
The rest of the paper is organized as follows. Section II provides some mathematical preliminaries and introduces the distributed multi-agent value iteration algorithm. Section III provides the main statements and associated mathematical analysis supporting the proposed algorithm. In Section IV we introduce a traffic routing problem and show how our proposed approach is able to solve the problem in a distributed manner. Finally, Section V provides concluding remarks and directions for future work.

II. PROBLEM SETUP A. Multi-agent Markov Decision Processes
We first introduce a standard MDP that will be expanded to a multi-agent setting in view of the proposed distributed algorithm presented in the sequel. We consider a finite set of n states denoted by X and let g(i, u, j) be the cost-to-go from state i ∈ X to state j ∈ X , using the input u ∈ U(i), where U(i) is a finite set of actions/decisions available at state i. Furthermore, let p(i, u, j) be the transition probability to transition from state i to state j using the input u, with n j=1 p(i, u, j) = 1. The objective is to solve a global discounted infinite horizon dynamic programming problem. The associated Bellman equation is given by J * (i) = min u∈U (i) n j=1 p(i, u, j)[g(i, u, j) + αJ * (j)], where α ∈ [0, 1) is a discount factor, and J * denotes the optimal value function that satisfies the Bellman identity. Solving the Bellman equation using conventional value iteration or policy iteration requires knowledge of all transition probabilities as well as associated costs.
We consider the setting where a set of q agents collaborate to solve the aforementioned infinite horizon dynamic programming problem in a distributed manner, with only partial knowledge of the state transition probabilities and costs while minimizing communication between agents. As such, we partition the state-space X into q subsets, I ℓ ⊂ X , such that for all m, ℓ ∈ {1, . . . , q} with m ̸ = ℓ, I ℓ ∩ I m = ∅. Each agent knows only the state transition probabilities and cost for transitions originating within its state subset I ℓ , i.e., for each agent ℓ, g ℓ : I ℓ × U × X → R, p ℓ : I ℓ × U × X → [0, 1]. Using conventional value iteration would require each agent to share its knowledge of the transition probabilities and costs, resulting in significant communication overhead. In the subsequent section, we instead introduce an alternative method relying on state aggregation that allows solving the discounted infinite horizon dynamic programming problem under consideration in a distributed manner where some tentative information is exchanged only with neighboring agents.

B. Distributed Value Iteration
We start by aggregating the value function (expected optimal cost-to-go) to construct for each agent ℓ = 1, . . . , q, the aggregate value, defined as r ℓ,ℓ = i∈I ℓ d ℓi J(i), where J constitutes an approximation of the optimal value function of the Bellman equation, and d ℓi is the so-called disaggregation probability (encoding the contribution of each agent's value function to the aggregate one), satisfying for all ℓ = 1, . . . , q, i∈I ℓ d ℓi = 1 and for all i / ∈ I ℓ , d ℓi = 0. The aggregate values for each agent, r ℓ,ℓ , can be combined into a common vector denoted by r ℓ ; tentative values for this vector will be communicated across all agents. Thus we define r ℓ = r ℓ,1 , . . . , r ℓ,ℓ , . . . , r ℓ,q T . In the sequel, we will define an iterative scheme, according to which each agent ℓ = 1, . . . q will update their estimate for the vector r ℓ ; we denote this at iteration k, by r k ℓ , where its m-th element is indicated by r k ℓ,m , m = 1, . . . , q. Next, we introduce the aggregation probability satisfying for all ℓ = 1, . . . , q, Such a formulation of the aggregation probability is known as hard aggregation in the literature [9, p. 311].
To solve the discounted infinite horizon dynamic programming problem, we propose a distributed algorithm for which the pseudocode is given in Algorithm 1 and 2. The proposed algorithm involves an agent-to-agent communication protocol. At each algorithm iteration k ≥ 0 we consider the directed communication graph (V, E(k)), where the node set V = {1, . . . , q} includes the agents and the set E(k), the directed edges (m, l), indicating that at iteration k agent ℓ can receive information from agent m.
We make the fairly standard assumption in the existing literature on the communication structure between agents [8], [10]:

Assumption 2.1: [Connectivity and Communication]
There exists a positive integer B such that (V, E B (k)) is fully connected for each iteration k.
Assumption 2.1 implies that for any agent pair (ℓ, m) there is a direct link at least once every B iterations. This prevents agents from having to share information with all other agents at all iterations as well as for the presence of a central authority.
Initially in Algorithm 1, each agent ℓ, ℓ = 1, . . . , q, starts with some tentative value of the aggregate vector r 0 ℓ , and local cost function V 0 ℓ . The initialization choice is arbitrary. Utilizing the aggregate vector r k ℓ at iteration k ≥ 0, each agent ℓ locally and in parallel executes Algorithm 2 so as to construct the updated local value function V k+1 ℓ : I ℓ → R which will, in turn, yield an updated aggregated cost r k+1 ℓ,ℓ ∈ R (Algorithm 1, Steps 8).
To minimize communication overhead, an updated aggregated cost r k+1 ℓ,ℓ is only shared with other agents in the set N ℓ (k) (the neighbors of ℓ at iteration k, when the aggregated cost has changed significantly, i.e., ∥r k+1 ℓ,ℓ − r prev ℓ,ℓ ∥> C threshold , where r prev ℓ,ℓ is the aggregated cost previously broadcast to other agents and C threshold is a communication threshold (Algorithm 1, Steps 11 -13), or when agent ℓ and a neighboring agent m have not communicated in the last B iterations (see Step 11; this ensures that there exists a bounded intercommunication time as required by Assumption 2.1). For further discussion on the communication threshold to limit the transmission of insignificant data, we refer to [11].
Until convergence is reached, each agent will repeat the execution of Algorithm 2 and the subsequent communication step. Note that at iteration k, the vector r k ℓ may contain aggregate values from other agents received at prior time steps (Algorithm 1, Step 17). It will be shown in the proof of Theorem 3.1 that convergence will nevertheless be reached.
We now turn to the update of V k ℓ and r k ℓ,ℓ , performed when calling Algorithm 2. Each agent uses the local value function, V k ℓ , to represent the cost-to-go function at each state i ∈ I ℓ , and the aggregated cost, r k ℓ , to approximate the cost for states j ∈ X \I ℓ . The local value function is iteratively updated for each state i ∈ I ℓ (Algorithm 2, Steps 2 -6). This update follows from standard value iteration; choosing the control action at state i that minimizes the expected cost-to-go (Algorithm 2, Step 4) and using a Gauss-Seidel update on the local value function to improve convergence and minimize memory usage (Algorithm 2, Step 5). We perform an update only to the local value function since we restrict ourselves to states for which sufficient knowledge of the transition probabilities and costs is available. After the local value function has been updated, the updated local aggregated cost, r k+1 ℓ,ℓ , is computed (Algorithm 2, Step 7). Notice that Algorithms 1 and 2 prevent disclosing the transition probabilities to all agents; in contrast, each agent ℓ, has access only to the probabilities and costs associated with transitions originating from their partition I ℓ , ℓ = 1, . . . , q. for each Agent ℓ ∈ 1, . . . , q do 8: if ∥r k+1 ℓ,ℓ − r prev ℓ,ℓ ∥> C threshold , 11: or if (ℓ, m) / ∈ k i=k−B+1 E(i) then 12: Send r k+1 ℓ,ℓ to all agents m ∈ N ℓ (k) return V k+1 ℓ , r k+1 ℓ,ℓ 9: end function

III. ALGORITHM ANALYSIS
The following theorem is the main result of this section. Theorem 3.1: Consider Assumption 2.1. Algorithm 1 converges asymptotically to a common value for r among agents, i.e., there existsr such that for all ℓ = 1, . . . , q, lim k→∞ ∥r k ℓ −r∥= 0. Moreover, for all ℓ = 1, . . . , q, for all i ∈ X , V k ℓ (i) converges to some V * ℓ (i). Proof: We show convergence when agents transmit r k ℓ,ℓ , ℓ = 1, . . . , q, irrespective of whether this deviates more than C threshold from their previously transmitted cost. This is without loss of generality as in this case the resulting sequence of transmitted costs would form a subsequence of {r k ℓ,ℓ } ∞ k=0 , hence it will be convergent as we will show that {r k ℓ,ℓ } ∞ k=0 converges. Fix any k ≥ B, and any ℓ ∈ {1, . . . , q}. Let ϵ k ℓ,m = r k+1 ℓ,m − r k ℓ,m , ϵ k ℓ,∞ = max m=1,...,q |r k+1 ℓ,m − r k ℓ,m |, and ϵ k ∞ = max ℓ=1,...,q k n=k−B ϵ n ℓ,∞ , i.e., ϵ k ∞ is the maximum among agents of the cumulative incremental update over the most recent B + 1 iterations; recall that due to Assumption 2.1 this is the window within which agent ℓ communicates with other agents at least once.
Consider the update to the value V k ℓ (i), for an arbitrary iteration k and agent ℓ = 1, . . . , q, which we define as where the first equality follows from the definition of V k ℓ , and the second one from a rearrangement after substituting V k ℓ (j) = V k−1 ℓ (j) + δ k−1 ℓ (j) and r k ℓ,m = r k−1 ℓ,m + ϵ k−1 ℓ,m . The inequality follows from the definition of V k ℓ and by considering the maximum over u ∈ U (i).
For each ℓ = 1, . . . , q, we now define the maximum update of |δ k ℓ |, over all states i ∈ I ℓ , as δ k ℓ = max i∈I ℓ |δ k ℓ (i)|. It then follows that for all i ∈ I ℓ , where the first inequality follows from the definition of δ k ℓ , and the equality follows from the fact that where the inequality is since (1) holds for all i ∈ I ℓ , and the exchange of the summation and the maximization operator is since all quantities are non-negative. At the same time, where the last inequality follows from (2), and the fact that the bound in (2) is independent of ℓ.
Theorem 3.1 implies consensus among agents to a commonr, and also establishes convergence of {V k ℓ (i)} k≥0 to some V * ℓ (i). Moreover, the convergence rate is linear since max{δ k ∞ , ϵ k ∞ } is shown to be contractive in the proof of Theorem 3.1. The exact convergence rate for {|r k+1 ℓ,m −r k ℓ,m |} k≥0 will depend on B.
Next, we consider error bounds between the limiting V * ℓ (i) and the optimal J * (i), satisfying the Bellman equation. For the subsequent result we assume that B = 0, i.e., agents communicate with all other agents at all iterations.
Proof: The proof is inspired by [4]. Fix ℓ = 1, . . . , q and i ∈ X . We only show that V * ℓ (i) ≤ J * (i)+α δ 1−α , as the other side in (5) follows symmetric arguments. To this end, for each i ∈ I ℓ , define V ℓ (i) := J * (i) + α δ 1−α . It follows then that the aggregate of V ℓ (i) can be constructed as where the inequality is since J * (j) ≤ min i∈I ℓ J * (i) + δ, while the second last equality is due to j∈I ℓ d ℓj = 1.
Denote the Bellman operator induced by the Bellman equation as T , such that we can compactly write the Bellman equation as J * (i) = (T J * )(i), where effectively by (T J * )(i) we imply the right-hand side of the Bellman equation which depends on i and on J * (j) for all j = 1, . . . , n. For all i = 1, . . . , n, we now consider the application of the Bellman operator to V ℓ , i.e., where the first inequality is due to the definition of V ℓ and utilizes r ℓ,m = r m,m , which holds since B = 0. The second inequality follows, since min j∈Im J * (j) ≤ J * (j) for all j ∈ X , and α δ 1−α ≤ δ 1−α , that in turn allows us to combine the terms multiplied with ϕ jℓ into the last summation. The last inequality follows from q m=1 ϕ jm = 1, while the second last equality is due to the Bellman equation.
By (7) we have that (T V ℓ )(i) ≤ V ℓ (i), for all i = 1, . . . , n, which implies that {V k ℓ (i)} k≥0 is a non-increasing sequence. Moreover, T is contractive and as such it will converge to its (unique) fixed point; a direct consequence of Theorem 3.1 is that this fixed point is V * ℓ (i) (as the latter was constructed by successive applications of the Bellman operator). Therefore, for all i = 1, . . . , n, V * 1−α , concluding the proof. It follows from Theorem 3.2, that if the aggregation sets I 1 , . . . , I q are chosen such that the cost function J * is expected to vary moderately between states within an aggregation set, then the maximum error compared to V 1 , . . . , V q will be moderate as well. One method to aggregate states with similar costs is to use feature-based aggregation, whereby the aggregation is performed on a set of representative features instead of representative states. This form of aggregation has been thoroughly investigated; we refer to [9, pg. 322], [12, pg. 68] and references therein. For applications with a high discount factor, i.e., α << 1, the maximum error will also be moderate.

A. Simulation Set-up
We demonstrate the proposed algorithm in a traffic routing case study. To this end, we begin by modeling the traffic network as a graph, with vertices representing junctions and edges representing roads connecting junctions. An illustration of such a graph representation for the Oxford road network is shown in Figure 1.
We consider a low-energy antenna with a limited range and computation power to be situated at the center of each network partition, gathering the current speed and location of nearby vehicles. The transition cost of an edge is the expected travel time along that edge and is computed based on the data where u is used to determine the selected edge between node i and j. Due to the nature of the low-energy application, we utilize the proposed algorithm to limit the amount of data that needs to be sent over longer distances. As is common in transit node routing, vehicles are trying to reach a common access node, such as a freeway used for long-distance routing. The goal is to find the fastest path to the access point, thus the cost at each vertex is the discounted expected travel time to the access point. In our example related to the Oxford traffic network, the access point will be London Road, leading long-distance travelers toward London.

B. Simulation results
The Oxford road network is partitioned into 5 subgraphs using K-means clustering of the vertices based on their euclidean distance, as shown in Figure 2.
The cost-to-go is calculated as in (8), where the average car speed is randomly generated from a uniform distribution to lie between 25% and 100% of the speed limit of the relevant road. For each vertex, the outgoing edges are enumerated and the transition probability between two vertices is set to either 1 or 0, depending on whether an edge directly connects the vertices and the input selects that edge.
The disaggregation probabilities are zero for all nodes that have no edge connected to a vertex in another aggregation set I ℓ . The remaining vertices are given a normalized nonzero disaggregation probability. The discount factor, α = 0.9, is chosen close enough to 1 so as to reflect the desire to reach the final node and avoid loops, yet smaller than 1, so as to weight costs further off as less important due to the constantly evolving traffic situation.
Solving the aggregate problem and comparing it to what is considered to be the "true solution" J * obtained via conventional value iteration, we notice that the normalized average error of the expected cost-to-go, i.e., 1 n ℓ=1,...,q i∈I ℓ |ϕ il V ℓ (i)−J * (i)| |J * (i)| , is 0.94%, and the normalized maximum error i.e., max ℓ=1,...,q max i∈I ℓ |ϕ il V ℓ (i)−J * (i)| |J * (i)| , is 190.83%. The value function is shown in Figure 3. The evolution over time of the aggregated costs is shown in Figure 4, where the colored dots represent when the change to an agent's local value function compared to the last broadcast is greater than the chosen communication threshold (0.1 in this example). At this point, the agent will broadcast its updated aggregated cost. If since the last broadcast, no agents value function has changed significantly, the agents signal each other that convergence is reached and the algorithm terminates. For comparison, we show how, on average, the normalized error increases as we increase the number of agents (see Table  below). This is to be expected, as agents will rely more heavily on the aggregated values the smaller their respective partitions become. To reproduce the numerical results the associated code has been made available in [13], with the ability to upload any OpenStreetMap file, which is then converted to a graph and subsequently a discounted Markov decision process.

V. CONCLUSION
We presented a multi-agent extension of aggregated value iteration which was shown to be able to solve large-scale dynamic programming problems in a fully distributed manner. The presented methodology finds application in problems where each agent has only partial knowledge of the transition probabilities and costs. To this end, we demonstrated its efficacy in a distributed traffic routing problem, for which the code has been made available in [13].
Future work aims at extending Theorem 3.2 that is based on the additional assumption of full network connectivity at all iterations to the more general case of Assumption 2.1, as well as to the case of hard aggregation (where each state is clearly assigned to no more than one agent).