Abstract
We consider mean-field control problems in discrete time with discounted reward, infinite time horizon and compact state and action space. The existence of optimal policies is shown and the limiting mean-field problem is derived when the number of individuals tends to infinity. Moreover, we consider the average reward problem and show that the optimal policy in this mean-field limit is \(\varepsilon \)-optimal for the discounted problem if the number of individuals is large and the discount factor close to one. This result is very helpful, because it turns out that in the special case when the reward does only depend on the distribution of the individuals, we obtain a very interesting subclass of problems where an average reward optimal policy can be obtained by first computing an optimal measure from a static optimization problem and then achieving it with Markov Chain Monte Carlo methods. We give two applications: Avoiding congestion an a graph and optimal positioning on a market place which we solve explicitly.
Similar content being viewed by others
1 Introduction
Mean-field control problems have been developed from McKean-Vlasov processes (see [26]) where the dynamics depend on the distribution of the current state itself. In the corresponding control problem the relevant data like reward and transition function not only depend on the current state and action but also on the distribution of the state. Whereas the original motivation comes from physics these kind of problems are able to model the interaction of a large population. Thus, other popular applications include finance, queueing, energy and security problems among others. In this paper we consider mean-field control problems in discrete time in contrast to the majority of literature which concentrates on continuous time models. Moreover, our optimization criterion is to maximize the social benefit of the system i.e. the overall expected reward. In particular in our paper individuals cooperate in contrast to the game situation where one usually tries to find the Nash equilibrium of the system. Here we rather aim at obtaining the Pareto optimal solution. A comprehensive overview over continuous-time mean-field games can be found in [7]. These games have been introduced in economics and later studied in mathematics since at least 15 years (see e.g. [24] for one of the first mathematical papers on this topic).
We review briefly the latest results on discrete-time mean-field problems. First note that there have been some early studies of interactive games in [23] under the name anonymous sequential games and in [35] of so-called oblivious games which are in nature very similar to mean-field games. For a recent paper on discrete-time mean-field games and a literature survey, see for example [32]. In this paper Markov Nash equilibria are considered in a model without common noise. For an early game paper with finite state space see [16]. Since our paper is not a game and more in the spirit of Markov Decision Processes (MDPs) we concentrate our literature survey on control papers. One of the first papers in this area have been [13, 14]. In both papers the authors’ goal is to investigate the convergence of a large interacting population process to the simpler mean-field model. More precisely, the authors show convergence of value functions and convergence of optimal policies which implies the construction of asymptotically optimal policies. In both papers the state space is finite and the action space compact. Whereas in [13] the convergence rate is studied, in [14] the authors also scale the time steps to obtain a continuous-time deterministic limit. Finite as well as infinite-horizon discounted reward problems are considered. In [20] the authors also investigate convergence in a discounted reward problem, however consider the situation that the random disturbance density in unknown. A consumption-investment example is discussed there. In [21] the same authors treat the unknown disturbance as a game against nature. The paper [29] already starts from a discrete-time mean-field control problem. The authors derive the value iteration and solve an LQ McKean-Vlasov control problem. In contrast to our paper there is no common noise, the authors restrict to finite time horizon and do not use MDP theory to solve their problem. However, their model data like cost and transition function may also depend on the distribution of actions. LQ-problems are popular as applications of mean-field control since it is often possible to obtain optimal policies in these cases. E.g. [11] is entirely devoted to these kind of problems.
The two papers which are closest to ours, at least as far as the model is concerned, are [8, 27]. In both papers, the model data may also depend on the distribution of actions, but there is no restriction on admissible actions. Both consider a discounted problem with infinite time horizon. In [8] the authors work with lower semicontinuous value functions, whereas we show continuity under the same assumptions. The main issues in [8] are an extensive discussion of different types of policies and the development of Q-learning algorithms. We however start already with Markovian deterministic policies since in MDP theory it is well-known that history-dependent policies or randomized policies do not increase the value. Moreover, we consider the convergence of the N-individuals problem as well as average reward optimization. In [27] the authors deal with so-called open-loop controls and restrict to individualized or decentralized information. They investigate the rate of convergence from the N-population model to the mean-field problem. They also derive a fixed point characterization of the value function and discuss the role of randomized controls. Since in [27] decisions may only depend on the history of the single agent an additional source of randomness is required such that individuals with same history may take different actions.
Other recent papers discuss reinforcement learning for mean-field control problems, see e.g. [8, 9, 17, 18]. In the second part of the paper we consider average reward mean-field control problems which is a new aspect. There are papers on average reward games, like [5] where the transition probability does not depend on the empirical distribution of individuals and [36] where under some strong ergodicity assumptions the existence of a stationary mean-field equilibrium is shown. Both papers do not consider the vanishing discount approach which we do here. The recent paper [6] considers the vanishing discount approach, but in a continuous-time setting and for a game.
The main contributions of our paper are as follows: We first want to stress the point that mean-field control problems fit naturally into the established MDP theory. We start with a problem where N interacting individuals try to maximize their expected discounted reward over an infinite time horizon. Reward and transition functions may depend on the empirical measure of the individuals. Moreover, the transition functions of individuals depend on an idiosyncratic noise and a common noise. Due to symmetry reasons instead of taking the state of each individual as a common state of the system it is enough to know the empirical measure over the states. This equivalence implies an MDP formulation where the underlying state process consists of empirical measures. A similar observation can be found in [27], however there the authors take the mean-field limit first. Letting the number N of individuals tend to infinity, implies a mean-field limit by applying the Glivenko-Cantelli theorem. The idiosyncratic noise vanishes in the limit. In our setting state and action spaces are compact Borel spaces. We also discuss the existence of optimal policies which is rarely done in other papers. E.g. we give explicit conditions under which an optimal deterministic policy does exist for the limit problem as well as for the initial N-individuals problem. Moreover, we investigate average optimality in mean-field control problems, an aspect which is neglected in the literature. Applying results from MDP theory leads to an average reward optimality inequality. In some cases we obtain optimal policies in this setting rather easily. Since we use the vanishing discount approach, we can show that these policies are \(\varepsilon \)-optimal for the initial problem when the number of individuals is large and the discount factor close to one. Thus, we get some kind of double approximation which is helpful in some applications. Indeed, it turns out that the case when the reward does not depend on the action yields an interesting special case. The average reward problem can then be solved by first finding an optimal measure for a static optimization problem and then by using Markov Chain Monte Carlo to find an optimal randomized decision rule which achieves the optimal measure in the limit. We show how this works in a network example where the aim is to avoid congestion. Another interesting feature of the solution is that it is a decentralized control, i.e. individuals can decide optimally based on their own state without knowing the distribution of all individuals, i.e. individuals do not have to communicate. A second example is the optimal placement on a market square.
The paper is organized as follows: In the first section we introduce the model with a finite number of N individuals. We give conditions under which the optimality equation holds and optimal policies exist. In Sect. 3 we show how to formulate an equivalent MDP whose state space consists of the empirical measures of individuals. Based on this formulation we let the number N of individuals tend to infinity in the next section. We prove the convergence of value functions and show how an asymptotically optimal policy can be constructed. In Sect. 5 we consider the average reward problem via the vanishing discount approach. Under some ergodicity assumptions we prove the existence of average reward optimal policies and verify that the value function satisfies an average reward optimality inequality. Next we show how to use this optimal policy to construct \(\varepsilon \)-optimal policies for the original problem.
We discuss how to solve average reward problems when the reward depends only on the distribution of individuals and not on the action. Finally in Sect. 6 we consider two applications (network congestion and positioning on market place) which we solve explicitly. The appendix contains additional material which consists of a useful convergence result and the definition of the Wasserstein distance and Wasserstein ergodicity. Moreover, longer proofs are also deferred to the appendix.
2 The Mean-Field Model
We consider the following Markov Decision Process with a finite number of individuals: Suppose we have a compact Borel set S of states and N statistically equal individuals. Each individual is at the beginning in one of the states, i.e. the state of the system is described by a vector \({\textbf{x}}=(x_1,\ldots ,x_N)\in S^N\) which represents the states of the individuals. In case we need the time index n, we write \(x_n^i\), \(i=1,\ldots ,N\). Each individual can choose actions from the same Borel set A. Let \(D(x)\subset A\) be the actions available for one individual who is in state \(x\in S\), i.e. \({\textbf{a}}=(a_1,\ldots ,a_N)\in {\textbf{D}}({\textbf{x}}):=D(x_1)\times \ldots \times D(x_N)\) is the vector of admissible actions for all individuals. We denote \(D:= \{ (x,a) \in S\times A: a\in D(x) \text{ for } \text{ all } x\in S\}\) and assume that it contains the graph of a measurable mapping \(f:S\rightarrow A\). Moreover, \({\textbf{D}}:= \{ ({\textbf{x}},{\textbf{a}}) | {\textbf{a}}\in {\textbf{D}}({\textbf{x}})\} \). After choosing an action each individual faces a random transition. In order to define this, suppose that \((Z_n^i)_{n\in {\mathbb {N}}}, i=1,\ldots ,N\) and \((Z_n^0)_{n\in {\mathbb {N}}}\) are sequences of i.i.d. random variables with values in a Borel set \({\mathcal {Z}}\). The sequence \((Z_n^0)_{n\in {\mathbb {N}}}\) will play the role of a common noise. In what follows we need the empirical measure of \({\textbf{x}}\), i.e. we denote
where \(\delta _y\) is the Dirac measure in point y. \(\mu [{\textbf{x}}]\) can be interpreted as a distribution on S. We denote by \({\mathbb {P}}(S)\) the set of all distributions on S and by
the set of all distributions which are empirical measures of N points. On these sets we consider the topology of weak convergence. The transition function of the system is now a combination of the individual transition functions which are given by a measurable mapping \(T: S\times A\times {\mathbb {P}}(S)\times {\mathcal {Z}}^2\rightarrow S\) such that
for \(i=1,\ldots ,N\). Note that the individual transition may also depend on the empirical distribution \(\mu [{\textbf{x}}_n]\) of all individuals. In total the transition function for the entire system is a measurable mapping \({\textbf{T}}: {\textbf{D}} \times {\mathbb {P}}_N(S)\times {\mathcal {Z}}^{N+1}\rightarrow S^N\) of the state \({\textbf{x}}\), the chosen actions \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\), the empirical measure \(\mu [{\textbf{x}}]\) and the disturbances \({\textbf{Z}}_{n+1}:=(Z_{n+1}^1,\ldots , Z_{n+1}^N), Z_{n+1}^0\) such that
Last but not least each individual generates a bounded one-stage reward \(r: S\times A\times {\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) which is given by \(r(x_i,a_i,\mu [{\textbf{x}}])\), i.e. it may also depend on the empirical distribution of all individuals. The total one-stage reward of the system is the average
of all individuals. The first aim will be to maximize the joint expected discounted reward of the system over an infinite time horizon, i.e. we consider here the social optimum of the system or Pareto optimality. In particular the agents have to work together in order to optimize the system. This is in contrast to mean-field games where each individual tries to maximize her own expected discounted reward and where the aim is to find Nash equilibria. We make the following assumptions:
-
(A0)
D is compact.
-
(A1)
\(x\mapsto D(x)\) is upper semicontinuous, i.e. for all \(x\in S\): If \(x_n\rightarrow x\) for \(n\rightarrow \infty \) and \(a_n\in D(x_n)\), then \((a_n)\) has an accumulation point in D(x).
-
(A2)
\((x,a,\mu ) \mapsto r(x,a,\mu )\) is upper semicontinuous.
-
(A3)
\( (x,a,\mu ) \mapsto T(x,a,\mu ,z,z_0)\) is continuous for all \(z,z_0 \in {\mathcal {Z}}.\)
A policy in this model is given by \(\pi =(f_0,f_1,\ldots )\) with \(f_n \in F\) being a decision rule where
is the set of all decision rules. In case we do not need the time index n we write \(f({\textbf{x}}):=(f^1({\textbf{x}}),\ldots ,f^N({\textbf{x}}))\). It is not necessary to introduce randomized or history-dependent policies here, since we obtain a classical MDP below and it is well-known that an optimal policy will be among deterministic Markov ones. We assume that each individual has information about the position of all other individuals. This point of view can be interpreted as a centralized control problem where all information is collected and shared by a central controller.
Together with the distributions of \((Z_n^i), (Z_n^0)\) and the transition function \({\textbf{T}}\), a policy \(\pi \) induces a probability measure \({\mathbb {P}}_{\textbf{x}}^\pi \) on the measurable space
where \( {\mathcal {B}}(S^N) \) is the Borel \(\sigma \)-algebra on \(S^N\). The corresponding state process is denoted by \(({\textbf{X}}_n)\) where \({\textbf{X}}_n(\omega _1,\omega _2,\ldots )=\omega _n\in S^N\) and the action process is denoted by \(({\textbf{A}}_n)\) where \({\textbf{A}}_n(\omega _1,\omega _2,\ldots )=f_n(\omega _n).\) Our aim is to maximize the expected discounted reward of the system over an infinite time horizon. Hence we define for a policy \(\pi =(f_0,f_1,\ldots )\)
where \(\beta \in (0,1)\) is a discount factor. \({\mathbb {E}}_{\textbf{x}}^\pi \) is the expectation w.r.t. \({\mathbb {P}}_{\textbf{x}}^\pi \). \(V^N({\textbf{x}})\) is the maximal expected discounted reward over an infinite time horizon, initially given the configuration \({\textbf{x}}\) of individual’s states.
Remark 2.1
It is not difficult to see that \(V^N\) is symmetric, i.e. \(V^N({\textbf{x}})=V^N(\sigma ({\textbf{x}}))\) for any permutation \(\sigma ({\textbf{x}})\) of \({\textbf{x}}\) because the reward \({\textbf{r}}({\textbf{x}},{\textbf{a}})={\textbf{r}}(\sigma ({\textbf{x}}),\sigma ({\textbf{a}}))\) and the transition function \({\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}], {\textbf{Z}}, Z^0)={\textbf{T}}(\sigma ({\textbf{x}}),\sigma ({\textbf{a}}),\mu [\sigma ({\textbf{x}})], {\textbf{Z}}, Z^0)\) are symmetric. This is a simple observation but in the end leads to the conclusion that it is only necessary to know how many individuals are in the different states.
In what follows we introduce some notations.
Definition 2.2
Let us define:
-
a)
The set \({\mathbb {M}}:= \{v:S^N \rightarrow {\mathbb {R}}\;| \; v \text{ is } \text{ bounded } \text{ and } \text{ upper } \text{ semicontinuous }\}\).
-
b)
The operator U on \({\mathbb {M}}\) by
$$\begin{aligned} Uv({\textbf{x}}) = (Uv)({\textbf{x}}):= & {} \sup _{{\textbf{a}}\in {\textbf{D}}({\textbf{x}})} \Big \{ {\textbf{r}}({\textbf{x}},{\textbf{a}})+ \beta {\mathbb {E}}\Big [v\big ( {\textbf{T}}({\textbf{x}}, {\textbf{a}}, \mu [{\textbf{x}}], {\textbf{Z}}, Z^0)\big )\Big ]\Big \}. \end{aligned}$$ -
c)
A decision rule \(f\in F\) is called maximizer of \(v\in {\mathbb {M}}\) if
$$\begin{aligned} Uv({\textbf{x}})= \textbf{ r}({\textbf{x}},f({\textbf{x}}))+ \beta {\mathbb {E}}\Big [v\big ( {\textbf{T}}({\textbf{x}}, f({\textbf{x}}), \mu [{\textbf{x}}], {\textbf{Z}}, Z^0)\big )\Big ]. \end{aligned}$$
From classical MDP theory we obtain:
Theorem 2.3
Assume (A0)–(A3). Then:
-
(a)
The value function \(V^N\) is the unique fixed point of the U-operator in \({\mathbb {M}}\), i.e. it satisfies the optimality equation \(V^N=U V^N\).
-
(b)
\(V^N = \lim _{n\rightarrow \infty } U^n 0\).
-
(c)
There exists a maximizer of \(V^N\) and every maximizer \(f^*\in F\) of \(V^N\) defines an optimal stationary (deterministic) policy \((f^*,f^*,\ldots )\).
The proof of this statement and all other longer proofs can be found in the appendix. We summarize the model data below:
Model MDP | |
---|---|
State space | \(S^N \ni {\textbf{x}}=(x_1,\ldots ,x_N)\) |
Admissible actions | \({\textbf{D}}({\textbf{x}}):=D(x_1)\times \ldots \times D(x_N)\ni {\textbf{a}}=(a_1,\ldots ,a_N)\) |
Transition function | \({\textbf{T}}({\textbf{x}}_n,{\textbf{a}}_n,\mu [{\textbf{x}}_n], {\textbf{Z}}_{n+1}, Z_{n+1}^0)= \Big ( T(x_n^i, a_n^i, \mu [{\textbf{x}}_n], Z_{n+1}^i, Z_{n+1}^0)\Big )_{i=1,\ldots ,N}\) |
Reward | \( {\textbf{r}}({\textbf{x}},{\textbf{a}}) :=\frac{1}{N} \sum _{i=1}^N r(x_i,a_i, \mu [{\textbf{x}}])\) |
Policy | \(\pi =(f_0,f_1,\ldots ),\) |
\(f_n\in F:= \{ f: S^N \rightarrow A^N \;| \; f \text{ is } \text{ measurable } f({\textbf{x}})\in {\textbf{D}}({\textbf{x}}),\; \forall {\textbf{x}}\in S^N\}\) |
Example 2.4
Suppose individuals move on a triangle. The state space is given by the nodes \(S=\{1,2,3\}\). Admissible actions are adjacent nodes, i.e. \(D(1)=\{2,3\}, D(2)=\{1,3\}, D(3)=\{1,2\}\). The individual one-stage reward may be given by \(r(x_i,a_i,\mu )= 1_{\{1\}}(x_i)- 1_{\{ |1-\bar{\mu }|\le 0.5\}}\).
Here \(\bar{\mu }= \int x\mu (dx)\). This means an individual gets a reward of 1 when it is in state 1, but only when the average position of the others is away from 1. A transition function may be
For \(N=5\) individuals, a state may be \({\textbf{x}}=(1,2,3,1,3)\) and an action \({\textbf{a}}=(2,1,2,3,1)\in {\textbf{D}}({\textbf{x}})\). In this case \(\mu [{\textbf{x}}]=(2/5,1/5,2/5)\) and \({\textbf{r}}({\textbf{x}},{\textbf{a}}) = 2/5\).
3 The Mean-Field MDP
Suppose that N is large. Even if the state space S is small, the solution of the problem may not be computationally tractable any more because \(S^N\) is large. We seek for some simplifications. In particular we want to exploit the symmetry of the problem. In the last section we have seen that the empirical measures of the individuals’ states is the essential information. Thus, we define as new state space \({\mathbb {P}}_N(S)\). Further we define the following sets:
where
is the set of all probability measures on D which are empirical measures on N points. The set \({\hat{D}}(\mu ) \) consists of probability measures on D which are empirical measures on N points and whose first marginal distribution equals \(\mu \). We obtain the following result.
Lemma 3.1
Suppose \({\textbf{a}}\in {\textbf{D}}({\textbf{x}}) \) is an arbitrary action in state \({\textbf{x}}\in S^N\). Then there exists an admissible \(Q\in {\hat{D}} (\mu [{\textbf{x}}]),\) s.t.
for all \({\textbf{x}}\in S^N\). The converse is also true, i.e. if \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\) then there exists an \({\textbf{a}}\in {\textbf{D}}({\textbf{x}}) \) s.t. (3.1) holds.
Proof
Let \({\textbf{x}}\) and \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) be given and let \(\mu := \mu [{\textbf{x}}]\in {\mathbb {P}}_N(S)\). Define the discrete point measure Q on D by
Then \(Q\in {\hat{D}} (\mu )\) by construction and
which proves the first statement. For the converse, suppose \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\). By definition this implies that there exists \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) s.t. \(Q=\mu [({\textbf{x}},{\textbf{a}})]\). Using this relation, (3.1) follows. \(\square \)
This lemma shows that instead of choosing actions \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) we can choose measures \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\) and \(\mu =\mu [{\textbf{x}}]\) is a sufficient information which can replace the high dimensional state \({\textbf{x}}\in S^N\). Intuitively this is clear from the fact that \({\textbf{r}}({\textbf{x}},{\textbf{a}})\) is symmetric (see Remark 2.1).
We consider now a second MDP with the following data which we will call mean-field MDP (for short \(\widehat{\textrm{MDP}}\)). The state space is \( {\mathbb {P}}_N(S)\) and the action space is \({\mathbb {P}}_N(D)\). The one-stage reward \(\hat{r}: {\hat{D}}\rightarrow {\mathbb {R}}\) is given by the expression in Lemma 3.1, i.e.
and the transition law \(\hat{T}: {\hat{D}} \times {\mathcal {Z}}^{N+1} \rightarrow {\mathbb {P}}_N(S)\) for \(Q=\mu [({\textbf{x}},{\textbf{a}})], \mu =\mu [{\textbf{x}}]\) by (the empty sum is zero)
The value of \({\hat{T}}\) simply is the empirical measure of the new states after a random transition. A policy is here denoted by \(\psi =(\varphi _0,\varphi _1,\ldots )\) with \(\varphi _n\in {\hat{F}}\) and we denote by \((\mu _n)\) the corresponding (random) sequence of empirical measures, i.e. \(\mu _0=\mu \), and for \(n\in {\mathbb {N}}_0\)
Remark 3.2
We define an action as a joint probability distribution Q on state and action combinations instead of the conditional distribution on actions given the state. Both descriptions are equivalent, since for \(Q\in {\hat{D}}(\mu )\) we can disintegrate
where \({\bar{Q}}\) is the regular conditional probability. For short: \(Q=\mu \otimes {\bar{Q}}\). The advantage of using the joint distribution is that we have one object to define actions in all states. The disadvantage is that we need to formulate the restriction that the marginal distribution on the states coincides with \(\mu \).
We define the value function of \(\widehat{\textrm{MDP}}\) in the usual way for state \(\mu \in {\mathbb {P}}_N(S)\) and policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) by
Finally, we show that the MDP and the mean-field MDP are equivalent.
Theorem 3.3
Assume (A0)-(A3). For \({\textbf{x}}\in S^N\) and \(\mu =\mu [{\textbf{x}}]\) we have:
Proof
Note that \(\mu _0=\mu =\mu [{\textbf{x}}]\) by definition. Let \({\textbf{a}}_0={\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) be the first action taken by MDP under an arbitrary policy. Then by Lemma 3.1 there exists \(Q\in {\hat{D}}(\mu )\), s.t. \({\textbf{r}}({\textbf{x}},{\textbf{a}})= \hat{r}(\mu ,Q)\) and
By induction over time n it follows that a sequence of states and feasible actions in MDP \(({\textbf{X}}_0,{\textbf{A}}_0,{\textbf{X}}_1,\ldots )\) can be coupled with a sequence of states and feasible actions \((\mu _0,Q_0,\mu _1,\ldots )\) for \(\widehat{\textrm{MDP}}\) and vice versa s.t. the same sequence of disturbances \(({\textbf{Z}}_n),(Z^0_n)\) is used and \( {\textbf{r}}({\textbf{X}}_n,{\textbf{A}}_n) = \hat{r}(\mu _n,Q_n)\) pathwise. The corresponding policies may be history-dependent, but \(V^N=J^N\) follows since it is well-known for MDPs that the maximal value is obtained when we restrict our optimization to Markovian policies. \(\square \)
As in Sect. 2 we define here a set and an operator for the mean-field MDP.
Definition 3.4
Let us define
-
(a)
The set \(\mathbb {{\hat{M}}}:= \{v: {\mathbb {P}}_N(S) \rightarrow {\mathbb {R}}\; |\; v \text{ is } \text{ bounded } \text{ and } \text{ upper } \text{ semicontinuous }\}\).
-
(b)
The operator \({\hat{U}}\) on \(\mathbb {{\hat{M}}}\) by
$$\begin{aligned} {\hat{U}}v(\mu ) = ({\hat{U}}v)(\mu ):= & {} \sup _{Q\in {\hat{D}}(\mu )} \Big \{ \hat{r}(\mu ,Q)+ \beta {\mathbb {E}}v( \hat{ T}(\mu ,Q,{\textbf{Z}},Z^0))\Big \}. \end{aligned}$$
Due to Theorem 3.3 and Theorem 2.3 we obtain:
Theorem 3.5
Assume (A0)–(A3). Then:
-
(a)
The value function \(J^N\) is the unique fixed point of the \({\hat{U}}\)-operator in \(\mathbb {{\hat{M}}}\) i.e. it satisfies the optimality equation \(J^N= {\hat{U}} J^N\).
-
(b)
\(J^N = \lim _{n\rightarrow \infty } {\hat{U}}^n0\).
-
(c)
There exists a maximizer of \(J^N\) and every maximizer \(\varphi ^*\in {\hat{F}}\) of \(J^N\) defines an optimal stationary policy \((\varphi ^*,\varphi ^*,\ldots )\).
We summarize the model data below:
Model \(\widehat{\textrm{MDP}}\) | |
---|---|
State space | \({\mathbb {P}}_N(S):= \{ \mu \in {\mathbb {P}}(S)\;| \; \mu = \mu [{\textbf{x}}], \text{ for } {\textbf{x}} \in S^N \} \ni \mu \) |
Admissible actions | \({\hat{D}} (\mu ) :=\{ \mu [({\textbf{x}},{\textbf{a}})] \;| \; {\textbf{x}}\in S^N \text{ s.t. } \mu [{\textbf{x}}] =\mu \text{ and } {\textbf{a}}\in D({\textbf{x}})\} \ni Q\) |
Transition function | \( \hat{T}(\mu ,Q,{\textbf{Z}},Z^0)= \mu [ {\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}],{\textbf{Z}},Z^0)]\) |
Reward | \( \hat{r}(\mu ,Q) := \int _{D} r(x,a,\mu ) Q(d(x,a))\) |
Policy | \(\psi =(\varphi _0,\varphi _1,\ldots ),\) |
\(\varphi _n\!\in \! {\hat{F}}\! :=\! \{ \varphi : {\mathbb {P}}_N(S) \rightarrow {\mathbb {P}}_N(D) \;| \; \varphi \text{ meas., } \varphi (\mu )\in {\hat{D}}(\mu ),\; \forall \mu \in {\mathbb {P}}_N(S) \} \) |
Example 3.6
We reconsider Example 2.4. The given state and action translates in \(\widehat{\textrm{MDP}}\) to \(\mu =\mu [{\textbf{x}}]=(2/5,1/5,2/5)\) as distribution on \(S=\{1,2,3\}\). The action is a distribution on \(D=\{(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)\}\) and translates into \(Q=(1/5,1/5,1/5,0,1/5,1/5)\). The transition kernel mentioned in Remark 3.2 in this example is given by \(\bar{Q}(2|1)=\frac{1}{2},{\bar{Q}}(3|1)=\frac{1}{2},{\bar{Q}}(1|2)=1, {\bar{Q}}(3|2)=0, {\bar{Q}}(1|3)=\frac{1}{2}, {\bar{Q}}(2|3)=\frac{1}{2}.\) Obviously \( \hat{r}(\mu ,Q) =2/5\).
4 The Mean-Field Limit MDP
In this section we let \(N\rightarrow \infty \) in order to obtain some simplifications. This yields the so-called mean-field limit.
We thus consider a third MDP, the so-called limit MDP (denoted by \(\widetilde{\textrm{MDP}}\)). We will later show that it will indeed appear to be the limit of the problems studied in the previous section. The limit MDP is defined by the following data: The state space is \( {\mathbb {P}}(S)\) and the action space is \({\mathbb {P}}(D)\). We define
The one-stage reward \(\tilde{r}: {\tilde{D}} \rightarrow {\mathbb {R}}\) is given as in (3.2):
The transition function is defined by \(\tilde{T}: {\tilde{D}} \times {\mathcal {Z}}\rightarrow {\mathbb {P}}(S)\)
where \(p^{x,a,\mu ,Z^0} (B):={\mathbb {P}}(T(x,a,\mu ,Z^i,Z^0)\in B |Z^0)\) with \(B\in {\mathcal {B}}(S)\), is the conditional probability that the next state is in B, given \(x,a,\mu \) and the common noise random variable \(Z^0\).
Remark 4.1
Recalling that \(Q\in {\tilde{D}}(\mu )\) means \(Q=\mu \otimes \bar{Q}\), we can (with the help of the Fubini theorem) instead of (4.3) equivalently write
where \(P^{{\bar{Q}},\mu ,Z^0} (dx'|x)= \int _{D(x)} p^{x,a,\mu ,Z^0}(dx') \bar{Q}(da|x)\). Hence \(P^{{\bar{Q}},\mu ,Z^0}\) is the transition kernel which determines the distribution at the next stage. In general it depends on \({\bar{Q}},\mu \) and the common noise \(Z^0\).
A decision rule is here a measurable mapping \(\varphi \) from \({\mathbb {P}}(S)\) to \({\mathbb {P}}(D)\) such that \(\varphi (\mu )\in \tilde{D}(\mu )\) for all \(\mu \). We denote by \({\tilde{F}}\) the set of all decision rules. Suppose that \(\psi =(\varphi _0,\varphi _1,\ldots )\) is a policy for the \(\widetilde{\textrm{MDP}}\). As in the previous section we set for \(n\in {\mathbb {N}}_0\)
which yields the sequence of distributions of individuals. Note that it is deterministic if \({\tilde{T}}\) does not depend on the common noise \(Z^0\).
Then we define for \(\widetilde{\textrm{MDP}}\) the following value functions for policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) and state \(\mu \in {\mathbb {P}}(S)\)
Instead of (A2) we will now assume that
-
(A2’)
\( (x,a,\mu ) \mapsto r(x,a,\mu )\) is continuous.
Definition 4.2
We define
-
(a)
The set \(\tilde{{\mathbb {M}}}:= \{ v: {\mathbb {P}}(S) \rightarrow {\mathbb {R}}\; |\; v \text{ is } \text{ continuous } \text{ and } \text{ bounded }\}\).
-
(b)
The maximal reward operator \({\tilde{U}}\) on \(\tilde{{\mathbb {M}}} \) in this model is
$$\begin{aligned} {\tilde{U}} v(\mu )= ({\tilde{U}} v)(\mu ):= & {} \sup _{Q\in {\tilde{D}} (\mu ) } \Big \{ \tilde{r}(\mu ,Q)+ \beta {\mathbb {E}}v( \tilde{T}(\mu ,Q,Z^0))\Big \}. \end{aligned}$$
For the mean-field limit MDP we obtain:
Theorem 4.3
Assume (A0), (A1), (A2’), (A3). Then:
-
(a)
The value function J is the unique fixed point of the \({\tilde{U}}\)-operator in \(\tilde{{\mathbb {M}}}\), i.e. it satisfies the optimality equation \(J= {\tilde{U}} J\).
-
(b)
\(J = \lim _{n\rightarrow \infty } {\tilde{U}}^n 0\).
-
(c)
There exists a maximizer of J and every maximizer \(\varphi ^*\in {\tilde{F}}\) of J defines an optimal stationary deterministic policy \((\varphi ^*,\varphi ^*,\ldots )\).
Remark 4.4
We can use the established solution methods like value iteration, policy iteration, linear programmes or reinforcement learning to numerically solve the limit MDP ( [4, 10, 30]).
The limit problem can be seen as a problem which approximates the original model when N is large. In order to proceed, we need a more restrictive assumption than (A3)
-
(A3’)
\({\mathcal {Z}}\) is compact and \( (x,a,\mu ,z,z_0) \mapsto T(x,a,\mu ,z,z_0)\) is continuous.
Remark 4.5
The assumption that \({\mathcal {Z}}\) is compact is not a strong assumption. Indeed, w.l.o.g. we may choose the disturbances to be uniformly distributed over [0, 1]. This is because if for example \({\mathcal {Z}}={\mathbb {R}}\) and F is the distribution function of Z we get \(Z{\mathop {=}\limits ^{d}} F^{-1}(U)\) with \(U\sim U([0,1])\) and \(F^{-1}\) is then part of the transition function.
Then it is possible to prove the following limit result.
Theorem 4.6
Assume (A0), (A1), (A2’) and (A3’). Let \(\mu ^N_0\Rightarrow \mu _0\) for \(N\rightarrow \infty \) where \(\mu _0^N \in {\mathbb {P}}_N(S)\). Then
-
(a)
\(\limsup _{N\rightarrow \infty } J^N(\mu ^N_0)= J(\mu _0)\).
-
(b)
Suppose \(\varphi ^*\) is a maximizer of J. Then it is possible to construct (possibly history-dependent) policies \(\psi ^N= (\varphi ^N_0,\varphi ^N_1,\ldots )\) for \(\widehat{\textrm{MDP}}\) s.t. \(\lim _{N\rightarrow \infty } J^N_{\psi ^N}(\mu _0^N)=J(\mu _0)\).
In particular the proof of part (b) shows how to obtain an \(\varepsilon \)-optimal policy for the model with N individuals (N large) when we know the optimal policy for the limit MDP.
Remark 4.7
-
(a)
In case there is no common noise, \(\widetilde{\textrm{MDP}}\) is completely deterministic. The optimality equation then reads
$$\begin{aligned} J(\mu )= & {} \sup _{Q\in {\tilde{D}}(\mu )} \Big \{ {\tilde{r}}(\mu ,Q)+\beta J({\tilde{T}}(\mu ,Q))\Big \} \end{aligned}$$(4.7)where \({\tilde{T}}(\mu ,Q)(B) = \int p^{x,a,\mu }(B) Q(d(x,a))\) with \(p^{x,a,\mu }(B) = {\mathbb {P}}(T(x,a,\mu ,Z)\in B)\).
-
(b)
If there is no common noise and r and T do not depend on \(\mu \), we obtain as a special case a standard MDP. The usual optimality equation for this MDP (for one individual) would be
$$\begin{aligned} V(x) = \sup _{a\in D(x)} \left\{ r(x,a)+ \beta {\mathbb {E}}V(T(x,a,Z))\right\} ,\; x\in S \end{aligned}$$(4.8)where \(V(x) = \sup _\pi \sum _{k=0}^\infty \beta ^k {\mathbb {E}}_x^\pi [r(X_k^i,A_k^i)]\). The results in this paper show that we can equivalently consider \(\widehat{\textrm{MDP}}\) which implies the optimality equation (4.7). It is possible to show by induction that the relation between both value functions is given by \(J(\mu ) = \int V(x)\mu (dx)\). Moreover, a maximizer of J is given by \(\varphi ^*(\mu )=\mu \otimes {\bar{Q}}^*\) with \({\bar{Q}}^*(\cdot |x)= \delta _{f^*(x)}\) for some \(f^*: S\rightarrow A\) with \(f^*(x)\in D(x)\) and \(f^*\) is a maximizer of V. Here the choice of the conditional distribution \({\bar{Q}}^*\) does not depend on \(\mu \) and is concentrated on a single action.
-
(c)
The policy \(\psi ^N\) which is constructed in Theorem 4.6 is deterministic but has the disadvantage that individuals have to communicate. Another possibility is to choose \(Q_0^N\) as an empirical measure of \(Q_0^*\) given \(\mu _0^N\). This means if \(Q_0^* = \mu _0 \otimes {\bar{Q}}^*\) and \(\mu _0^N = \mu [{\textbf{x}}^N]\) then simulate for all \(x_i^N\) actions \(a_i^N\) according to the kernel \({\bar{Q}}^*\). This is then a randomized policy but has the advantage that every individual can do this on its own without having the information about the other states and actions. This is then a decentralized control, i.e. \(f^i({\textbf{x}})=f^i(x_i)\). Also the speed of the convergence in Theorem 4.6 depends on the chosen approximation method.
We summarize the model data below:
Model \(\widetilde{\textrm{MDP}}\) | |
---|---|
State space | \({\mathbb {P}}(S) \ni \mu \) |
Admissible actions | \({\tilde{D}} (\mu ) :=\{ Q \in {\mathbb {P}}(D) \,| \text{ the } \text{ first } \text{ margin } \text{ of } Q \text{ is } \mu \}\ni Q\) |
Transition function | \( \tilde{T}(\mu ,Q,Z^0)(B) = \int _{ D} p^{x,a,\mu ,Z^0}(B) Q(d(x,a))\) where |
\(p^{x,a,\mu ,Z^0} (B):={\mathbb {P}}(T(x,a,\mu ,Z^i,Z^0)\in B |Z^0)\) | |
Reward | \( \tilde{r}(\mu ,Q) := \int _{D} r(x,a,\mu ) Q(d(x,a))\) |
Policy | \(\psi =(\varphi _0,\varphi _1,\ldots ),\) |
\(\varphi _n\in {\tilde{F}} := \{ \varphi : {\mathbb {P}}(S) \rightarrow {\mathbb {P}}(D) \;| \; \varphi \text{ meas., } \varphi (\mu )\in {\tilde{D}}(\mu ),\; \forall \mu \in {\mathbb {P}}(S) \} \) |
Example 4.8
We reconsider Example 2.4. In \(\widetilde{\textrm{MDP}}\) a state can be any distribution on S, e.g. \(\mu =(\pi ^{-1},0,1-\pi ^{-1})\). An action is a distribution on \(D=\{(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)\}\) s.t. the first margin is \(\mu \). For example \(Q=(\pi ^{-1},0,0,0,3/4(1-\pi ^{-1}),1/4(1-\pi ^{-1}))\). Here \( \tilde{r}(\mu ,Q) =\pi ^{-1}\).
5 Average Reward Optimality
In this section we consider the problem of finding the maximal average reward of the mean-field limit problem \(\widetilde{\textrm{MDP}}\). So suppose an \(\widetilde{\textrm{MDP}}\) as in the previous section (Eq. (4.6)) is given. For a fixed policy \(\psi =(\varphi _1,\varphi _2,\ldots )\) define
The problem is to find \(G(\mu ):= \sup _\psi G_\psi (\mu )\) for all \(\mu \in {\mathbb {P}}(S)\). We will construct the solution via the vanishing discount approach, see e.g. [3, 19, 33, 34]. This has the advantage that we get a statement about the approximation of the \(\beta \)-discounted problem by the average reward problem immediately. For this purpose we denote by \(J^\beta , J^\beta _ \psi \) the value functions of the discounted reward problem \(\widetilde{\textrm{MDP}}\) of the previous section in order to stress that they depend on the discount factor \(\beta \).
We first note that the following Tauber Theorem holds (see e.g. [34], Th. A.4.2):
Lemma 5.1
For arbitrary \(\mu \in {\mathbb {P}}(S)\) and policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) we have
In order to proceed we make the following assumption (compare with condition (B) in [33] or condition (SEN) in [34], Section 7.2).
-
(A4)
There exist \(L>0, \bar{\beta }\in (0,1)\) and a function \(M: {\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) such that
$$\begin{aligned} M(\mu ) \le h^\beta (\mu ):= J^\beta (\mu )-J^\beta (\nu )\le L \end{aligned}$$for fixed \(\nu \in {\mathbb {P}}(S)\), all \(\mu \in {\mathbb {P}}(S)\) and all \(\beta \ge \bar{\beta }\).
We define \(\rho (\beta ):= (1-\beta ) J^\beta (\nu )\). Note that since r is bounded by a constant \(C>0\) say, we obtain \( |\rho (\beta )| \le (1-\beta ) |J^\beta (\nu )| \le C\). I.e. \( \rho (\beta )\) is bounded and \(\limsup _{\beta \uparrow 1} \rho (\beta )=:\rho \) exists. Now we obtain:
Lemma 5.2
Under (A4) there exists a sequence \((\beta _n)\) with \(\lim _{n\rightarrow \infty } \beta _n = 1\) s.t.
for all \(\mu \in {\mathbb {P}}(S)\). In particular we have \(G_\psi (\mu ) \le \rho \) for all \(\mu \) and \(\psi \).
Proof
Using (A4) we obtain:
The last term converges to zero when we choose \((\beta _n)\) s.t. \(\lim _{n\rightarrow \infty }\beta _n=1\) and \(\lim _{n\rightarrow \infty }\rho (\beta _n)=\rho \) which is possible due to the considerations preceding this lemma. The first term also tends to zero. \(\square \)
We obtain:
Theorem 5.3
Assume (A0), (A1), (A2’), (A3’), (A4). Then:
-
(a)
There exists a constant \(\rho \in {\mathbb {R}}\) and an upper semicontinuous function \(h:{\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) such that the average reward optimality inequality holds, i.e. for all \(\mu \in {\mathbb {P}}(S)\)
$$\begin{aligned} \rho + h(\mu ) \le \sup _{Q\in {\tilde{D}}(\mu )} \left\{ {\tilde{r}}(\mu , Q) + {\mathbb {E}}[ h({\tilde{T}}(\mu ,Q,Z^0))] \right\} . \end{aligned}$$(5.2)Moreover, there exists a maximizer \(\varphi ^*\) of (5.2).
-
(b)
The stationary policy \((\varphi ^*,\varphi ^*,\ldots )\) is optimal for the average reward problem and \(\rho = \limsup _{\beta \uparrow 1} \rho (\beta )\) is the maximal average reward, independent of \(\mu \). Moreover, there exists a decision rule \(\varphi ^0\) and sequences \(\beta _m(\mu )\uparrow 1\), \(\mu _m(\mu ) \rightarrow \mu \) s.t.
$$\begin{aligned} \varphi ^0(\mu ):= \lim _{m\rightarrow \infty } \varphi ^{\beta _m(\mu )}(\mu _m(\mu )) \end{aligned}$$where \(\varphi ^\beta \) is an optimal decision rule in the \(\beta \)-discounted model and the stationary policy \((\varphi ^0,\varphi ^0,\ldots )\) is optimal for the average reward problem.
Note that part (b) of the previous theorem states that it is possible to obtain an average reward optimal policy from optimal policies in the discounted model. Indeed what is maybe more interesting is the converse. From the average optimal policy we can construct \(\varepsilon \)-optimal policies for \(\widetilde{\textrm{MDP}}\) and thus also for \(\widehat{\textrm{MDP}}\) if \(\beta \) is close to one. The idea is to use the double approximation (number of agents large, discount factor large) to approximate the discounted finite agent model by the average mean-field problem. We do not tackle the question of convergence speed or how \(\beta \) depends on N here. A policy \(\psi \) is \(\varepsilon \)-optimal in state \(\mu \in {\mathbb {P}}(S)\) for \(\widetilde{\textrm{MDP}}\) if
Thus, we obtain:
Corollary 5.4
Under the assumptions of Theorem 5.3 suppose \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\) is an optimal stationary policy for the average reward problem and \(\psi ^N\) is constructed as in Theorem 4.6. Then for all \(\varepsilon >0\) and for all \(\mu \in {\mathbb {P}}(S)\) there exists a \(\beta (\mu ) <1\)
-
(a)
s.t. \(\psi ^*\) is \(\varepsilon \)-optimal for \(\widetilde{\textrm{MDP}}\) in state \(\mu \) for all \(\beta \ge \beta (\mu )\).
-
(b)
and there exists a \(N(\mu ,\beta (\mu ))\in {\mathbb {N}}\) s.t. for all \(N\ge N(\mu ,\beta (\mu ))\) and \(\beta \ge \beta (\mu )\) \(\psi ^N\) is \(\varepsilon \)-optimal for \(\widehat{\textrm{MDP}}\), i.e. \((1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^N(\mu ^N) |\le \varepsilon \) where \(\mu ^N \Rightarrow \mu \).
Proof
-
(a)
By Theorem 5.3 we know that \(\rho =G_{\psi ^*}(\mu )\) is the maximal average reward. Lemma 5.1 and Theorem 5.3 together imply
$$\begin{aligned} \rho= & {} G_{\psi ^*}(\mu )\le \liminf _{\beta \uparrow 1}(1-\beta ) J^\beta _{\psi ^*}(\mu ) \le \limsup _{\beta \uparrow 1}(1-\beta ) J^\beta _{\psi ^*}(\mu ) \\\le & {} \limsup _{\beta \uparrow 1}(1-\beta ) J^\beta (\mu ) =\rho \end{aligned}$$which means that we have equality everywhere. Since r is bounded, w.l.o.g. we may assume that r is bounded from below by \(\underline{C}>0\), otherwise we have to shift the function by a constant. Now for all \(\varepsilon >0\) we can choose, due to the preceding equation, \(\beta (\mu )\) s.t. for all \(\beta \ge \beta (\mu )\)
$$\begin{aligned} |J^\beta (\mu )-J^\beta _{\psi ^*}(\mu )|\le \frac{\varepsilon }{1-\beta } \text{ and } \text{ hence } 1- \Big | \frac{J_{\psi ^*}^\beta (\mu )}{J^\beta (\mu )}\Big | \le \frac{\varepsilon }{(1-\beta ) J^\beta (\mu )}\le \frac{\varepsilon }{\underline{C}} \end{aligned}$$which implies the result.
-
(b)
Let \(\varepsilon >0\). From part a) choose \(\beta (\mu )<1\) s.t. for all \(\beta \ge \beta (\mu )\) we have \((1-\beta ) |J^\beta (\mu )-J^\beta _{\psi ^*}(\mu )|\le \varepsilon /3.\) Fix such a \(\beta \ge \beta (\mu )\). From Theorem 4.6 choose \(N\ge N(\mu ,\beta )\) s.t.
$$\begin{aligned} |J^N_{\psi ^N}(\mu ^N)-J^\beta _{\psi ^*}(\mu ) |\le \varepsilon /3 \text{ and } |J^N(\mu ^N)-J^\beta (\mu ) |\le \varepsilon /3. \end{aligned}$$Then, in total
$$\begin{aligned}{} & {} (1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^N(\mu ^N)| \le (1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^\beta _{\psi ^*}(\mu )|\nonumber \\{} & {} \quad +(1-\beta )|J^\beta _{\psi ^*}(\mu )-J^\beta (\mu )| + (1-\beta )|J^\beta (\mu )-J^N(\mu ^N)| \le \varepsilon \end{aligned}$$(5.3)which implies the statement.
\(\square \)
5.1 Special Case I
We consider the following special case: The reward depends only on \(\mu \), i.e. we have \({\tilde{r}}(\mu ,Q)={\tilde{r}} (\mu )\). The transition function is independent of \(\mu \) and there is no common noise, i.e. all individuals move independently from each other. Suppose \(\mu ^*\in {\mathbb {P}}(S)\) is the solution of the static optimization problem
which exists since r is continuous on the compact space \({\mathbb {P}}(S)\). In the described situation \(\widetilde{\textrm{MDP}}\) is deterministic and the evolution of the state process for a given policy is
for \(k\in {\mathbb {N}}\) where we start with the initial distribution \(\mu _0\).
Now suppose further that there exists a transition kernel (policy) \( {\bar{Q}}^*\) such that \(\mu ^*\) is a stationary distribution of \( P^{ {\bar{Q}}^*}\) and \(P^{ {\bar{Q}}^*}\) satisfies the Wasserstein ergodicity (see Appendix). Suppose further that \((\mu _k^*)\) is the state sequence obtained in (5.5) where we replace \(P^{{\bar{Q}}}\) by \(P^{ {\bar{Q}}^*}\). Then \(\mu _k^*\Rightarrow \mu ^*\) for \(k\rightarrow \infty \) weakly since convergence in the Wasserstein metric implies weak convergence on compact sets. Problem (5.4) and the solution approach here is similar to the concept of steady state policies in [12].
Lemma 5.5
Under the assumptions of this subsection \(\varphi ^*(\mu ) = \mu \otimes {\bar{Q}}^*\) defines an average reward optimal stationary policy \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\).
Proof
Since \(\mu \mapsto {\tilde{r}}(\mu )\) is continuous (see proof of Theorem 4.3) we obtain \(\lim _{k\rightarrow \infty }\tilde{r}(\mu _k^*) \rightarrow {\tilde{r}}(\mu ^*)\). Thus we have for all \(\mu \in {\mathbb {P}}(S)\)
The last equation follows from the definition of \(\mu ^*\). Hence \(\psi ^*\) is average reward optimal. \(\square \)
We can think of the problem thus been transformed into a Markov Chain Monte Carlo problem to sample from \(\mu ^*\). In order to obtain an \(\varepsilon \)-optimal policy in the N individual problem with large discount factor, an individual in state x can sample its action from \({\bar{Q}}^*(\cdot |x)\) (see proof of Theorem 4.6 and Remark 4.7 c)). This yields a decentralized decision which does not depend on the complete state of the system. I.e. the individuals do not have to communicate with each other in order to push the system to the social optimum. The knowledge about the own state is sufficient. Problems may occur when the solution of (5.4) is not unique. Then the individuals have to communicate which solution is preferred. In particular the individual’s optimal decision coincides with the social optimal decision. This is because we can interpret \(\mu _k\) as the distribution of a typical individual at time k. Also note that in this case it can be shown that Assumption (A4) is satisfied since \(|{\tilde{r}}(\mu _k^*)-{\tilde{r}}(\mu ^*)| \le C W(\mu _k^*,\mu ^*) \le {\tilde{C}} \rho ^k\) with \(\rho \in (0,1)\) where W is the Wasserstein distance of two measures (see Appendix). We will give a more specific application in Sect. 6.
5.2 Special Case II
We relax the previous case and allow the transition function to depend on \(\mu \). Again we determine the solution \(\mu ^*\) of (5.4) first. Next we check whether there exists a transition kernel (policy) \({\bar{Q}}^*\) such that \(\mu ^*\) is a stationary distribution of \( P^{{\bar{Q}}^*}\) with \(P^{\bar{Q}^*}(B|x)= \int p^{x,a,\mu ^*}(B) {\bar{Q}}^* (da|x)\) for \(x\in S, B\in {\mathcal {B}}(S)\) and \(P^{{\bar{Q}}^*}\) satisfies the Wasserstein ergodicity. Here, we need some further properties of the model to obtain the same result as in Case I, because we have to make sure that the system still converges to \(\mu ^*\), even if we choose the ’wrong’ transition kernel
at stage k. Note that the evolution of the state in this model is given by
In particular we want to find an optimal decentralized control. The following assumptions will be useful:
-
(T1)
There exists \(\gamma _W>0\) s.t. \(\sup _{x,a,z}|T(x,a,\mu ,z)-T(x,a,\mu ^*,z)|\le \gamma _W W(\mu ,\mu ^*)\) for all \(\mu \in {\mathbb {P}}(S)\).
-
(T2)
D(x) does not depend on x and \(W({\bar{Q}}^*(\cdot |x), {\bar{Q}}^*(\cdot |x'))\le \gamma _Q |x-x'|\) for all \(x,x'\in S\).
-
(T3)
There exists \(\gamma _A>0\) s.t. \(\sup _{x,z}|T(x,a,\mu ^*,z)-T(x,a',\mu ^*,z)|\le \gamma _A |a-a'|\) for all \(a,a'\in A\).
-
(T4)
There exists \(\gamma _S>0\) s.t. \(\sup _{a,z}|T(x,a,\mu ^*,z)-T(x',a,\mu ^*,z)|\le \gamma _S |x-x'|\) for all \(x,x'\in S\).
-
(T5)
\(\gamma :=\gamma _W+\gamma _Q\gamma _A+\gamma _S<1.\)
The next lemma states that under these assumptions the sequence \((\mu _k^*)\) still converges against the optimal distribution \(\mu ^*\).
Lemma 5.6
Under (T1)-(T5) we obtain: \(W(\mu _{k+1}^*,\mu ^*)\le \gamma W(\mu _k^*,\mu ^*)\) and thus \(\mu _k^*\Rightarrow \mu ^*\) weakly.
Lemma 5.6 then implies that even in this case the maximal average reward \({\tilde{r}}(\mu ^*)\) is achieved by applying \({\bar{Q}}^*\) throughout the process which corresponds to a decentralized control. An example where (T1), (T3), (T4) are fulfilled is \(T(x,a,\mu ,z) = \gamma _S x+ \gamma _A a +\gamma _W \int x\mu (dx) + z\).
6 Applications
6.1 Avoiding Congestion
We consider here the following special case: N individuals move on a graph with nodes \(S=\{1,\ldots ,d\}\) and edges \(E\subset \{(x,x'): x,x'\in S\}\). Individuals can move along one edge in one time step. We assume that nodes are connected. The aim is to avoid congestion and to try to spread the individuals such that they keep a maximum distance. More precisely suppose that the current empirical distribution of the individuals on the nodes is \(\mu \) and that the distance between node x and \(x'\), \(x,x'\in S\) is given by \(\Delta (x,x')>0\) where \(\Delta (x,x)=0\) and \(\Delta (x,x')=\Delta (x',x)\). Then the average distance between an individual at position x and all other individuals is
Here \(r(x,a,\mu )\) does not depend on a. Hence
where \(\Delta =\big ( \Delta (x,x')\big )_{x,x'\in S}\) is the matrix of distances. Note that \(\Delta \) is symmetric. We assume that \(A=S\) and \(D(x)=\{x'\in S: (x,x')\in E\}\cup \{x\}\), i.a. actions in the original model are neighbours on the graph. We interpret actions as intended directions the individual wants to move to, but this may be disturbed by some random external noise. In the mean-field limit the state of the system at time n is just given by a generalized distribution on S. Recall that the general transition equation of the mean-field limit is
if S, A are finite where \( p^{x,a,\mu ,z^0}(x') = {\mathbb {P}}(T(x,a,\mu ,Z,z^0)=x')\) and \(Q_n\) has first margin \(\mu _n\). Problems where the reward decreases when more individuals share the same state are typical for mean-field problems, see e.g. [25] where a Wardrop equilibrium is computed. In [28] the authors consider spreading contamination on graphs.
6.1.1 No Common Noise
We consider the mean-field limit now. At the beginning let us assume that \(p^{x,a,\mu ,z^0} = p^{x,a}\) does not depend on \(\mu \) and \(z^0\), i.e. the individuals move on their own, not affected by others and there is no common noise. Moreover, it is reasonable to set \(p^{x,a}(x')=0\) if \((x,x')\notin E\) except for \(x=x'\). Let us denote \( P^{{\bar{Q}}}=\big ( p_{xx'}^{{\bar{Q}}}\big )\) where
with \({\bar{Q}}(a|x)\mu (x)=Q(x,a)\). Hence (6.1) can be written as \(\mu _{n+1}=\mu _n P^{{\bar{Q}}_n}\). Here it is more intuitive to work with the conditional probabilities \({\bar{Q}}(a|x)\) instead of the joint distribution Q(x, a).
Obviously the optimization problem
has an optimal solution \(\mu ^*\) since \({\mathbb {P}}(S)\) is compact and \(\mu \Delta \mu ^\top \) continuous.
We consider the following special case: For \(a,x'\in D(x)\) set \(p^{x,a}(x') =\alpha \) for \(a=x'\) and \(p^{x,a}(x') =\frac{1-\alpha }{|D(x)|-1}\) else. All other probabilities are zero. I.e. if we choose a vertex a we will move there with probability \(\alpha \) and move to any other admissible vertex with equal probability. Formally for \(x\in S\), action \(a\in D(x)=\{x_1,\ldots ,x_m\}\) (where \(x_i=x\) for one of the \(x_i\)’s) and disturbance \(Z\sim U[0,1]\) the transition function in this example is given by
Lemma 6.1
If \(\mu ^*(x)>0\) for all \(x\in S\) and \(\alpha \) is large enough, then there exists a \(Q^*\in {\mathbb {P}}(D)\) s.t. \(\mu ^* = \mu ^* P^{\bar{Q}^*}\), i.e. \(\mu ^*\) is a stationary distribution for the transition kernel \(P^{{\bar{Q}}^*}\) given in (6.2).
Proof
We use a construction similar to the Metropolis algorithm. For \(x,x'\in S\) let
and
The parameter \(\kappa >0\) should be such that \(P^{{\bar{Q}}^*}\) is a transition matrix. Then the detailed balance equations
are satisfied and hence \(\mu ^*\) is a stationary distribution of \(P^{{\bar{Q}}^*}\). We now have to determine \({\bar{Q}}^*\) s.t. \(P^{\bar{Q}^*}\) has the specified form. Let us fix \(x\in S\). We have to solve (6.2) for \({\bar{Q}}^*\). We claim that (6.2) is solved for
This can be seen since
In order to have \({\bar{Q}}^*(a|x)\in [0,1]\) we have to make sure that \( \alpha \ge p_{xx'}^{{\bar{Q}}^*} \vee (1-p_{xx'}^{{\bar{Q}}^*})\) for all \(x,x'\in S\) and \(\alpha \ge \frac{1}{2}\). \(\square \)
Theorem 6.2
The optimal average reward policy for the limit model considered here is the stationary policy \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\) with \(\varphi ^*(\mu )= \mu \otimes {\bar{Q}}^*\) with \({\bar{Q}}^*\) from (6.4). Thus, for N large and \(\beta \) close to one, sampling actions from \({\bar{Q}}^*\) is \(\varepsilon \)-optimal for the \(\beta \)-discounted problem with N individuals.
Proof
The statement follows from our previous discussions. Note that when we start with an arbitrary \(\mu _0^*\), the sequence of distributions generated by \(\mu _{k+1}^* = \mu _k^* P^{{\bar{Q}}^*}\) converges against \(\mu ^*\) since the matrix \(P^{{\bar{Q}}^*}\) is irreducible by construction and we have a finite state space. Thus, \(G_\psi (\mu _0^*)\) in (5.1) yields the same limit \(\mu ^* \Delta (\mu ^*)^\top \) which is maximal since it solves (5.4). \(\square \)
Remark 6.3
It is tempting to say that for the discounted problem, once we have reached the stationary distribution after a transient phase we know that the optimal policy is to choose \({\bar{Q}}^*\) forever. However, there are only rare cases where the stationary distribution is reached after a finite number of steps (see e.g. [15]), so the transient phase will in most cases last forever.
Example 6.4
We consider a regular \(3\times 3\) grid, i.e. \(d=9\) (see Fig. 1, left). We set the distance between nodes equal to 1 when there is only one edge between them. Nodes which are connected via 2 edges get the distance 1.4, when there are 3 edges in between 1.7 and finally we set the distance equal to 2.2 when there are 4 edges in between. The distance matrix \(\Delta \) is thus given by
The optimal distribution of problem (5.4) is here given by \(\mu ^*= \frac{1}{37} (7,2,7,2,1,2,7,2,7)\). The masses are illustrated in Fig. 1, right picture. The area of the circle is proportional to the corresponding value of \(\mu ^*\). We think of the proportion of individuals who occupy this node.
We set \(\alpha =1\) and \(\psi =0.25\). Then we obtain from (6.4) that the optimal decision in every node is given by the following transition kernel \({\bar{Q}}^*(a|x)\)
where \(b=\frac{1}{8}\) and \(c=\frac{1}{14}\). So using this decentralized decision throughout the process yields the maximal average reward. In Fig. 2 we see the evolution of the system when all mass starts initially in node 1. The pictures show the distribution of the mass after 2, 4, 8, 16, 32 and 64 time steps. Note that sampling actions from \({\bar{Q}}^*\) is also \(\varepsilon \)-optimal for the system when we have a finite but large number of individuals and \(\beta \) is close to one for the discounted reward criterion.
6.1.2 With Common Noise
Next we suppose that \(\alpha \) depends on the common noise \(Z^0\). In this case the maximal average reward which can be achieved is less or equal to the case without common noise since the sequence of distributions is stochastic and may deviate from the optimal one. We simplify things a little bit since we assume here that \(|D(x)| = \gamma \) independent of x. From the previous section, equation (6.5) we know that we can write
In matrix notation
where U is a \(d\times d\) matrix containing ones only and \({\bar{Q}}=({\bar{Q}}(x'|x)).\) Here the situation is more complicated, in particular the next empirical distribution of individuals is stochastic and given by
with \(e=(1,\ldots ,1)\in {\mathbb {R}}^d\). Plugging this into the reward function yields
Now consider the problem
Obviously this problem has an optimal solution \(\nu ^*\) since we maximize a continuous function over a compact set. Now \(\nu \) corresponds to \(\mu _n {\bar{Q}}_n\) in (6.6). In case it is possible to choose for all \(\mu \in {\mathbb {P}}(S)\) a matrix \({\bar{Q}}\) s.t. \(\mu {\bar{Q}}= \nu ^*\), then this would be the optimal strategy, since we would get the maximal expected reward in each step. This is for example possible if the graph is complete. Then we can simply choose \({\bar{Q}}\) as the matrix with identical rows which consist of \(\nu ^*\).
6.2 Positioning on a Market Place
Suppose we have a rectangular market place like in Fig. 3. The state \(\mu \) represents the distribution of individuals over the market place. Point A is an ice cream vendor. The aim of the individuals is to keep distance to others and be as close as possible to the ice cream vendor. Thus, \(S\subset {\mathbb {R}}^2\) is the rectangle BCED and the one-stage reward is
In what follows in order to simplify the computation we choose \(d(x,y)=\Vert x-y\Vert ^2\) for \(x,y\in S\). We want to solve (5.4) in this case. Let us formulate the problem with the help of random variables. Let \(X=(X_1,X_2), Y=(Y_1,Y_2)\) be independent r.v. having distribution \(\mu \). Then \({\tilde{r}}(\mu )\) is the same as
Thus, we can treat the margins separately and the dependence between them is not interesting for the reward. Now obviously since X and Y both have the same distribution we can write
Suppose we fix \({\mathbb {E}}X_i\) for a moment. Since \(x\mapsto x^2\) is convex, the distribution which maximizes the expression is maximal in convex order, given the fixed expectation. But this distribution is due to the convexity property concentrated on the endpoints of the interval. Thus we can restrict to random variables \(X_1\) which have mass \(p\in [0,1]\) on \(B_1\) and \(1-p\) on \(C_1\), i.e. we maximize
over \(p\in [0,1]\).
The solution is given by \(p= \frac{1}{4} + \frac{C_1-A_1}{2(C_1-B_1)}\). Since the joint distribution does not matter we can choose independent margins and obtain
This is the target distribution which should be attained. For a numerical example we choose B(0, 0), C(4, 0), D(0, 3), E(4, 3) and A(2.5, 2). In this case we obtain
The distribution is illustrated in Fig. 3, (right).
Depending on how the transition law precisely looks like, if one is able to choose \({\bar{Q}}^*\) such that \(\mu ^*\) is the stationary distribution of \(P^{{\bar{Q}}^*}\), the problem is solved. Of course the optimal distribution \(\mu ^*\) depends on what kind of distance d we choose. Varying the metric for the distance leads to interesting optimization problems.
7 Conclusion
We have seen that the average reward mean-field problem can in some cases be solved rather easily by computing an optimal measure from a static optimization problem. The policy which is obtained in this way is \(\varepsilon \)-optimal for the \(\beta \)-discounted N-individuals problem where N is large and \(\beta \) close to one. The static optimization problem for measures gives rise to some interesting mathematical questions.
References
Bäuerle, N., Lange, D.: Optimal control of partially observable piecewise deterministic Markov processes. SIAM J. Control Optim. 56(2), 1441–1462 (2018)
Bäuerle, N., Rieder, U.: Markov Decision Processes with Applications to Finance. Springer-Verlag, Berlin Heidelberg (2011)
Bäuerle, N.: Convex stochastic fluid programs with average cost. J. Math. Anal. Appl. 259(1), 137–156 (2001)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont, Mass (1996)
Biswas, A.: Mean field games with ergodic cost for discrete time Markov processes. arXiv preprint arXiv:1510.08968 (2015)
Cao, H., Dianetti, J., Ferrari, G.: Stationary discounted and ergodic mean-field games of singular control. arXiv preprint arXiv:2105.07213 (2021)
Carmona, R., Delarue, F.: Probabilistic Theory of Mean Field Games with Applications I-II. Springer Nature, Berlin (2018)
Carmona, R., Laurière, M., Tan, Z.: Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning. arXiv preprint arXiv:1910.12802 (2019)
Carmona, R., Laurière, M., Tan, Z.: Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. arXiv preprint arXiv:1910.04295 (2019)
Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I.: Simulation-Based Algorithms for Markov Decision Processes. Springer, London (2007)
Elliott, R., Li, X., Ni, Y.H.: Discrete time mean-field stochastic linear-quadratic optimal control problems. Automatica 49(11), 3222–3233 (2013)
Flynn, J.: Steady state policies for deterministic dynamic programs. SIAM J. Appl. Math. 37(1), 128–147 (1979)
Gast, N., Gaujal, B.: A mean field approach for optimization in discrete time. Discret. Event Dyn. Syst. 21(1), 63–101 (2011)
Gast, N., Gaujal, B., Le Boudec, J.Y.: Mean field for Markov decision processes: From discrete to continuous optimization. IEEE Trans. Autom. Control 57(9), 2266–2280 (2012)
Glynn, P.W., Iglehart, D.L.: Conditions under which a Markov chain converges to its steady state in finite time. Probab. Eng. Inform. Sci. 2(3), 377–382 (1988)
Gomes, D.A., Mohr, J., Souza, R.R.: Discrete time, finite state space mean field games. J. Mathématiques Pures Appl. 93(3), 308–328 (2010)
Gu, H., Guo, X., Wei, X., Xu, R.: Dynamic programming principles for learning MFCs. arXiv preprint arXiv:1911.07314 (2019)
Gu, H., Guo, X., Wei, X., Xu, R.: Q-Learning for Mean-Field Controls. arXiv preprint arXiv:2002.04131 (2020)
Hernández-Lerma, O., Lasserre, J.B.: Average optimality in Markov control processes via discounted-cost problems and linear programming. SIAM J. Control Optim. 34(1), 295–310 (1996)
Higuera-Chan, C.G., Jasso-Fuentes, H., Minjárez-Sosa, J.A.: Discrete-time control for systems of interacting objects with unknown random disturbance distributions: a mean field approach. Appl. Math. Optim. 74(1), 197–227 (2016)
Higuera-Chan, C.G., Jasso-Fuentes, H., Minjárez-Sosa, J.A.: Control systems of interacting objects modeled as a game against nature under a mean field approach. J. Dyn. Games 4(1), 59 (2017)
Hordijk, A., Yushkevich, A.A.: Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards. Math. Methods Oper. Res. 49(1), 1–39 (1999)
Jovanovic, B., Rosenthal, R.W.: Anonymous sequential games. J. Math. Econ 17, 77–87 (1988)
Lasry, J.M., Lions, P.L.: Jeux à champ moyen. I—Le cas stationnaire. Comptes Rendus Math. 343(9), 619–625 (2006)
Li, S. H., Yu, Y., Calderone, D., Ratliff, L., Açikmeşe, B.: Tolling for constraint satisfaction in Markov decision process congestion games. In: 2019 American Control Conference (ACC) (pp. 1238–1243). IEEE (2019)
McKean, H.P.: A class of Markov processes associated with nonlinear parabolic equations. Proc. Natl. Acad. Sci. USA 56(6), 1907–1911 (1966)
Motte, M., Pham, H.: Mean-field Markov decision processes with common noise and open-loop controls. arXiv preprint arXiv:1912.07883. To appear in Annals of Applied Probability (2021+)
Peyrard, N., Sabbadin, R.: Mean field approximation of the policy iteration algorithm for graph-based Markov decision processes. Front. Artif. Intell. Appl. 141, 595 (2016)
Pham, H., Wei, X.: Discrete time McKean–Vlasov control problem: a dynamic programming approach. Appl. Math. Optim. 74(3), 487–506 (2016)
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, vol. 703. Wiley, New York (2007)
Rudolf, D., Schweizer, N.: Perturbation theory for Markov chains via Wasserstein distance. Bernoulli 24(4A), 2610–2639 (2018)
Saldi, N., Basar, T., Raginsky, M.: Markov-Nash equilibria in mean-field games with discounted cost. SIAM J. Control Optim. 56(6), 4256–4287 (2018)
Schäl, M.: Average optimality in dynamic programming with general state space. Math. Oper. Res. 18(1), 163–172 (1993)
Sennott, L.I.: Stochastic Dynamic Programming and the Control of Queueing Systems, vol. 504. Wiley, New York (2009)
Weintraub, G.Y., Benkard, L., Van Roy, B.: Oblivious equilibrium: a mean field approximation for large-scale dynamic games. Adv. Neural Inf. Process. Syst. 18, 1489–1496 (2005)
Wiȩcek, P.: Discrete-time ergodic mean-field games with average reward on compact spaces. Dyn. Games Appl. 10(1), 222–256 (2020)
Acknowledgements
The author would like to thank two anonymous referees for their comments which helped to improve the paper.
Funding
Open Access funding enabled and organized by Projekt DEAL. The authors have not disclosed any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have not disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
8 Appendix
8 Appendix
1.1 8.1 Auxiliary Results
The following result can be found in [1], Lemma 7.2:
Lemma 8.1
Let X be a separable metric space, Y be compact metric and \(f:X \times Y\rightarrow {\mathbb {R}}\) continuous. Then \(x_n\rightarrow x\) for \(n\rightarrow \infty \) implies
1.2 8.2 Wasserstein Ergodicity
For the following definitions and results see [31].
Definition 8.2
For two probability measures \(\mu ,\nu \) on S, the dual representation of the Wasserstein distance is given by
where
Note that convergence in Wasserstein metric implies weak convergence when we are on compact sets.
Definition 8.3
A transition kernel \(P(\cdot | x)\) from S to S is called Wasserstein ergodic when there exist constants \(\rho \in (0,1)\) and \(C>0\) s.t. for all \(n\in {\mathbb {N}}\)
Suppose P is Wasserstein ergodic and has stationary distribution \(\mu ^*\) which means that \(\mu ^* = \int P(\cdot |x)\mu ^*(dx)=:\mu ^* P\). Then for any \(\mu _0\in {\mathbb {P}}\) and \(\mu _n = \mu _0 P^n\) we obtain \(W(\mu _n,\mu ^*) \le C \rho ^n\).
1.3 8.3 Additional Proofs
1.3.1 8.3.1 Proof of Theorem 2.3:
We first show that \(U:{\mathbb {M}}\rightarrow {\mathbb {M}}\). Hence, let \(v\in {\mathbb {M}}\). Since r and v are bounded, Uv is bounded. (A2) implies that \(({\textbf{x}},{\textbf{a}}) \mapsto {\textbf{r}}({\textbf{x}},{\textbf{a}})\) is upper semicontinuous. This follows since \(({\textbf{x}}_n,{\textbf{a}}_n) \rightarrow ({\textbf{x}},{\textbf{a}})\) for \(n\rightarrow \infty \) implies \(x_n^i\rightarrow x^i, a_n^i \rightarrow a^i\), \(i=1,\ldots ,N\) and \(\mu [{\textbf{x}}_n] \rightarrow \mu [{\textbf{x}}]\) (in weak topology) for \(n\rightarrow \infty \). Moreover, the sum of upper semicontinuous functions is upper semicontinuous. And finally due to (A3) and the fact that v is upper semicontinuous
is upper semicontinuous. This together implies that
is upper semicontinuous and \(U:{\mathbb {M}}\rightarrow {\mathbb {M}}\) follows from Proposition 2.4.3 in [2].
Next note that \({\mathbb {M}}\) together with the sup-norm \(\Vert v\Vert = \sup _{{\textbf{x}}\in S^N} |v({\textbf{x}})|\) is a Banach space. Also \(0\in {\mathbb {M}}\) which is the function identical to zero. Moreover for \(v,w\in {\mathbb {M}}\):
thus U is contracting since \(\beta \in (0,1)\). Next, the properties in (A0), (A1) imply that \({\textbf{D}}({\textbf{x}})\) is compact and \({\textbf{x}}\mapsto {\textbf{D}}({\textbf{x}})\) is upper semicontinuous. From the first part of the proof we know that the mapping in (8.1) is upper semicontinuous. Thus, the existence result for maximizers from Proposition 2.4.3 in [2] implies that for all \(v\in {\mathbb {M}}\) there exists a maximizer \(f\in F\).
Altogether, we have shown all assumptions from Theorem 7.3.5 in [2] which directly implies the statement.\(\Box \)
1.3.2 8.3.2. Proof of Theorem 3.5:
We only have to show that \({\hat{U}}:\mathbb {{\hat{M}}}\rightarrow \mathbb {\hat{M}}\). The statement then follows from Theorem 2.3 and Theorem 3.3 since we can identify policies, rewards, transition laws and operators. To show \({\hat{U}}:\mathbb {{\hat{M}}}\rightarrow \mathbb {{\hat{M}}}\) we use Proposition 2.4.3 in [2]. Thus, we have to check the following continuity and compactness assumptions.
-
(i)
\({\hat{D}}(\mu )\) is compact and \(\mu \mapsto {\hat{D}}(\mu )\) is upper semicontinuous on \({\mathbb {P}}_N(S)\).
-
(ii)
\((\mu ,Q)\mapsto {\hat{r}}(\mu ,Q)\) is upper semicontinuous and bounded on \({\hat{D}}\).
-
(iii)
for \(v\in \mathbb {{\hat{M}}}\) the mapping \((\mu ,Q)\mapsto {\mathbb {E}}v( \hat{ T}(\mu ,Q,{\textbf{Z}},Z^0))\) is upper semicontinuous and bounded on \({\hat{D}}\).
For (i) first note that \({\hat{D}}(\mu )\) is compact for all \(\mu \) since D is compact. Upper semicontinuity of \(\mu \mapsto {\hat{D}}(\mu )\) can be seen as follows: Let \((\mu _n) \subset {\mathbb {P}}_N(S)\) and \(\mu _n \Rightarrow \mu \) for \(n\rightarrow \infty \) and \(Q_n\in {\hat{D}}(\mu _n)\). Since \({\hat{D}}\) is compact there exists an accumulation point \(Q\in {\mathbb {P}}_N(D)\) s.t. \(Q_{n_k}\Rightarrow Q\) for a subsequence \((n_k),\) and the sequence of the first margins converges to \(\mu \) hence \(Q\in {\hat{D}}(\mu )\).
Part (ii) follows from the fact that
for \(Q=\mu [({\textbf{x}},{\textbf{a}})]\), (A2) and the observation that \((Q_n)\subset {\mathbb {P}}_N(D), Q_n \Rightarrow Q\in {\mathbb {P}}_N(D)\) implies pointwise convergence \(x_i^{(n)} \rightarrow x_i, a_i^{(n)}\rightarrow a_i\) for \(n\rightarrow \infty \), \(i=1,\ldots ,N\).
Finally for (iii) note that
is continuous on \({\hat{D}}\) which follows from (A3). This implies (iii) and the statement follows from Proposition 2.4.3 in [2].
1.3.3 8.3.3. Proof of Theorem 4.3:
In order to show the statement we use Theorem 7.3.5 in [2]. Thus, we first prove that \(\tilde{U}:\tilde{{\mathbb {M}}}\rightarrow \tilde{{\mathbb {M}}}\). We do this by showing that
-
(i)
\({\tilde{D}}(\mu )\) is compact and \(\mu \mapsto {\tilde{D}}(\mu )\) is continuous.
-
(ii)
\((\mu ,Q)\mapsto {\tilde{r}}(\mu ,Q)\) is continuous and bounded.
-
(iii)
for \(v\in {\mathbb {M}}\) the mapping \((\mu ,Q)\mapsto {\mathbb {E}}v({\tilde{T}}(\mu ,Q,Z^0))\) is continuous and bounded.
Consider (i): \({\tilde{D}}(\mu )\) is compact for all \(\mu \) since D is compact. Next the mapping \(\mu \mapsto {\tilde{D}}(\mu )\) is continuous if and only if it is upper and lower semicontinuous. Upper semicontinuity follows as in the proof of Lemma 3.5. Lower semicontinuity means that when \(\mu _n \Rightarrow \mu \in {\mathbb {P}}(S)\) for \(n\rightarrow \infty \), then for each \(Q\in {\tilde{D}}(\mu )\) we find a sequence \((Q_n)\) with \(Q_n \Rightarrow Q\) and \(Q_n \in {\tilde{D}}(\mu _n)\). This can be achieved as follows: We can decompose Q into \(Q=\mu \otimes \bar{Q}\). Now define \(Q_n:= \mu _n \otimes \bar{Q}\), then the constructed sequence has the desired properties.
For (ii) suppose that \((\mu _n, Q_n) \Rightarrow (\mu ,Q)\) for \(n\rightarrow \infty \). We have to show that
We obtain:
The first term converges to zero due to Assumption (A2’) and Lemma 8.1. The second term converges to zero since \(Q_n\Rightarrow Q\) for \(n\rightarrow \infty \) and (A2’). Boundedness follows from the boundedness of r.
Next we show (iii). Boundedness is clear. In order to show continuity we first consider the mapping
for fixed \(z^0\). We claim that this mapping is continuous. Let \(h: S\rightarrow {\mathbb {R}}\) be continuous and bounded. By \({\mathbb {P}}^Z\) we denote the distribution of the r.v. \(Z_n^i\). We have to show that
is a.s. continuous. Let \((\mu _n,Q_n) \rightarrow (\mu ,Q)\). We obtain:
In the first term we can interchange the limit \(\lim _{n\rightarrow \infty }\) and the integral due to dominated convergence and obtain
due to (A2’) and Lemma 8.1. The second term converges to zero for \(n\rightarrow \infty \) since \((x,a) \mapsto h( T(x,a,\mu ,z,z^0))\) is continuous due to (A3). In total we have shown that the mapping in (8.2) is continuous.
Finally take \(v\in \tilde{{\mathbb {M}}}\) and pick a sequence with \((\mu _n,Q_n)\rightarrow (\mu ,Q)\) for \(n\rightarrow \infty \). We obtain with dominated convergence, the continuity of v and the continuity of (8.2)
which shows the stated continuity of \((\mu ,Q)\mapsto {\mathbb {E}}v( \tilde{T}(\mu ,Q,Z^0))\). Now Proposition 2.4.8 in [2] implies that \({\tilde{U}}: \tilde{{\mathbb {M}}} \rightarrow \tilde{{\mathbb {M}}}\).
The next condition in Theorem 7.3.5 [2] is that \({\tilde{U}}\) is contracting on \( \tilde{{\mathbb {M}}}\). But this follows along the same lines as in the proof of Theorem 2.3. Finally, the existence of maximizers which is another assumption in Theorem 7.3.5 [2] follows again from Proposition 2.4.8 in [2].
In total the statement is a consequence of Theorem 7.3.5 in [2] with the set \(\tilde{{\mathbb {M}}}\). \(\Box \)
1.3.4 8.3.4. Proof of Theorem 4.6
We partition the proof into three steps.
Step 1: Let \(Q^N\Rightarrow Q\) for \(N\rightarrow \infty \) where \( Q^N \in {\mathbb {P}}_N(D).\) Hence there exist \({\textbf{x}}^N=(x_1^N,\ldots ,x_N^N)\) and \({\textbf{a}}^N=(a_1^N,\ldots ,a_N^N)\in {\textbf{D}}({\textbf{x}}^N)\) s.t. \(\mu [({\textbf{x}}^N,{\textbf{a}}^N)]=Q^N\) and \(\mu [{\textbf{x}}^N]=\mu ^N\) and \(Q^N\in {\hat{D}}(\mu ^N)\).
Further, suppose we fix \(\omega \in \Omega \) and consider a realization \({\textbf{z}}^N=(z_1^N,\ldots ,z_N^N)\) of \((Z_1^N,\ldots ,Z_N^N)\) and \(z^0\) of \(Z_1^0.\) We show that \(\hat{T}(\mu ^N, Q^N,{\textbf{z}}^N,z^0) \Rightarrow {\tilde{T}}(\mu ,Q,z^0)\) where \(\mu \) is the first margin of Q. In order to show this let \(h:S\rightarrow {\mathbb {R}}\) be bounded and continuous. We obtain:
Since h, T are continuous, \(D, {\mathcal {Z}}\) are compact and \(\mu ^N\Rightarrow \mu \) we can for all \(\varepsilon >0\) choose N large enough s.t.
Hence the first term in (8.3) converges to zero for \(N\rightarrow \infty \). Let \(\mu ^N_z\) be the empirical measures of \({\textbf{z}}^N\). We obtain:
Since \(Q^N \otimes \mu _z^N \Rightarrow Q\otimes {\mathbb {P}}^Z\) for \(N\rightarrow \infty \) by the Glivenko-Cantelli Theorem for \(N\rightarrow \infty \), the r.h.s. of (8.4) converges to
Thus, we get \({\hat{T}}(\mu ^N, Q^N,{\textbf{Z}}^N,Z^0) \Rightarrow \tilde{T}(\mu ,Q,Z^0)\) \({\mathbb {P}}\)-a.s. In the proof of Theorem 4.3 we have shown that this implies \(\lim _{N\rightarrow \infty } \tilde{r}(\mu ^N,Q^N) = {\tilde{r}}(\mu ,Q)\).
Step 2: Suppose \(\psi ^N=(\varphi _0^N,\varphi _1^N,\ldots )\) is an arbitrary policy for \(\widehat{\textrm{MDP}}\). Let \(Q_0^N = \varphi _0^N(\mu _0^N)\). Now \((Q_0^N)\) is a sequence of measures on the compact space D. Hence there is a subsequence \((m_N)\) s.t. \(Q_0^{m_N}\Rightarrow Q_0\in {\mathbb {P}}(D)\) for \(N\rightarrow \infty \). From Step 1 we know that \(\lim _{N\rightarrow \infty } {\tilde{r}}(\mu _0^{m_N},Q_0^{m_N}) = {\tilde{r}}(\mu _0,Q_0)\) where \(\mu _0\) is the first margin of \(Q_0\) and that
Let \(Q_1^{m_N}=\varphi _1(\mu _1^{m_N})\) and choose again a subsequence \(m_N'\) s.t. \(Q_1^{m_N'} \Rightarrow Q_1\) where the first margin of \(Q_1\) is \({\tilde{T}}(\mu _0,Q_0,Z_1^0)\). When we consider the first \(L\in {\mathbb {N}}\) transitions in that way, we find a joint subsequence (for convenience still denoted by \(m_N\)) s.t. for \(N\rightarrow \infty \) \({\mathbb {P}}\)-a.s.
and where the limit is by construction an admissible state-action sequence for \(\widetilde{\textrm{MDP}}\). This is because the subsequences are taken such that the limits satisfy \(Q_n\in {\mathbb {P}}(D)\) that the first margin of \(Q_n\) is \(\mu _n\) and finally because of (8.5) which is by induction not only satisfied for time point one, but also for \(n=1,\ldots ,L\). Hence
Since \(|r|\le C\) we can choose L large enough s.t.
This implies \(\limsup _{N\rightarrow \infty } J^N(\mu ^N_0)\le J(\mu _0)\).
Step 3: We finally have to show that we can construct from \(\varphi ^*\) a policy \(\psi ^N=(\varphi _0^N,\varphi _1^N,\ldots )\) s.t. \(\limsup _{N\rightarrow \infty } J^N(\mu ^N_0)= J(\mu _0)\). This proves a) and b). Suppose \(\varphi ^*(\mu _0)=Q_0^*\). It is possible to construct a sequence \(Q_0^N\in {\mathbb {P}}_N(D)\) s.t. \(Q_0^N \Rightarrow Q_0^*\) and \(\mu _0^N\) is the first margin of \(Q_0^N\). This can be done as follows: Suppose \(Q^*_0 = \mu _0\otimes {\bar{Q}}_0\) then \(\mu _0^N\Rightarrow \mu _0\) by assumption and \(Q_0^N = \mu _0^N \otimes {\bar{Q}}_0^N\) where the kernel \({\bar{Q}}_0^N\) is an appropriate discretization of \({\bar{Q}}_0\) (e.g. by quantization or quasi Monte Carlo methods). Applying the results in Step 1 we obtain \(\lim _{N\rightarrow \infty } {\tilde{r}}(\mu _0^N,Q_0^N) = {\tilde{r}}(\mu ^*,Q^*)\) and \(\mu _1^N={\hat{T}}(\mu _0^N, Q_0^N,{\textbf{Z}}_1,Z^0_1) \Rightarrow {\tilde{T}}(\mu ^*,Q^*,Z^0_1)=\mu _1^*\) \({\mathbb {P}}\) a.s.. Continuing in that way as in Step 1 we can attain the upper bound \(J(\mu _0)\) in the limit. In order to implement this strategy the central controller has to know \(Q_n^*\) or \(\mu _n^*\) at time n. If there is no common noise, then the sequence \((\mu _0^*,Q_0^*,\mu _1^*,Q_1^*,\ldots )\) is deterministic and we only have to know the time step n, so the policy is non-stationary. If the common noise is present, in order to know \(Q_n^*\) the central controller has to keep track of the history \((Z_1^0,Z_2^0,\ldots )\), so the policy \(\psi ^N\) is history-dependent. However, we know from MDP theory that such a policy can always be dominated by a Markovian policy, so
which yields the statements of the theorem. \(\Box \)
1.3.5 8.3.5. Proof of Theorem 5.3
Let \(\rho =\limsup _{\beta \uparrow 1} \rho (\beta )\) and let \((\beta _n)\) be the subsequence s.t. \(\rho =\lim _{n\rightarrow \infty } \rho (\beta _n)\). Define
where d is a metric on \({\mathbb {P}}(S)\). Note that h is a limit of bounded, continuous functions which are decreasing in n and is thus at least upper semicontinuous.
Let us now consider the \(\beta \)-discounted optimality equation
where \(\varphi ^\beta \) is an optimal decision rule in the \(\beta \)-discounted model. Subtracting \(\beta J^\beta (\nu )\) on both sides yields:
From Lemma 3.4 in [33] we know that there exist sequences \((k_n)\) of integer-valued measurable mappings and \((\mu _n)\) of \({\mathbb {P}}(S)\)-valued measurable mappings on \({\mathbb {P}}(S)\) such that \(k_n(\mu )\rightarrow \infty \), \(\mu _n(\mu )\Rightarrow \mu \) for \(n\rightarrow \infty \) and \(h^{\beta _{k_n(\mu )}}(\mu _n(\mu ))\rightarrow h(\mu )\). Define \(Q_n(\mu )= f^{\beta _{k_n(\mu )}}(\mu _n(\mu ))\). In what follows we fix \(\mu \in {\mathbb {P}}(S)\) and suppress the dependence on \(\mu \) in our notation. Then by (8.6)
Moreover, it follows from [33] Proposition 3.5 that there exists a measurable function \(g^0: {\mathbb {P}}(S)\rightarrow {\mathbb {P}}(D)\) s.t. \(g^0(\mu )\) is an accumulation point of \((Q_n(\mu ))\) and \(g^0(\mu )\in {\tilde{D}}(\mu )\). For the fixed \(\mu \) choose a subsequence \((n_m)\) of natural numbers (for simplicity denoted by m) such that \(Q_m(\mu ) \Rightarrow g^0(\mu )\). Next note since \({\tilde{r}}\) is continuous (see proof of Theorem 4.3) we obtain
Since for \(k_n\) large enough
we obtain \(\limsup _{n\rightarrow \infty } h^{\beta _{k_n}}({\tilde{T}} (\mu _{k_n}, Q_{k_n}, Z^0)) \le h({\tilde{T}}(\mu ,g^0(\mu ),Z^0))\). Hence taking \(\limsup _{m\rightarrow \infty }\) in (8.7) we obtain altogether with monotone convergence for the integral
where \(\varphi ^*\) is a maximizer of h which exists since the r.h.s. is upper semicontinuous and since \({\tilde{D}}(\mu )\) is compact. This proves part a).
Iterating this inequality n times yields by (A4)
Dividing by n and taking \(\liminf _{n\rightarrow \infty }\) on both sides we obtain \( \rho \le G_{g^0}(\mu )\). From Lemma 5.2 we deduce that \(g^0\) and hence also \(\varphi ^*\) yield an average optimal policy. The remaining statements follow from [33], Proposition 3.5. \(\Box \)
1.3.6 8.3.6. Proof of Lemma 5.6
We obtain
Let us first consider the term in (8.8). By the fact that all f are Lipschitz with maximal constant 1 and (T1) we obtain that (8.8) can be bounded by \(\gamma _W W(\mu _k^*,\mu ^*)\). Thus, we consider next (8.9). We show that
is Lipschitz with constant bounded by \(\gamma _Q\gamma _A+\gamma _S\). From this property it follows then that (8.9) can be bounded by \((\gamma _Q\gamma _A+\gamma _S) W(\mu _k^*,\mu ^*)\). Hence consider
By (T4) we can bound (8.11) by \(\gamma _S |x-x'|\) since f is Lipschitz with constant less than 1. Now finally we have to treat (8.10). Here we show that \(g(a):= \int f(T(x,a,\mu ^*,z)) {\mathbb {P}}^Z (dz) \) is Lipschitz with constant less than \(\gamma _A\):
This altogether shows that (8.9) can be bounded by \((\gamma _Q\gamma _A+\gamma _S) W(\mu _{k}^*,\mu ^*)\). Finally we obtain
for \(k\rightarrow \infty \) and weak convergence follows from convergence in the Wasserstein metric. \(\Box \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bäuerle, N. Mean Field Markov Decision Processes. Appl Math Optim 88, 12 (2023). https://doi.org/10.1007/s00245-023-09985-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s00245-023-09985-1