1 Introduction

Mean-field control problems have been developed from McKean-Vlasov processes (see [26]) where the dynamics depend on the distribution of the current state itself. In the corresponding control problem the relevant data like reward and transition function not only depend on the current state and action but also on the distribution of the state. Whereas the original motivation comes from physics these kind of problems are able to model the interaction of a large population. Thus, other popular applications include finance, queueing, energy and security problems among others. In this paper we consider mean-field control problems in discrete time in contrast to the majority of literature which concentrates on continuous time models. Moreover, our optimization criterion is to maximize the social benefit of the system i.e. the overall expected reward. In particular in our paper individuals cooperate in contrast to the game situation where one usually tries to find the Nash equilibrium of the system. Here we rather aim at obtaining the Pareto optimal solution. A comprehensive overview over continuous-time mean-field games can be found in [7]. These games have been introduced in economics and later studied in mathematics since at least 15 years (see e.g. [24] for one of the first mathematical papers on this topic).

We review briefly the latest results on discrete-time mean-field problems. First note that there have been some early studies of interactive games in [23] under the name anonymous sequential games and in [35] of so-called oblivious games which are in nature very similar to mean-field games. For a recent paper on discrete-time mean-field games and a literature survey, see for example [32]. In this paper Markov Nash equilibria are considered in a model without common noise. For an early game paper with finite state space see [16]. Since our paper is not a game and more in the spirit of Markov Decision Processes (MDPs) we concentrate our literature survey on control papers. One of the first papers in this area have been [13, 14]. In both papers the authors’ goal is to investigate the convergence of a large interacting population process to the simpler mean-field model. More precisely, the authors show convergence of value functions and convergence of optimal policies which implies the construction of asymptotically optimal policies. In both papers the state space is finite and the action space compact. Whereas in [13] the convergence rate is studied, in [14] the authors also scale the time steps to obtain a continuous-time deterministic limit. Finite as well as infinite-horizon discounted reward problems are considered. In [20] the authors also investigate convergence in a discounted reward problem, however consider the situation that the random disturbance density in unknown. A consumption-investment example is discussed there. In [21] the same authors treat the unknown disturbance as a game against nature. The paper [29] already starts from a discrete-time mean-field control problem. The authors derive the value iteration and solve an LQ McKean-Vlasov control problem. In contrast to our paper there is no common noise, the authors restrict to finite time horizon and do not use MDP theory to solve their problem. However, their model data like cost and transition function may also depend on the distribution of actions. LQ-problems are popular as applications of mean-field control since it is often possible to obtain optimal policies in these cases. E.g. [11] is entirely devoted to these kind of problems.

The two papers which are closest to ours, at least as far as the model is concerned, are [8, 27]. In both papers, the model data may also depend on the distribution of actions, but there is no restriction on admissible actions. Both consider a discounted problem with infinite time horizon. In [8] the authors work with lower semicontinuous value functions, whereas we show continuity under the same assumptions. The main issues in [8] are an extensive discussion of different types of policies and the development of Q-learning algorithms. We however start already with Markovian deterministic policies since in MDP theory it is well-known that history-dependent policies or randomized policies do not increase the value. Moreover, we consider the convergence of the N-individuals problem as well as average reward optimization. In [27] the authors deal with so-called open-loop controls and restrict to individualized or decentralized information. They investigate the rate of convergence from the N-population model to the mean-field problem. They also derive a fixed point characterization of the value function and discuss the role of randomized controls. Since in [27] decisions may only depend on the history of the single agent an additional source of randomness is required such that individuals with same history may take different actions.

Other recent papers discuss reinforcement learning for mean-field control problems, see e.g. [8, 9, 17, 18]. In the second part of the paper we consider average reward mean-field control problems which is a new aspect. There are papers on average reward games, like [5] where the transition probability does not depend on the empirical distribution of individuals and [36] where under some strong ergodicity assumptions the existence of a stationary mean-field equilibrium is shown. Both papers do not consider the vanishing discount approach which we do here. The recent paper [6] considers the vanishing discount approach, but in a continuous-time setting and for a game.

The main contributions of our paper are as follows: We first want to stress the point that mean-field control problems fit naturally into the established MDP theory. We start with a problem where N interacting individuals try to maximize their expected discounted reward over an infinite time horizon. Reward and transition functions may depend on the empirical measure of the individuals. Moreover, the transition functions of individuals depend on an idiosyncratic noise and a common noise. Due to symmetry reasons instead of taking the state of each individual as a common state of the system it is enough to know the empirical measure over the states. This equivalence implies an MDP formulation where the underlying state process consists of empirical measures. A similar observation can be found in [27], however there the authors take the mean-field limit first. Letting the number N of individuals tend to infinity, implies a mean-field limit by applying the Glivenko-Cantelli theorem. The idiosyncratic noise vanishes in the limit. In our setting state and action spaces are compact Borel spaces. We also discuss the existence of optimal policies which is rarely done in other papers. E.g. we give explicit conditions under which an optimal deterministic policy does exist for the limit problem as well as for the initial N-individuals problem. Moreover, we investigate average optimality in mean-field control problems, an aspect which is neglected in the literature. Applying results from MDP theory leads to an average reward optimality inequality. In some cases we obtain optimal policies in this setting rather easily. Since we use the vanishing discount approach, we can show that these policies are \(\varepsilon \)-optimal for the initial problem when the number of individuals is large and the discount factor close to one. Thus, we get some kind of double approximation which is helpful in some applications. Indeed, it turns out that the case when the reward does not depend on the action yields an interesting special case. The average reward problem can then be solved by first finding an optimal measure for a static optimization problem and then by using Markov Chain Monte Carlo to find an optimal randomized decision rule which achieves the optimal measure in the limit. We show how this works in a network example where the aim is to avoid congestion. Another interesting feature of the solution is that it is a decentralized control, i.e. individuals can decide optimally based on their own state without knowing the distribution of all individuals, i.e. individuals do not have to communicate. A second example is the optimal placement on a market square.

The paper is organized as follows: In the first section we introduce the model with a finite number of N individuals. We give conditions under which the optimality equation holds and optimal policies exist. In Sect. 3 we show how to formulate an equivalent MDP whose state space consists of the empirical measures of individuals. Based on this formulation we let the number N of individuals tend to infinity in the next section. We prove the convergence of value functions and show how an asymptotically optimal policy can be constructed. In Sect. 5 we consider the average reward problem via the vanishing discount approach. Under some ergodicity assumptions we prove the existence of average reward optimal policies and verify that the value function satisfies an average reward optimality inequality. Next we show how to use this optimal policy to construct \(\varepsilon \)-optimal policies for the original problem.

We discuss how to solve average reward problems when the reward depends only on the distribution of individuals and not on the action. Finally in Sect. 6 we consider two applications (network congestion and positioning on market place) which we solve explicitly. The appendix contains additional material which consists of a useful convergence result and the definition of the Wasserstein distance and Wasserstein ergodicity. Moreover, longer proofs are also deferred to the appendix.

2 The Mean-Field Model

We consider the following Markov Decision Process with a finite number of individuals: Suppose we have a compact Borel set S of states and N statistically equal individuals. Each individual is at the beginning in one of the states, i.e. the state of the system is described by a vector \({\textbf{x}}=(x_1,\ldots ,x_N)\in S^N\) which represents the states of the individuals. In case we need the time index n, we write \(x_n^i\), \(i=1,\ldots ,N\). Each individual can choose actions from the same Borel set A. Let \(D(x)\subset A\) be the actions available for one individual who is in state \(x\in S\), i.e. \({\textbf{a}}=(a_1,\ldots ,a_N)\in {\textbf{D}}({\textbf{x}}):=D(x_1)\times \ldots \times D(x_N)\) is the vector of admissible actions for all individuals. We denote \(D:= \{ (x,a) \in S\times A: a\in D(x) \text{ for } \text{ all } x\in S\}\) and assume that it contains the graph of a measurable mapping \(f:S\rightarrow A\). Moreover, \({\textbf{D}}:= \{ ({\textbf{x}},{\textbf{a}}) | {\textbf{a}}\in {\textbf{D}}({\textbf{x}})\} \). After choosing an action each individual faces a random transition. In order to define this, suppose that \((Z_n^i)_{n\in {\mathbb {N}}}, i=1,\ldots ,N\) and \((Z_n^0)_{n\in {\mathbb {N}}}\) are sequences of i.i.d. random variables with values in a Borel set \({\mathcal {Z}}\). The sequence \((Z_n^0)_{n\in {\mathbb {N}}}\) will play the role of a common noise. In what follows we need the empirical measure of \({\textbf{x}}\), i.e. we denote

$$\begin{aligned} \mu [{\textbf{x}}]:= \frac{1}{N}\sum _{i=1}^N \delta _{x_i} \end{aligned}$$

where \(\delta _y\) is the Dirac measure in point y. \(\mu [{\textbf{x}}]\) can be interpreted as a distribution on S. We denote by \({\mathbb {P}}(S)\) the set of all distributions on S and by

$$\begin{aligned} {\mathbb {P}}_N(S):= \{ \mu \in {\mathbb {P}}(S)\;| \; \mu = \mu [{\textbf{x}}], \text{ for } {\textbf{x}} \in S^N \}, \end{aligned}$$

the set of all distributions which are empirical measures of N points. On these sets we consider the topology of weak convergence. The transition function of the system is now a combination of the individual transition functions which are given by a measurable mapping \(T: S\times A\times {\mathbb {P}}(S)\times {\mathcal {Z}}^2\rightarrow S\) such that

$$\begin{aligned} x_{n+1}^i = T(x_n^i, a_n^i, \mu [{\textbf{x}}_n], Z_{n+1}^i, Z_{n+1}^0) \end{aligned}$$

for \(i=1,\ldots ,N\). Note that the individual transition may also depend on the empirical distribution \(\mu [{\textbf{x}}_n]\) of all individuals. In total the transition function for the entire system is a measurable mapping \({\textbf{T}}: {\textbf{D}} \times {\mathbb {P}}_N(S)\times {\mathcal {Z}}^{N+1}\rightarrow S^N\) of the state \({\textbf{x}}\), the chosen actions \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\), the empirical measure \(\mu [{\textbf{x}}]\) and the disturbances \({\textbf{Z}}_{n+1}:=(Z_{n+1}^1,\ldots , Z_{n+1}^N), Z_{n+1}^0\) such that

$$\begin{aligned} {\textbf{x}}_{n+1}= {\textbf{T}}({\textbf{x}}_n,{\textbf{a}}_n,\mu [{\textbf{x}}_n], {\textbf{Z}}_{n+1}, Z_{n+1}^0)= \Big ( T(x_n^i, a_n^i, \mu [{\textbf{x}}_n], Z_{n+1}^i, Z_{n+1}^0)\Big )_{i=1,\ldots ,N}. \end{aligned}$$

Last but not least each individual generates a bounded one-stage reward \(r: S\times A\times {\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) which is given by \(r(x_i,a_i,\mu [{\textbf{x}}])\), i.e. it may also depend on the empirical distribution of all individuals. The total one-stage reward of the system is the average

$$\begin{aligned} {\textbf{r}}({\textbf{x}},{\textbf{a}}):=\frac{1}{N} \sum _{i=1}^N r(x_i,a_i, \mu [{\textbf{x}}]) \end{aligned}$$

of all individuals. The first aim will be to maximize the joint expected discounted reward of the system over an infinite time horizon, i.e. we consider here the social optimum of the system or Pareto optimality. In particular the agents have to work together in order to optimize the system. This is in contrast to mean-field games where each individual tries to maximize her own expected discounted reward and where the aim is to find Nash equilibria. We make the following assumptions:

  1. (A0)

    D is compact.

  2. (A1)

    \(x\mapsto D(x)\) is upper semicontinuous, i.e. for all \(x\in S\): If \(x_n\rightarrow x\) for \(n\rightarrow \infty \) and \(a_n\in D(x_n)\), then \((a_n)\) has an accumulation point in D(x).

  3. (A2)

    \((x,a,\mu ) \mapsto r(x,a,\mu )\) is upper semicontinuous.

  4. (A3)

    \( (x,a,\mu ) \mapsto T(x,a,\mu ,z,z_0)\) is continuous for all \(z,z_0 \in {\mathcal {Z}}.\)

A policy in this model is given by \(\pi =(f_0,f_1,\ldots )\) with \(f_n \in F\) being a decision rule where

$$\begin{aligned} F:= \{ f: S^N \rightarrow A^N \;| \; f \text{ is } \text{ measurable } f({\textbf{x}})\in {\textbf{D}}({\textbf{x}}) \text{ for } \text{ all } {\textbf{x}}\in S^N\} \end{aligned}$$

is the set of all decision rules. In case we do not need the time index n we write \(f({\textbf{x}}):=(f^1({\textbf{x}}),\ldots ,f^N({\textbf{x}}))\). It is not necessary to introduce randomized or history-dependent policies here, since we obtain a classical MDP below and it is well-known that an optimal policy will be among deterministic Markov ones. We assume that each individual has information about the position of all other individuals. This point of view can be interpreted as a centralized control problem where all information is collected and shared by a central controller.

Together with the distributions of \((Z_n^i), (Z_n^0)\) and the transition function \({\textbf{T}}\), a policy \(\pi \) induces a probability measure \({\mathbb {P}}_{\textbf{x}}^\pi \) on the measurable space

$$\begin{aligned} (\Omega =S^N\times S^N\times \ldots , {\mathcal {F}}={\mathcal {B}}(S^N) \otimes {\mathcal {B}}(S^N) \otimes \ldots ) \end{aligned}$$

where \( {\mathcal {B}}(S^N) \) is the Borel \(\sigma \)-algebra on \(S^N\). The corresponding state process is denoted by \(({\textbf{X}}_n)\) where \({\textbf{X}}_n(\omega _1,\omega _2,\ldots )=\omega _n\in S^N\) and the action process is denoted by \(({\textbf{A}}_n)\) where \({\textbf{A}}_n(\omega _1,\omega _2,\ldots )=f_n(\omega _n).\) Our aim is to maximize the expected discounted reward of the system over an infinite time horizon. Hence we define for a policy \(\pi =(f_0,f_1,\ldots )\)

$$\begin{aligned} V_\pi ^N({\textbf{x}}):= & {} \frac{1}{N} \sum _{i=1}^N \sum _{k=0}^\infty \beta ^k {\mathbb {E}}_{\textbf{x}}^\pi \big [r(X_k^i,A_k^i, \mu [{\textbf{X}}_k])\big ] \end{aligned}$$
(2.1)
$$\begin{aligned} V^N({\textbf{x}}):= & {} \sup _\pi V_\pi ^N({\textbf{x}}) \end{aligned}$$
(2.2)

where \(\beta \in (0,1)\) is a discount factor. \({\mathbb {E}}_{\textbf{x}}^\pi \) is the expectation w.r.t. \({\mathbb {P}}_{\textbf{x}}^\pi \). \(V^N({\textbf{x}})\) is the maximal expected discounted reward over an infinite time horizon, initially given the configuration \({\textbf{x}}\) of individual’s states.

Remark 2.1

It is not difficult to see that \(V^N\) is symmetric, i.e. \(V^N({\textbf{x}})=V^N(\sigma ({\textbf{x}}))\) for any permutation \(\sigma ({\textbf{x}})\) of \({\textbf{x}}\) because the reward \({\textbf{r}}({\textbf{x}},{\textbf{a}})={\textbf{r}}(\sigma ({\textbf{x}}),\sigma ({\textbf{a}}))\) and the transition function \({\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}], {\textbf{Z}}, Z^0)={\textbf{T}}(\sigma ({\textbf{x}}),\sigma ({\textbf{a}}),\mu [\sigma ({\textbf{x}})], {\textbf{Z}}, Z^0)\) are symmetric. This is a simple observation but in the end leads to the conclusion that it is only necessary to know how many individuals are in the different states.

In what follows we introduce some notations.

Definition 2.2

Let us define:

  1. a)

    The set \({\mathbb {M}}:= \{v:S^N \rightarrow {\mathbb {R}}\;| \; v \text{ is } \text{ bounded } \text{ and } \text{ upper } \text{ semicontinuous }\}\).

  2. b)

    The operator U on \({\mathbb {M}}\) by

    $$\begin{aligned} Uv({\textbf{x}}) = (Uv)({\textbf{x}}):= & {} \sup _{{\textbf{a}}\in {\textbf{D}}({\textbf{x}})} \Big \{ {\textbf{r}}({\textbf{x}},{\textbf{a}})+ \beta {\mathbb {E}}\Big [v\big ( {\textbf{T}}({\textbf{x}}, {\textbf{a}}, \mu [{\textbf{x}}], {\textbf{Z}}, Z^0)\big )\Big ]\Big \}. \end{aligned}$$
  3. c)

    A decision rule \(f\in F\) is called maximizer of \(v\in {\mathbb {M}}\) if

    $$\begin{aligned} Uv({\textbf{x}})= \textbf{ r}({\textbf{x}},f({\textbf{x}}))+ \beta {\mathbb {E}}\Big [v\big ( {\textbf{T}}({\textbf{x}}, f({\textbf{x}}), \mu [{\textbf{x}}], {\textbf{Z}}, Z^0)\big )\Big ]. \end{aligned}$$

From classical MDP theory we obtain:

Theorem 2.3

Assume (A0)–(A3). Then:

  1. (a)

    The value function \(V^N\) is the unique fixed point of the U-operator in \({\mathbb {M}}\), i.e. it satisfies the optimality equation \(V^N=U V^N\).

  2. (b)

    \(V^N = \lim _{n\rightarrow \infty } U^n 0\).

  3. (c)

    There exists a maximizer of \(V^N\) and every maximizer \(f^*\in F\) of \(V^N\) defines an optimal stationary (deterministic) policy \((f^*,f^*,\ldots )\).

The proof of this statement and all other longer proofs can be found in the appendix. We summarize the model data below:

Model MDP

 

State space

\(S^N \ni {\textbf{x}}=(x_1,\ldots ,x_N)\)

Admissible actions

\({\textbf{D}}({\textbf{x}}):=D(x_1)\times \ldots \times D(x_N)\ni {\textbf{a}}=(a_1,\ldots ,a_N)\)

Transition function

\({\textbf{T}}({\textbf{x}}_n,{\textbf{a}}_n,\mu [{\textbf{x}}_n], {\textbf{Z}}_{n+1}, Z_{n+1}^0)= \Big ( T(x_n^i, a_n^i, \mu [{\textbf{x}}_n], Z_{n+1}^i, Z_{n+1}^0)\Big )_{i=1,\ldots ,N}\)

Reward

\( {\textbf{r}}({\textbf{x}},{\textbf{a}}) :=\frac{1}{N} \sum _{i=1}^N r(x_i,a_i, \mu [{\textbf{x}}])\)

Policy

\(\pi =(f_0,f_1,\ldots ),\)

 

\(f_n\in F:= \{ f: S^N \rightarrow A^N \;| \; f \text{ is } \text{ measurable } f({\textbf{x}})\in {\textbf{D}}({\textbf{x}}),\; \forall {\textbf{x}}\in S^N\}\)

Example 2.4

Suppose individuals move on a triangle. The state space is given by the nodes \(S=\{1,2,3\}\). Admissible actions are adjacent nodes, i.e. \(D(1)=\{2,3\}, D(2)=\{1,3\}, D(3)=\{1,2\}\). The individual one-stage reward may be given by \(r(x_i,a_i,\mu )= 1_{\{1\}}(x_i)- 1_{\{ |1-\bar{\mu }|\le 0.5\}}\).

Here \(\bar{\mu }= \int x\mu (dx)\). This means an individual gets a reward of 1 when it is in state 1, but only when the average position of the others is away from 1. A transition function may be

$$\begin{aligned} T(x, a, \mu , z, z^0)= \left\{ \begin{array}{cl} a, &{} \text{ if } z\in [0,\frac{1}{2}),\\ x, &{} \text{ if } z\in [\frac{1}{2},1] \end{array}\right. \end{aligned}$$

For \(N=5\) individuals, a state may be \({\textbf{x}}=(1,2,3,1,3)\) and an action \({\textbf{a}}=(2,1,2,3,1)\in {\textbf{D}}({\textbf{x}})\). In this case \(\mu [{\textbf{x}}]=(2/5,1/5,2/5)\) and \({\textbf{r}}({\textbf{x}},{\textbf{a}}) = 2/5\).

3 The Mean-Field MDP

Suppose that N is large. Even if the state space S is small, the solution of the problem may not be computationally tractable any more because \(S^N\) is large. We seek for some simplifications. In particular we want to exploit the symmetry of the problem. In the last section we have seen that the empirical measures of the individuals’ states is the essential information. Thus, we define as new state space \({\mathbb {P}}_N(S)\). Further we define the following sets:

$$\begin{aligned} {\hat{D}} (\mu ):= & {} \{ \mu [({\textbf{x}},{\textbf{a}})] \;| \; {\textbf{x}}\in S^N \text{ s.t. } \mu [{\textbf{x}}] =\mu \text{ and } {\textbf{a}}\in {\textbf{D}}({\textbf{x}})\},\; \mu \in {\mathbb {P}}_N(S)\\ {\hat{D}}:= & {} \{ (\mu ,Q) \;| \; \mu \in {\mathbb {P}}_N(S), Q \in {\hat{D}}(\mu )\}\\ {\hat{F}}:= & {} \{ \varphi : {\mathbb {P}}_N(S) \rightarrow {\mathbb {P}}_N(D) \;| \; \varphi \text{ measurable, } \varphi (\mu )\in {\hat{D}}(\mu ) \text{ for } \text{ all } \mu \in {\mathbb {P}}_N(S) \}, \end{aligned}$$

where

$$\begin{aligned} {\mathbb {P}}_N(D):= \{ Q\in {\mathbb {P}}(D) \; | \; Q= \mu [({\textbf{x}},{\textbf{a}})] \text{ for } ({\textbf{x}},{\textbf{a}}) \in {\textbf{D}} \} \end{aligned}$$

is the set of all probability measures on D which are empirical measures on N points. The set \({\hat{D}}(\mu ) \) consists of probability measures on D which are empirical measures on N points and whose first marginal distribution equals \(\mu \). We obtain the following result.

Lemma 3.1

Suppose \({\textbf{a}}\in {\textbf{D}}({\textbf{x}}) \) is an arbitrary action in state \({\textbf{x}}\in S^N\). Then there exists an admissible \(Q\in {\hat{D}} (\mu [{\textbf{x}}]),\) s.t.

$$\begin{aligned} {\textbf{r}}({\textbf{x}},{\textbf{a}})= \int _D r(x,a,\mu )Q(d(x,a)) =: \hat{r}(\mu ,Q), \end{aligned}$$
(3.1)

for all \({\textbf{x}}\in S^N\). The converse is also true, i.e. if \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\) then there exists an \({\textbf{a}}\in {\textbf{D}}({\textbf{x}}) \) s.t. (3.1) holds.

Proof

Let \({\textbf{x}}\) and \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) be given and let \(\mu := \mu [{\textbf{x}}]\in {\mathbb {P}}_N(S)\). Define the discrete point measure Q on D by

$$\begin{aligned} Q:= \mu [({\textbf{x}},{\textbf{a}})]. \end{aligned}$$

Then \(Q\in {\hat{D}} (\mu )\) by construction and

$$\begin{aligned} {\textbf{r}}({\textbf{x}},{\textbf{a}})= & {} \frac{1}{N}\sum _{i=1}^N r(x_i,a_i, \mu ) =\int _{D} r(x,a,\mu )Q(d(x,a)) \end{aligned}$$

which proves the first statement. For the converse, suppose \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\). By definition this implies that there exists \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) s.t. \(Q=\mu [({\textbf{x}},{\textbf{a}})]\). Using this relation, (3.1) follows. \(\square \)

This lemma shows that instead of choosing actions \({\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) we can choose measures \(Q\in {\hat{D}} (\mu [{\textbf{x}}])\) and \(\mu =\mu [{\textbf{x}}]\) is a sufficient information which can replace the high dimensional state \({\textbf{x}}\in S^N\). Intuitively this is clear from the fact that \({\textbf{r}}({\textbf{x}},{\textbf{a}})\) is symmetric (see Remark 2.1).

We consider now a second MDP with the following data which we will call mean-field MDP (for short \(\widehat{\textrm{MDP}}\)). The state space is \( {\mathbb {P}}_N(S)\) and the action space is \({\mathbb {P}}_N(D)\). The one-stage reward \(\hat{r}: {\hat{D}}\rightarrow {\mathbb {R}}\) is given by the expression in Lemma 3.1, i.e.

$$\begin{aligned} \hat{r}(\mu ,Q):= & {} \int _{D} r(x,a,\mu ) Q(d(x,a)) \end{aligned}$$
(3.2)

and the transition law \(\hat{T}: {\hat{D}} \times {\mathcal {Z}}^{N+1} \rightarrow {\mathbb {P}}_N(S)\) for \(Q=\mu [({\textbf{x}},{\textbf{a}})], \mu =\mu [{\textbf{x}}]\) by (the empty sum is zero)

$$\begin{aligned} \hat{T}(\mu ,Q,{\textbf{Z}},Z^0)= & {} \mu [ {\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}],{\textbf{Z}},Z^0)] \end{aligned}$$

The value of \({\hat{T}}\) simply is the empirical measure of the new states after a random transition. A policy is here denoted by \(\psi =(\varphi _0,\varphi _1,\ldots )\) with \(\varphi _n\in {\hat{F}}\) and we denote by \((\mu _n)\) the corresponding (random) sequence of empirical measures, i.e. \(\mu _0=\mu \), and for \(n\in {\mathbb {N}}_0\)

$$\begin{aligned} \mu _{n+1} = \hat{T}(\mu _n,\varphi _n(\mu _n),{\textbf{Z}}_{n+1},Z_{n+1}^0). \end{aligned}$$

Remark 3.2

We define an action as a joint probability distribution Q on state and action combinations instead of the conditional distribution on actions given the state. Both descriptions are equivalent, since for \(Q\in {\hat{D}}(\mu )\) we can disintegrate

$$\begin{aligned} Q(B)=\int _B \bar{Q}(da|x)\mu (dx),\; B \in {\mathcal {B}}(D) \end{aligned}$$

where \({\bar{Q}}\) is the regular conditional probability. For short: \(Q=\mu \otimes {\bar{Q}}\). The advantage of using the joint distribution is that we have one object to define actions in all states. The disadvantage is that we need to formulate the restriction that the marginal distribution on the states coincides with \(\mu \).

We define the value function of \(\widehat{\textrm{MDP}}\) in the usual way for state \(\mu \in {\mathbb {P}}_N(S)\) and policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) by

$$\begin{aligned} J_\psi ^N(\mu ):= & {} \sum _{k=0}^\infty \beta ^k {\mathbb {E}}_\mu ^\psi \big [\hat{r}(\mu _k,\varphi _k)\big ]. \end{aligned}$$
(3.3)
$$\begin{aligned} J^N(\mu ):= & {} \sup _\psi J^N_\psi (\mu ). \end{aligned}$$
(3.4)

Finally, we show that the MDP and the mean-field MDP are equivalent.

Theorem 3.3

Assume (A0)-(A3). For \({\textbf{x}}\in S^N\) and \(\mu =\mu [{\textbf{x}}]\) we have:

$$\begin{aligned} V^N({\textbf{x}})=J^N(\mu ). \end{aligned}$$

Proof

Note that \(\mu _0=\mu =\mu [{\textbf{x}}]\) by definition. Let \({\textbf{a}}_0={\textbf{a}}\in {\textbf{D}}({\textbf{x}})\) be the first action taken by MDP under an arbitrary policy. Then by Lemma 3.1 there exists \(Q\in {\hat{D}}(\mu )\), s.t. \({\textbf{r}}({\textbf{x}},{\textbf{a}})= \hat{r}(\mu ,Q)\) and

$$\begin{aligned} \mu [{\textbf{X}}_{1}]=\mu [{\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}], {\textbf{Z}}_{1}, Z^0_{1})]=\hat{T}(\mu ,Q, {\textbf{Z}}_{1},Z^0_{1})=\mu _{1}. \end{aligned}$$

By induction over time n it follows that a sequence of states and feasible actions in MDP \(({\textbf{X}}_0,{\textbf{A}}_0,{\textbf{X}}_1,\ldots )\) can be coupled with a sequence of states and feasible actions \((\mu _0,Q_0,\mu _1,\ldots )\) for \(\widehat{\textrm{MDP}}\) and vice versa s.t. the same sequence of disturbances \(({\textbf{Z}}_n),(Z^0_n)\) is used and \( {\textbf{r}}({\textbf{X}}_n,{\textbf{A}}_n) = \hat{r}(\mu _n,Q_n)\) pathwise. The corresponding policies may be history-dependent, but \(V^N=J^N\) follows since it is well-known for MDPs that the maximal value is obtained when we restrict our optimization to Markovian policies. \(\square \)

As in Sect. 2 we define here a set and an operator for the mean-field MDP.

Definition 3.4

Let us define

  1. (a)

    The set \(\mathbb {{\hat{M}}}:= \{v: {\mathbb {P}}_N(S) \rightarrow {\mathbb {R}}\; |\; v \text{ is } \text{ bounded } \text{ and } \text{ upper } \text{ semicontinuous }\}\).

  2. (b)

    The operator \({\hat{U}}\) on \(\mathbb {{\hat{M}}}\) by

    $$\begin{aligned} {\hat{U}}v(\mu ) = ({\hat{U}}v)(\mu ):= & {} \sup _{Q\in {\hat{D}}(\mu )} \Big \{ \hat{r}(\mu ,Q)+ \beta {\mathbb {E}}v( \hat{ T}(\mu ,Q,{\textbf{Z}},Z^0))\Big \}. \end{aligned}$$

Due to Theorem 3.3 and Theorem 2.3 we obtain:

Theorem 3.5

Assume (A0)–(A3). Then:

  1. (a)

    The value function \(J^N\) is the unique fixed point of the \({\hat{U}}\)-operator in \(\mathbb {{\hat{M}}}\) i.e. it satisfies the optimality equation \(J^N= {\hat{U}} J^N\).

  2. (b)

    \(J^N = \lim _{n\rightarrow \infty } {\hat{U}}^n0\).

  3. (c)

    There exists a maximizer of \(J^N\) and every maximizer \(\varphi ^*\in {\hat{F}}\) of \(J^N\) defines an optimal stationary policy \((\varphi ^*,\varphi ^*,\ldots )\).

We summarize the model data below:

Model \(\widehat{\textrm{MDP}}\)

 

State space

\({\mathbb {P}}_N(S):= \{ \mu \in {\mathbb {P}}(S)\;| \; \mu = \mu [{\textbf{x}}], \text{ for } {\textbf{x}} \in S^N \} \ni \mu \)

Admissible actions

\({\hat{D}} (\mu ) :=\{ \mu [({\textbf{x}},{\textbf{a}})] \;| \; {\textbf{x}}\in S^N \text{ s.t. } \mu [{\textbf{x}}] =\mu \text{ and } {\textbf{a}}\in D({\textbf{x}})\} \ni Q\)

Transition function

\( \hat{T}(\mu ,Q,{\textbf{Z}},Z^0)= \mu [ {\textbf{T}}({\textbf{x}},{\textbf{a}},\mu [{\textbf{x}}],{\textbf{Z}},Z^0)]\)

Reward

\( \hat{r}(\mu ,Q) := \int _{D} r(x,a,\mu ) Q(d(x,a))\)

Policy

\(\psi =(\varphi _0,\varphi _1,\ldots ),\)

 

\(\varphi _n\!\in \! {\hat{F}}\! :=\! \{ \varphi : {\mathbb {P}}_N(S) \rightarrow {\mathbb {P}}_N(D) \;| \; \varphi \text{ meas., } \varphi (\mu )\in {\hat{D}}(\mu ),\; \forall \mu \in {\mathbb {P}}_N(S) \} \)

Example 3.6

We reconsider Example 2.4. The given state and action translates in \(\widehat{\textrm{MDP}}\) to \(\mu =\mu [{\textbf{x}}]=(2/5,1/5,2/5)\) as distribution on \(S=\{1,2,3\}\). The action is a distribution on \(D=\{(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)\}\) and translates into \(Q=(1/5,1/5,1/5,0,1/5,1/5)\). The transition kernel mentioned in Remark 3.2 in this example is given by \(\bar{Q}(2|1)=\frac{1}{2},{\bar{Q}}(3|1)=\frac{1}{2},{\bar{Q}}(1|2)=1, {\bar{Q}}(3|2)=0, {\bar{Q}}(1|3)=\frac{1}{2}, {\bar{Q}}(2|3)=\frac{1}{2}.\) Obviously \( \hat{r}(\mu ,Q) =2/5\).

4 The Mean-Field Limit MDP

In this section we let \(N\rightarrow \infty \) in order to obtain some simplifications. This yields the so-called mean-field limit.

We thus consider a third MDP, the so-called limit MDP (denoted by \(\widetilde{\textrm{MDP}}\)). We will later show that it will indeed appear to be the limit of the problems studied in the previous section. The limit MDP is defined by the following data: The state space is \( {\mathbb {P}}(S)\) and the action space is \({\mathbb {P}}(D)\). We define

$$\begin{aligned} {\tilde{D}}(\mu ):= & {} \{ Q \in {\mathbb {P}}(D) \,| \text{ the } \text{ first } \text{ margin } \text{ of } Q \text{ is } \mu \},\, \mu \in {\mathbb {P}}(S) \end{aligned}$$
(4.1)
$$\begin{aligned} {\tilde{D}}:= & {} \{(\mu ,Q)\, |\, \mu \in {\mathbb {P}}(S), Q\in \tilde{D}(\mu )\}. \end{aligned}$$
(4.2)

The one-stage reward \(\tilde{r}: {\tilde{D}} \rightarrow {\mathbb {R}}\) is given as in (3.2):

$$\begin{aligned} \tilde{r}(\mu ,Q):= \int _{D} r(x,a,\mu ) Q(d(x,a)). \end{aligned}$$

The transition function is defined by \(\tilde{T}: {\tilde{D}} \times {\mathcal {Z}}\rightarrow {\mathbb {P}}(S)\)

$$\begin{aligned} \tilde{T}(\mu ,Q,Z^0)(B) = \int _{ D} p^{x,a,\mu ,Z^0}(B) Q(d(x,a)) \end{aligned}$$
(4.3)

where \(p^{x,a,\mu ,Z^0} (B):={\mathbb {P}}(T(x,a,\mu ,Z^i,Z^0)\in B |Z^0)\) with \(B\in {\mathcal {B}}(S)\), is the conditional probability that the next state is in B, given \(x,a,\mu \) and the common noise random variable \(Z^0\).

Remark 4.1

Recalling that \(Q\in {\tilde{D}}(\mu )\) means \(Q=\mu \otimes \bar{Q}\), we can (with the help of the Fubini theorem) instead of (4.3) equivalently write

$$\begin{aligned} \tilde{T}(\mu ,Q,Z^0)(B)= & {} \int _D p^{x,a,\mu ,Z^0}(B) \bar{Q}(da|x) \mu (dx) \end{aligned}$$
(4.4)
$$\begin{aligned}= & {} \int _S P^{{\bar{Q}},\mu ,Z^0}(B|x) \mu (dx) \end{aligned}$$
(4.5)

where \(P^{{\bar{Q}},\mu ,Z^0} (dx'|x)= \int _{D(x)} p^{x,a,\mu ,Z^0}(dx') \bar{Q}(da|x)\). Hence \(P^{{\bar{Q}},\mu ,Z^0}\) is the transition kernel which determines the distribution at the next stage. In general it depends on \({\bar{Q}},\mu \) and the common noise \(Z^0\).

A decision rule is here a measurable mapping \(\varphi \) from \({\mathbb {P}}(S)\) to \({\mathbb {P}}(D)\) such that \(\varphi (\mu )\in \tilde{D}(\mu )\) for all \(\mu \). We denote by \({\tilde{F}}\) the set of all decision rules. Suppose that \(\psi =(\varphi _0,\varphi _1,\ldots )\) is a policy for the \(\widetilde{\textrm{MDP}}\). As in the previous section we set for \(n\in {\mathbb {N}}_0\)

$$\begin{aligned} \mu _0:= & {} \mu ,\\ \mu _{n+1}:= & {} {\tilde{T}}(\mu _n,\varphi _n(\mu _n), Z^0_{n+1}) \end{aligned}$$

which yields the sequence of distributions of individuals. Note that it is deterministic if \({\tilde{T}}\) does not depend on the common noise \(Z^0\).

Then we define for \(\widetilde{\textrm{MDP}}\) the following value functions for policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) and state \(\mu \in {\mathbb {P}}(S)\)

$$\begin{aligned} J_{\psi }(\mu )= & {} \sum _{k=0}^\infty \beta ^k {\mathbb {E}}_\mu ^\psi [{\tilde{r}}(\mu _k, \varphi _k)], \nonumber \\ J(\mu )= & {} \sup _\psi J_\psi (\mu ). \end{aligned}$$
(4.6)

Instead of (A2) we will now assume that

  1. (A2’)

    \( (x,a,\mu ) \mapsto r(x,a,\mu )\) is continuous.

Definition 4.2

We define

  1. (a)

    The set \(\tilde{{\mathbb {M}}}:= \{ v: {\mathbb {P}}(S) \rightarrow {\mathbb {R}}\; |\; v \text{ is } \text{ continuous } \text{ and } \text{ bounded }\}\).

  2. (b)

    The maximal reward operator \({\tilde{U}}\) on \(\tilde{{\mathbb {M}}} \) in this model is

    $$\begin{aligned} {\tilde{U}} v(\mu )= ({\tilde{U}} v)(\mu ):= & {} \sup _{Q\in {\tilde{D}} (\mu ) } \Big \{ \tilde{r}(\mu ,Q)+ \beta {\mathbb {E}}v( \tilde{T}(\mu ,Q,Z^0))\Big \}. \end{aligned}$$

For the mean-field limit MDP we obtain:

Theorem 4.3

Assume (A0), (A1), (A2’), (A3). Then:

  1. (a)

    The value function J is the unique fixed point of the \({\tilde{U}}\)-operator in \(\tilde{{\mathbb {M}}}\), i.e. it satisfies the optimality equation \(J= {\tilde{U}} J\).

  2. (b)

    \(J = \lim _{n\rightarrow \infty } {\tilde{U}}^n 0\).

  3. (c)

    There exists a maximizer of J and every maximizer \(\varphi ^*\in {\tilde{F}}\) of J defines an optimal stationary deterministic policy \((\varphi ^*,\varphi ^*,\ldots )\).

Remark 4.4

We can use the established solution methods like value iteration, policy iteration, linear programmes or reinforcement learning to numerically solve the limit MDP ( [4, 10, 30]).

The limit problem can be seen as a problem which approximates the original model when N is large. In order to proceed, we need a more restrictive assumption than (A3)

  1. (A3’)

    \({\mathcal {Z}}\) is compact and \( (x,a,\mu ,z,z_0) \mapsto T(x,a,\mu ,z,z_0)\) is continuous.

Remark 4.5

The assumption that \({\mathcal {Z}}\) is compact is not a strong assumption. Indeed, w.l.o.g. we may choose the disturbances to be uniformly distributed over [0, 1]. This is because if for example \({\mathcal {Z}}={\mathbb {R}}\) and F is the distribution function of Z we get \(Z{\mathop {=}\limits ^{d}} F^{-1}(U)\) with \(U\sim U([0,1])\) and \(F^{-1}\) is then part of the transition function.

Then it is possible to prove the following limit result.

Theorem 4.6

Assume (A0), (A1), (A2’) and (A3’). Let \(\mu ^N_0\Rightarrow \mu _0\) for \(N\rightarrow \infty \) where \(\mu _0^N \in {\mathbb {P}}_N(S)\). Then

  1. (a)

    \(\limsup _{N\rightarrow \infty } J^N(\mu ^N_0)= J(\mu _0)\).

  2. (b)

    Suppose \(\varphi ^*\) is a maximizer of J. Then it is possible to construct (possibly history-dependent) policies \(\psi ^N= (\varphi ^N_0,\varphi ^N_1,\ldots )\) for \(\widehat{\textrm{MDP}}\) s.t. \(\lim _{N\rightarrow \infty } J^N_{\psi ^N}(\mu _0^N)=J(\mu _0)\).

In particular the proof of part (b) shows how to obtain an \(\varepsilon \)-optimal policy for the model with N individuals (N large) when we know the optimal policy for the limit MDP.

Remark 4.7

  1. (a)

    In case there is no common noise, \(\widetilde{\textrm{MDP}}\) is completely deterministic. The optimality equation then reads

    $$\begin{aligned} J(\mu )= & {} \sup _{Q\in {\tilde{D}}(\mu )} \Big \{ {\tilde{r}}(\mu ,Q)+\beta J({\tilde{T}}(\mu ,Q))\Big \} \end{aligned}$$
    (4.7)

    where \({\tilde{T}}(\mu ,Q)(B) = \int p^{x,a,\mu }(B) Q(d(x,a))\) with \(p^{x,a,\mu }(B) = {\mathbb {P}}(T(x,a,\mu ,Z)\in B)\).

  2. (b)

    If there is no common noise and r and T do not depend on \(\mu \), we obtain as a special case a standard MDP. The usual optimality equation for this MDP (for one individual) would be

    $$\begin{aligned} V(x) = \sup _{a\in D(x)} \left\{ r(x,a)+ \beta {\mathbb {E}}V(T(x,a,Z))\right\} ,\; x\in S \end{aligned}$$
    (4.8)

    where \(V(x) = \sup _\pi \sum _{k=0}^\infty \beta ^k {\mathbb {E}}_x^\pi [r(X_k^i,A_k^i)]\). The results in this paper show that we can equivalently consider \(\widehat{\textrm{MDP}}\) which implies the optimality equation (4.7). It is possible to show by induction that the relation between both value functions is given by \(J(\mu ) = \int V(x)\mu (dx)\). Moreover, a maximizer of J is given by \(\varphi ^*(\mu )=\mu \otimes {\bar{Q}}^*\) with \({\bar{Q}}^*(\cdot |x)= \delta _{f^*(x)}\) for some \(f^*: S\rightarrow A\) with \(f^*(x)\in D(x)\) and \(f^*\) is a maximizer of V. Here the choice of the conditional distribution \({\bar{Q}}^*\) does not depend on \(\mu \) and is concentrated on a single action.

  3. (c)

    The policy \(\psi ^N\) which is constructed in Theorem 4.6 is deterministic but has the disadvantage that individuals have to communicate. Another possibility is to choose \(Q_0^N\) as an empirical measure of \(Q_0^*\) given \(\mu _0^N\). This means if \(Q_0^* = \mu _0 \otimes {\bar{Q}}^*\) and \(\mu _0^N = \mu [{\textbf{x}}^N]\) then simulate for all \(x_i^N\) actions \(a_i^N\) according to the kernel \({\bar{Q}}^*\). This is then a randomized policy but has the advantage that every individual can do this on its own without having the information about the other states and actions. This is then a decentralized control, i.e. \(f^i({\textbf{x}})=f^i(x_i)\). Also the speed of the convergence in Theorem 4.6 depends on the chosen approximation method.

We summarize the model data below:

Model \(\widetilde{\textrm{MDP}}\)

 

State space

\({\mathbb {P}}(S) \ni \mu \)

Admissible actions

\({\tilde{D}} (\mu ) :=\{ Q \in {\mathbb {P}}(D) \,| \text{ the } \text{ first } \text{ margin } \text{ of } Q \text{ is } \mu \}\ni Q\)

Transition function

\( \tilde{T}(\mu ,Q,Z^0)(B) = \int _{ D} p^{x,a,\mu ,Z^0}(B) Q(d(x,a))\) where

 

\(p^{x,a,\mu ,Z^0} (B):={\mathbb {P}}(T(x,a,\mu ,Z^i,Z^0)\in B |Z^0)\)

Reward

\( \tilde{r}(\mu ,Q) := \int _{D} r(x,a,\mu ) Q(d(x,a))\)

Policy

\(\psi =(\varphi _0,\varphi _1,\ldots ),\)

 

\(\varphi _n\in {\tilde{F}} := \{ \varphi : {\mathbb {P}}(S) \rightarrow {\mathbb {P}}(D) \;| \; \varphi \text{ meas., } \varphi (\mu )\in {\tilde{D}}(\mu ),\; \forall \mu \in {\mathbb {P}}(S) \} \)

Example 4.8

We reconsider Example 2.4. In \(\widetilde{\textrm{MDP}}\) a state can be any distribution on S, e.g. \(\mu =(\pi ^{-1},0,1-\pi ^{-1})\). An action is a distribution on \(D=\{(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)\}\) s.t. the first margin is \(\mu \). For example \(Q=(\pi ^{-1},0,0,0,3/4(1-\pi ^{-1}),1/4(1-\pi ^{-1}))\). Here \( \tilde{r}(\mu ,Q) =\pi ^{-1}\).

5 Average Reward Optimality

In this section we consider the problem of finding the maximal average reward of the mean-field limit problem \(\widetilde{\textrm{MDP}}\). So suppose an \(\widetilde{\textrm{MDP}}\) as in the previous section (Eq. (4.6)) is given. For a fixed policy \(\psi =(\varphi _1,\varphi _2,\ldots )\) define

$$\begin{aligned} \liminf _{n\rightarrow \infty } \frac{1}{n} \sum _{k=0}^{n-1} {\mathbb {E}}_\mu ^\psi [\tilde{r} (\mu _k,\varphi _k)] =: G_\psi (\mu ). \end{aligned}$$
(5.1)

The problem is to find \(G(\mu ):= \sup _\psi G_\psi (\mu )\) for all \(\mu \in {\mathbb {P}}(S)\). We will construct the solution via the vanishing discount approach, see e.g. [3, 19, 33, 34]. This has the advantage that we get a statement about the approximation of the \(\beta \)-discounted problem by the average reward problem immediately. For this purpose we denote by \(J^\beta , J^\beta _ \psi \) the value functions of the discounted reward problem \(\widetilde{\textrm{MDP}}\) of the previous section in order to stress that they depend on the discount factor \(\beta \).

We first note that the following Tauber Theorem holds (see e.g. [34], Th. A.4.2):

Lemma 5.1

For arbitrary \(\mu \in {\mathbb {P}}(S)\) and policy \(\psi =(\varphi _0,\varphi _1,\ldots )\) we have

$$\begin{aligned}{} & {} \liminf _{n\rightarrow \infty } \frac{1}{n} \sum _{k=0}^{n-1} {\mathbb {E}}_\mu ^\psi [{\tilde{r}} (\mu _k,\varphi _k)] = G_\psi (\mu ) \le \liminf _{\beta \uparrow 1} (1-\beta ) J_\psi ^\beta (\mu )\\{} & {} \le \limsup _{\beta \uparrow 1} (1-\beta ) J_\psi ^\beta (\mu )\le \limsup _{n\rightarrow \infty } \frac{1}{n} \sum _{k=0}^{n-1} {\mathbb {E}}_\mu ^\psi [\tilde{r} (\mu _k,\varphi _k)] <\infty \end{aligned}$$

In order to proceed we make the following assumption (compare with condition (B) in [33] or condition (SEN) in [34], Section 7.2).

  1. (A4)

    There exist \(L>0, \bar{\beta }\in (0,1)\) and a function \(M: {\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) such that

    $$\begin{aligned} M(\mu ) \le h^\beta (\mu ):= J^\beta (\mu )-J^\beta (\nu )\le L \end{aligned}$$

    for fixed \(\nu \in {\mathbb {P}}(S)\), all \(\mu \in {\mathbb {P}}(S)\) and all \(\beta \ge \bar{\beta }\).

We define \(\rho (\beta ):= (1-\beta ) J^\beta (\nu )\). Note that since r is bounded by a constant \(C>0\) say, we obtain \( |\rho (\beta )| \le (1-\beta ) |J^\beta (\nu )| \le C\). I.e. \( \rho (\beta )\) is bounded and \(\limsup _{\beta \uparrow 1} \rho (\beta )=:\rho \) exists. Now we obtain:

Lemma 5.2

Under (A4) there exists a sequence \((\beta _n)\) with \(\lim _{n\rightarrow \infty } \beta _n = 1\) s.t.

$$\begin{aligned} \lim _{n\rightarrow \infty } (1-\beta _n) J^{\beta _n}(\mu )=\rho \end{aligned}$$

for all \(\mu \in {\mathbb {P}}(S)\). In particular we have \(G_\psi (\mu ) \le \rho \) for all \(\mu \) and \(\psi \).

Proof

Using (A4) we obtain:

$$\begin{aligned} |(1-\beta ) J^\beta (\mu )-\rho |= & {} |(1-\beta ) h^\beta (\mu ) + \rho (\beta )-\rho | \le (1-\beta ) |h^\beta (\mu )| + |\rho (\beta )-\rho | \\\le & {} (1-\beta ) \max \{L,M(\mu )\} + |\rho (\beta )-\rho |. \end{aligned}$$

The last term converges to zero when we choose \((\beta _n)\) s.t. \(\lim _{n\rightarrow \infty }\beta _n=1\) and \(\lim _{n\rightarrow \infty }\rho (\beta _n)=\rho \) which is possible due to the considerations preceding this lemma. The first term also tends to zero. \(\square \)

We obtain:

Theorem 5.3

Assume (A0), (A1), (A2’), (A3’), (A4). Then:

  1. (a)

    There exists a constant \(\rho \in {\mathbb {R}}\) and an upper semicontinuous function \(h:{\mathbb {P}}(S)\rightarrow {\mathbb {R}}\) such that the average reward optimality inequality holds, i.e. for all \(\mu \in {\mathbb {P}}(S)\)

    $$\begin{aligned} \rho + h(\mu ) \le \sup _{Q\in {\tilde{D}}(\mu )} \left\{ {\tilde{r}}(\mu , Q) + {\mathbb {E}}[ h({\tilde{T}}(\mu ,Q,Z^0))] \right\} . \end{aligned}$$
    (5.2)

    Moreover, there exists a maximizer \(\varphi ^*\) of (5.2).

  2. (b)

    The stationary policy \((\varphi ^*,\varphi ^*,\ldots )\) is optimal for the average reward problem and \(\rho = \limsup _{\beta \uparrow 1} \rho (\beta )\) is the maximal average reward, independent of \(\mu \). Moreover, there exists a decision rule \(\varphi ^0\) and sequences \(\beta _m(\mu )\uparrow 1\), \(\mu _m(\mu ) \rightarrow \mu \) s.t.

    $$\begin{aligned} \varphi ^0(\mu ):= \lim _{m\rightarrow \infty } \varphi ^{\beta _m(\mu )}(\mu _m(\mu )) \end{aligned}$$

    where \(\varphi ^\beta \) is an optimal decision rule in the \(\beta \)-discounted model and the stationary policy \((\varphi ^0,\varphi ^0,\ldots )\) is optimal for the average reward problem.

Note that part (b) of the previous theorem states that it is possible to obtain an average reward optimal policy from optimal policies in the discounted model. Indeed what is maybe more interesting is the converse. From the average optimal policy we can construct \(\varepsilon \)-optimal policies for \(\widetilde{\textrm{MDP}}\) and thus also for \(\widehat{\textrm{MDP}}\) if \(\beta \) is close to one. The idea is to use the double approximation (number of agents large, discount factor large) to approximate the discounted finite agent model by the average mean-field problem. We do not tackle the question of convergence speed or how \(\beta \) depends on N here. A policy \(\psi \) is \(\varepsilon \)-optimal in state \(\mu \in {\mathbb {P}}(S)\) for \(\widetilde{\textrm{MDP}}\) if

$$\begin{aligned} 1- \Big | \frac{J_\psi ^\beta (\mu )}{J^\beta (\mu )}\Big | \le \varepsilon . \end{aligned}$$

Thus, we obtain:

Corollary 5.4

Under the assumptions of Theorem 5.3 suppose \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\) is an optimal stationary policy for the average reward problem and \(\psi ^N\) is constructed as in Theorem 4.6. Then for all \(\varepsilon >0\) and for all \(\mu \in {\mathbb {P}}(S)\) there exists a \(\beta (\mu ) <1\)

  1. (a)

    s.t. \(\psi ^*\) is \(\varepsilon \)-optimal for \(\widetilde{\textrm{MDP}}\) in state \(\mu \) for all \(\beta \ge \beta (\mu )\).

  2. (b)

    and there exists a \(N(\mu ,\beta (\mu ))\in {\mathbb {N}}\) s.t. for all \(N\ge N(\mu ,\beta (\mu ))\) and \(\beta \ge \beta (\mu )\) \(\psi ^N\) is \(\varepsilon \)-optimal for \(\widehat{\textrm{MDP}}\), i.e. \((1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^N(\mu ^N) |\le \varepsilon \) where \(\mu ^N \Rightarrow \mu \).

Proof

  1. (a)

    By Theorem 5.3 we know that \(\rho =G_{\psi ^*}(\mu )\) is the maximal average reward. Lemma 5.1 and Theorem 5.3 together imply

    $$\begin{aligned} \rho= & {} G_{\psi ^*}(\mu )\le \liminf _{\beta \uparrow 1}(1-\beta ) J^\beta _{\psi ^*}(\mu ) \le \limsup _{\beta \uparrow 1}(1-\beta ) J^\beta _{\psi ^*}(\mu ) \\\le & {} \limsup _{\beta \uparrow 1}(1-\beta ) J^\beta (\mu ) =\rho \end{aligned}$$

    which means that we have equality everywhere. Since r is bounded, w.l.o.g. we may assume that r is bounded from below by \(\underline{C}>0\), otherwise we have to shift the function by a constant. Now for all \(\varepsilon >0\) we can choose, due to the preceding equation, \(\beta (\mu )\) s.t. for all \(\beta \ge \beta (\mu )\)

    $$\begin{aligned} |J^\beta (\mu )-J^\beta _{\psi ^*}(\mu )|\le \frac{\varepsilon }{1-\beta } \text{ and } \text{ hence } 1- \Big | \frac{J_{\psi ^*}^\beta (\mu )}{J^\beta (\mu )}\Big | \le \frac{\varepsilon }{(1-\beta ) J^\beta (\mu )}\le \frac{\varepsilon }{\underline{C}} \end{aligned}$$

    which implies the result.

  2. (b)

    Let \(\varepsilon >0\). From part a) choose \(\beta (\mu )<1\) s.t. for all \(\beta \ge \beta (\mu )\) we have \((1-\beta ) |J^\beta (\mu )-J^\beta _{\psi ^*}(\mu )|\le \varepsilon /3.\) Fix such a \(\beta \ge \beta (\mu )\). From Theorem 4.6 choose \(N\ge N(\mu ,\beta )\) s.t.

    $$\begin{aligned} |J^N_{\psi ^N}(\mu ^N)-J^\beta _{\psi ^*}(\mu ) |\le \varepsilon /3 \text{ and } |J^N(\mu ^N)-J^\beta (\mu ) |\le \varepsilon /3. \end{aligned}$$

    Then, in total

    $$\begin{aligned}{} & {} (1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^N(\mu ^N)| \le (1-\beta )|J^N_{\psi ^N}(\mu ^N)-J^\beta _{\psi ^*}(\mu )|\nonumber \\{} & {} \quad +(1-\beta )|J^\beta _{\psi ^*}(\mu )-J^\beta (\mu )| + (1-\beta )|J^\beta (\mu )-J^N(\mu ^N)| \le \varepsilon \end{aligned}$$
    (5.3)

    which implies the statement.

\(\square \)

5.1 Special Case I

We consider the following special case: The reward depends only on \(\mu \), i.e. we have \({\tilde{r}}(\mu ,Q)={\tilde{r}} (\mu )\). The transition function is independent of \(\mu \) and there is no common noise, i.e. all individuals move independently from each other. Suppose \(\mu ^*\in {\mathbb {P}}(S)\) is the solution of the static optimization problem

$$\begin{aligned} \left\{ \begin{array}{ll} \max {\tilde{r}}(\mu ) \\ s.t. \; \mu \in {\mathbb {P}}(S) \end{array}\right. \end{aligned}$$
(5.4)

which exists since r is continuous on the compact space \({\mathbb {P}}(S)\). In the described situation \(\widetilde{\textrm{MDP}}\) is deterministic and the evolution of the state process for a given policy is

$$\begin{aligned} \mu _{k+1}(B) = \int _D p^{x,a}(B) {\bar{Q}} (da|x)\mu _k(dx) = \int _S P^{{\bar{Q}}}(B|x)\mu _k(dx),\; B\in {\mathcal {B}}(S)\qquad \end{aligned}$$
(5.5)

for \(k\in {\mathbb {N}}\) where we start with the initial distribution \(\mu _0\).

Now suppose further that there exists a transition kernel (policy) \( {\bar{Q}}^*\) such that \(\mu ^*\) is a stationary distribution of \( P^{ {\bar{Q}}^*}\) and \(P^{ {\bar{Q}}^*}\) satisfies the Wasserstein ergodicity (see Appendix). Suppose further that \((\mu _k^*)\) is the state sequence obtained in (5.5) where we replace \(P^{{\bar{Q}}}\) by \(P^{ {\bar{Q}}^*}\). Then \(\mu _k^*\Rightarrow \mu ^*\) for \(k\rightarrow \infty \) weakly since convergence in the Wasserstein metric implies weak convergence on compact sets. Problem (5.4) and the solution approach here is similar to the concept of steady state policies in [12].

Lemma 5.5

Under the assumptions of this subsection \(\varphi ^*(\mu ) = \mu \otimes {\bar{Q}}^*\) defines an average reward optimal stationary policy \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\).

Proof

Since \(\mu \mapsto {\tilde{r}}(\mu )\) is continuous (see proof of Theorem 4.3) we obtain \(\lim _{k\rightarrow \infty }\tilde{r}(\mu _k^*) \rightarrow {\tilde{r}}(\mu ^*)\). Thus we have for all \(\mu \in {\mathbb {P}}(S)\)

$$\begin{aligned} G_{\psi ^*}(\mu ) =\liminf _{n\rightarrow \infty } \frac{1}{n} \sum _{k=0}^{n-1} {\tilde{r}} (\mu _k^*) = {\tilde{r}}(\mu ^*)= G(\mu ). \end{aligned}$$

The last equation follows from the definition of \(\mu ^*\). Hence \(\psi ^*\) is average reward optimal. \(\square \)

We can think of the problem thus been transformed into a Markov Chain Monte Carlo problem to sample from \(\mu ^*\). In order to obtain an \(\varepsilon \)-optimal policy in the N individual problem with large discount factor, an individual in state x can sample its action from \({\bar{Q}}^*(\cdot |x)\) (see proof of Theorem 4.6 and Remark 4.7 c)). This yields a decentralized decision which does not depend on the complete state of the system. I.e. the individuals do not have to communicate with each other in order to push the system to the social optimum. The knowledge about the own state is sufficient. Problems may occur when the solution of (5.4) is not unique. Then the individuals have to communicate which solution is preferred. In particular the individual’s optimal decision coincides with the social optimal decision. This is because we can interpret \(\mu _k\) as the distribution of a typical individual at time k. Also note that in this case it can be shown that Assumption (A4) is satisfied since \(|{\tilde{r}}(\mu _k^*)-{\tilde{r}}(\mu ^*)| \le C W(\mu _k^*,\mu ^*) \le {\tilde{C}} \rho ^k\) with \(\rho \in (0,1)\) where W is the Wasserstein distance of two measures (see Appendix). We will give a more specific application in Sect. 6.

5.2 Special Case II

We relax the previous case and allow the transition function to depend on \(\mu \). Again we determine the solution \(\mu ^*\) of (5.4) first. Next we check whether there exists a transition kernel (policy) \({\bar{Q}}^*\) such that \(\mu ^*\) is a stationary distribution of \( P^{{\bar{Q}}^*}\) with \(P^{\bar{Q}^*}(B|x)= \int p^{x,a,\mu ^*}(B) {\bar{Q}}^* (da|x)\) for \(x\in S, B\in {\mathcal {B}}(S)\) and \(P^{{\bar{Q}}^*}\) satisfies the Wasserstein ergodicity. Here, we need some further properties of the model to obtain the same result as in Case I, because we have to make sure that the system still converges to \(\mu ^*\), even if we choose the ’wrong’ transition kernel

$$\begin{aligned} \int p^{x,a,\mu _k}(B) {\bar{Q}}^* (da|x) \end{aligned}$$

at stage k. Note that the evolution of the state in this model is given by

$$\begin{aligned} \mu _{k+1}^*(B) = \int \int p^{x,a,\mu _k}(B) {\bar{Q}}^* (da|x)\mu _k^*(dx). \end{aligned}$$

In particular we want to find an optimal decentralized control. The following assumptions will be useful:

  1. (T1)

    There exists \(\gamma _W>0\) s.t. \(\sup _{x,a,z}|T(x,a,\mu ,z)-T(x,a,\mu ^*,z)|\le \gamma _W W(\mu ,\mu ^*)\) for all \(\mu \in {\mathbb {P}}(S)\).

  2. (T2)

    D(x) does not depend on x and \(W({\bar{Q}}^*(\cdot |x), {\bar{Q}}^*(\cdot |x'))\le \gamma _Q |x-x'|\) for all \(x,x'\in S\).

  3. (T3)

    There exists \(\gamma _A>0\) s.t. \(\sup _{x,z}|T(x,a,\mu ^*,z)-T(x,a',\mu ^*,z)|\le \gamma _A |a-a'|\) for all \(a,a'\in A\).

  4. (T4)

    There exists \(\gamma _S>0\) s.t. \(\sup _{a,z}|T(x,a,\mu ^*,z)-T(x',a,\mu ^*,z)|\le \gamma _S |x-x'|\) for all \(x,x'\in S\).

  5. (T5)

    \(\gamma :=\gamma _W+\gamma _Q\gamma _A+\gamma _S<1.\)

The next lemma states that under these assumptions the sequence \((\mu _k^*)\) still converges against the optimal distribution \(\mu ^*\).

Lemma 5.6

Under (T1)-(T5) we obtain: \(W(\mu _{k+1}^*,\mu ^*)\le \gamma W(\mu _k^*,\mu ^*)\) and thus \(\mu _k^*\Rightarrow \mu ^*\) weakly.

Lemma 5.6 then implies that even in this case the maximal average reward \({\tilde{r}}(\mu ^*)\) is achieved by applying \({\bar{Q}}^*\) throughout the process which corresponds to a decentralized control. An example where (T1), (T3), (T4) are fulfilled is \(T(x,a,\mu ,z) = \gamma _S x+ \gamma _A a +\gamma _W \int x\mu (dx) + z\).

6 Applications

6.1 Avoiding Congestion

We consider here the following special case: N individuals move on a graph with nodes \(S=\{1,\ldots ,d\}\) and edges \(E\subset \{(x,x'): x,x'\in S\}\). Individuals can move along one edge in one time step. We assume that nodes are connected. The aim is to avoid congestion and to try to spread the individuals such that they keep a maximum distance. More precisely suppose that the current empirical distribution of the individuals on the nodes is \(\mu \) and that the distance between node x and \(x'\), \(x,x'\in S\) is given by \(\Delta (x,x')>0\) where \(\Delta (x,x)=0\) and \(\Delta (x,x')=\Delta (x',x)\). Then the average distance between an individual at position x and all other individuals is

$$\begin{aligned} r(x,a,\mu ) = r(x,\mu ) = \sum _{x'} \Delta (x,x') \mu (x')= \int \Delta (x,x')\mu (dx'). \end{aligned}$$

Here \(r(x,a,\mu )\) does not depend on a. Hence

$$\begin{aligned} {\tilde{r}}(\mu ,Q) = {\tilde{r}}(\mu ) = \int r(x,\mu ) \mu (dx) = \int \int \Delta (x,x') \mu (dx)\mu (dx')= \mu \Delta \mu ^\top \end{aligned}$$

where \(\Delta =\big ( \Delta (x,x')\big )_{x,x'\in S}\) is the matrix of distances. Note that \(\Delta \) is symmetric. We assume that \(A=S\) and \(D(x)=\{x'\in S: (x,x')\in E\}\cup \{x\}\), i.a. actions in the original model are neighbours on the graph. We interpret actions as intended directions the individual wants to move to, but this may be disturbed by some random external noise. In the mean-field limit the state of the system at time n is just given by a generalized distribution on S. Recall that the general transition equation of the mean-field limit is

$$\begin{aligned} \mu _{n+1}(x')= & {} \sum _x \sum _{a\in D(x)} p^{x,a,\mu _n,z^0}(x') Q_n(x,a) \nonumber \\= & {} \sum _x \sum _{a\in D(x)} p^{x,a,\mu _n,z^0}(x') \bar{Q}_n(a|x)\mu _n(x) \end{aligned}$$
(6.1)

if SA are finite where \( p^{x,a,\mu ,z^0}(x') = {\mathbb {P}}(T(x,a,\mu ,Z,z^0)=x')\) and \(Q_n\) has first margin \(\mu _n\). Problems where the reward decreases when more individuals share the same state are typical for mean-field problems, see e.g. [25] where a Wardrop equilibrium is computed. In [28] the authors consider spreading contamination on graphs.

6.1.1 No Common Noise

We consider the mean-field limit now. At the beginning let us assume that \(p^{x,a,\mu ,z^0} = p^{x,a}\) does not depend on \(\mu \) and \(z^0\), i.e. the individuals move on their own, not affected by others and there is no common noise. Moreover, it is reasonable to set \(p^{x,a}(x')=0\) if \((x,x')\notin E\) except for \(x=x'\). Let us denote \( P^{{\bar{Q}}}=\big ( p_{xx'}^{{\bar{Q}}}\big )\) where

$$\begin{aligned} p_{xx'}^{{\bar{Q}}} = \sum _{a\in D(x)} p^{x,a}(x') {\bar{Q}}(a|x) \end{aligned}$$
(6.2)

with \({\bar{Q}}(a|x)\mu (x)=Q(x,a)\). Hence (6.1) can be written as \(\mu _{n+1}=\mu _n P^{{\bar{Q}}_n}\). Here it is more intuitive to work with the conditional probabilities \({\bar{Q}}(a|x)\) instead of the joint distribution Q(xa).

Obviously the optimization problem

$$\begin{aligned} \left\{ \begin{array}{ll} \max \mu \Delta \mu ^\top \\ s.t. \; \mu \in {\mathbb {P}}(S) \end{array}\right. \end{aligned}$$
(6.3)

has an optimal solution \(\mu ^*\) since \({\mathbb {P}}(S)\) is compact and \(\mu \Delta \mu ^\top \) continuous.

We consider the following special case: For \(a,x'\in D(x)\) set \(p^{x,a}(x') =\alpha \) for \(a=x'\) and \(p^{x,a}(x') =\frac{1-\alpha }{|D(x)|-1}\) else. All other probabilities are zero. I.e. if we choose a vertex a we will move there with probability \(\alpha \) and move to any other admissible vertex with equal probability. Formally for \(x\in S\), action \(a\in D(x)=\{x_1,\ldots ,x_m\}\) (where \(x_i=x\) for one of the \(x_i\)’s) and disturbance \(Z\sim U[0,1]\) the transition function in this example is given by

$$\begin{aligned} T(x,x_i,\mu ,z,z^0)= \left\{ \begin{array}{cl} x_i, &{} \text{ if } z\in [0,\alpha ],\\ x_j, &{} \text{ if } z\in (\alpha +(j-1) \frac{1-\alpha }{m-1}, \alpha + j\frac{1-\alpha }{m-1} ],\; j=1,\ldots ,i-1,\\ x_j, &{} \text{ if } z\in (\alpha +(j-2) \frac{1-\alpha }{m-1}, \alpha + (j-1)\frac{1-\alpha }{m-1} ],\; j=i+1,\ldots ,m. \end{array}\right. \end{aligned}$$

Lemma 6.1

If \(\mu ^*(x)>0\) for all \(x\in S\) and \(\alpha \) is large enough, then there exists a \(Q^*\in {\mathbb {P}}(D)\) s.t. \(\mu ^* = \mu ^* P^{\bar{Q}^*}\), i.e. \(\mu ^*\) is a stationary distribution for the transition kernel \(P^{{\bar{Q}}^*}\) given in (6.2).

Proof

We use a construction similar to the Metropolis algorithm. For \(x,x'\in S\) let

$$\begin{aligned} \Psi _{xx'}:= \left\{ \begin{array}{ll} \kappa , &{} \text{ if } (x,x')\in E\\ 0 &{} \text{ else. } \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} p_{xx'}^{{\bar{Q}}^*}:= \left\{ \begin{array}{ll} \Psi _{xx'}\Big ( \frac{\mu ^*(x')}{\mu ^*(x)}\wedge 1\Big ), &{} \text{ if } x\ne x'\\ 1- \sum _{y\ne x} \Psi _{xy} \Big ( \frac{\mu ^*(y)}{\mu ^*(x)}\wedge 1\Big )&{} \text{ if } x=x'. \end{array}\right. \end{aligned}$$

The parameter \(\kappa >0\) should be such that \(P^{{\bar{Q}}^*}\) is a transition matrix. Then the detailed balance equations

$$\begin{aligned} \mu ^*(x) p_{xx'}^{{\bar{Q}}^*} = \mu ^*(x') p_{x'x}^{{\bar{Q}}^*}, \quad x,x'\in S \end{aligned}$$

are satisfied and hence \(\mu ^*\) is a stationary distribution of \(P^{{\bar{Q}}^*}\). We now have to determine \({\bar{Q}}^*\) s.t. \(P^{\bar{Q}^*}\) has the specified form. Let us fix \(x\in S\). We have to solve (6.2) for \({\bar{Q}}^*\). We claim that (6.2) is solved for

$$\begin{aligned} {\bar{Q}}^*(a|x)= \frac{(|D(x)|-1) p_{xa}^{{\bar{Q}}^*}-(1-\alpha )}{\alpha |D(x)| -1}. \end{aligned}$$
(6.4)

This can be seen since

$$\begin{aligned} \sum _{a\in D(x)} p^{x,a}(x') {\bar{Q}}^*(a|x)= & {} {\bar{Q}}^*(x'|x) \alpha +\frac{1-\alpha }{|D(x)|-1} (1-{\bar{Q}}^*(x'|x))=\nonumber \\= & {} {\bar{Q}}^*(x'|x) \Big ( \frac{\alpha |D(x)|-1}{|D(x)|-1}\Big ) + \frac{1-\alpha }{|D(x)|-1}= p_{xx'}^{{\bar{Q}}^*}. \end{aligned}$$
(6.5)

In order to have \({\bar{Q}}^*(a|x)\in [0,1]\) we have to make sure that \( \alpha \ge p_{xx'}^{{\bar{Q}}^*} \vee (1-p_{xx'}^{{\bar{Q}}^*})\) for all \(x,x'\in S\) and \(\alpha \ge \frac{1}{2}\). \(\square \)

Theorem 6.2

The optimal average reward policy for the limit model considered here is the stationary policy \(\psi ^*=(\varphi ^*,\varphi ^*,\ldots )\) with \(\varphi ^*(\mu )= \mu \otimes {\bar{Q}}^*\) with \({\bar{Q}}^*\) from (6.4). Thus, for N large and \(\beta \) close to one, sampling actions from \({\bar{Q}}^*\) is \(\varepsilon \)-optimal for the \(\beta \)-discounted problem with N individuals.

Proof

The statement follows from our previous discussions. Note that when we start with an arbitrary \(\mu _0^*\), the sequence of distributions generated by \(\mu _{k+1}^* = \mu _k^* P^{{\bar{Q}}^*}\) converges against \(\mu ^*\) since the matrix \(P^{{\bar{Q}}^*}\) is irreducible by construction and we have a finite state space. Thus, \(G_\psi (\mu _0^*)\) in (5.1) yields the same limit \(\mu ^* \Delta (\mu ^*)^\top \) which is maximal since it solves (5.4). \(\square \)

Remark 6.3

It is tempting to say that for the discounted problem, once we have reached the stationary distribution after a transient phase we know that the optimal policy is to choose \({\bar{Q}}^*\) forever. However, there are only rare cases where the stationary distribution is reached after a finite number of steps (see e.g. [15]), so the transient phase will in most cases last forever.

Example 6.4

We consider a regular \(3\times 3\) grid, i.e. \(d=9\) (see Fig. 1, left). We set the distance between nodes equal to 1 when there is only one edge between them. Nodes which are connected via 2 edges get the distance 1.4, when there are 3 edges in between 1.7 and finally we set the distance equal to 2.2 when there are 4 edges in between. The distance matrix \(\Delta \) is thus given by

$$\begin{aligned} \Delta := \left( \begin{array}{ccccccccc} 0 &{} 1 &{} 1.4 &{} 1 &{} 1.4 &{} 1.7 &{} 1.4 &{} 1.7 &{} 2.2\\ 1 &{} 0 &{} 1 &{} 1.4 &{} 1 &{} 1.4 &{} 1.7 &{} 1.4 &{} 1.7 \\ 1.4 &{} 1 &{} 0 &{} 1.7 &{} 1.4 &{} 1 &{} 2.2 &{} 1.7 &{} 1.4\\ 1 &{} 1.4 &{} 1.7 &{} 0 &{} 1 &{} 1.4 &{} 1 &{} 1.4 &{} 1.7\\ 1.4 &{} 1 &{} 1.4 &{} 1 &{} 0 &{} 1 &{} 1.4 &{} 1 &{} 1.4\\ 1.7 &{} 1.4 &{} 1 &{} 1.4 &{} 1 &{} 0 &{} 1.7 &{} 1.4 &{} 1\\ 1.4 &{} 1.7 &{} 2.2 &{} 1 &{} 1.4 &{} 1.7 &{} 0 &{} 1 &{} 1.4\\ 1.7 &{} 1.4 &{} 1.7 &{} 1.4 &{} 1 &{} 1.4 &{} 1 &{} 0 &{} 1\\ 2.2 &{} 1.7 &{} 1.4 &{} 1.7 &{} 1.4 &{} 1 &{} 1.4 &{} 1 &{} 0 \end{array}\right) \end{aligned}$$

The optimal distribution of problem (5.4) is here given by \(\mu ^*= \frac{1}{37} (7,2,7,2,1,2,7,2,7)\). The masses are illustrated in Fig. 1, right picture. The area of the circle is proportional to the corresponding value of \(\mu ^*\). We think of the proportion of individuals who occupy this node.

Fig. 1
figure 1

Network with labelled nodes (left); Optimal stationary distribution (right)

We set \(\alpha =1\) and \(\psi =0.25\). Then we obtain from (6.4) that the optimal decision in every node is given by the following transition kernel \({\bar{Q}}^*(a|x)\)

$$\begin{aligned} {\bar{Q}}^*:= \left( \begin{array}{ccccccccc} 12c &{} c &{} 0 &{} c &{} 0 &{} 0 &{} 0 &{} 0 &{} 0\\ 2b &{} 3b &{} 2b &{} 0 &{} b &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} c &{} 12c &{} 0 &{} 0 &{} c &{} 0 &{} 0 &{} 0\\ 2b &{} 0 &{} 0 &{} 3b &{} b &{} 0 &{} 2b &{} 0 &{} 0\\ 0 &{} 2b &{} 0 &{} 2b &{} 0 &{} 2b &{} 0 &{} 2b &{} 0\\ 0 &{} 0 &{} 2b &{} 0 &{} b &{} 3b &{} 0 &{} 0 &{} 2b\\ 0 &{} 0 &{} 0 &{} c &{} 0 &{} 0 &{} 12c &{} c &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} b &{} 0 &{} 2b &{} 3b &{} 2b\\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} c &{} 0 &{} c &{} 12c \end{array}\right) \end{aligned}$$

where \(b=\frac{1}{8}\) and \(c=\frac{1}{14}\). So using this decentralized decision throughout the process yields the maximal average reward. In Fig. 2 we see the evolution of the system when all mass starts initially in node 1. The pictures show the distribution of the mass after 2, 4, 8, 16, 32 and 64 time steps. Note that sampling actions from \({\bar{Q}}^*\) is also \(\varepsilon \)-optimal for the system when we have a finite but large number of individuals and \(\beta \) is close to one for the discounted reward criterion.

Fig. 2
figure 2

Evolution of the individuals using the optimal randomized decision when all start in node 1, after \(n=2,4,8,16,32\) and 64 time steps (left to right, above to below)

6.1.2 With Common Noise

Next we suppose that \(\alpha \) depends on the common noise \(Z^0\). In this case the maximal average reward which can be achieved is less or equal to the case without common noise since the sequence of distributions is stochastic and may deviate from the optimal one. We simplify things a little bit since we assume here that \(|D(x)| = \gamma \) independent of x. From the previous section, equation (6.5) we know that we can write

$$\begin{aligned} p^{{\bar{Q}}}_{xx'} = {\bar{Q}}(x'|x) \frac{\alpha (Z^0)\gamma -1}{\gamma -1}+\frac{1-\alpha (Z^0)}{\gamma -1}. \end{aligned}$$

In matrix notation

$$\begin{aligned} P^{{\bar{Q}}} = \frac{1}{\gamma -1} (1-\alpha (Z^0)) U + \frac{1}{\gamma -1} (\alpha (Z^0)\gamma -1){\bar{Q}} \end{aligned}$$

where U is a \(d\times d\) matrix containing ones only and \({\bar{Q}}=({\bar{Q}}(x'|x)).\) Here the situation is more complicated, in particular the next empirical distribution of individuals is stochastic and given by

$$\begin{aligned} \mu _{n+1}= \frac{1}{\gamma -1} (1-\alpha (Z^0)) e +\frac{1}{\gamma -1} (\alpha (Z^0)\gamma -1)\mu _n {\bar{Q}}_n \end{aligned}$$

with \(e=(1,\ldots ,1)\in {\mathbb {R}}^d\). Plugging this into the reward function yields

$$\begin{aligned}{} & {} {(\gamma -1)^2}{\mathbb {E}}\Big [ \mu _{n+1} \Delta \mu _{n+1}^\top \Big ] = {\mathbb {E}}[(1-\alpha (Z^0))^2] e \Delta e^\top \nonumber \\{} & {} \quad + 2 {\mathbb {E}}[(1-\alpha (Z^0)) ( \alpha (Z^0)\gamma -1)] (e \Delta {\bar{Q}}_n^\top \mu _n^\top ) \nonumber \\{} & {} \quad + {\mathbb {E}}[(\alpha (Z^0)\gamma -1)^2] \mu _n {\bar{Q}}_n \Delta {\bar{Q}}_n^\top \mu _n^\top . \end{aligned}$$
(6.6)

Now consider the problem

$$\begin{aligned} \left\{ \begin{array}{l} 2 {\mathbb {E}}[(1-\alpha (Z^0)) ( \alpha (Z^0)\gamma -1)] (e \Delta \nu ^\top ) + {\mathbb {E}}[(\alpha (Z^0)\gamma -1)^2] \nu \Delta \nu ^\top \rightarrow \max \\ \nu \in {\mathbb {P}}(S) \end{array} \right. \qquad \end{aligned}$$
(6.7)

Obviously this problem has an optimal solution \(\nu ^*\) since we maximize a continuous function over a compact set. Now \(\nu \) corresponds to \(\mu _n {\bar{Q}}_n\) in (6.6). In case it is possible to choose for all \(\mu \in {\mathbb {P}}(S)\) a matrix \({\bar{Q}}\) s.t. \(\mu {\bar{Q}}= \nu ^*\), then this would be the optimal strategy, since we would get the maximal expected reward in each step. This is for example possible if the graph is complete. Then we can simply choose \({\bar{Q}}\) as the matrix with identical rows which consist of \(\nu ^*\).

6.2 Positioning on a Market Place

Suppose we have a rectangular market place like in Fig. 3. The state \(\mu \) represents the distribution of individuals over the market place. Point A is an ice cream vendor. The aim of the individuals is to keep distance to others and be as close as possible to the ice cream vendor. Thus, \(S\subset {\mathbb {R}}^2\) is the rectangle BCED and the one-stage reward is

$$\begin{aligned} {\tilde{r}}(\mu )= \int \int d(x,y)\mu (dx)\mu (dy) -\int d(x,A) \mu (dx). \end{aligned}$$

In what follows in order to simplify the computation we choose \(d(x,y)=\Vert x-y\Vert ^2\) for \(x,y\in S\). We want to solve (5.4) in this case. Let us formulate the problem with the help of random variables. Let \(X=(X_1,X_2), Y=(Y_1,Y_2)\) be independent r.v. having distribution \(\mu \). Then \({\tilde{r}}(\mu )\) is the same as

$$\begin{aligned} \sum _{i=1}^2 {\mathbb {E}}(X_i-Y_i)^2 - {\mathbb {E}}(X_i-A_i)^2. \end{aligned}$$

Thus, we can treat the margins separately and the dependence between them is not interesting for the reward. Now obviously since X and Y both have the same distribution we can write

$$\begin{aligned} {\mathbb {E}}(X_i-Y_i)^2 - {\mathbb {E}}(X_i-A_i)^2= & {} {\mathbb {E}}X_i^2 + 2 {\mathbb {E}}X_i (A_i-{\mathbb {E}}X_i) - A_i^2. \end{aligned}$$

Suppose we fix \({\mathbb {E}}X_i\) for a moment. Since \(x\mapsto x^2\) is convex, the distribution which maximizes the expression is maximal in convex order, given the fixed expectation. But this distribution is due to the convexity property concentrated on the endpoints of the interval. Thus we can restrict to random variables \(X_1\) which have mass \(p\in [0,1]\) on \(B_1\) and \(1-p\) on \(C_1\), i.e. we maximize

$$\begin{aligned} B_1^2 p+C_1^2 (1-p)+2(B_1 p+C_1 (1-p)) (A_1-B_1p-C_1 (1-p)) \end{aligned}$$

over \(p\in [0,1]\).

Fig. 3
figure 3

Market place with ice cream vendor (left). Optimal distribution in example (right)

The solution is given by \(p= \frac{1}{4} + \frac{C_1-A_1}{2(C_1-B_1)}\). Since the joint distribution does not matter we can choose independent margins and obtain

$$\begin{aligned}{} & {} \mu ^* = \delta _B \Big ( \frac{1}{4} + \frac{C_1-A_1}{2(C_1-B_1)}\Big )\Big ( \frac{1}{4} + \frac{D_2-A_2}{2(D_2-B_2)}\Big ) \\{} & {} \qquad + \delta _C \Big ( \frac{3}{4} - \frac{C_1-A_1}{2(C_1-B_1)}\Big )\Big ( \frac{1}{4} + \frac{D_2-A_2}{2(D_2-B_2)}\Big )\\{} & {} \qquad + \delta _D \Big ( \frac{1}{4} + \frac{C_1-A_1}{2(C_1-B_1)}\Big )\Big ( \frac{3}{4} - \frac{D_2-A_2}{2(D_2-B_2)}\Big )\\{} & {} \qquad +\delta _E \Big ( \frac{3}{4} - \frac{C_1-A_1}{2(C_1-B_1)}\Big )\Big ( \frac{3}{4} - \frac{D_2-A_2}{2(D_2-B_2)}\Big ). \end{aligned}$$

This is the target distribution which should be attained. For a numerical example we choose B(0, 0), C(4, 0), D(0, 3), E(4, 3) and A(2.5, 2). In this case we obtain

$$\begin{aligned} \mu ^*= \delta _B \frac{35}{192}+\delta _C \frac{45}{192}+\delta _D \frac{49}{192}+\delta _E \frac{63}{192}. \end{aligned}$$

The distribution is illustrated in Fig. 3, (right).

Depending on how the transition law precisely looks like, if one is able to choose \({\bar{Q}}^*\) such that \(\mu ^*\) is the stationary distribution of \(P^{{\bar{Q}}^*}\), the problem is solved. Of course the optimal distribution \(\mu ^*\) depends on what kind of distance d we choose. Varying the metric for the distance leads to interesting optimization problems.

7 Conclusion

We have seen that the average reward mean-field problem can in some cases be solved rather easily by computing an optimal measure from a static optimization problem. The policy which is obtained in this way is \(\varepsilon \)-optimal for the \(\beta \)-discounted N-individuals problem where N is large and \(\beta \) close to one. The static optimization problem for measures gives rise to some interesting mathematical questions.