General limit value in Dynamic Programming

We consider a dynamic programming problem with arbitrary state space and bounded rewards. Is it possible to define in an unique way a limit value for the problem, where the"patience"of the decision-maker tends to infinity ? We consider, for each evaluation $\theta$ (a probability distribution over positive integers) the value function $v_{\theta}$ of the problem where the weight of any stage $t$ is given by $\theta_t$, and we investigate the uniform convergence of a sequence $(v_{\theta^k})_k$ when the"impatience"of the evaluations vanishes, in the sense that $\sum_{t} |\theta^k_{t}-\theta^k_{t+1}| \rightarrow_{k \to \infty} 0$. We prove that this uniform convergence happens if and only if the metric space ${v_{\theta^k}, k\geq 1}$ is totally bounded. Moreover there exists a particular function $v^*$, independent of the particular chosen sequence $({\theta^k})_k$, such that any limit point of such sequence of value functions is precisely $v^*$. Consequently, while speaking of uniform convergence of the value functions, $v^*$ may be considered as the unique possible limit when the patience of the decision-maker tends to infinity. The result applies in particular to discounted payoffs when the discount factor vanishes, as well as to average payoffs where the number of stages goes to infinity, and also to models with stochastic transitions. We present tractable corollaries, and we discuss counterexamples and a conjecture.


Introduction
We consider a dynamic programming problem with arbitrary state space Z and bounded rewards. Is it possible to define in an unique way a possible limit value for the problem, where the "patience" of the decision-maker tends to infinity ?
For each evaluation (probability distribution over positive integers) θ = (θ t ) t≥1 , we consider the value function v θ of the problem where the initial state is arbitrary in Z and the weight of any stage t is given by θ t . The total variation of θ, that we also wall the impatience of θ, is defined by T V (θ) = ∞ t=1 |θ t+1 −θ t |. For instance, for each positive integer n the evaluation θ = (1/n, ..., 1/n, 0, ..., 0, ...) induces the value functionv n corresponding to the maximization of the mean payoff for the first n stages; and for any λ in (0, 1] the evaluation θ = (λ(1 − λ) t−1 ) t induces the discounted value function v λ .
A well known theorem of Hardy and Littlewood (see e.g. Lippman, 1969) implies that for an uncontrolled problem, the pointwise convergence of (v n ) n , when n goes to infinity, and of (v λ ) λ , when λ goes to 0, are equivalent, and that in case of convergence both limits are the same. However, Lehrer and Sorin (1992) provided an example of a dynamic programming problem where (v n ) n and (v λ ) λ have different pointwise limits. But they also proved that the uniform convergence of (v n ) n and of (v λ ) λ are equivalent, with equality of the limit in case of convergence. And Sorin and Monderer (1993) extended this result to families of evaluations satisfying some conditions. Mertens and Neyman (1982) proved that when the family (v λ ) λ not only uniformly converges but has bounded variation, then the dynamic programming problem has a uniform value, in the sense that for all initial state z and ε > 0, there exists a play with mean payoffs from stage 1 to stage T at least v − ε provided T is large enough (see also Lehrer Monderer 1994 andSorin Moderer 1993 for proofs that the uniform convergence of (v λ ) λ or (v n ) n does not imply the existence of the uniform value of the problem). In this case of existence of a uniform value, one can show that all value functions v θ are close to v * , whenever θ is a non increasing evaluation with small θ 1 . The reason is that whenever θ is non increasing, the θ-payoff of a play can be expressed as a convex combination of the Cesàro values (v n ) n .
In the present paper, we investigate the uniform convergence of sequences (v θ k ) k when the "impatience" of the evaluations vanishes, in the sense that t |θ k t − θ k t+1 | → k→∞ 0. We will prove in theorem 2.5 that this uniform convergence happens if and only if the metric space {v θ k , k ≥ 1} (with the distance between functions given by the sup of their differences), is totally bounded. Moreover the uniform limit, whenever it exists, can only be the following function, which is independent of the particular chosen sequence (θ k ) k : where for each evaluation θ = (θ t ) t≥1 , the evaluation m, θ is defined as the evaluation with weight 0 for the first m stages and with weight θ t−m for stages t > m. Consequently, while speaking of uniform convergence of the value functions when the patience of the decision-maker tends to infinity, v * can be considered as the unique possible limit value. We also give simple conditions on the state space, the payoffs and the transitions (mainly compactness, continuity and non expansiveness) implying the uniform convergence of such value functions.
The paper is organized as follows: section 2 contains the model and the main results, which are shown to extend to the case of stochastic transitions. Section 3 contains a few examples and counterexamples and section 4 contains the proof of theorem 2.5. In the last section we formulate the following conjecture, which is shown to be true for uncontrolled problems: does the uniform convergence of (v n ) n , or equivalently of (v λ ) λ , implies the general convergence of the value functions, in the sense that: 2 Model and results

General values in dynamic programming problems
We consider a dynamic programming problem given by a non empty set of states Z, a correspondence F with non empty values from Z to Z, and a mapping r from Z to [0,1]. Z is called the set of states, F is the transition correspondence and r is the reward (or payoff) function. An initial state z 0 in Z defines the following dynamic programming problem: a decision maker, also called player, first has to select a new state z 1 in F (z 0 ), and is rewarded by r(z 1 ). Then he has to choose z 2 in F (z 1 ), has a payoff of r(z 2 ), etc... The decision maker is interested in maximizing his "long-term" payoffs, for whatever it means. From now on we fix Γ = (Z, F, r), and for every state z 0 we denote by Γ(z 0 ) = (Z, F, r, z 0 ) the corresponding problem with initial state z 0 . For z 0 in Z, a play at z 0 is a sequence s = (z 1 , ..., z t , ...) ∈ Z ∞ such that: ∀t ≥ 1, z t ∈ F (z t−1 ). We denote by S(z 0 ) the set of plays at z 0 , and by S = ∪ z 0 ∈Z S(z 0 ) the set of all plays. The set of bounded functions from Z to IR is denoted by V, and for v and v ′ in V we use Cesàro values. For n ≥ 1 and s = (z t ) t≥1 ∈ S, the average payoff of the play s up to stage n is defined by: γ n (s) = 1 n n t=1 r(z t ). And the n-stage average value of Γ(z 0 ) is: v n (z 0 ) = sup s∈S(z 0 ) γ n (s). By the Bellman-Shapley recursive formula, for all n and z we have: n v n (z) = sup z ′ ∈F (z) (r(z ′ ) + (n − 1) v n−1 (z ′ )) . We also have |v n (z) − sup z ′ ∈F (z) v n (z ′ )| ≤ 2 n , and a pointwise limit of (v n ) n should satisfy v(z) = sup z ′ ∈F (z) v(z ′ ) for all z.
Discounted values. Given λ ∈ (0, 1], the λ-discounted payoff of a play s = , and the λ-discounted value at the initial state z 0 is v λ (z 0 ) = sup s∈S(z 0 ) γ λ (s). It is easily proved that v λ is the unique mapping in V satisyfing the fixed point equation : General values. We denote by Θ the set of probability distributions over positive integers. An element θ = (θ t ) t≥1 in Θ is called an evaluation.
For each stage t we denote by δ t the Dirac mass on stage t and by n the Cesàro evaluation (1/n, ..., 1/n, 0, ..., 0, ...) = (1/n) n t=1 δ t , so that the notation v θ for θ = n coincide with the Cesàro-value v n , also writtenv n . It is easy to see that for each evaluation θ, the Bellman recursive formula can be written as follows: v θ (z) = sup Definition 2.3. The total variation of an evaluation θ = (θ t ) t≥1 is We have sup t θ t ≤ T V (θ) ≤ 2. In the case of a Cesàro evaluation θ = (1/n, ..., 1/n, 0, 0, ...), we have T V (θ) = 1/n. For a discounted evaluation θ = (λ(1 − λ) t−1 ) t≥1 , we have T V (θ) = λ. A small T V (θ) corresponds to a patient evaluation, and sometimes we will refer to T V (θ) as the impatience of θ. We will consider here limits when T V (θ) goes to zero, generalizing the cases where n −→ ∞ or λ −→ 0. Notice that if an evaluation θ is non increasing, i.e. satisfies θ t+1 ≤ θ t for all t, we have that T V (θ) = θ 1 . In the case of a sequence of non increasing evaluations We always have: is small, the L 1 -distance between θ and the shifted evaluation θ + is also small. Notice also the following inequalities: for any given T , denote by θ(T ) the arithmetic mean of θ 1 ,..., θ T . We have for all t = 1, ..., T : So if T V (θ) is small, then for all T and t ≤ T , the weight θ t is close to the average θ(T ).
Given an evaluation θ and m ≥ 0, we write v m,θ for the value function associated to the evaluation θ ′ = ∞ t=1 θ t δ m+t . The following function will play a very important role in the sequel:

Main results
We now state the main result of this paper. Recall that a metric space is totally bounded, or precompact, if for all ε > 0 it can be covered by finitely many balls with radius ε.
We have for all z in Z: Moreover, the sequence (v θ k ) k uniformly converges if and only if the metric space ({v θ k , k ≥ 1}, d ∞ ) is totally bounded. And in case of convergence, the limit value is v * . This theorem generalizes theorem 3.10 in Renault, 2011, which was only dealing with Cesàro evaluations 1 . In particular, there is a unique possible limit point for all sequences (v θ k ) k such that T V (θ k ) − −− → k→∞ 0, and consequently any (uniform) limit of such sequence is v * . Notice that this is not true if we replace uniform convergence by pointwise convergence: even for uncontrolled problems, it may happen that several limit points are possible. As an immediate corollary of theorem 2.5, when Z is finite the sequence (v θ k ) k is bounded and has a unique limit point, so converges to v * .
Corollary 2.6. Assume that Z is endowed with a distance d such that: a) (Z, d) is a precompact metric space, and b) the family (v θ ) θ∈Θ is uniformly equicontinuous. Then there is general uniform convergence of the value functions to v * , i.e.
The proof of corollary 2.6 from theorem 2.5 follows from 1) Ascoli's theorem, and 2) the fact that the convergence of (v θ k ) k to v * for each sequence of evaluations such that T V (θ k ) − −− → k→∞ 0 is enough to have the general uniform convergence of the value functions to v * .
. Then we have the same conclusions as corollary 2.6, there is general uniform convergence of the value functions to v * , i.e.
Proof of corollary 2.7. One can proceed as in the proof of corollary 3.9 in Renault, 2011. Given two states z and z ′ , one can construct inductively from each play s = (z t ) t≥1 at z a play s = (z ′ t ) t≥1 at z ′ such that d(z t , z ′ t ) ≤ d(z, z ′ ) for all t. Regarding payoffs, we introduce the modulus of continuityε of r by: ) for each pair of states z, z ′ , andε is continuous at 0. Using the previous construction, we obtain that for z and z ′ in Z, for all k ≥ 1, ). In particular, the family (v θ k ) k≥1 is uniformly continuous, and corollary 2.6 gives the result.
A completely different proof of corollary 2.7, with another expression for the limit value v * , can be found in theorem 3.9 of Renault Venel 2012.

Extension to stochastic transitions
We generalize here theorem 2.5 to the case of stochastic transitions. We will only consider transitions with finite support, and given a set X we denote by ∆ f (X) the set of probabilities with finite support over X. We consider now stochastic dynamic programming problems of the following form. There is an arbitrary non empty set of states X, a transition given by a multi-valued mapping F : X ⇒ ∆ f (X) with non empty values, and a payoff (or reward) function r : X → [0, 1]. The interpretation is that given an initial state x 0 in X, a decision-maker has to choose a probability with finite support u 1 in F (x 0 ), then x 1 is selected according to u 1 and there is a payoff r(x 1 ). Then the player has to select u 2 in F (x 1 ), x 2 is selected according to u 1 and the player receives the payoff r(x 2 ), etc... Following Maitra and Sudderth (1996), we say that Γ = (X, F, r) is a Gambling House. We assimilate an element x in X with its Dirac measure δ x in ∆(X), we write Z = ∆ f (X) and an element in Z is written u = x∈X u(x)δ x . In case the values of F only consist of Dirac measures on X, we are in the previous case of a dynamic programming problem.
We linearly extend r and F to ∆ f (X) by defining for each u in Z, the payoff r(u) = x∈X r(x)u(x) and the transition F A play at x 0 is a sequence σ = (u 1 , ..., u t , ...) ∈ Z ∞ such that u 1 ∈ F (x 0 ) and u t+1 ∈ F (u t ) for each t ≥ 1, and we denote by Σ(x 0 ) the set of plays at x 0 . Given an evaluation θ, the θ-payoff of a play σ = (u 1 , ..., u t , ...) is defined as: γ θ (σ) = t≥1 θ t r(u t ), and the θ-value at x 0 is: v θ is by definition a mapping from X to [0, 1], and we define as before, for all x in X: v * (x) = inf θ∈Θ sup m≥0 v m,θ (x).
Theorem 1 easily extends to this context.
Theorem 2.8. Let (θ k ) k≥1 be a sequence of evaluations with vanishing total variation, i.e. such that T V (θ k ) − −− → k→∞ 0. We have: Moreover, the sequence (v θ k ) k uniformly converges if and only if the metric space ({v θ k , k ≥ 1}, d ∞ ) is totally bounded. And in case of convergence, the limit value is v * .
Proof. Consider the deterministic dynamic programming problem Γ = (Z, F, r). Notice that as an "infsup" of affine functions, there is no reason a priori forṽ * to be affine. However, the restriction ofṽ * to X is v * . Consider now a sequence (θ k ) k≥1 of evaluations with vanishing total variation. Applying theorem 2.5 to Γ, we first obtain that for all x in X: Moreover, given two evaluations θ and θ ′ , we have (using the same notation d ∞ for the distances on [0, 1] X and on [0, 1] Z ): is, and this completes the proof. .

Examples
The first very simple example shows that, even when the set of states is finite, it is not possible to obtain the conclusions of theorem 2.5 or corollaries 2.6 and 2.7 with sequences of evaluations satisfying the weaker convergence condition: sup t≥1 θ k t −→ k→∞ 0.
Example 3.1. Consider the following dynamic programming problem with 2 states: Z = {z 0 , z 1 }, F (z 0 ) = {z 1 }, F (z 1 ) = {z 0 }, with payoffs r(z 0 ) = 0 and r(z 1 ) = 1. We have a deterministic Markov chain, so that any play alternates forever between z 0 and z 1 . Define for each k the evaluations θ k = 1 k k t=1 δ 2t−1 and θ ′k = 1 k k t=1 δ 2t . We have v θ k (z 0 ) = v θ ′k (z 1 ) = 1, and v θ k (z 1 ) = v θ ′k (z 0 ) = 0 for all k. Define now ν k as θ k when k is even, and θ ′k when k is odd. The evaluation ν k satisfies sup t ν k t = 1 k −→ k→∞ 0, however (v ν k (z 0 )) k and (v ν k (z 1 )) k do not converge. Lehrer and Sorin (1992) proved that the uniform convergence of the Cesàro values (v n ) n≥1 was equivalent to the uniform convergence of the discounted values (v λ ) λ∈(0,1] . The following example shows that this property does not extend to general evaluations: given 2 sequences of TV-vanishing evaluations (θ k ) k≥1 and (θ ′k ) k≥1 , the uniform convergence of (v θ k ) k and (v θ ′k ) k are not equivalent.
Example 3.2. In this example, (vn) n will pointwise converges to the constant 1/2 whereas for a particular sequence of evaluations (θ k ) k with total variation going to zero, we will have (v θ k ) k (z) = 1 for all k and z.
We construct a dynamic programming problem defined via a rooted tree T without terminal nodes (as in Sorin Monderer 1992or Lehrer Monderer 1994. T has countably many nodes, and the payoff attached to each node is either 0 or 1.
We first construct a tree T 1 , with countably many nodes and root z 0 . Each node has an outgoing degree one, except the root which has countably many potential successors z 1 , z 2 ,..., z n ... On the n th branch starting from z n , each node has a unique successor and the payoffs starting from z n are successively 0 for n stages, then 1 for n stages, then 0 until the end of the play. We now define T inductively from T 1 . T 2 is obtained from T 1 by attaching the tree T 1 to each node of T 1 \{z 0 }. This means that for each node z of T 1 \{z 0 } we add a copy of the tree T 1 where z plays the role of the root of T 1 . And for each l, the tree T l is obtained by attaching the tree T 1 to each node of T l−1 \T l−2 . Finally, T is defined as the union l≥1 T l .
Starting from z 0 , any sequence of n consecutive payoffs of 1 has to be preceeded by n consecutive payoffs of 0, sov n (z 0 ) ≤ 1/2 for each n ≥ 1, and for each node z and even integer n it is possible to get exactly n/2 payoffs of 0 followed by n/2 payoffs of 1. Consequently one can deduce that (v n (z)) n converges to 1/2 for each state z. But sup z∈Z v n (z) = 1 for each n, and the convergence is not uniform.
Example 3.3. The condition ({v θ , θ ∈ Θ}, d ∞ ) totally bounded is satisfied with the hypotheses of corollary 2.6 or corollary 2.7, and is sufficient to obtain the general uniform convergence of the value functions. This condition turns out to be stronger than having ({v θ k , k ≥ 1}, d ∞ ) totally bounded for every sequence of evaluations with vanishing TV.
In the following example, there is no control and the state space Z is the set of all integers, with transition given by the shift: F (z) = {z + 1}. The payoffs are given by r(0) = 1 and r(z) = 0 for all z = 0.
For all evaluations θ = (θ t ) t≥1 , we have sup z∈Z v θ (z) = sup t θ t , so we have general uniform convergence of the value functions to v * = 0.
For all positive t, we can consider the evaluation given by the Dirac measure on t. We have v δt (−t) = 1, and v δt (z) = 0 if z = −t. The set {v δt , t ≥ 1} is not totally bounded.

Proof of theorem 2.5
We start with a few notations and definitions. We define inductively a sequence of correspondences (F n ) n from Z to Z, by F 0 (z) = {z} for every state z, and ∀n ≥ 0, F n+1 = F n • F (the composition being defined by G • H(z) = {z" ∈ Z, ∃z ′ ∈ H(z), z" ∈ G(z ′ )}). F n (z) represents the set of states that the decision maker can reach in n stages from the initial state z. We also define for every state z, G m (z) = m n=0 F n (z) and G ∞ (z) = ∞ n=0 F n (z). The set G ∞ (z) is the set of states that the decision maker, starting from z, can reach in a finite number of stages.
For all θ in Θ, m ≥ 0 and initial state z, we clearly have: In the sequel, we fix a sequence of evaluations (θ k ) k≥1 such that T V (θ k ) − −− → k→∞ 0.
Lemma 4.1. For all m 0 ≥ 0 and z in Z, A key result is the following proposition, which is true for all evaluations θ.
Proposition 4.2. For all evaluations θ in Θ and initial state z in Z, Proof of proposition 4.2 z and θ being fixed, put β = sup z ′ ∈G ∞ (z) v θ (z ′ ). Fix ε ∈ (0, 1], there exists T 0 such that ∞ t=T 0 +1 θ t ≤ ε, and fix T 1 ≥ T 0 /ε. For any play s = (z 1 , ..., z t , ...) in S(z), we have by definition of β that for all T , ∞ t=T +1 θ t−T r(z t ) ≤ β. Let m be a non negative integer, we define: We obtain: We now consider γ θ k (s) for k large. We compute ∞ t=1 θ k t r(z t ) by dividing the stages into blocks of length T 1 . For each m ≥ 0, let θ k (m) be the Cesàro-average of θ k where the last inequality follows from equation (1). Summing up over m, we obtain: Consequently, lim sup k v θ k (z) ≤ β 1−ε + ε, and this is true for all ε. Proof: Consider an initial state z, and write α = inf k sup m v m,θ k (z). It is clear that α ≥ inf θ∈Θ sup m≥0 v m,θ (z). Now for each k ≥ 1 there exists m(k), such that v m(k),θ k (z) ≥ α − 1/k, and we define the evaluation θ ′k = ∞ t=m(k)+1 θ k t−m(k) δ t . We have T V (θ ′k ) = T V (θ k ) − −− → k→∞ 0, so by proposition 4.2 we obtain that for all From lemma 4.1 and proposition 4.2, one can easily deduce the following corollary.
Corollary 4.4. For all m 0 ≥ 0 and z in Z, And we can now conclude the proof the theorem 2.5, proceeding as in the proof of theorem 3.10 in Renault, 2011. End of the proof of theorem 2.5. Define d(z, z ′ ) = sup k≥1 |v θ k (z) − v θ k (z ′ )| for all states z and z ′ . The space (Z, d) is now a pseudometric space (may not be Hausdorff). By assumption, there exists a finite set of indices I such that for all k ≥ 1, there exists i in I satisfying d ∞ (v k θ , v i ) ≤ ε. Consider now the set {(v i (z)) i∈I , z ∈ Z}, it is a subset of the compact metric space [0, 1] I with the uniform distance, so it is itself precompact and we obtain the existence of a finite subset C of states in Z such that: We have obtained that for each ε > 0, there exists a finite subset C of Z such that for every z in Z, there is c ∈ C with d(z, c) ≤ ε. The pseudometric space (Z, d) is itself precompact. Equivalently, any sequence in Z admits a Cauchy subsequence for d. Notice that all value functions v θ k are clearly 1-Lipschitz for d.
Fix z in Z, and consider now the sequence of sets (G m (z)) m≥0 . For all m, G m (z) ⊂ G m+1 (z) so using the precompacity of (Z, d) it is not difficult to show (see, e.g. step 2 in the proof of theorem 3.7 in Renault, 2011) that (G m (z)) m≥0 converges to G ∞ (z), in the sense that: (2) We now use corollary 4.4 to conclude. We have for all m : Fix finally ε > 0, and consider k ≥ 1 and m ≥ 0 given by equation (2). Let We obtain lim inf k≥1 v θ k (z) ≥ lim sup k≥1 v θ k (z) − 2ε, and so (v θ k (z)) k converges. Since (Z, d) is precompact and all v θ k are 1-Lipschitz, the convergence is uniform.

An open question
We know since Lehrer and Sorin (1992) that the uniform convergence of the Cesàro values (v n ) n≥1 is equivalent to the uniform convergence of the discounted values (v λ ) λ∈(0,1] . Example 3.2 shows that is possible to have no uniform convergence of the Cesàro values (or equivalently of the discounted values) but uniform convergence for a particular sequence of evaluations with vanishing TV. Could it be the case that the Cesàro values and the discounted values have the following "universal" property ? Assuming uniform convergence of the Cesàro values, do we have general uniform convergence of the value functions, i.e. is it true that (v θ k ) k uniformly converges for every sequence of evaluations (θ k ) k≥1 such that T V (θ k ) − −− → k→∞ 0 ?
The above property is true in case of an uncontrolled problem (zero-player), i.e. when the transition F is single-valued.