Best-response Dynamics in Zero-sum Stochastic Games

We define and analyse three learning dynamics for two-player zero-sum discountedpayoff stochastic games. A continuous-time best-response dynamic in mixed strategies is proved to converge to the set of Nash equilibrium stationary strategies. Extending this, we introduce a fictitious-play-like process in a continuous-time embedding of a stochastic zero-sum game, which is again shown to converge to the set of Nash equilibrium strategies. Finally, we present a modified δ-converging best-response dynamic, in which the discount rate converges to 1, and the learned value converges to the asymptotic value of the zero-sum stochastic game. The critical feature of all the dynamic processes is a separation of adaption rates: beliefs about the value of states adapt more slowly than the strategies adapt, and in the case of the δ-converging dynamic the discount rate adapts more slowly than everything else.


Introduction
Evolutionary and learning approaches to game theory justify equilibrium play as the end point of a dynamic process resulting from adaptations made by boundedly rational players. However to date there has been only limited success in applying the evolutionary or adaptive learning approach to stochastic games. The continuous-time best-response dynamic, a staple of evolutionary game theory, has thus far only been studied in normal-form and extensive-form games. We therefore define and investigate best-response dynamics for two-player zero-sum stochastic games.
The standard best-response dynamic in a game is specified as a differential inclusion with a constant revision rate; see Matsui (1989), Gilboa and Matsui (1991), Hofbauer (1995), and Balkenborg et al. (2013). A state in the dynamic specifies the strategy profile of all players, and the frequency of a strategy increases only if it is a best response to the current state. It is worth noting that the continuous-time best-response dynamic is equivalent to a continuous-time fictitious play (Brown, 1949) after a time rescaling. The best-response dynamic has been analyzed in various classes of normal-form games (also called strategicform games or one-shot games); see Hofbauer and Sigmund (1998) and Sandholm (2010). In particular, the convergence of a continuous-time best-response dynamic to the set of Nash equilibria has been shown in Harris (1998), Hofbauer (1995), and Hofbauer and Sorin (2006) for two-player zero-sum games, in Harris (1998) for weighted-potential games, and in Berger (2005) for 2 × n games. For convergence in extensive-form games of perfect information, see Xu (2016).
In a stochastic game (Shapley, 1953) players are in some state each time a decision is to be made; the actions of players in the current state determine not only the instantaneous payoffs but also the transition probability to the state for the next decision making. Thus, each player has to balance between the two sometimes contradictory goals, namely the better instantaneous payoff today and the better state distribution tomorrow. Meanwhile, the other players are also maximizing their own goals, which makes the decision problem of each player even more complicated. The existence of Nash equilibrium in a stochastic game has been proved for several classes of stochastic games; see Solan (2009) for a survey.
The question addressed in this paper is whether boundedly rational players can reach an equilibrium in a stochastic game. In particular, if players are unable or unprepared to carry out equilibrium calculations or solve Bellman equations for future reward, could they learn the Nash equilibrium strategy in the end? In the present paper, we focus on zero-sum stochastic games with discounted payoff, as is introduced by Shapley (1953), and consider best-response dynamics.
We first point out that it is non-trivial to define a best-response dynamic in a stochastic game, and indeed no established notion is available in the literature yet. Some discrete-time algorithmic approaches that achieve convergence have been presented (e.g., Borkar, 2002;Szepesvári and Littman, 1999;Vrieze and Tijs, 1982), but their convergence has been proved using ad hoc methods instead of considering an underlying dynamic. Perkins (2013) studies a continuous-time best-response dynamic in a stochastic game in which an agent does not anticipate changes to future payoffs as a result of strategy evolution. In his model, a player can calculate the expected future discounted payoff starting at each state for any given stationary strategy profile. When a player is calculating the best response at a state, she assumes that her total payoff will consist of the instantaneous payoff for taking that action against the opponent's action in that state, followed by a future payoff that is determined by the current strategies of both players. Convergence is shown only when the players are sufficiently impatient.
In the present paper, we construct best-response dynamics in which the future payoffs are learned separately from the strategies, to circumvent the problems encountered by Perkins (2013). We suppose that players are myopic learners who cannot calculate the future expected discounted payoff in a zero-sum stochastic game. Instead, they assume an (initially) arbitrary set of continuation payoffs, one for each state. These continuation payoffs allow the definition of an auxiliary game for each state, in which the payoff to an action is given by the instantaneous payoff plus the expected continuation payoff at the subsequent state.
In all our learning dynamics, the continuation payoffs are updated more slowly as time goes on, at rate 1/t. In this way, a continuation payoff is simply the time average of payoffs in the corresponding auxiliary game. As players do not have the ability to calculate the true continuation payoff for the current mixed strategies in the stochastic game, they view this time average as the current best estimate.
We first consider a best-response dynamic in which each player plays a mixed strategy in each auxiliary game, and continuously adjusts this auxiliary game strategy in the direction of the best response to the current mixed strategy of the opponent in that auxiliary game. Here, the speed of strategy adjustment in the best-response dynamic is independent of calendar time t. The key to the convergence of this best-response dynamic is simply the different adjustment speed between the best-response dynamic on players' strategies and the slow adaptation of the continuation payoffs. The slowly evolving auxiliary games allow the players to learn to play close to an equilibrium of the auxiliary game; this in turn allows the continuation payoffs to converge, so that the strategy profile being played approaches an equilibrium strategy profile in the stochastic game. We show in Section 3 that this dynamic converges at rate 1/t in payoff terms.
In the best-response dynamic so far proposed, both players update and play mixed strategies in all states at all times. To introduce a more natural learning model of play between two players, we also introduce a continuous-time statedependent fictitious play process, in which actual play of the game takes place in real time. In this process, the game transitions through the states according to a controlled continuous-time Markov chain, where the controlling parameter is the action profile currently being played in the state. While the game is in a state, each player plays a best response to her belief about the opponent's action in that state as well as the current continuation payoffs. Specifically, each player observes the action taken by the opponent and updates her belief about the opponent's behavior in the current state at constant rate in the direction of the currently-observed action. The continuation payoffs of all states are updated as in the best-response dynamic, tracking the empirical time average of auxiliary game payoffs. There is no need for these to be updated only in the current state, since these are unobserved hypothetical quantities anyway. Again, the separation of adjustment speeds ensures convergence of this state-dependent fictitious play process.
We finish by progressing further and propose a variant of the best-response dynamic such that the payoff in each auxiliary game converges to the corresponding asymptotic value of the zero-sum stochastic game when the discount factor increases to 1. This is achieved by once again evolving a parameter slowly in comparison to the others; in this case the discount factor adjusts towards 1 even more slowly than the continuation payoffs. So far as we can ascertain, this is the first adaptive dynamical procedure which converges to the asymptotic value of a zero-sum stochastic game.
We postpone the literature review of stochastic games, and the positioning of our work within that literature, to Section 6.

The Game Models
We begin by reviewing relevant results in two-player zero-sum normal-form games. These results will be used for the convergence within auxiliary games in the best-response dynamics for stochastic games. We then define zero-sum stochastic games and introduce the concepts that are central to the development of our learning dynamics in the rest of the paper.

Zero-sum Normal-form Games
In a two-player zero-sum game G where Player 1 and 2's finite pure strategy sets are A 1 are A 2 , respectively, the (a 1 , a 2 ) element r(a 1 , a 2 ) in the payoff matrix denotes the payoff to Player 1 when Player 1 plays a 1 and Player 2 plays a 2 . We can then linearly extend the payoff function to mixed strategies, i.e. r(x 1 , x 2 ) is defined for any x 1 ∈ ∆(A 1 ) and x 2 ∈ ∆(A 2 ). For convenience, we may write (2.1) An optimal strategy of Player 1 guarantees the payoff no less than v(G), regardless of the strategy of Player 2; similarly, an optimal strategy of Player 2 guarantees the payoff to Player 1 no more than v(G). An optimal strategy profile is also a Nash equilibrium in G. (We use "optimal" here to mean a minimax strategy in a zero-sum game.) The best-response dynamics have been well studied by authors including Brown (1949), Matsui (1989), Gilboa andMatsui (1991), Hofbauer (1995), Hofbauer and Sigmund (1998), Fudenberg and Levine (1998), Harris (1998), Hopkins (1999, Benaïm et al. (2005), Berger (2005), Hofbauer and Sorin (2006), Leslie and Collins (2006), Sandholm (2010), and Viossat and Zapechelnyuk (2013). They are motivated as a model of learning either by individuals constantly updating their mixed strategies towards a best response to opponent mixed strategies (e.g. Leslie and Collins, 2006), as a continuous-time fictitious play process in which beliefs are continuously adjusted towards observed opponent best responses (e.g. Harris, 1998), as a version of Bayesian updating process with a prior in a Dirichlet distribution (e.g. Fudenberg and Levine, 1998), or as a limiting process that can be used to study discrete time fictitious play (e.g. Benaïm et al., 2005). Others consider the best-response dynamics simply as a method for calculating equilibrium (Brown, 1949). Under these dynamics, strategies evolve at a constant rate in the direction of the current best response, defined for Player 1 and 2 respectively as br 1 (x 2 ) := argmax ρ 1 ∈∆(A 1 ) r(ρ 1 , x 2 ) and br 2 (x 1 ) := argmin The best-response dynamic in a normal-form game is therefore defined bẏ where the dot represents derivative with respect to time, and we have suppressed the time argument t. Since best-response strategies are in general not unique, this is actually a differential inclusion. In normal-form games the set br i (x −i ) is upper semi-continuous in x −i , so a solution trajectory of (2.2) exists, though not necessarily unique; see Aubin and Cellina (1984) and Benaïm et al. (2005). Given a strategy profile x = (x 1 , x 2 ), we define the energy to be It is straightforward to see that and that w(x) = 0 if and only if x 1 and x 2 are optimal strategies of Player 1 and 2, respectively. Harris (1998) and Hofbauer and Sorin (2006) show the following result: Theorem 2.1. Given a zero-sum normal-form game G, along every solution trajectory (x(t)) t≥0 of (2.2), w(x(t)) is a Lyapunov function with d dt w(x(t)) = −w(x(t)) for almost all t. (2.5) Hence w(x(t)) = e −t w(x(0)) (2.6) and every solution trajectory of (2.2) converges to the set of optimal strategy profiles. That is, where Z denotes the set of optimal strategy profiles in G.

Zero-sum Stochastic Games
Our objective in this article is to develop similar results for a stochastic game, defined in this section. A two-player zero-sum discounted-payoff stochastic game is a tuple Γ = I, S, A, P, r, δ constructed as follows.
• Let S be a set of finitely many states.
• For each player i in state s, A i s denotes a set of finitely many actions. For each state s, we put the set of action pairs A s := A 1 s × A 2 s .
• For each state pair (s, s ) and each action pair a ∈ A s , we define P s,s (a) to be the transition probability from state s to state s given the action pair a.
• We define r s (·) to be the stage payoff function for Player 1. That is, when the process is in a state s, r s (a) is the instantaneous payoff to Player 1 for the action pair a ∈ A s . Note that, in a zero-sum game, Player 2 always receives stage payoff −r s (a).
• δ is a discount factor that affects the importance of future stage payoffs relative to the current stage payoff.
In any state s, Player i plays an action x i s ∈ ∆(A i s ) =: ∆ i s . That is, x i s (a i ) denotes the probability that when in state s, player i selects action a i ∈ A i s . In this paper, we only consider stationary strategies for both players. A stationary strategy x i ∈ ∆ i := × s∈S ∆ i s of player i specifies for each state s a mixed strategy x i s to be played whenever the state is s. We denote a strategy profile by x = (x 1 , x 2 ) = ((x 1 s ) s∈S , (x 2 s ) s∈S ), and the set of strategy profiles by ∆ := ∆ 1 × ∆ 2 . Given a strategy profile x in ∆, for any state s, we may write and similar treatment applies to a transition probability P s,s (x s ). To ease the exposition, we denote a stochastic game Γ starting from a state s by Γ s . We can then define the expected discounted payoff for Player 1 under the strategy profile x in Γ s as where {s n } n∈{0,1,2,...} is a stochastic process representing the state of the process at each iteration, and (1 − δ) is to normalize the discounted payoff. Of course, Player 2 has an expected discounted payoff −U s (x). Define Then U s (x) is in B for any strategy profile x starting in any state s. Shapley (1953) proves that for every two-player zero-sum discounted-payoff stochastic game Γ s , there exists a unique value Val s , called the value of state s, equal to the expected discounted payoff of Player 1 that she can guarantee by an optimal strategy. Shapley (1953) further shows the existence of a stationary optimal strategy profile, also called a Nash equilibrium; for any stationary optimal strategy profilex, Val s satisfies equations (2.10) We can also study the asymptotic behavior in a stochastic game Γ s (δ) where δ increases to 1. Given a finite stochastic game, for each state s ∈ S, the asymptotic value lim δ→1 Val s (δ) exists; see Bewley and Kohlberg (1976) and Mertens and Neyman (1981).

An Auxiliary Game
A central concept in stochastic games is that of the auxiliary game formed by composing the stage game payoffs with the expected future discounted payoffs (Shapley, 1953). If Player 1 knows (or assumes) that the future discounted payoff achievable from every state s is given by u s , then the expected future discounted payoff achievable by playing mixed strategy x s in state s is given by where u is the vector of continuation payoffs u s . The auxiliary game with payoff f s, u (·) is denoted as G s, u . Since the stage games are zero-sum, G s, u is also zero-sum, and Player 2 receives payoff −f s, u (·).
To define a best-response dynamic in a stochastic game and to show the convergence, we will apply the continuous-time best-response dynamic in auxiliary games. It will therefore be convenient to consider best responses and energy in these auxiliary games. We denote the best responses in the auxiliary game in state s with the continuation payoff vector u by Similarly, we denote the energy in the same auxiliary game by (2.12)

Game
Our first process is a continuous-time dynamical system in which continuation payoffs evolve slowly, while strategies follow a best-response dynamic defined in the auxiliary games. All strategies and continuation payoffs evolve at all times; we will consider a more plausible model of actual play in Section 4. Nevertheless, we can motivate the dynamic in this section as follows. At each time instant, each player knows x(t), i.e., both her own and her opponent's mixed strategies, and estimates each continuation payoff u s (t) as the average auxiliary game payoff in state s up to time t. Each player i thus learns the current auxiliary games G s, u(t) in all states s, and then calculates the auxiliary game payoffs f s, u(t) (x(t)) for the current mixed strategy as well as the best responses br i s, u(t) (x −i (t)). Meanwhile, the strategies are adapted, at constant rate, towards the best responses.
Formally, we pick an arbitrary initial vector u(1) = (u s (1)) s∈S with u s (1) ∈ B for every s ∈ S, where B is the bounding interval defined in (2.9). Suppose that the initial stationary strategy profile (x s (1)) s∈S is given. We define the following dynamical system for every state s ∈ S at every time t ≥ 1 and call such a dynamical system the best-response dynamic in stochastic game Γ. Note that (3.1) is equivalent to u s (t) = t 1 f s, u(τ ) (x(τ )) dτ , which is the average auxiliary game payoff up to time t, while (3.2) indicates that x s (t) follows a bestresponse dynamic in the auxiliary game G s, u(t) . We start the dynamic at t = 1 simply for notational convenience in (3.1).
Theorem 3.1. Let Γ be a two-player zero-sum stochastic game, and let x(t) and u(t) be any solution trajectory of the best-response dynamic (3.1) and (3.2).
(i) For each state s, as t → ∞, both f s, u(t) (x s (t)) and u s (t) converge to Val s , and x 1 (t) and x 2 (t) converge to the set of stationary optimal strategies of Player 1 and 2, respectively.
(ii) There exists a constant K such that, for all s ∈ S, |Val s − u s (t)| ≤ Kt −1 , i.e. the continuation payoffs converge to Val s at rate t −1 .
This means that the continuation payoffs u(t) move very slowly, and the same energy-based arguments as used in Theorem 2.1 can be used to show that w s, u(t) (x s (t)) → 0. This in turn tells us that ) then (3.1) would become, essentially, a time rescaling of the scheme of Vigeral (2010); the remainder of our proof of part (i) of the theorem is simply a generalisation of that of Vigeral (2010).
Part (ii) of the theorem simply considers more carefully the bounds we place on the rates of convergence of each part of the dynamical system, and notes that the slowest rate is 1/t.
The full proof is given in Appendices B and C.
The dynamical system (3.1)-(3.2) can also be viewed as a feedback system in which (f s, u(t) (a)) a∈As,s∈S transforms strategies to payoffs and the best-response dynamic (3.2) transforms the payoffs back to strategies. Several recent works (e.g. Hofbauer and Sandholm, 2009;Sandholm, 2010;Fox and Shamma, 2013) consider evolutionary dynamics under this separation framework, with Zusai (2019) providing both a helpful summary of the concept, and using it to show the dynamic stability of general "economically reasonable" myopic dynamics in single-population games in which the equilibria are statically stable.

Continuous-time state-dependent fictitious play
In this section, we present a continuous-time embedding of actual play in a stochastic game, in which players transition through the state space, and always play an auxiliary-game best response to the current beliefs about opponent strategies.
Each player plays an action in the current state at every time instant; the holding time in the current state and the distribution over successor states depends on the players' current actions. In this learning process, players update their actions, beliefs about opponent strategies, and continuation payoffs in continuous time, while playing the continuous-time embedding of the game.
We start by introducing our model of a continuous-time embedding of a stochastic game, which is closely related to the models in Guo and Hernández-Lerma (2005), Levy (2013), andNeyman (2017). In order to ensure that all states are visited at a comparable rate, we restrict to an irreducible stochastic game Γ, which requires min s,s ∈S min a∈As P s,s (a) > 0.
In the continuous-time embedding of an irreducible game Γ, given any states s, s and a pure action profile a s , let q s,s (a s ) be the transition rate from state s to s when action a s is being played. Thus if action a s is played at time t then the probability of a transition from state s to s = s in time [t, t+h] is simply q s,s (a s )h+ o(h), and the probability of staying in state s during [t, t + h] is 1 + q s,s (a s )h + o(h) (and thus q s,s (a s ) = − s =s q s,s (a s )). We define a regular embedding of a game Γ to satisfy that q s,s (a s ) ∈ (λ min , λ max ) for each tuple (s, s , a s ) with s = s, for some 0 < λ min < λ max . This condition ensures that the holding times are non-pathological. A consistent regular embedding of Γ further requires that the transitions in the continuous-time embedding follow the same distribution over successor states as the transitions in the original game: given any s, a s , and any pair of states s and s both different from s, P s,s (a s )/P s,s (a s ) = q s,s (a s )/q s,s (a s ). The definition of the transition rate can be linearly generalized for a mixed strategy profile.
Consider now two boundedly-rational players playing this continuous-time embedding of the game. At time t they find themselves in state s, with beliefs x −i s (t) about opponent play in this state. In the spirit of fictitious play, players will play an auxiliary-game best response to these beliefs, which requires the use of some continuation payoffs u. We make the following modelling assumptions, along the lines of other models of boundedly-rational learning (see, e.g., Harris, 1998): (i) Each player believes that in each state s the exponentially weighted average play of player −i in state s up to time t is the best estimate of the stage-game mixed strategy in s.
(ii) Each player ignores strategic consideration in the dynamic adaptive process and believes that the realization of her plays in this process will not affect her opponent's predetermined strategy. The players will therefore play best responses to current beliefs; to do so requires some estimate of continuation payoffs.
(iii) Although a player has beliefs about opponent mixed strategy in all states, and could in theory calculate a solution to either (2.8) or Bellman's equation to find self-consistent continuation payoffs, she is unable or unwilling to do so. Hence the player takes the continuation payoffs for each state s to be the historical time average of the believed auxiliary game payoffs f s , u(·) (x s (·)).
The consequences of these assumptions are that, when the players are in a state s at a time t in the learning process, each player i plays a best-response action, denoted by in this auxiliary game is the final payoff she will receive in Γ. At the same time, Player i updates her belief x −i s (t) at a constant rate towards the observed best response, b −i s (t), of the opponent −i, and updates the continuation payoffs u(t) in all states, to ensure they are always the average of f s, u(t) (x s (t)).
We formalise the dynamics as follows. Define an indicator function 1 s (t) such that 1 s (t) = 1 if the players are in state s at time t, otherwise 1 s (t) = 0. We pick an arbitrary initial vector u(1) = (u s (1)) s∈S with u s (1) ∈ B for every s ∈ S, where B is the bounding interval defined in (2.9). Suppose that the initial stationary strategy profile (x s (1)) s∈S is given. The continuation payoffs and beliefs evolve according to Equation (4.2) is simply the best-response dynamic (3.2) activated whenever players are in state s; (4.1) ensures that the continuation payoffs are the time average of f s, u(t) (x s (t)). Once again, the continuation payoffs u s (t) are updated more slowly than the belief x −i s (t). The continuation vector u(t) may be viewed as a preference parameter in auxiliary game G s, u(t) . In the literature of evolutionary game theory, preference update is often more slowly than behaviour update; see, e.g., Ely and Yilankaya (2001) and Sandholm (2001). The model is therefore consistent with this theory.
A natural question is why we don't assume that players use u s (t) = f s, u(t) (x s (t)) to calculate the best responses in the dynamic instead of u s (t) evolving towards f s, u(t) (x s (t)). Firstly, note that we make a bounded-rationality assumption that players would like to maximise the discounted payoff in the stochastic game but do not know how to calculate (2.8) or solve Bellman's equation. Hence a belief is needed on the continuation payoff in order to calculate a best-response action b i s (t) based on this belief and the behaviour of the other player. However, a player knows that neither belief u s (t) nor payoff f s, u(t) (x s (t)) is likely the true discounted payoff U s (x(t)). In this case, she understands that if she forces u s (t) = f s, u(t) (x s (t)), then the resultant new payoff in (2.11) is not the old f s, u(t) (x s (t)), which means the guess u s (t) is not internally consistent until the full Bellman equations are solved; evolving towards reasonable values is a sensible boundedly-rational approach. A secondary consideration is that removing this boundedly-rational assumption, and allowing players to use correct continuation payoffs given current strategy beliefs, is only known to converge when the players are sufficiently impatient (Perkins, 2013).
Under the assumption that players adjust their continuation payoff estimates towards f s, u(t) (x s (t)), a second obvious question is why in (4.1) the target payoff is . After all, action profile b s (t) is played and the perceived instantaneous payoff should be the latter one. However, we would like to emphasise that the belief about players' actions is x s (t), and so the current best estimate for the continuation payoff in state s is f s, u(t) (x s (t)); the action b −i s (t) is simply the new information at time t that will be used by player i to update her belief x −i s (t). Denote the set of stationary optimal strategy profiles by Z, and define the distance in the space of stationary strategy profiles by the infinity norm.
Theorem 4.1. Let Γ be a two-player irreducible zero-sum stochastic game, and let u(t) and x(t) evolve according to the learning dynamic (4.1) and (4.2) in a regular embedding of Γ. Then, given any µ > 0, there exists a timet such that for eacht >t, The proof is given in Appendix D. We first show in Lemma D.1 and Corollary D.2 that with high probability, for a sufficiently long period, players stay in each state for at least a fixed proportion of that period of time, irrespective of what actions they play in the embedding process. We then build on the proof of Theorem 3.1 to give convergence of first the x s (t) to a neighbourhood of the auxiliary game equilibria, then the convergence of the continuation payoffs, conditional on the event that all states are updated in a sufficient proportion of the time.

The δ-converging Best-response Dynamic
In addition to its own interest, the study of the value of a zero-sum stochastic games is essential to the study of related non-zero-sum stochastic games; see, e.g., Dutta (1995) and Hörner et al. (2011). In particular, for Folk Theorem, it is often assumed that players are patient in a non-zero-sum stochastic game; the asymptotic value with the discount factor converging to 1 in the corresponding zero-sum stochastic game gives the limit of individually rational payoff in the non-zero-sum stochastic game.
The asymptotic value exists in a finite zero-sum stochastic game: Bewley and Kohlberg (1976) prove the existence by a semi-algebraic approach, and Oliu-Barton (2014) prove it by an approach of asymptotically optimal strategies. Based on the existence result, we present below a δ-converging best-response dynamic as an adaptive approach to compute the asymptotic value. Note that neither of the previous approaches (Bewley and Kohlberg, 1976;Oliu-Barton, 2014) are readily accessible in computation: the former uses the Tarski-Seidenberg elimination theorem from real algebraic geometry, while the latter needs stationary optimal strategies in an infinite sequence of zero-sum stochastic games.
It is also worth noting that the asymptotic value in discounted payoff is equal to the value in limit average payoff for any finite zero-sum stochastic game. For the formulation of value in limit average payoff, let us first observe that the value exists in a stochastic game where the interaction lasts only for a natural number T stages and the final payoff is the average of these T stage payoffs. If T increases to ∞, then the payoff at each given stage is insignificant as compared to the payoffs in all other stages. Mertens and Neyman (1981) prove that a value exists under the condition that the limsup average stage payoff is applied as T increases to ∞. Moreover, this value is the same as the asymptotic value in discounted payoff when δ increases to 1. So far no direct computational method to reach the value in limit average payoff is available in the literature.
As δ is not a constant in the following model, let us rewrite (2.11): Similarly to the best-response dynamic in Section 3, pick an arbitrary δ(2) ∈ (0, 1), and u(2) = (u s (2)) s∈S with u s (2) ∈ B for each s ∈ S, where B is the bounding interval defined in (2.9) (starting the process at t = 2 is once again solely for notational convenience). We show here that given any state s in a zero-sum stochastic game, u s (t) of any solution trajectory to the following system with initial time t = 2 converges to the asymptotic value of Γ s : We call such a dynamic a δ-converging best-response dynamic. Again, one can show the existence of a solution trajectory to the dynamical system from any initial condition ((x s (2)) s∈S , u(2), δ(2)), by the results in Aubin and Cellina (1984).
Theorem 5.1. Let Γ be a two-player zero-sum stochastic game, and let x(t), u(t) and δ(t) evolve according to the δ-converging best response dynamic (5.2)-(5.4). Then for each state s, as t → ∞, both f s, u(t) (x(t)) and u s (t) converge to the asymptotic value of Γ s .
The only difference between this δ-converging best-response dynamic and the best-response dynamic in Section 3 is the evolution of the discount factor δ(t) given by (5.2). Note that this discount factor adapts even more slowly than both continuation payoffs and players' actions, and is independent of players' actions and continuation payoffs, taking values δ(t) = 1 − c(log t) −1 for a constant c determined by the initial condition δ(2). The specific formulation (5.2) is just one example of a sufficiently slow δ-increasing process, satisfying the important condition that δ(t) → 1 andδ(t) = o(1/t). To see why we need (5.2), first note that the speed difference between (5.3) and (5.4) allows each player to learn an approximately optimal action in each auxiliary game equipped with the current continuation payoff vector, as we have discussed before. The slowness of discount factor adaption allows the continuation payoff vector defined in (5.3) to eventually converge to a small set of vectors in which each one is approximately valid as the continuation payoff vector for all the time when δ(t) is sufficiently close to 1. The proof is given in Appendix E.

Discussion
We note that several alternative approaches to learning in stochastic games might also be considered appropriate. We could translate the stochastic game into a normal-form game with actions equal to the stationary pure strategies of the stochastic game, and payoffs given by the corresponding discounted payoffs U i s (·) in the stochastic game, perhaps aggregated over s. Standard learning dynamics can be deployed in the normal-form representation, and will converge since the game is zero-sum. However, a mixed strategy in the normal form does not correspond to a stationary mixed strategy in the stochastic game. To illustrate this point, consider a one-player stochastic game with two states, α and β. β is an absorbing state with stage payoff −4. There are two actions, a and b in state α. If the player selects a then she receives payoff r α (a) = 0 and the state in the next stage is still α with probability 1; if the player selects b then she receives r α (b) = 1 and P α,α (b) = P α,β (b) = 1 2 . A mixed strategy in the normal-form representation corresponds to using pure strategy a for all time with probability 1 − ρ, and pure strategy b for all time with probability ρ, for some ρ ∈ [0, 1]. A stationary mixed strategy in the stochastic game will correspond to selecting a with probability 1−ρ (and b with probability ρ) independently each time state α is encountered. Thus convergence of the dynamics in the normal-form representation does not necessarily result in convergence to a stationary Nash equilibrium in the stochastic game, as the normal-form representation and the original stochastic game are related but different games.
Another natural approach is to note that the stationary strategy space ∆ is a compact and convex space. Results of Hofbauer and Sorin (2006) on dynamics in compact and convex strategy spaces might then be applied. Note however that the state transition formulation makes the payoff structure more complex than those studied by Hofbauer and Sorin (2006). In particular, they consider only those games with payoff concave in Player 1's strategy space and convex in Player 2's strategy space. Consider again the game introduced in the previous paragraph. We abuse the notation and denote by ρ the strategy that assigns probability ρ to playing b in state α. The expected discounted payoff in state α satisfies

It follows that
If δ > 1/3, then the second derivative is positive, and hence U α (ρ) is convex in ρ, taking us outside of the framework of Hofbauer and Sorin (2006). One may also be tempted to apply the convergence result of the best-response dynamic defined on convex/concave envelopes of the payoff function in a continuous quasiconcave-quasiconvex zero-sum game, proved by Barron et al. (2009). However, they also show that the envelopes are necessary by a counterexample that the dynamic may not converge with respect to the payoff function itself. The construction of convex/concave envelopes makes the learning procedure much more complicated than the implementation of best-response strategies only. Relying on these earlier results in normal-form games is thus not appropriate.
There exist other learning methods explicitly designed for stochastic games, such as Szepesvári and Littman (1999), Vrieze and Tijs (1982) and Borkar (2002). Note however that Szepesvári and Littman (1999) requires the solution of a linear program on every iteration of learning, Vrieze and Tijs (1982) presents a somewhat unnatural dynamic relying on very specific starting beliefs, and Borkar (2002)'s results are weaker than ours, albeit using players that require less information about the game. These results can be viewed as computational techniques to find the value.
The most well-known algorithm to compute the value of a zero-sum stochastic game with discounted payoff is still the value iteration process in Shapley (1953). However, this algorithm needs to compute the values of all zero-sum auxiliary games in each round. A continuous-time extension of this value iteration process is presented in Vigeral (2010) as follows. In a zero-sum stochastic game with discounted payoff, the so-called Shapley operator v (·,·) is nonexpansive. That is, for each pair of continuation payoff vectors ( u, u ), max s∈S |v s, u − v s, u | ≤ δ max s∈S |u s − u s |.
By this property, Vigeral (2010) proves that the dynamic systeṁ u s (t) = v s, u(t) − u s (t), ∀s ∈ S (6.1) converges to the value of the zero-sum stochastic game. The basic idea of the proof is derived from the property that in the state with the maximum distance of |v s, u(t) − u s (t)|, this distance is always decreasing, which follows an intermediate result (B.13) in our proof of Theorem 3.1.i. Vigeral (2010) also shows the convergence of a variation of dynamic (6.1) with discount factor increasing to 1, analogous to our δ-converging result in Section 5. Our results can therefore be considered as a boundedly-rational extension of Vigeral (2010) in which players do not calculate values of games, and simply play best responses to current beliefs; the end product of this myopic adjustment process is an optimal stationary strategy profile, and associated values, in the zero-sum stochastic game.
We would like to emphasize again that our work focuses on stochastic games with discounted payoff. In addition to expected discounted payoffs defined in (2.8), one can also apply limit average payoffs, in which the players only care about the long-run average payoffs, and the payoff at any given stage is insignificant as compared with all the other stage payoffs. Schoenmakers et al. (2007) provide a counterexample demonstrating that a natural fictitious play dynamic need not converge in the case of limit average payoffs, and we leave it an open question as to whether a dynamic such as those present in this article may converge.
Finally we note that our results, like the vast majority of those in learning in games, consider the setting where all players use the same algorithm. Stronger results would provide consistency results for a learner that deploys the algorithm without knowing what algorithm the other players would use, along the lines of Fudenberg and Levine (2014). However we are aware of no results along these lines that apply to stochastic games.

Appendix A Properties of Zero-sum Normal-form Games
We present two standard preliminary results for a zero-sum normal-form game G with payoff function u.
Lemma A.1. Given a positive finite number c, if we modify the payoff function u to u with the property |u (a 1 , a 2 ) − u(a 1 , a 2 )| ≤ c for all (a 1 , a 2 ) ∈ A 1 × A 2 , then for any (mixed) strategy profile ( Proof. This follows from the linear property of u. Lemma A.2. Given a positive finite number c, if we modify the payoff function u toū with the property |ū(a 1 , a 2 ) − u(a 1 , a 2 )| ≤ c for all (a 1 , a 2 ) ∈ A 1 × A 2 , then |v(Ḡ) − v(G)| ≤ c, whereḠ is the game with the modified payoff functionū.

Appendix B Proof of Theorem 3.1.(i)
With similar argument to the standard best response differential inclusion (2.2), from any initial condition (x s (1), u s (1)) s∈S , there exists a solution trajectory (x s (t), u s (t)) s∈S,t≥1 for the best-response dynamic (3.1)-(3.2), where x s (t) and f s, u(t) (x s (t)) are differentiable for almost all t ≥ 1 in all states s; see Aubin and Cellina (1984). It then follows that the derivatives of v s, u(t) exist for almost all t ≥ 1 at all s. Fix a solution trajectory (u s (t), x s (t)) s∈S,t≥1 throughout the proof. For each state s ∈ S, at each time t ≥ 0, we denote the value of the auxiliary game G s, u(t) by v s, u(t) := v(G s, u(t) ), which is defined in (2.1), and recall from (2.12) that the energy in G s, u(t) under x s (t) is denoted by w s, u(t) (x s (t)). We study this energy before proving the convergence of the auxiliary game play x s (t).
Lemma B.1. For any state s, w s, u(t) (x s (t)) is Lipschitz continuous with respect to t.

Proof. It is clear from the definition that f s, u (x s ) is Lipschitz with respect to both
u and x s . Both u(t) and x s (t) are Lipschitz with respect to t, by the definition of a trajectory. Hence f s, u(t) (x s (t)) is Lipschitz with respect to t. From Theorem A.4 in Hofbauer and Sandholm (2009), it follows that both max ρ 1 ∈∆ 1 s f s, u(t) (ρ 1 , x 2 s (t)) and min ρ 2 ∈∆ 2 s f s, u(t) (x 1 s (t), ρ 2 ) are Lipschitz continuous with respect to t. Therefore, is Lipschitz continuous with respect to t.
From the definition (3.1) of the dynamical system,u s (t) exists everywhere for all states s. From definitions (2.11) and (2.12) of the energy for auxiliary game G s, u(t) , we observe that D u w s, u(t) (x s (t)) always exists. Finally, from (2.7) in the proof of Theorem 2.1, we may infer thatẋ s D xs w s, u(t) (x s (t)) exists for almost all t. We can then conclude by the chain rule that holds for almost all t. Throughout the proofs in the present paper, all statements about derivatives are to be taken to hold where the derivatives exist, which is everywhere except on a set of time of measure 0.
Lemma B.2. For each state s in S, |f s, u(t) (x s (t)) − v s, u(t) | → 0 as t increases to infinity.
Proof. First note that, by (2.4), |f s, u(t) (x s (t)) − v s, u(t) | → 0 is an immediate consequence of w s, u(t) (x s (t)) → 0, which we prove below by extending Theorem 2.1. Suppose that an arbitrarily small > 0 is given. The definitions of the bounding constants b 1 and b 2 in (2.9) imply that in any state s, Therefore, it follows from the definition of the dynamic (3.1) that there exists t > 1 such that Note, from (2.11) and (2.12), that a change in continuation payoffs u with maximal change corresponds to a change in w s, u (x) of at most 2δ . Hence ˙ u · D u w s, u(t) (x s (t)) ≤ 2δ max s ∈Sus . Furthermore, Harris (1998) and Hofbauer and Sorin (2006) show thaṫ for all time t ≥ t and for all s ∈ S. This in turn implies that, for sufficiently large t, w s, u(t) (x s (t)) < 2 for all states s ∈ S. 2 Since > 0 is arbitrarily small, w s, u(t) (x s (t)) converges to 0, and the result follows.
Lemma B.2 shows that for large t the auxiliary game play will be close to the equilibrium determined by current continuation payoffs. Note that (B.4) is the only line in the proof of Theorem 3.1 where we use a property of the best-response dynamic (3.2), and other revision protocols that give rise to the conclusion of Lemma B.2 would also result in an equivalent of Theorem 3.1.(i). For the rest of the proof, we only need the formulation of continuation payoff adjustment (3.1) and the auxiliary game structure (2.11).
Let > 0 be arbitrary, and let t 1 ( ) be such that for all t ≥ t 1 ( ) and all states s in S, Such a t 1 ( ) exists by Lemma B.2. For the rest of the proof we will assume that t ≥ t 1 ( ) and hence that (B.6) holds. It remains to show that the continuation payoffs will converge to the correct values, i.e. those of a Nash equilibrium. This part of the proof extends the approach of Vigeral (2010), who proves that continuation payoffs converge to equilibrium values if the payoff adjustment dynamics (3.1) are modified to (6.1) so that u s (t) moves in the direction of the value of the auxiliary game instead of in the direction of the current payoff in the auxiliary game. We start with some preliminary definitions: • For any time t ≥ 1, we mark a state which, by (3.1), implies that • We also, for any time t ≥ 1, mark a state Recall that Lemma B.2 shows that f s, u(t) (x(t)) becomes close to v s, u(t) (t) for all s.
we will show that, in the limit, for each s, all of f s, u(t) (x s (t)), u s (t) and v s, u(t) are equal. This is sufficient to prove the theorem. Below is a technical lemma.
then for any state s with the property This lemma says that if the maximal distance between u s (t) and f s, u(t) x s (t) is big enough, then the absolute value between some u s (t) and v s, u(t) is decreasing at a rate at least linear in 1/t. (Since this rate would result in the absolute value becoming negative, condition (B.10) cannot always hold, as we will see in Lemma B.4.) Proof. From Lemma A.2 and the definition (B.8) of s f (t) as the maximiser of |u s (t)|, it follows that Now fix a state s with the property (B.11) at time t ≥ t 1 ( ). We may infer from (B.11) and the fact that |v s, u(t) − f s, u(t) (x s (t))| ≤ (1 − δ) /16 for all s, by (B.6), that (B.14) Thus, from the dynamic (3.1) for u s (t), it follows that for this s, where the last inequality holds since u s f (t) (t) ≥ /t by (B.10) and (3.1). Combining our inequalities (B.13) and (B.16) we see that u(t) . The closeness of v s, u(t) and f s, u(t) (x s (t)) given by (B.6), along with the conditions (B.10) and (B.11) of the lemma, give that Invoking (B.6) once more we see that and so, by the definition of the dynamic (3.1),u s (t) = (f s, u(t) (x(t)) − u s (t))/t < 0. Combined with (B.17), this implies that Recalling the lower bound of |u s (t)| in (B.16) and the upper bound of v s, u(t) in (B.13), we may then infer that A near-identical calculation shows the same conclusion if u s (t) < v s, u(t) . Thus, we have ( The result then follows on noting, once again, that u s f (t) (t) = |f s f (t), u(t) (x s f (t) (t))− u s f (t) |/t ≥ /t using (B.10) to bound |f s f (t), u(t) (x s f (t) (t)) − u s f (t) | below by .
This now puts us in a position to prove the important final lemma.
So far, we have shown that there exists a time t 2 ( ) when the desired result holds. We now show that the desired result holds for arbitrary t > t 2 ( ), by checking two cases.
Case 1: (B.10) does not hold at t, so that |u and (B.20) follows immediately. Furthermore, by (B.6) and (B.21), Case 2: (B.10) holds at t. Then, the existence of time t 2 ( ) implies that there exists a time t 3 ( ) with t 2 ( ) < t 3 ( ) ≤ t such that where t − 3 ( ) denotes the left limit of t 3 ( ), and Without loss of generality, we assume that (B.10) holds throughout the time period [t 3 ( ), t]. By the continuity of u and v, we may infer from Case 1 that To show (B.20), from (B.21) and (B.6), we may infer that for all t ∈ [t 3 ( ), t], From (B.23) and (B.24), it follows that Proof of Theorem 3.1.(i). From Lemma B.4, we see that for each state s, Let Z denote the set of optimal strategy profiles for zero-sum auxiliary games G s,(Vals) s∈S , where vector (Val s ) s∈S is a solution to equation (2.10). It follows that x(t) converges to the set Z as t → ∞. We now only need to know that Val s is unique for each s, and each z ∈ Z is an optimal strategy profile in stochastic game Γ s regardless of the initial state s. This is proved in Theorem 2 of Shapley (1953).
Similarly to (B.15), we may further infer that Along the argument in the proof of Lemma B.3, we have by (B.13), and we can further deduce that From the assumption that (C.1) does not hold, it follows that and hence Combined with (C.2), we observe that Thus, u sv(t) (t) converges to v sv(t), u(t) at rate 1/t. Recall that f s, u(t) (x s (t)) converges to v s, u(t) at rate 1/t in all states s. Therefore, for all s ∈ S, u s (t), f s, u(t) (x s (t)), and v s, u(t) all converge at rate 1/t. Together with the argument in Theorem 2 in Shapley (1953), as we have used in the proof of Theorem 3.1.i, they all converge to Val s at rate 1/t.

Appendix D Proof of Theorem 4.1
We start by proving a seemingly-obvious result about the occupation times of states in a controlled Markov chain, which is needed to ensure that the action is updated sufficiently frequently in every state despite only the action at the current state being updated at any particular time. The reason we cannot apply standard ergodicity results directly to the controlled Markov chain is because the transition rates of the chain are probably continually evolving under the control parameter x(t); the result is likely to already exist elsewhere, but we have not managed to find it and hence include the proof here for completeness.
Lemma D.1. Consider a continuous-time controlled Markov chain on a finite state space S. Let the transition rates between states s and s be given by q s,s (x(t)) where x(t) is an arbitrary control parameter, and define q s (x(t)) := −q s,s (x(t)) = s =s q s,s (x(t)).
Assume that: • there exists η > 0 such that q s,s (x(t))/q s (x(t)) ≥ η for all s, s , and t, so that when a jump occurs the probability of jumping to any state is bounded below by η, and • there exist λ min and λ max such that 0 < λ min < q s (x(t)) < λ max for all s and t, so that the holding times in states are well-behaved.
Let Q > 0 and > 0. Then, there exists a ∆T > 0 such that for all T ≥ 0, all s ∈ S, and irrespective of x(t) t≥0 , Proof. We construct a proof using a coupling argument, linking our original process to one in which simple renewal-reward arguments (e.g., Grimmett and Stirzaker, 2001) show the probability of the event we care about is sufficiently high. Throughout, we assume nothing about the control parameter x(t), and we show that our result holds irrespective of x(t).
First note that our Markov model can be implemented using a sequence of independent uniform random variables as follows. If the kth state is s k and the process arrives here at time t k , a uniform random variable U k ∼ Unif(0, 1) is sampled; the state remains at s k until t k+1 which satisfies a further uniform random variable V k+1 ∼ Unif(0, 1) is then sampled to determine the state transition, with the next state s k+1 being selected using the inverse cumulative distribution function method on the probability mass function (q s k ,s (x(t k+1 ))/q s k (x(t k+1 ))) s =s k . If transition rates did not depend on x(t), (D.2) would result in the standard exponential holding times with jump chain transition probabilities q s k ,s k+1 /q s k , as in Grimmett and Stirzaker (2001, Section 6.9). When we have non-constant transition rates, it is an easy calculation to see that the instantaneous transition rates in the above construction, if the state is s at time t, are given by q s,s (x(t)); thus the construction is a valid implementation of the state sequence. Without loss of generality, we will show (D.1) for a state s * ∈ S, for T = 0. We start modifying our process by introducing a new state s † . Suppose that at time t k a state transition occurs from a state s k−1 = s * , and we have sampled a V k to determine the state s k . If in the original process we would have transitioned to s * (i.e. V k < q s k−1 ,s * (x(t))/q s k−1 (x(t))) then in our modified process we transition to s k = s * only if V k < η ≤ q s k−1 ,s * (x(t))/q s k−1 (x(t)); otherwise we transition to s † . We stay at either of s * or s † until t k+1 satisfying (D.2) for s k = s * then transition to a successor state s k+1 determined by using V k+1 in the inverse cdf method on (q s * ,s (x(t k+1 ))/q s * (x(t k+1 ))) s / ∈{s * ,s † } (i.e. we use the transition rates for state s * irrespective of whether we are in s * or s † ). When the original process is in a state other than s * , the modified process is in the same state; when the original process is in s * , the modified process is in either s * or s † . The modified process therefore spends no more time in s * than the original process, in any interval [T, T + ∆T ].
Our next modification homogenises the holding times, and amalgamates all states other than s * . We introduce a new state sequences k such that if V k < η and s k−1 = s * thens k = s * ; otherwises k = s − , where s − is a new state amalgamating all states other than s * . This means that if s k = s * in the first modification thens k = s * , whereas if s k = s * in the first modification thens k = s − . We also define new holding times, such that the holding time in states k is given by − log(1 − U k )/λs k with λ s * = λ max and λ s − = λ min . This means that the kth holding time whens k = s * is bounded above by the kth holding time in the original process, whereas the kth holding time whens k = s − is bounded below by the kth holding time in the original process. Once again, thes k process spends no more time in s * than the original process, in any interval [T, T + ∆T ].
Finally note that thes k process has a very simple transition structure: when in state s * , wait for an Exp(λ max ) holding time then transition to s − ; when in state s − wait for an Exp(λ min ) holding time then transition to s * with probability η, otherwise return to s − and restart the clock. Simple renewal-reward theory (e.g. Grimmett and Stirzaker, 2001) easily gives that there exists a ∆T such that (D.1) holds for thes k process. However note that any U k , V k sequence for which thes k process occupies state s * for time at least Q in [T, T + ∆T ] also ensures that the same holds for the original state transition process. It follows immediately that (D.1) holds for the the original process.
We make use of this result in the context of a regular embedding of an irreducible stochastic game as follows.
Corollary D.2. Consider a regular embedding of an irreducible stochastic game Γ. For any > 0 and k > 0, there exists a time ∆T > Q, depending on Q, and k, such that for any time T ≥ 1, any initial state in S, and any measurable strategy process x(t), Proof. The definition of a regular embedding of an irreducible game, given in Section 4, ensures that the rates q s,s (x(t)) meet the conditions of Lemma D.1. Hence there exists ∆T > 0 such that Therefore, Hence, in any time interval of length ∆T , the probability that each x s (t) is updated for at least Q time units is high. Now fix > 0 for the remainder of the proof, define where b 1 and b 2 are the bounds on rewards defined in (2.9), and let ∆T be appropriate for this choice of , Q and an as yet unspecified k. Define A(T ) to be the event Each of the subsequent lemmas will be conditioned on some A(T ), and hence (by Corollary D.2) will be true with a controlled probability. Lemma D.3. There exists an integer k 0 > 1 depending only on b 1 , b 2 , δ and such that, for any timeT ≥ (k 0 − 1)∆T , if A(T ) holds, then Proof. Firstly, Lemma B.1 implies that w s, u(t) (x s (t)) is differentiable for almost all time t. In any state s and at any time t, from (B.1) it follows that Recall from (2.9) and (4.1) that we can assume |u s (t)| ≤ (b 2 − b 1 )/t for all s and all t. Hence, as in the proof of Lemma B.2, we can choose k 0 sufficiently large (depending only on b 1 , b 2 , δ and ) such that |˙ u·D u w s, u(t) (x s (t))| ≤ (1−δ) /(64∆T ) for all t ≥ (k 0 − 1)∆T . Thus, for any t ≥ (k 0 − 1)∆T , then it follows from (D.6) that Now suppose that, contrary to the conclusion of the lemma, ∃s ∈ S s.t. ws , u(T +∆T ) (xs(T + ∆T )) > (1 − δ) 32 . (D.8) By the previous calculation, it follows that ws , u(t) (xs(t)) > (1−δ) 64 for all t ∈ [T ,T + ∆T ]. Since w ·,· (·) ≤ b 2 − b 1 by (2.9), it follows from (D.6) and the definition of Q that contradicting (D.8).
Comment: As in the proof of the best response dynamic in Appendix B, (D.5) is the only line in the proof of Theorem 4.1 where we use a property of the bestresponse dynamic (4.2). For the rest of the proof, we only need the formulation of payoff adjustment (4.1) and the stochastic game structure.
The proof concludes in an identical manner to the proof of Theorem 3.1.(i).

Appendix E Proof of Theorem 5.1
To emphasize that δ(t) is a variable, we denote the auxiliary game by G s, u(t),δ(t) , its value by v s, u(t),δ(t) , and its energy by w s, u(t),δ(t) , for each state s ∈ S at each time t ≥ 0. Begin by noting that it is immediate from (5.2) that ∀t ≥ 2, δ(t) = 1 − e c log t (E.1) where 1 − e c log 2 = δ(2).

(E.3)
Comment: Similarly to the other dynamical systems we consider, the partial derivative (i) is the only line in the proof of Theorem 5.1 where we use a property of the best-response dynamic (5.4), i.e., an implication of the revision protocol. For the rest of the proof, we only need the formulations (5.2) and (5.3) as well as the auxiliary game structure (2.11).
The notations of s f (t) and s v (t) are defined in (B.7) and (B.9), respectively.
for all t >t.
Proof of Theorem 5.1: Recall the convergence of Val s (δ) as δ increases to 1, shown in Bewley and Kohlberg (1976). The desired conclusion follows from Lemmata E.1 and E.2.