Learning in Monotone Bayesian Games

This paper studies learning in monotone Bayesian games with one-dimensional types and finitely many actions. Players switch between actions at a set of thresholds. A learning algorithm under which players adjust their strategies in the direction of better ones using payoffs received at similar signals to their current thresholds is examined. Convergence to equilibrium is shown in the case of supermodular games and potential games.


Introduction
Bayesian games are widely used in Economics, for example in the study of auctions and public goods provision. Rather little is known about whether players are likely to learn to play the equilibria involved in such games. Beggs (2009) proposed a simple model of learning but limited attention to games with binary actions. This paper extends the analysis to games with many actions. It uses recent results of Benaïm and Faure (2012) to show that in supermodular games play converges to equilibrium under weak assumptions. Convergence is also shown in potential games.
In a Bayesian game each player receives a private signal before choosing between actions. This signal may convey information about his and other players' payoffs and therefore indirectly about their likely choice of action. Learning in such a game is in principle very complex. This paper shows that convergence to equilibrium can be achieved by a simple adaptive rule.
The paper studies learning where the actions players take are ordered and are increasing functions of their signal, which is assumed one-dimensional. It follows that a player's strategy can be represented by a collection of thresholds at which he switches between actions. It is assumed that each player adjusts their thresholds in the direction suggested by past play. It is shown that this can be achieved by a simple scheme.
The scheme proposed is a generalization to many actions of that suggested in the case of two actions by Beggs (2009). As noted in that paper the rule can be seen as encapsulating two ideas. Firstly, it can be seen as example of 'directional learning', as suggested by Selten and Buchta (1998). Players do not pick best responses at each instant but rather move in a direction that will improve their payoffs. Secondly, it incorporates ideas of 'similarity', as studied for example by Gilboa and Schmeidler (2003). Players only observe one signal at a time. In order to estimate payoffs at signals they do not observe, they must use observations of 'similar' signals, weighting them by their closeness to the signal at which payoffs are to be estimated.
The main technical tool of the paper is stochastic approximation. The learning rule studied is an example of passive stochastic approximation, introduced by Härdle and Nixdorf (1987). Recent results of Benaïm and Faure (2012) on non-convergence to unstable sets in perturbed cooperative systems, adapted to apply to the current environment, play a key role in the analysis.
Convergence to equilibrium is shown in supermodular games satisfying a concavity property. Convergence is also shown in potential games.
The literature on learning in Bayesian games is fairly small. The literature on learning in stochastic fictitious play considers learning in normal form games where players's payoffs are subject to random perturbations in the spirit of Harsanyi -see for example Fudenberg and Kreps (1993) or Benaïm and Hirsch (1999). These perturbed games can be regarded as Bayesian games but it is assumed that the perturbations are independent, which makes learning much more straightforward. Benaïm and Hirsch (1999) allow for correlation in perturbations but assume that players ignore it when choosing actions so the system does not converge to a true Bayesian equilibrium. Steiner and Stewart (2007) consider a model similar in spirit to that of Beggs (2009). They also employ a rule under which players used payoffs at nearby signals are used to estimate the signals at the current one but do not employ a threshold rule. They restrict attention to global games. Convergence is however to a perturbed version of equilibrium rather than a true Bayesian equilibrium as in the current paper. The published version Steiner and Stewart (2008) differs slightly in that the game is drawn randomly each period but players both observe perfectly the game.
Section 2 lays out the general framework. Section 3 states preliminary results characterizing equilibrium conditions in the static game. In particular it shows that the equilibria of the game can be represented as the solution to a variational inequality. Section 4 gives examples of the environment studied.
Section 5 presents a model of normal-form learning to motivate the principal model of extensive form learning outlined in Section 6. Section 7 contains the main results on convergence in supermodular and potential games. Section 8 concludes.

The Static Game
This section describes the underlying static game. Attention is restricted to pure strategy Bayesian equilibria.
There are p players and each observes a signal t i , i = 1, . . . , p, drawn from a set T i . Payoffs may also depend on the state of nature which belongs to the set After observing his signal each player then takes an action a i drawn from a set A i . Let A = A 1 × . . . A p . Nature takes no action.
Assumption 2 The signals have a joint density f on T with respect to Lebesque measure which is continuous in t and strictly positive.
The assumption that all players share the same type space is without loss of generality and simplifies the notation. F denotes the cumulative distribution function of f .
Each player i has a bounded measurable payoff function U i (a, t) where a ∈ A and t ∈ T .
Assumption 4 U i (a, t) is jointly continuous in t for each a ∈ A, i = 1, . . . , p.
For the convergence analysis stronger assumptions will be needed Assumption 5 U i is continuously differentiable in t for each i and a. f is continuously differentiable.
Assumption 6 U i is twice continuously differentiable in t for each i and a and f is twice continuously differentiable.
A strategy for player i is a measurable function σ i : be their density and f i (t i ) be the marginal density of t i . If opponents use a measurable strategy σ j : T j → A j , j = i then i's interim expected payoff conditional on receiving signal i and taking action a i is, denoting by σ −i the vector of opponents strategies, σ −i will often be suppressed from the notation.
Player i's ex ante expected payoff from using strategy i is for all all strategies σ i of player i. Recall that a real-valued function g(x, t) of two variables on two ordered for all x > x and t > t Assumption 7 If all players apart from i, i = 1, . . . , n, use increasing strategies σ j : T j → A j then V i satisfies single crossing in (a i , t i ).
If Assumption 7 holds then if all players apart from i use an increasing strategy, i has a best reply which is increasing (see Athey (2001) andShannon (1994)). Existence of equilibrium under the assumptions above follows from the general results of Athey (2001).
Recall that a function φ(x, y) is (strictly) supermodular in variables x and y if it has (strictly) increasing differences, that is x ≥ x and y ≥ y. In other words, the gain to increasing x is increasing in y. A function is log-supermodular if its log is supermodular. A set of random variables will be said to be affiliated if they have a joint density which is logsupermodular.
As noted in Athey (2001) two important sufficient conditions for Assumption 7 are Definition 1 A game is (strictly) supermodular with increasing beliefs if for all i, (i) U i (a, t) is (strictly) supermodular in a and supermodular in (a i , t j ), j = 1, . . . , n and (ii) S dF (t −i |t i ) dt i is increasing for all sets whose indicator functions are increasing.
Definition 2 A game is log-supermodular with affiliated beliefs if (i) for all i, U i (a, t) is non-negative log-supermodular in (a, t) and (ii) types are affiliated.
If payoffs are supermodular then the gain to increasing action for player i increases as the actions of the other players increase and signals increase.

If all other players use increasing strategies under Definition
Assumption 7 is satisfied.

Equilibrium Conditions
This section shows that the conditions for equilibrium of the static game can be written as the solution to a variational inequality describing the firstorder conditions for each agent. This representation is all that is required for what follows but some interpretation is given for completeness. The latter is not essential for the following sections and may be skipped. These characterizations first appeared in Beggs (2011), where a more detailed discussion can be found.
As described by Athey (2001) a monotone strategy be specified by a vector of cutoffs. For player i, k i j denotes the signal at which he first plays action j, The dummy cutoffs k i 0 ≡ 0 and k i m(i)+1 ≡ 1 are sometimes convenient for notational purposes. The strategy set for player i is the set Note that player 1's payoff conditional on receiving signal t 1 and taking action a 1 , V 1 as defined in Section 2, can be written as: and similarly for other actions and players.
For any pair of actions j and l let ∆U be the payoff in switching from j to l and let ∆V i (j, l, t i ) denote the expected payoff to switching from action j to l conditional upon receiving signal t i : A player's ex ante payoff function as a function of his cutoffs can be written, given the strategies of the other players k −i , as Denote the gradient vector of W i with respect to k i by ∇W i .
Recall that for a closed convex set X, the Normal cone at x ∈ X is the The first-order necessary conditions for k i to maximize W i can be be written That is any direction of increase of W i must point outside Σ i (see Borwein and Lewis (2006) Proposition 2.1). Equivalently k i solves the variational inequality Since the normal cone to a product set is the product of the normal cones, the first-order necessary conditions for equilibrium are, letting ∇W = (∇W 1 , . . . , ∇W p ), In other words k * solves the variational inequality V I(−∇W, Σ). Note that ∇W is simply the product of the vectors ∇W i and not a true derivative vector.
Lemma 1 Under Assumptions 1-4 and 7, k i maximizes W i given k −i if and only if (5) holds. k * = (k * 1 , . . . , k * p ) is an equilibrium vector of cutoffs if and only if (6) holds.
In other words, the first-order necessary conditions for optimality are also sufficient under Assumption 7. This will follow from the equivalent characterization given below. In the case when payoffs are supermodular, sufficiency is clear if the marginal density of signals is uniform, since then (4) W i is concave By a suitable transformation it can always be assumed that the marginal distribution of signals is uniform (see for example Nelsen (2006)). In general, Assumption 7 implies that the derivative of W i with respect to each cutoff changes sign only once and this turns out to be enough for sufficiency.
An equivalent but more intuitive characterization can be obtained by noting that in order to maximise W i , player i's action must maximise his interim payoff V i (a, t) for almost every signal t i .
If an action is played over an interval of positive measure then it will be said to be active, otherwise it is inactive. That is, action j is active if and only if k i j+1 − k i j > 0. Call two active actions adjacent if there is no active action lying between them.
Under Assumptions 1-4, ∆V i is continuous in k i , and Assumption 7 yields the following simple characterization of best replies: Lemma 2 If other players use increasing strategies then under Assumptions 1-4 and 7 a set of cutoffs for k i for player i correspond to a strategy which is an optimal response if and only if: (a) If j and l are adjacent active actions then These conditions are equivalent to (5).
Condition (a)(i) simply expresses the fact that the player must be indifferent at the switch point between two active actions. Condition (a)(ii), (b) and (c) rule out payoffs being raised by using a currently inactive action. Note that if an action is not used then an active action must come into use at exactly the same cutoff.
The sufficiency of these conditions follows directly from single crossing (Assumption 7). If j < l are indifferent at k i l then by single crossing j must be (weakly) better for all lower signals and l weakly better for all higher signals.
Iterating this argument yields that j is (weakly) better than all active actions in the range in which it is active. A similar argument implies that is better than any inactive action in this range as well. The equivalence to (5) is shown in the Appendix.

Examples
This section presents three examples of the environment studied.
Consider the coordination game in Figure 1. t 1 and t 2 are will be referred to as the types or signals of players 1 and 2 respectively. Players privately observe their own type before choosing actions. Note that action 0 is dominant for low values of t i and action 1 for high values.
t 1 and t 2 take values between 0 and 1 and are drawn from a distribution with joint distribution function F (t 1 , t 2 ) and density function f .
the higher a player's own type the higher the other player's type is likely to be, then there is at least one interior equilibrium of the form: k i is determined by the pair of equations It is natural to ask whether players can learn to play such an equilibrium.
After the game has been played once, if each player can observe the action of the other he can compute the ex post difference in payoffs to actions 0 and 1.
In the case of player i, this will be At the cutoff, k i , the difference in realized payoffs should have expectation zero ex ante. This suggests raising the cutoff if action 0 has higher realized payoff than action 1. Since the observed signal, t i , may not equal the equilibrium cutoff, this suggests revising the cutoff less if t i is far from k i as behavior at t i may be a poor guide to behavior at k i . Beggs (2009) analyzed a formal model on these lines in the case of binary actions. A general model is presented in Section 7 of this paper.
Two features of this example are worthy of note: firstly that F j (t j |t i ) is decreasing in t i , secondly that the gain to playing action 1 rather than action 0 is increasing in own type and the action played by the other player, that is payoffs are supermodular. These assumptions guarantee that there is an equilibrium in which player's actions are increasing in their type and so their strategies can be characterized by a cutoff. They are will also be of use in the convergence analysis.

Figure 2
Each player receives a signal t i , which has a positive continuous density on [0, 1] and is independent of the other player's signal. θ i = λ t i − 1 2 , where λ is large positive parameter. Note that as in the previous example the gain from switching from a lower action to a higher action is increasing in the other player's action and the player's own signal. Signals are independent, so beliefs about the other player's type are unaffected by the signal. Again there will be an equilibrium in which actions are increasing functions of the players' signals.
It is straightforward to see that, for large enough λ, each player will choose action 0 if their signal is low, action 1 for medium values and action 2 if it is high. As in the previous example, equilibrium can be specified by a thresholds for each player: at k i 1 player i switches from action 0 to action 1 and at k i 2 player i switches from action 1 to 2.
In a general game, not all actions need be played. For example a player might play action 0 for low signals and then switch directly to action 2 rather than playing action 1 for intermediate values. The structure of the game here rules this out. Note that action 0 is dominant for low signals and action 2 for high signals. In addition the gain to raising action is decreasing in action: the gain to switching from 0 to 1 is less than the gain from switching to 1 to 2 whatever the signal received or the action taken by the other player. These assumptions will also play a role in some of the convergence analysis.

Example 3 (Global Games)
Consider the following modification of Example 1: As in Example 1, there will exist an equilibrium in cutoff strategies. If θ and the action of the other player are observed after the game is played, cutoffs can be revised as outlined in Example 1. This is an example of the kind of game studied in the literature on global games (see for example Morris and Shin (2003)).

Normal Form Learning
The rest of the paper considers learning. It is assumed that the game outlined in Section 2 is played repeatedly over periods n = 1, 2, 3, . . . with signals drawn independently across periods.
The equilibrium conditions in the previous section suggest a simple model of normal-form learning. 3 Consider the following algorithm with k(n) the value of k in period n: where and Π Σ denotes orthogonal projection onto Σ.
In this algorithm each player adjusts the cutoff between neighboring actions up and down according as the payoff difference between them is negative or positive (a higher cutoff meaning play the lower action until a higher signal is reached). α n > 0 measures the degree of adjustment. The term f i (k i j ) implies that players adjust payoffs less at signals which occur less often, perhaps because this is less urgent. This raw procedure may lead to non-monotone cutoffs so it is assumed that players instead pick the nearest point which is monotone, applying the projection Π Σ , which is the product of the projections on individual Σ i .
In this form it is clear that each player is essentially using a projected gradient algorithm since: In the case where the players are playing a potential game this algorithm can be shown to converge to equilibrium. The game is a potential game if there is a function Υ and functions φ i , i = 1, . . . , q, such that for each i In other words all players are maximising a common objective function plus a function independent of player i's choices. Note that φ i does not affect ∂W i /∂k i j . It is straightforward to check that Example 1 is a potential game.
If one letsῩ(k) be the expected value of Υ as a function of cutoffs (cf. the definition of W i ) then the algorithm can be written as In other words, the overall learning process can be regarded as a projected gradient algorithm. It is well known that such processes converge to stationary points ofῩ, under mild assumptions, and under single-crossing these are equilibria.
Theorem 1 Under Assumptions 1-5 and 7 if the game is a potential game and α n = M for all n, then if M is sufficiently small, any limit point of (12) is an equilibrium. In particular, if equilibrium is unique then (12) converges to it. This a standard result for projected gradient algorithms. The proof can be found in the Appendix. One can also allow the step-size (α n ) to be variable as in the next section. This is of course a model of normal-form learning. In particular it assumes that the cutoffs, or equivalently the strategies, of other players are observed, so that expected payoffs can be calculated. The next section considers an extension of this model to the case where cutoffs are unobserved.

Unobserved Cutoffs
If cutoffs are unobserved the rule of the previous is not feasible. The adjustment rule requires the cutoffs and so the strategies of the other agents to be known. A natural idea is to estimate expected payoffs by the realized payoffs in any period.
The actual signal received may, however, be far from the cutoff whose payoffs it is desired to adjust. Beggs (2009) suggested, in the case of binary actions, circumventing these difficulties by using the rule where ∆U i j (n) is the realized difference in payoffs between actions j − 1 and j at time n. Denote this rule in vector form by k i (n + 1) = h(k i (n)). t i (n) is the signal received by player i in period n. K is a weighting or Kernel function, which determines how much weight is given to the current observation.
If the signal received is far from the cutoff being adjusted, then little weight will be attached to it. As noted this rule can be interpreted as reflecting an idea of 'similarity': cutoffs are adjusted more if the signal received is close or 'similar' to them. Such a system is an example of passive stochastic approximation, suggested first by Härdle and Nixdorf (1987).
In terms of the examples in Section 4, the proposed learning rules requires the player to observe the realized payoffs for all his actions at stage n. That is the row player, say, observes the payoffs in the column corresponding to the action chosen by the other player. In Examples 1 and 2, he can calculate these if he knows the form of the payoff matrix and observes the action played by the other player, since he knows his own type each period. In Example 3 this will the the case if he in addition observes the state of nature, θ.
K will be assumed to satisfy the standard assumptions for a kernel: K is assumed the same for all players to simplify the notation but this is not essential. α n and h n are adjustment parameters. α n represents the extent to which the current cutoff is adjusted and plays the usual role in stochastic approximation: it must tend to zero to eliminate the influence of noise but not too quickly that the system can become stuck far from the optimum. h n controls the extent to which weight is put on observations far from the cutoff. As n becomes large, h n becomes small, putting more and more weight near to the cutoff.
They will be assumed to satisfy If cutoffs are adjusted independently then random shocks may mean that they need not be monotonic. To circumvent this it is assumed the agent chooses the nearest monotonic cutoffs: The projection may be thought a little unnatural and assumptions will be made later which imply that it is not used asymptotically.
(15) can be rewritten as a modified version of (14) The term f i (k i j ) reflects the fact that cutoffs are adjusted only when the signals are close to k i j . u i j (n) is a noise term given by where F n−1 denotes the history of the process up to time n. b i j (n) is a bias term given by reflecting the fact that observations at signals other than the current cutoff are used for updating.
In vector form this can be written as and (16) as where Standard results, see Kushner and Yin (1997), show the asymptotic properties of (20) are closely related to those of the differential equatioṅ where L Σ is an impulsive term which is non-zero only when the process is on the boundary of Σ and serves to keep the process within Σ.
Equivalently they are related to those of the projected differential equatioṅ where π projects v onto the tangent cone to Σ at k. The latter is the closure of the set of feasible directions at k: is the polar of the Normal cone to Σ at k: T Σ (k) = {y : y x ≤ 0, x ∈ N Σ (k) (see Rockafellar and Wets (1998) Chapter 6).

A solution to (24) is an absolutely continuous function such that (24) holds
for almost all t. Dupuis and Nagurney (1993) show the equivalence of the formulations (20) and (24) and prove the uniqueness of solutions (existence follows from earlier results, for example Henry (1973)).
Lemma 3 Under Assumptions 1-5 and 7-9 the set of limit points of k(n) under (20) is almost surely a compact, connected set invariant under the flow of (24), or equivalently of (23). Moreover the limit set is contained in the internally chain recurrent set of (24).
The proof is in the Appendix and is immediate from standard results. A set is internally chain recurrent (or transitive) if it contains no proper attractor (see Benaïm (1999)).
The rest points of (24) are precisely those of the underlying game: Lemma 4 k * is a stationary point of (24), or equivalently (23), if and only if it is an equilibrium of the game.
This follows from Lemma 1 of Dupuis and Nagurney (1993), for example, who note that the zeros of (24) correspond to solutions of the variational inequality V I(Σ, −G), that is by Lemma 1 of this paper are equilibria of the game.
The next section provides conditions under which the limit set of the learning model is contained in the set of equilibria of the game.

Convergence Results
This section provides the principal convergence results. In Section 7.1 convergence in supermodular games is studied. In section 7.2 potential games are examined.

Supermodular Games
As noted in the previous section the convergence of the stochastic algorithm is closely related to the asymptotic behavior of the deterministic systeṁ where L Σ is a reflection term. In the absence of the latter term, this system defines a cooperative dynamical system if the game is supermodular, that is using single indices for simplicity. Equivalently the Jacobian matrix DG has non-negative entries.
This follows since which is increasing in others' cutoffs (and independent of a player's own other cutoffs). If other players raise their cutoffs the benefit to raising own cutoffs is increased if payoffs are supermodular. If payoffs are strictly supermodular then the Jacobian matrix DG is in addition irreducible since then the benefit to increasing cutoffs is strictly increasing in others' cutoffs.
Projection may not preserve the cooperative nature of the system. If however the vector field G points inward on the boundary then the projection term can be dropped so the limiting differential equation is cooperative. In addition the projection is almost surely not used for large n, removing a less attractive feature of the algorithm.

It will be assumed that each player's lowest action (0) is dominant if his signal is low enough and his highest action (m(i) for player i) is dominant if his signal is high enough:
Assumption 10 There exist t > 0 and t < 1 such that for all i (a) if t i < t, It will also be assumed that actions display a diminishing return (or concavity) property 4 , that is the gain to raising actions is decreasing in the level of the action.
Assumption 11 For all i, ∆U i (j − 1, j, a −i , t) is strictly decreasing in j for all a −i and t.
These two assumptions guarantee that the vector field defined by G points inward on the boundary and in particular equilibria cannot occur there. Assumption 10 guarantees that action 0 must be played if signals are low enough and the highest action must be played if they are high enough. Cutoffs are therefore bounded away from 0 and 1. Assumption 11 guarantees that it can never be desirable to leap from (say) action 1 to 3 and omit action 2. If switching from 1 to 2 would lower payoffs then Assumption 11 implies that switching from 2 to 3 certainly would. It follows that the monotonicity constraints are never binding.
These assumptions are satisfied in Example 2.
Lemma 5 Under Assumptions 10-11 L Σ is identically zero in (26) and the limit set of any trajectory of (24) lies strictly in the interior of Σ. In particular, any equilibrium lies strictly in the interior of Σ.
The other ingredient in the convergence result is the recent paper of Benaïm and Faure (2012). They consider, inter alia, the Robbins-Munro algorithm where H is globally Lipschitz and They show that x(n) cannot convergence to a linearly unstable set (more precisely a normally hyperbolic repulsive set) Γ of the differential equatioṅ provided that the noise is non-degenerate in a neighborhood U of Γ. More precisely they assume that there is a continuous mapping Q(x) from U to the set of positive-definite matrices such that for x(n − 1) ∈ U , They also assume that is almost surely bounded for some p > 1.
They prove their result by using a diffusion approximation. They apply the result to the convergence of stochastic fictitious play in supermodular games.
This result is significantly more general than earlier results, for example of Pemantle (1990) who considered only isolated equilibria. It is also easier to adapt to the current model.
In the current context k(n + 1) = k(n) + α n G(k(n)) + α n (u(n) + b(n)) (34) where b(n) has non-zero mean. Moreover u(n) has unbounded variance (see (18)). The results of Benaïm and Faure (2012) cannot therefore be directly applied. It is, however, straightforward to modify them to apply to the current case by using a diffusion approximation for passive stochastic approximation (cf. Härdle and Nixdorf (1987)). In order to do so, one further assumption will be made: Assumption 12 α n = 1/n and h n = 1/n η , where 1/5 < η < 1/2.
This assumption guarantees that one can make a diffusion approximation using a suitable scaling and the bias term can be neglected. The details are somewhat technical and are relegated to the Appendix.
Theorem 2 If the game is strictly supermodular with increasing beliefs, then under Assumptions 1-6, 8 and 10-12 the limit set of {k(n)} is almost surely an ordered arc of equilibria that is not linearly unstable. In particular if equilibrium is unique, play converges to it almost surely.
The assumption of strict supermodularity is used to guarantee that the vector field defined by G is irreducible as well as co-operative. In addition it guarantees that the noise in the process is non-degenerate in the neighborhood of any limit set of the process using variants on standard results for the variance matrix of kernel regression (cf. Schuster (1972)).
Corollary 1 Under the assumptions of Theorem 2, if in addition f and U i are real analytic for all i, then play converges almost surely to a Nash equilibrium.
This follows from a result of Jiang (1991) which states that a real-analytic cooperative vector field cannot have an ordered arc of equilibria.
If there is a unique equilibrium, then Benaïm and Faure (2012)'s results are not needed as it follows from Theorem 3.3 of Benaïm (2000) that the internally chain recurrent set consists solely of this equilibrium. One therefore does not need the precise step size assumptions in Assumption 12.
Theorem 3 If the game is strictly supermodular with increasing beliefs and has a unique equilibrium, then under Assumptions 1-5 and Assumptions 8-11, the limit set of {k(n)} is almost surely the equilibrium set of cutoffs.

Potential Games
In the case of potential games one obtains an immediate generalization of Theorem 1. As in the previous section letῩ(k) be the expected value of the potential. Let E denote the equilibrium set.
Theorem 4 Under Assumptions 1-5 and 7-9 if the game is a potential game with potential function Υ then the set of limit points of (12) is contained in the set of equilibria ifῩ(E) has empty interior. In particular, if equilibrium is unique then play converges to it.
The condition onῩ(E) holds in particular if the set of equilibria is finite. It also holds ifῩ is smooth enough by Sard's theorem.

Conclusion
This paper has studies a simple model of learning in Bayesian games with monotone equilibria. Convergence has been shown in the case of supermodular and potential games. Learning in other Bayesian games is a promising topic for future exploration.
Proof of Lemma 1 is the monotone non-negative cone bounded above by 1 (see for example Boyd and Vandenberghe (2004) exercise 2.33), so the result can be deduced from standard ones, but a proof is sketched with w m(i)+1 = 0. Using this implies that y ∈ N Σ (k) if and only if (a) If j and j are adjacent active actions then (i) j l=j y l = 0, (ii) for any j with j < j < j , j l=j y l ≥ 0, (b)) If j is the least active action, then j l=1 y l ≤ 0, Similarly (c) If j is the greatest active action, then j l=1 y j ≥ 0. This implies that, the vector ∇W i = −f (k i j )∆V i (j − 1, j, k i j ) lies in N Σ i (k i ) if and only if the conditions in the Lemma hold since ∆V i (j, j , k

Proof of Theorem 1
According to Rusczyński (2006) Proposition 6.1 if X is a compact convex set and h a continuously differentiable function whose gradient ∇h satisfies a Lipschitz condition with constant L on K ( ∇h(x) − ∇h(y) ≤ L x − y ∀x, y ∈ X) then any limit point of the algorithm with α n = s for all k where 0 < s < 1/L is a stationary point. Under Assumptions Assumptions 1-6Ῡ is twice continuously differentiable in k (using the fundamental theorem of calculus and dominated convergence theorem) and since Σ is compact satisfies a Lipschitz condition, so the result is immediate.

Proof of Lemma 3
Under Assumption 1-5, G is Lipschitz (see (1)). Moreover from (18), for some constant M . It follows from Assumption 9 that n α n u(n) is an L 2 -bounded martingale and therefore convergent. Moreover from (19) b i j (n) ≤ M h n for some constant M , using the change of variables s = (k i j − t i (n))/h n and the properties of K in Assumption 8. Moreover k(n) is bounded, so in particular its second moment is uniformly bounded. n α 2 n < ∞ by Assumption 9. It follows from Theorems 5.2.3 and 5.2.5 of Kushner and Yin (1997) (see also Theorems 4.1, 5.3 and 6.11 of Benaïm (1999)) that the set of limit points of k(n) is almost surely a compact, connected set invariant under the flow of (23) and is internally chain connected.

Proof of Lemma 5
If X = {x : a j x ≤ b j ∀j} is a polyhedral set, then by Theorem 6.46 of Rockafellar and Wets (1998), the tangent cone at x ∈ X is T X (x) = {y : a j y ≤ 0, j ∈ J} where J is the set of indices corresponding to the set of constraints which bind, that is for which a j x j = b j .
Applying this to Σ i = {k : −k 1 ≤ 0, k 1 − k 2 ≤ 0, . . . , k m(i) − k m(i)−1 ≤ 0, k m (i) ≤ 1}, shows that the tangent cone of Σ i is the set of vectors y such that Under Assumptions 10 and 11 the vector y = −∆V i (j − 1, j, k i j ) satisfies all the constraints above with strict inequality: and hence belongs to the interior of the tangent cone to Σ i . Assumption 10 implies (39) and (41) and Assumption 11 implies (40).
also belongs to the interior of the tangent cone to Σ i . Since T Σ (k) = T Σ 1 (k 1 ) × . . . × T Σ p (k p ), G belongs to the interior of the tangent cone of Σ for all k, so is inward-pointing on the boundary.
It follows that the projection π(k, G) always equals G, so L Σ vanishes. As G points inwards on the boundary any trajectory starting on the boundary of Σ immediately enters the interior, so any equilibria and limit sets lie in the interior.

Proof of Theorem 2
From Lemma 3, the limit set of k(n) is almost surely an invariant set of (23). From Lemma 5, the vector field G points inward on the boundary so any internally chain recurrent set of (23) lies strictly within Σ. It follows that the reflection term can be neglected asymptotically and that any limit set of k(n) is an internally chain-recurrent set of the flow of G. The recursion studied is from (20) G is cooperative and irreducible and C 2 . Theorem 4.2 of Benaïm and Faure (2012) implies that any internally chain-recurrent set of G is either an ordered arc of equilibria or linearly unstable (more precisely a normally-repulsive hyperbolic set -see definition 3.1 of Benaïm and Faure (2012)). The result will follow provided the result of Benaïm and Faure (2012) on nonconvergence to linearly unstable sets can be extended to this context. The rest of the proof checks this.
It is straightforward to check using (18) and (19) that under Assumptions 1- and They show that if X(t) satisfies the following two hypotheses (Hypotheses 2.2 and 3.6 of Benaïm and Faure (2012)
Consider the discrete-time process (42) falls into this framework with x(n) = k(n), H = G and v(n) = u(n) + b(n) under Assumption 12. x n and v n will be used interchangeably with x(n) and v(n) for notational convenience.
Let α n = 1 n and set τ n = n l=1 α l . Consider the interpolated process It will be shown that X(t) satisfies Hypothesis 1 and Hypothesis 2 when X(t) is derived from (42) and the limit set of X(t) lies in a neighborhood U of a linearly unstable set Γ.
It is straightforward to check that the arguments in Benaïm (1999) and Benaïm (2000) can be modified in the current to context to show that Hypothesis 1 holds with This follows since n α n u(n) is a martingale with conditional variance of order n α 2 n h n and b(n) converges to zero.

Verification of Hypothesis 2
In the case of the Robbins-Munro process, Benaïm and Faure (2012), prove Hypothesis 2 by using a diffusion approximation. Let m n = sup{s ∈ N : τ s ≤ n}, s n = m n+1 − m n , t n j = τ m n +j − τ m n and t n = t n s n . Let {G n } n be the σ-algebra {F m n } n . Benaïm and Faure (2012) note that to prove Hypothesis 2 in the case of the discrete-time interpolated process x n it is enough to find a positive vanishing sequence {γ(n)} and a G n -adapted process Y n such that (i) For any β > 0, where with γ(n) = s n s=1 α 2 m n +s Here the same sequence is considered (with H corresponding to G in the current paper) but with γ(n) = s n s=1 α 2 m n +s /h m n +s Equivalently y n j =α It is straightforward to check that Benaïm and Faure (2012)'s arguments for (i) can be easily modified for the current context. (iii) follows immediately from the definition of γ(n) in (55) together with (51) and (52).
To prove point (ii) Benaïm and Faure (2012) appeal to a Central Limit Theorem for triangular arrays due to Hall and Heyde (1980), in a form given by Chow and Teicher (1998) Theorem 2 in Section 9.5: Theorem A1 (Hall-Heyde) For any n ≥ 1 let s n be a positive integer and (Ω n , F n , P n ) a probability space. Consider an increasing sequence of σ-algebras F n 1 ⊂ . . . ⊂ F n s n ⊂ F n and a family {y n j } j=1,...,s n of F n j -adapted random variables such that then the sequence (y n ) n , with y n = s n j=1 y n j , converges in distribution to a random variable y defined on (Ω, F, P ) and whose characteristic function is given by E e −(1/2)t wt .
These hypotheses are verified below:

Verification of (a)
In the case of Benaïm and Faure (2012), the y n j are a martingale, so (a) is automatic. Here Assumption 12 guarantees that the bias term can be asymptotically neglected so (a) holds.
In more detail, under Assumption 6, f and U are C 2 , so using a secondorder Taylor expansion yields that E( b(n) ) ≤ Lh 2 n for some constant L using Assumption 8. Hence By Assumption 12 ∞ n=1 h 5 n is convergent, so an application of the Cauchy-Schwartz inequality yields that the right hand-side tends to zero since γ(n) = s n j=1α 2 m n +j . The remainder of (a) is verified similarly.

Verification of (b)
(b) is verified in the same way as Benaïm and Faure (2012). Note that Using the change of variables s i = (k i j (n) − t i (n))/h n and the boundedness of K and ∆ i j U (n) shows that Applying Hölder's inequality as in Benaïm and Faure (2012) yields that which tends to zero as n → ∞ as γ(n) → ∞ and α 2 n /h 2 n → 0 by Assumption 12. The remaining details are as in Benaïm and Faure (2012).
The same argument can be applied here provided it is shown that there is a continuous mapping Q(x) from U , where U is neighborhood of the global attractor of X, to the set of positive-definite matrices such that for k n−1 ∈ U , Q n (k(n)) := E (ṽ(n)ṽ(n) |F n−1 ) → Q(k(n)) uniformly in n. All except the positive-definiteness of Q is clear.
From (a) the bias terms can be neglected and will be omitted in the expressions below.
Using the change of variables s i = (k i j (n) − t i (n))/h n , yields that if j = j the second line is equal to plus terms of order h n and higher. The third line is of order h n (using the same change of variables in both expectations). It follows that the diagonal elements convergence uniformly to those of the matrix Q whose diagonal entries are given by (63). The diagonal elements of Q are therefore strictly positive and bounded away from zero in the interior of Σ as the assumption of strict supermodularity implies that incremental payoffs are non-constant (and in particular not zero everywhere). The diagonal elements are also bounded above by continuity.
It is straightforward if tedious to check (cf. Schuster (1972)) that for large enough n the off-diagonal elements of Q n converge to zero. If j = j the compact support assumption in Assumption 8 and the fact that any internally-chain recurrent set is strictly in the interior of Σ implies that (62) tends to zero uniformly.
For covariances involving different players note that Using the change of variables s i = (k i j (n) − t i (n))/h n and s i = (k i j (n) − t i (n))/h n yields that both terms are of order h n , so converge to zero uniformly. Now w n = s n j=1 E y n j y n j |F n j−1 = s n j=1γ 2 m n +j γ(n) Π n,j E (ṽ nṽ n |F n−1 ) Π n,j where Π n,j = Π s n l=j+1 (I + γ m n +l DH (Φ n l (x m n ))) and if one defines w * n by w * n = s n j=1γ 2 m n +j γ(n) Π n,j Q(x m n +j )Π n,j then w n − w * n converges uniformly to zero and one can apply the argument of Benaïm and Faure (2012) applied to w * n to verify (d).

Proof of Theorem 3
The proof is exactly the same as that of Theorem 2 except that as Theorem 3.3 of Benaïm (2000) implies that the only internally chain-recurrent set consists solely of the equilibrium cutoffs, one does not need to appeal to Benaïm and Faure (2012) and these steps can be omitted.

Proof of Theorem 4
From Lemma 3 the set of limit points of (20) is a compact invariant subset of the flow of (24). Proposition 6.4 of Benaïm (1999) states that if Λ is an invariant set for a semi-flow Φ, V is a strict Lyapounov function for Φ and V (Λ) has empty interior then the set of internally chain-recurrent points is contained in Λ.
The result here follows provided it is shown thatῩ is a Lyapounov function for the flow of (24) and the equilibrium set Λ = {k : π(k, G) = 0} = {k : π(k, ∇Ῡ) = 0} = E, where ∇ is the gradient operator. The argument follows that in Bianchi and Jakubowicz (2013). Let , denote inner product.