Asynchronous Stochastic Approximation with Differential Inclusions

The asymptotic pseudo-trajectory approach to stochastic approximation of Benaim, Hofbauer and Sorin is extended for asynchronous stochastic approximations with a set-valued mean field. The asynchronicity of the process is incorporated into the mean field to produce convergence results which remain similar to those of an equivalent synchronous process. In addition, this allows many of the restrictive assumptions previously associated with asynchronous stochastic approximation to be removed. The framework is extended for a coupled asynchronous stochastic approximation process with set-valued mean fields. Two-timescales arguments are used here in a similar manner to the original work in this area by Borkar. The applicability of this approach is demonstrated through learning in a Markov decision process.

where {α(n)} n∈N is a positive, decreasing sequence, {V n } n∈N is a zero-mean martingale noise sequence, {d n } n∈N is a bounded sequence which converges to zero and F (•) : R K → R K is a Lipschitz continuous mean field.Standard arguments (e.g.[3]) are then used to show that the limiting behaviour of the iterative process in (1.1) can be studied through the ordinary differential equation (ODE) Commonly known as the ODE method of stochastic approximation, originally proposed by Ljung [19], this technique has been extended by numerous authors, for example Benaïm [2], Benaïm, Hofbauer and Sorin [4], Borkar [10], Kushner and Clark [15] and Kushner and Yin [16], [17].In particular Benaïm, Hofbauer and Sorin [4] have developed the approach so that under some weak criteria {x n } n∈N can be updated via a set-valued mean field, F (•).This allows for the limiting behaviour to be studied using the associated differential inclusion.Standard stochastic approximations are not always applicable; an example which we examine in this paper is when learning action values in a Markov decision process (MDP) and this is also discussed by Konda and Tsitsiklis [14], Tsitsiklis [23] and Singh et al. [21].In a MDP updates are made to a single random component at each iteration.Therefore we have a stochastic, asynchronous updating pattern, where a subset of an iterative process similar to (1.1) can be updated many times before the remaining components are selected for a single update.Based on this idea extensions to the standard theory have been examined such as those by Kushner and Yin [16], [17].Here however we follow the extension to asynchronous stochastic approximation provided by Borkar [8] and Konda and Borkar [13].They show that when the iterative updates have a Lipschitz continuous mean field then, similarly to a standard stochastic approximation, the limiting behaviour can be studied via the associated differential equation, (1.3) where M (t) is a K × K diagonal matrix and the diagonal elements of M (t) lie in the set [0, 1] for all t > 0. This early work on asynchronous stochastic approximations has certain restrictions which limit its usability.In particular, many of the assumptions made in the work of Borkar [8] and Konda and Borkar [13] are given in implicit form and are difficult to verify in specific situations.
As with the initial results for a standard stochastic approximation the results of Borkar [8] are limited to the case when the mean field, F (•), is a Lipschitz continuous function.The subsequent work by Benaïm, Hofbauer and Sorin [4] on set-valued mean fields leaves the natural question of whether similar results are possible for asynchronous stochastic approximations when a set-valued mean field is used.In addition, the ODE in (1.3) is non-autonomous and the scaling matrix M (•) is not explicitly defined.This makes analysis of the limiting behaviour more difficult to study, although some methods for verifying global convergence are outlined by Borkar [10].
Borkar [7] originally extended the stochastic approximation framework to two-timescales.Since then Leslie and Collins [18] extended this idea to multiple timescales and Konda and Borkar [13] provide a first venture into the two-timescale asynchronous stochastic approximation.However, all of these only consider stochastic approximations when the mean field is Lipschitz continuous.
The aim of this work is to combine and generalise the results by Borkar [7], [8], Konda and Borkar [13] and Benaïm, Hofbauer and Sorin [4] to create a framework for single and two-timescale asynchronous stochastic approximations which is straightforward to use in practical applications.In this paper we show that, under a set of verifiable assumptions, the diagonal elements of M (t) lie in the closed set [ε, 1], for some ε > 0. The set [ε, 1] can be combined with the mean field F (•) to form a set-valued mean field, F (•), whose limiting behaviour can be studied via the associated differential inclusion using the results of Benaïm, Hofbauer and Sorin [4].A natural benefit of using the differential inclusion framework is that F (•) can be set-valued as this does not alter the analysis.
This paper is organised in the following manner: Section 2 reviews some previous results on stochastic approximation with differential inclusions and asynchronous stochastic approximations.In Section 3 we focus on the singletimescale asynchronous stochastic approximation.We state the main theorem before presenting the weak convergence results required for the proof.Section 4 examines the extension to a two-timescale asynchronous stochastic approximation process.Large parts of this section follow directly from the results in Section 3. In Section 5 we present an example of a learning algorithm for discounted reward Markov decision processes and obtain convergence results by applying the method shown in Section 4. This illustrates the ease in which this framework can be used.Finally, the paper concludes with a summary of the work.Throughout this paper many of the proofs are omitted from the main flow of text and are instead presented in an appendix.
2. Background.Throughout this paper we use two main ideas from the stochastic approximation literature.The first relates to the work by Benaïm, Hofbauer and Sorin [4] on stochastic approximation with differential inclusions, and the second concerns the asynchronous stochastic approximation framework introduced by Borkar [8].We take the opportunity to review the pertinent features of their work in this section.
In what is to follow we use the standard concept of set multiplication: if A ⊂ R K×K is a set of K × K matrices and B ⊂ R K are two closed, convex sets then let the multiplication of these sets be defined as, Note that A•B is also closed and convex.This definition is still used if either or both of the sets A and B are single valued.We also use the same concept when multiplying a constant by a set.That is, if α is a constant then define However, in this latter case we often drop the '•' notation for convenience.
2.1.Stochastic Approximation with Differential Inclusions.We begin by outlining the current convergence results for stochastic approximations with set-valued maps proved by Benaïm, Hofbauer and Sorin [4].These results are heavily used in Section 3, most notably to prove our main result.Initially we provide a definition which outlines the class of set-valued mean fields we are able to use for stochastic approximation.These criteria are taken directly from the original work on stochastic approximations with differential inclusions by Benaïm et al. [4].
and a solution to (2.1) is an absolutely continuous mapping x : R → R K such that x(0) = x and for almost every t > 0, The flow induced by (2.1) is defined by, for any T > 0 and where d(•, •) is a distance measure on R K .
Many important properties of a dynamical system {Φ t (•)} t≥0 and the asymptotic pseudo-trajectories of the systems are discussed by Benaïm, Hofbauer and Sorin [4].Most important for the work here is that an asymptotic pseudo-trajectory to (2.1) will behave in a similar manner to the solutions of the differential inclusion and hence the limiting behaviour will be closely related.
We conclude this section by considering a standard iterative process in the form of (1.1) where the mean field F (•) : R K → R K is a stochastic approximation map.The following theorem states that under four assumptions a linear interpolation of {x n } n∈N (a function defined precisely in Section 3.1) is an asymptotic pseudo-trajectory to the differential inclusion (2.1).Hence the limiting behaviour of {x n } n∈N can be studied via this differential inclusion. where Then a linear interpolation of the iterative process {x n } n∈N given by (1.1) is an asymptotic pseudo-trajectory of the differential inclusion (2.1).This is a slight modification to a result stated by Benaïm, Hofbauer and Sorin [4,Proposition 1.3] to include the {d n } n∈N terms.It is trivial to verify that this will not alter any of the asymptotic results of the original work.

Asynchronous Stochastic
Approximations.Now we fully introduce the asynchronous stochastic approximation notation used.A typical asynchronous stochastic approximation such as those studied by Borkar [8] fits the following framework.If 2 I is the power set of all possible updating combinations in I then let Īn ∈ 2 I be the components of the iterative process {x n } n∈N updated at iteration n.Using a counter for state i ∈ I, we consider processes in which no component, i, in the asynchronous process needs to know the global counter, n, merely its own counter, ν n (i).Let F i (•), x n (i), V n (i) and d n (i) be the i th component of F (•), x n , V n and d n respectively, for i = 1, . . ., K. We directly extend the notation used in (1.1) for an asynchronous stochastic approximation; let F (•) be a stochastic approximation map, then for i = 1, . . ., K let Define the asynchronous step sizes, ᾱn , and the relative step sizes, µ n (i), to be The asynchronous step sizes, ᾱn , are random step sizes (in contrast with the deterministic α(n) terms) whilst the relative step sizes, µ n (i) are zero whenever the i th component of the iterative process is not updated.Clearly . By letting M n be the K × K diagonal matrix of the µ n (i) terms we can express the previous asynchronous stochastic approximation (2.3) in the more concise form This is a more familiar form for a stochastic approximation with a setvalued mean field.If F (•) is a stochastic approximation map in (1.1) then (2.4) differs from (1.1) only in that the step sizes in (2.4) are random and the addition of the M n+1 coefficient.Instead of thinking of M n+1 as a coefficient of the step sizes we combine it with the mean field.Convergence of the error term, d n+1 , will be unaffected and, under a set of assumptions in Section 3.2, the noise term M n+1 V n+1 will still satisfy the Kushner and Clark noise condition (2.2).Combining the M n+1 term and the mean field term F (•) into a single set provides an intuitive method of rephrasing the stochastic approximation and leads to a set-valued mean field.
Proceeding with this intuition is not immediately straightforward since M n is time varying and can be zero infinitely often, in which case the mean field could be zero even when the original update term, F (•), is not.This would mean the limiting behaviour of the differential inclusion in the asynchronous stochastic approximation could be different to the synchronous case, where ultimately we wish to say that the two behave in the same manner in the limit.To avoid this scenario we follow Borkar [8] and consider the weak limit of the interpolations of {M n } n∈N , which will always be bounded away from zero under some verifiable assumptions, given in Section 3.2.
3. Asynchronous SA with Differential Inclusions.We begin by presenting the main result of this paper which concerns the limiting behaviour of the asynchronous stochastic approximation in (2.4), before outlining the results required to prove this in the remainder of the section.
3.1.Main Result.Assume that F (•) is a stochastic approximation map and for all n define f n ∈ F (x n ) by its component parts, f n (i), i = 1, . . ., K, such that Notice that if µ n+1 (i) = 0 then we can select any f n (i) ∈ F i (x n ).Then we can write the iterative process in (2.4) as (3.2) For some fixed ε > 0, let Mn be a series of K × K diagonal matrices with entries in the set [ε, 1], for all n, to be defined in Section 3.3.We can again rewrite the iterative process in (3.2) as For general k, δ > 0 let and define If F (•) is Lipschitz continuous direct comparisons can be made between the mean field, F (x), and the analogous mean field M (t)F (x) from (1.3) which is used by Borkar [8] and Konda and Borkar [13].This provides the key insight into the new approach we take.Under the assumptions used in Section 3.2 the equivalent M (t) values almost surely lie in Ω ε K .By combining this with F (x) we produce a differential inclusion which is more straightforward to study than a non-autonomous differential equation and naturally fits the stochastic approximation framework of Benaïm, Hofbauer and Sorin [4].In addition, this idea naturally lends itself to examining a similar process for a set-valued mean field as we proceed to do in this paper.Equation (3.3) can be expressed in the form of a stochastic approximation with a set-valued mean field as in [4]: Let τ0 := 0, τn := n k=1 ᾱk be the timescale for the asynchronous updates.To allow this process to be analysed in continuous time consider an interpolated version of the stochastic approximation (3.6) so that this process can be considered in continuous time, (3.7) x( Directly from Theorem 3.1 and [4, Proposition 3.27] we get the key result concerning the convergence of an asynchronous stochastic approximation process. Corollary 3.2.If there is a globally attracting set, A, for the differential inclusion (3.8), and assumptions (A1)-(A5) are satisfied, then the iterative process (2.4) will converge to A.

Assumptions.
Throughout this section we study the convergence properties of the iterative process (2.4).We make reference to the following assumptions, (A1)-(A5), all of which are either standard requirements for a stochastic approximation or can be verified prior to running the asynchronous stochastic approximation process.This is in contrast with the previous work on asynchronous stochastic approximations by Borkar [8] and Konda and Borkar [13].
(A1)(a) is a slight strengthening of the standard stochastic approximation boundedness assumption; however this is still a relatively mild condition.Methods to ensure that it is satisfied are discussed elsewhere, for example [12], [13] or [23].A basic restriction is placed on {d n } n∈N in (A1)(b); in this form the sequence does not affect the asymptotic behaviour of the process.(A2)(a) is a standard assumption required for stochastic approximation, and (A2)(b) is a mild technical condition required to deal with the asynchronicity, which is also used by Borkar [8].We have dropped the additional restriction on the step-sizes used by Borkar which severely restricts the possible choices of {α(n)} n∈N .(A3) ensures that we can use the convergence results presented in Theorem 2.3 and is a standard assumption for stochastic approximations with a set-valued mean field.
Define Ī ⊂ 2 I as the set of all the possible combinations which have positive probability of occurring.As an example, if every element of I gets updated and it is known that Īn is a singleton for each n, then Ī = I.
Let F n be a sigma algebra containing all the information up to and including the n th iteration.That is (b) For all x ∈ C the transition probabilities P (In,I n+1 ) (x) form an aperiodic, irreducible, positive recurrent Markov chain over Ī and for all i ∈ I there exists an I ∈ Ī such that i ∈ I.
(A4)(a) assumes that the transitions between the updated elements in Ī are part of a controlled Markov chain.(A4)(b) is a straightforward assumption on this controlled Markov chain which can be verified prior to implementation which allows us to negate the need for some of the original technical assumptions made by Konda and Borkar [13].In this previous work Konda and Borkar assume that every state is updated at a comparable rate in the limit which cannot directly be verified prior to running the process.(A4)(c) is a condition which is required later to use a result from Ma et al. [20] on the convergence of stochastic approximation with Markovian Noise.
for each c > 0.
An assumption similar to (A5) is used by Benaïm, Hofbauer and Sorin [4] to verify a condition for noise convergence and is similar to that used by Kushner and Clark [15].We use this assumption only to show the noise term still satisfies the Kushner-Clark condition with the convergence given in Lemma 3.3; the proof is presented via two lemmas in Appendix A.2. Lemma 3.3.Assume that (A2)(b), (A4) and (A5) hold.Then with probability 1, for all T > 0, where τ0 := 0, τn := n k=1 ᾱk and m(t) := sup{k ≥ 0; t ≥ τk }.
Note that if Lemma 3.3 can be verified directly without (A5) then this assumption is redundant, and hence we only require (A1)-(A4) and Lemma 3.3 to hold.This approach is used in Section 4.

Weak Convergence of Asynchronous Updates.
As discussed in the introduction, the key issue with asynchronous stochastic approximations is how to handle the interaction of the relative step sizes and the mean field.It is important to be able to bound the limit of the relative step sizes, µ n (i), away from zero for all i ∈ I in order to produce an asynchronous stochastic approximation mean field, F (•) which will behave similarly to the synchronous mean field, F (•). However the relative step size of a state is zero whenever that state is not updated, hence it is not immediately clear that this is even possible.Despite this it is sufficient that for any T > 0 an 'average' of µ n (i) over length T in the continuous time interpolation converges to a value which is bounded away from zero.In this section we prove that under (A2)(b) and (A4) this is indeed the case.
For T > 0 the space Following the method used in [10] and [13], define U to be the space of maps u(•) : R → [0, 1] with the coarsest topology which for all T > 0 leaves continuous the map, Hence U is a space of [0, 1] trajectories.This means that for any map defined on U convergence to a limit point will be in the weak sense, along a subsequence.That is a sequence of maps {ϕ n (•)} n∈N such that ϕ n (•) ∈ U for all n is said to possess a limit point φ(•) ∈ U if for fixed T > 0 there exists a subsequence k(n), such that for any h Many authors provide a more detailed discussion on weak convergence; for example [6], [10, Appendix A] or [11].Now we extend the relative step sizes, µ n , to continuous time; for all i We now expand upon the discussions in Sections 2.2 and 3.1 on producing a sequence of matrices { Mn } n∈N .In order to use the differential inclusions framework described in Section 2.1 we need to define a sequence of diagonal matrices, { Mn } n∈N with diagonal entries which are always in the set [ε, 1], for some ε > 0, and such that the terms converge to the same limit as the terms of {M n } n∈N .Recall that M n is a diagonal matrix containing the µ n (i) terms.
Fix ε > 0 taken from Lemma 3.4 and define a new function v(•) : R → R K such that For all t > 0 let vn i (t) := v i (t + τn ).Corollary 3.5 shows that, with respect to the topology of U , in the limit u i (t) ∈ [ε, 1] for almost every t and similarly v i (t) ∈ [ε, 1] for all t.From this it is clear u i (t) and v i (t) have the same limit point in U .That is, if ũ(t) is a limit point of {ū n (•)} n∈N then it is also a limit point of {v n (•)} n∈N .Hence for any T > 0 there exists a subsequence However, the key interest here is in the convergence of u i (•) and v i (•).Following the reasoning of Borkar [8] and Konda and Borkar [13], it does not matter whether ūn i (•) and vn i (•) converge directly or via a subsequence as this does not affect the convergence of the continuous processes u i (•) and v i (•).Hence we can say that u i (τ n + •), v i (τ n + •) converge weakly to a limit point ũi (•), or equivalently, if h(•) is any bounded, continuous function then for all i = 1, . . ., K, Define M (t) to be the K × K diagonal matrix of the v i (t) terms and let Mn := M (τ n ).
3.4.Proof of Theorem 3.1.We must verify that the four conditions of Theorem 2.3 hold for the stochastic approximation process in (3.6) to ascertain that x(t) is an asymptotic pseudo-trajectory of (3.8).

Two-timescale Asynchronous Stochastic Approximation.
A useful extension of standard stochastic approximations is to two-timescales.This concept was originally introduced by Borkar [7] and has later been used by Leslie and Collins [18] for multiple timescales and Konda and Borkar [13] for two-timescales asynchronous stochastic approximation.If we have a coupled pair of stochastic approximations where one system can be seen to update more aggressively than the other then the aggressive process is always fully adjusted to the value of the other process.This is all controlled through the user's choice of step sizes in the stochastic approximation.The main result in this section is Corollary 4.8, which comes from combining Theorem 3.1 with the previous work of Konda and Borkar [13].

Notation.
In what is to follow we consider the extension of Theorem 3.1 to the two-timescales setting, with updates {x n } n∈N and {y n } n∈N on different timescales.Let I be the set of individual elements of the x process as in Section 3, and define J similarly for the y process.Let K = |I| and L = |J| so that for all n, x n ∈ R K and y n ∈ R L .As in Section 3 let Ī ⊂ 2 I be the set containing all combinations of elements in I which have a positive probability of being part of the asynchronous update, and define J ⊂ 2 J in the same manner for the y process.At iteration n let Īn ∈ Ī and Jn ∈ J be the updated components of each timescale respectively.Let each component of the two processes have a counter for the number of times it has been selected to be updated defined by, Here ν n (i) is as in Section 3 and φ n (j) has an analogous definition for the {y n } n∈N process.Let {V n } n∈N , {U n } n∈N be martingale noise processes defined on R K and R L respectively, and {d n } n∈N , {e n } n∈N → 0 as n → ∞ similarly defined on R K and R L respectively.Let V n (i), d n (i) ∈ R be component i of V n and d n , and similarly let U n (j), e n (j) ∈ R be component j of U n and e n .As in the previous sections {α(n)} n∈N , and now {γ(n)} n∈N , are positive, deceasing sequences of step sizes.Similar restrictions to those in (A2) will be placed on {α(n)} n∈N , {γ(n)} n∈N with an additional requirement for the two-timescale arguments to be valid; this will be made precise in Section 4.2.Finally, are set-valued maps, where F i (x, y) is the i th value of F (x, y) and similarly for G j (x, y).For all i = 1 . . ., K and j = 1, . . ., L consider the following coupled process, (4.1) Notice that the only change to the first process from Sections 2 and 3 is that the mean field now depends on y n as well as x n .It follows that the asynchronous and relative step sizes retain the same form.Recall these definitions and extend them for the {y n } n∈N process: As in Section 2.2 let M n be the K × K diagonal matrix of the µ n (i) terms and similarly let N n be the L × L diagonal matrix of the σ n (j) terms.The coupled stochastic process (4.1) can be written more concisely as, (4.2) Finally, define the two timescales; let τ0 := 0, τk := k i=1 ᾱi , ρ0 := 0, ρk := k i=1 γi .The division of time on the 'slow' timescale is given by the increments {τ n } n∈N and similarly for {ρ n } n∈N on the 'fast' timescale.In a similar manner to the previous sections let mα (t) := sup{k ≥ 0; t ≥ τk }, and mγ (t) := sup{k ≥ 0; t ≥ ρk }.

4.2.
Assumptions.We state the assumptions (B1)-(B6) used for the convergence results of the two-timescale algorithm (4.2).These are exactly analogous to (A1)-(A5) and are simply extended to accommodate the twotimescales framework.The exceptions to this are (B2)(c) and (B6) and the slight adaptations to (B3), which are in line with those used by Borkar [7] and Konda and Borkar [13].In (B4) we have produced a single combined Markov chain instead of one for each of the {x n } n∈N and {y n } n∈N processes to present a clearer assumption.
(b) {d n } n∈N and {e n } n∈N are bounded sequences such that d n , e n → 0 as n → ∞. (B2) The following must be true for a(n) = α(n) and a(n and for all y ∈ D, F (•, y) : R K → R K is a stochastic approximation map.
(b) For all z ∈ {(x, y); x ∈ C, y ∈ D}, G(z) is a stochastic approximation map The first and second assumptions are direct extensions of (A1) and (A2) to two-timescales with the addition of (B2)(c) which is a standard twotimescale assumption used by Borkar [7].Condition (B3)(a) is similar to (A3) for the 'slow' timescale, however this must hold for all values of the 'fast' timescale.(B3)(b) is a similar condition for the 'fast' timescale.
Define H ⊂ Ī × J such that if I ∈ Ī and J ∈ J then (I, J ) ∈ H if and only if I and J have a positive probability of occurring simultaneously (at the same iteration).This means that H is the combination of elements across Ī × J which have positive probability of being updated at any particular iteration.At iteration n Hn ∈ H is taken to be the updated components across I and J.In addition, let z n = (x n , y n ) ∈ C × D and F n be a sigma algebra containing all the information up to and including the n th iteration.That is (b) For all z ∈ C × D the transition probabilities Q (Hn,H n+1 ) (z) form aperiodic, irreducible, positive recurrent Markov chains over H and for all i ∈ I, j ∈ J there exists a H, one of the following assumptions is satisfied: (a) For some q ≥ 2 n a(n) 1+q/2 < ∞, and sup for each c > 0.
(B4) and (B5) are straightforward extensions to (A4) and (A5) where in (B4) we have chosen to create a combined Markov chain over both Ī and J to present a clearer assumption.As a result of (B4), Lemma 3.4 gives that every element of H (and hence every element of I and J) is updated some minimum proportion of the time in the limit (see Section 3.3).Let ε > 0 be this minimum proportion.Define F (x, y) := Ω ε K • F (x, y) and Ḡ(x, y) := Ω ε L • G(x, y) analogously to the definition of F (•) in (3.5),where Ω ε K , Ω ε L are as defined in (3.4).(B6) For all x ∈ C the differential inclusion, has a unique globally asymptotic stable equilibrium, Λ(x), where Λ(•) : R K → R L is bounded, continuous and single-valued for x ∈ C.
The final assumption, (B6), is the asynchronous equivalent to the 'fast' timescale convergence criteria used by Borkar [7] and we use throughout this section.However, at the end of Section 4.4 we provide an alternative assumption, (B6 ′ ), which allows the 'fast' timescale to converge instead to a globally attracting set, as opposed to a continuous single-valued function.

4.3.
Convergence of the 'Fast' Timescale.Many of the proofs in this section use arguments which are identical to the corresponding results of Section 3; where this is the case we do not go into detail and instead direct the reader to the appropriate result(s) to identify the method used.
Firstly, we require an additional lemma used by Konda and Borkar [13] which shows that the key two-timescales arguments made by Borkar [7] will continue to hold in the asynchronous case.Proof.The proof of this result is identical to [13, lemma 4.6].The requisite assumptions are encapsulated in (B2) and the result of Lemma A.1, in Appendix A.1.
Again, we follow the method of Borkar [10] when examining the setvalued updates.Define f n and g n in the following manner; for i = 1, . . ., K, j = 1, . . ., L let As in Section 3.1, if µ n+1 (i) = 0 then we can select any f n (i) ∈ F i (x n , y n ) and similarly if σ n+1 (j) = 0 then we can select any g n (j) ∈ G j (x n , y n ).f n and g n represent the realised values of F (x n , y n ) and G(x n , y n ) respectively.We express (4.2) as, (4.3) Let 0 K be a zero vector of length K, and define Then we can express the coupled process in (4.3) as a single iterative process, This is in the same form as equation (2.4) which is examined throughout Section 3. The limiting behaviour of (4.4) can therefore be studied using the same method as in Section 3.Although assumptions (A1)-(A4) are embedded in (B1)-(B6), we negate the need for (A5) under (B1)-(B6) by immediately proving the corresponding result in Lemma 3.3 for (4.4).
Now we note that assumption (B4) allows us to use identical arguments to those in Lemma 3.4.Recall the definition of Ω δ k in (3.4).Then there exists an ε > 0 such that Γ n ∈ Ω ε K+L .Let z(t) be the linear interpolation of (4.4) in the same manner as the single timescale case in (3.7).Fix ε > 0 and let Lemma 4.3.Under assumptions (B1)-(B6), with probability 1, z(t) ∈ R K+L is an asymptotic pseudo-trajectory of the differential inclusion, Proof.We have shown (A1)-(A4) hold for (4.4) and this is in the form of (2.4).Combining this and Lemma 4.2 we can use Theorem 3.1 to give the result immediately.
Let the linear interpolations of the two-timescales in (4.3) be denoted by x(t) and ȳ(t) respectively analogously to the single timescale case in (3.7).

4.4.
Convergence of the 'Slow' Timescale.Now since we have a function F (•, •) which depends on two variables, but we are treating one of these as fully calibrated to the other, we have a slightly different framework to that of Benaim, Hofbauer and Sorin [4].Therefore we present a slight variation on their perturbed solution [4, Definition (II)].Despite this we are still able to show that this is an asymptotic pseudo-trajectory to the desired differential inclusion.Hence the same convergence results still apply.To reduce notation, define F Λ (•) : R K → R K as F Λ (x) := F x, Λ(x) .Note that under (B3)(a) and (B6) F Λ (•) is a stochastic approximation map.Following our previous notation, let Definition 4.5.A continuous function z : R + → R K is a jointly perturbed solution to the differential inclusion, (4.7) , for almost every t > 0, and for some bounded d(t), δ(t) → 0, where The key difference between a jointly perturbed solution and a perturbed solution is that in the original work of Benaim, Hofbauer and Sorin [4] (and as in Section 3) the mean field depends on a single variable.In contrast to this, here the mean field depends on two variables.Hence in part (ii) we must allow for perturbations in both variables simultaneously instead of perturbing just the one.Lemma 4.6.Under assumptions (B1)-(B6) a jointly perturbed solution of (4.7) is also an asymptotic pseudo-trajectory to the flow induced by (4.7).
Proof.The proof is identical to the proof of [4,Theorem 4.2] which establishes that a perturbed solution is an asymptotic pseudo-trajectory.
Define M (t) and Mn in an identical manner to the same terms in Section 3.4.Corollary 4.4 allows us to consider the updates on the 'slow' timescale of the coupled process in (4.3) given by, (4.8) where, as in Section 3.4, As with the single timescale framework let x(t) be a linear interpolation of (4.8) in the same manner as (3.7).That is, Theorem 4.7.Under assumptions (B1)-(B6), with probability 1, x(•) is an asymptotic pseudo-trajectory to the differential inclusion (4.7).
Proof.We show that x(•) is a jointly perturbed solution of (4.7) and hence by Lemma 4.6 it is an asymptotic pseudo-trajectory of (4.7).
The proof is almost identical to the proof of [4,Proposition 1.3].The differences come with our choice of δ(t) = x(t)−x m(t) + y m(t) −Λ(x m(t) ) , the first term of which will converge to zero as in the original proof and the second of which converges to zero as y n − Λ(x n ) → 0 almost surely as a result of the convergence of the 'fast' timescale.Clearly from Definition 4.5 part (ii), Remark 1.It should be clear that the methods in this chapter can be applied to a standard, synchronous, stochastic approximation with setvalued mean fields.In this case (B4) is trivially satisfied, (B2)(b) can be removed and the use of the sets Ω ε K , Ω ε L can be replaced by a single K × K and L × L identity matrix respectively.This means that F (x, y) = F (x, y) and similarly Ḡ(x, y) = G(x, y).
Remark 2. This framework allows for the 'fast' timescale to converge to a set of limit points.Assumption (B6) can be replaced with the following (B6 ′ ) For all x ∈ C the differential inclusion, has a globally attracting set, Λ(x) where Λ(•) : R K → R L is an upper semi-continuous set valued map, such that for all x ∈ C, Λ(x) is compact, convex and non-empty.In addition, for all x ∈ C and for all, F (x, Λ(x)) is a convex set-valued map, i.e., for all λ, λ ′ ∈ Λ(x), Under (B6 ′ ) F Λ (x) will still be a stochastic approximation map and, although the details of the arguments in Section 4.4 change slightly, the method remains almost identical.
5. An Application: Learning in a Markov Decision Process.In this section we provide an example where our two-timescale asynchronous stochastic approximation approach is needed.The algorithm is an actorcritic style algorithm based upon estimating rewards and playing an ε-greedy best response to these estimates.Similarly to Konda and Borkar [13] we make assumptions on the Markov decision process (MDP) and the coupled learning algorithm separately.These correspond to (B1)-(B6).
Firstly, we begin by outlining a suitable infinite horizon, discounted reward MDP described by the tuple S, A, P, r, β .S is the state space of the MDP, A is the set of actions of the decision-maker (agent), and A(s) is the set of actions available to the agent in state s ∈ S. P represents the form of the stochastic transitions and in what is to follow we take P ss ′ (a) to denote probability of transitioning from state s ∈ S to s ′ ∈ S when the agent has selected action a ∈ A(s).The reward to the decision maker for selecting action a ∈ A(s) is denoted by r(s, a), and β ∈ (0, 1) is the discount factor.
Let s n ∈ S and a n ∈ A(s n ) be the state and the action, respectively, selected by the decision-maker at iteration n.Assume that at every iteration the agent observes the state of the process and a noisy version of the reward received from the action they have chosen, denoted by R n .If and we assume that R n has a finite variance.Let K := |S| and ∆ A(s) represent the set of probability distributions over A(s).Then we denote the combination of K probability distributions as ∆ K := ∆ A(1) × . . .× ∆ A(K) .A strategy for state s ∈ S is denoted by π(s) ∈ ∆ A(s) and let π := π(1), . . ., π(K) ∈ ∆ K be a strategy over all states.π(s, a) is defined as the probability that action a is taken in state s.Players start with a strategy π 0 ; the MDP begins in a random state s 1 ∈ S and the decision-maker selects an action a 1 from π 0 .The agent wishes to find a strategy, π, to maximise their expected discounted reward, Define V π (s) for all s ∈ S in the following manner, (5.1) For all s ∈ S there exists a maximum value of V π (s) for all π ∈ ∆ K [1].Let π ∈ ∆ K be a strategy such that V π(s) is maximal in (5.1) for all s ∈ S. π is known as an optimal strategy for the MDP.Again we use a subscript n on π(s, a), π(s) and π to represent the strategy of the player at iteration n.
To reduce the complexity of the notation we assume that every state has the same number of actions which is denoted by A := |A(s)| for any s ∈ S and we let m := KA.Note that having a different number of actions in each state does not affect the validity of this approach but does make the notation more cumbersome.
We begin by placing restrictions on the learning rates and the MDP.In this algorithm we select learning rates {α(n)} n∈N and {γ(n)} n∈N which satisfy (B2) and (B5)(a) and use two asynchronous counters, In addition we assume that the set of transition probabilities, {P ss ′ (a)} s,s ′ ,a , form an aperiodic, irreducible, positive recurrent Markov chain.Moreover, at every iteration when the MDP is in the state s we enforce that every action a ∈ A(s) is played with a non-zero minimum probability and that this holds for every s ∈ S. Therefore for all s ∈ S, a ∈ A(s) and n ≥ 0, π n (s, a) ≥ ε for some ε > 0.
Finally, before directly analysing the algorithm we present a method for verifying the global convergence of a standard differential inclusion in the form of (2.1).
Finding a Lyapunov function proves the global convergence of set-valued dynamical systems in the form of (2.1), a concept which is fully described by Benaïm, Hofbauer and Sorin [4].
Because we include the asynchronicity with the mean field the associated differential inclusions associated with our algorithm will be in the form, We state slight modifications a result of Konda and Borkar [13] and a result of Benaïm, Hofbauer and Sorin [5] to allow us to easily prove the convergence of differential inclusions in the form of (5.2).We do not state proofs for either of these as they are straightforward extensions of [13,Lemma 5.4] and [5,Theorem 3.10].Let and K s (π) be the A-vector of these terms for all a ∈ A(s).For all s ∈ S, a ∈ A(s).In addition, let ∇ π V (s) denote taking the partial derivative of V with respect to π. Lemma 5.2.Let G be a vector field on Theorem 5.3.Assuming that h ′ (x) is a stochastic approximation map and there exists a positive definite function Then W (•) is a Lyapunov function for (5.2) with attracting set Λ.
Using Lemma 5.2 and Theorem 5.3 we show the convergence of the following algorithm.
5.1.The Algorithm.This algorithm cannot be studied in the framework of Konda and Borkar [13] due to the Lipschitz continuous restriction they place on the mean fields of the coupled stochastic approximations.In this work we have relaxed this condition allowing the study of process which are based on the best response.Firstly, if {Q(s, a)} s,a is the set of action values for a MDP, let b s (Q) := arg max a∈A(s) {Q(s, a)} be the best response set to {Q(s, a)} s,a for state s ∈ S. Use the following coupled algorithm to estimate action values and the optimal strategy for all s ∈ S and a ∈ A(s), where V n (s) = a∈A(s) π n (s, a)Q n (s, a).The action a n is selected using an ε-greedy version of the strategy π n .For all s ∈ S and a ∈ A(s) let . We must verify that (B1)-(B6) hold for this algorithm.We do not directly verify that (B1) holds for this algorithm, but as pointed out in Section 3.2 methods to do so are discussed elsewhere.Furthermore the choice of learning parameters verifies (B2) and the choice of mean field in (5.3) and (5.4) immediately give that (B3) and (B5) hold.
For this algorithm we have J = {(s, a); s ∈ S, a ∈ A(s)} and Ī = {s; s ∈ S}.This gives that H = (s, a), s ; s ∈ S, a ∈ A(s) and for simplicity we write, H = (s, a); s ∈ S, a ∈ A(s) .
Following the notation of Section 4 we have that . This shows that (B4)(a) is satisfied.Again using the notation of Section 4 the set of transition probabilities are denoted, A consequence of (B4) from Appendix A.1 is that in the limit every state of the MDP is visited a minimum proportion of the time, η, for some η > 0. Similarly, by placing the restriction that every action is selected with at least probability ε for some ε > 0 then every state, action pair is taken a minimum proportion of the time, η ′ , for some η ′ > 0. Using the approach of Section 4 we do not explicitly need to know the values η and η ′ as we verify convergence for every η, η ′ > 0.
Finally, we need to verify (B6).Define and let Q π (s) be the A-vector containing Q π (s, a) for all a ∈ A(s).Let where V (s) = a∈A(s) π(s, a)Q(s, a).Let h s (π, Q) be the A-vector of the h s,a (π, Q) terms which means that h(π, Q π ) = Q π .For fixed π ∈ ∆ K consider the differential inclusion is the unique asymptotically stable equilibrium to (5.5).
From this it follows that the values in {Q n (s, a)} s,a converge to the true action values for the strategy π.These values are Lipschitz continuous in π, which ensures (B6) holds.Hence, Theorem 4.7 holds and the linear interpolation of the iterative process in (5.4) is an asymptotic pseudo-trajectory to the differential inclusion for some η > 0. For a particular action a ∈ A(s), let π s,a (t) and πs,a (t) represent the individual components of π s (t) and πs (t) respectively, whilst π(t) and π(t) are the K × A matrices containing all of the π s,a (t) and πs,a (t) elements.
With assumptions (B1)-(B6) satisfied all that remains is to show that the differential inclusion (5.6) has a globally attracting set.Corollary 4.8 will then provide the convergence result of the coupled algorithm in (5.3) and (5.4).We note that (5.6) is in the form of (5.2) and hence we use Theorem 5.3 to prove the global convergence.The second term here is zero since a∈A(s) ρ(s, a)V π (s) = V π (s) for any ρ(s) ∈ ∆(A(s)).The first term is clearly positive by the definition of the best response.Hence πs , K s (π) ≥ 0.
Then using Lemma 5.2 gives the desired result.Now we use Theorem 5.3 and Lemma 5.5 to produce a Lyapunov function for (5.6) and hence prove the global convergence of the second algorithm given by (5.3) and (5.4).Let Λ := {π; π an optimal strategy}.Lemma 5.6.Fix π as an optimal strategy.Then is a Lyapunov function for the differential inclusion (5.6) and Λ is a globally attracting set for (5.6).
To prove the condition of Theorem 5.3 we note that for any fixed t > 0, for some strategy π ∈ ∆ K , and for each s ∈ S, a fixed ω s ∈ Ω η and πs ∈ b s (Q π ) such that πs = ω s πs − π s , for all s ∈ S.
Corollary 5.7.The coupled process (Q n , π n ) from (5.3) and (5.4) converges to the limit (Q π, π), where π is an optimal strategy and {Q π(s, a)} s,a is the set of associated action values.
Proof.Lemma 5.6 shows that a Lyapunov function exists for the differential inclusion (5.6); this with Corollary 4.8 proves the claim.6. Summary.We have combined the work of Benaïm, Hofbauer and Sorin [4] on differential inclusions with the work of Borkar and Konda [13] and Borkar [8] on asynchronous stochastic approximation in order to provide a framework for asynchronous stochastic approximation with a set-valued mean field.This enables us to modify the previous work on asynchronous stochastic approximation to use a set of assumptions which are straightforward to verify a priori.
Furthermore we extended the work of Konda and Borkar on asynchronous two-timescale stochastic approximation using this new framework.By allowing the mean fields to be updated using set-valued functions we provide a new result in two-timescale stochastic approximations which clearly applies to asynchronous and synchronous stochastic approximations.
This approach provides a clear framework for single or multiple timescale asynchronous stochastic approximations with clear assumptions which differ little from the synchronous case.Where previously the additional and difficult to verify assumptions required for asynchronous stochastic approximations could be perceived as a reason to avoid their use, this framework removes many of these issues.
In Section 5 we provided an example of a coupled learning algorithm for a discounted reward Markov decision process.We analysed the limiting behaviour using the results of Section 4. The algorithm uses a set-valued mean field based upon a best response style of actor-critic learning.We used the results of Section 4 to show convergence to an optimal strategy under a straightforward set of assumptions.This algorithm demonstrates the main value of the approach to asynchronous stochastic approximation in this paper.

APPENDIX A: OMITTED PROOFS
A.1.Minimum Update Proportion.For the asynchronous stochastic approximations in (2.4) we are interested in understanding how different components of x n get selected to be updated.The previous work on this makes the direct assumption that in the limit all the elements of Ī are updated in an equally spaced manner in the limit and some minimum proportion of the iterations [13]; this assumption is difficult to verify prior to running the process.In this work we use results on Markov chains via assumption (A4) as an alternative which can be checked a priori if we know the transition probabilities of the state process.
Lemma A.1.Under (A4), there exists η > 0 such that ∀i ∈ I, Proof.The values of Īn ∈ Ī form a controlled Markov chain where Īn+1 depends on the current updated component, Īn , and the value of the iterative process, x n .For x ∈ C let π x (•) be a stationary distribution for this Markov chain given by the transition probabilities P x (•, •); standard theory gives that, under (A4), π x (•) exists, is unique, and for some δ x > 0, π x (I) > δ x for all I ∈ Ī.Let η = min x∈C δ x , which exists and is positive since C is compact.Then for all I ∈ Ī, x ∈ C we have that This is in the form of a stochastic approximation with controlled Markovian noise as in [9].
Let w(t) be the linear interpolation of the { wn } n∈N process.Using [9, Corollary 3.1] the limiting behaviour of the interpolated process, w(t), will be an asymptotic pseudo-trajectory to a differential equation, For a suitable process x(t) ∈ C based upon {x n } n∈N .We know that π x (I) ≥ η > 0 for all n and hence the dynamics of w(t) can be expressed as, where Ω η | Ī| is as defined in (3.4).This implies that any limit point of w(t) will be in Ω η | Ī| and hence lim inf n→∞ wn(I) n ≥ η a.s.∀I ∈ Ī.
Lemma A.2.If {V n } n∈N is a martingale difference noise process and assuming (A2)(b) and (A4) hold; (i) If sup n E V n q < ∞ for some q > 0 then almost surely, then almost surely there exists a Γ > 0 such that for all θ ∈ R K , (ii) Let Vn (i) := I {i∈ Īn} V n (i).Clearly Vn ≤ V n .In addition let e i be a K dimensional basis vector with a 1 in the i th term and 0 everywhere else.From the assumption in lemma A.2, Using this gives the following, Now using the independence of the V i terms and letting θn (i) := Letting Γ := KΓA 2 η , which is constant, completes the proof of (ii).Where the first step must hold for some i ∈ I, the second step follows from Lemma A.1 and the last step is directly from (A2)(b).
Proof.(of Lemma 3.4) Since L 2 ([0, T ]) is a Hilbert space it is relatively compact and relatively sequentially compact using the Banach-Alaoglu Theorem [10, Appendix A], which guarantees that the sequence {ū n i (•)} n∈N has a weakly convergent subsequence with a limit point in L 2 ([0, T ]).Hence, there exists a limit point ũi (•) of (3.10).For a fixed T > 0, ũi (•) must satisfy (3.9) for all h(•) ∈ L 2 ([0, T ]).Hence by showing that for an arbitrary fixed T and a single h(•) any limit point is bounded below this is enough to prove the claim.Fix T > 0, select 0 < v ≤ T , and take h(t) = 1 for all t ∈ [0, T ].Let {k(n)} n∈N be a subsequence of the natural numbers under which {ū Now, U is a compact metric space [10, Appendix A] and hence is weakly countably (limit point) compact.This means the infinite subset {u n i (•)} n∈N has a limit in U , and this limit point must satisfy Lemma 3.4.This gives, and since this statement is true for all t, v > 0, the statement follows.
A.4. Proof of Lemma 3.6.Extend our notation to continuous time by defining M (t) as the K × K diagonal matrix of the (u 1 (t), . . ., u K (t)) terms and recall that M (t) is the K × K diagonal matrix of the (v 1 (t), . . ., v K (t)) terms.
Let h(•) be a bounded continuous function on [τ n , τk ] such that h(τ j ) = f j for all j = n, . . ., k.This will mean that h(•) ∈ L 2 ([0, T ]).Throughout this paper we consider the continuous interval [0, ∞) divided into segments of length ᾱn , and hence we can approximate the sum from Lemma 3.6 as an integral, Now we note that u i (t) and v i (t) are extensions of {µ n (i)} n∈N to continuous time which are constant on intervals [τ n , τn+1 ), and that M (t) and M (t) are just matrices containing u i (t) and v i (t) respectively.This will mean that M (t) = M m(t)+1 and M (t) = M m(t)+1 and hence we have,
is a locally integrable function such that for all T >

Corollary 4 . 8 .
The rest of the proof completes as in [4, Proposition 1.3].If there is a globally attracting set, A, for the differential inclusion (4.7) and assumptions (B1)-(B6) are satisfied, then the twotimescale iterative process (4.2) will almost surely converge to A. Proof.Immediate by combining Corollary 4.4 and Theorem 4.7 with [4, Proposition 3.27].