Empirical Q-Value Iteration

We propose a new simple and natural algorithm for learning the optimal $Q$-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as $Q$-learning and `actor-critic' algorithms, this algorithm doesn't depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical $Q$-value iteration (EQVI) algorithm, converges almost surely to the optimal $Q$-value function. To the best of our knowledge, this is the first algorithm for learning in MDPs that guarantees an almost sure convergence without using stochastic approximations. We also give a rate of convergence or a non-aymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms. In fact, the asynchronous setting EQVI vastly outperforms the popular and widely-used Q-learning algorithm.


Introduction
Q-Learning algorithm of Watkins [1] has been an early and among the most popular and widelyused algorithms for approximate dynamic programming for Markov decision processes.An important feature of this and other algorithms of this ilk (actor-critic, TD(λ), LSTD, LSPE, natural gradient, etc.) has been that they are stochastic approximations, i.e., recursive schemes that update a vector incrementally based on observed payoffs.This is achieved by using stepsizes that are either decreasing slowly in a precise sense or equal a small positive constant.In either case, this induces a slower time scale for the iteration compared to the 'natural' time scale on which the underlying stochastic phenomena evolve.Thus the two time scale effects such as averaging kick in, ensuring that the algorithm effectively follows an averaged dynamics, i.e., its original dynamics averaged out over the random processes affecting it on the natural time scale.The iterations are designed such that this averaged dynamics has the desired convergence properties.
What we propose here is an alternative scheme for Q-Learning that is not incremental and therefore evolves on the natural time scale.It does the usual Q-value iteration with the proviso that the conditional averaging with respect to the actual transition kernel of the underlying Define the optimal value function V * : S → R + as The Bellman operator is a contraction mapping, i.e., T (V ) − T (V ) ∞ ≤ γ V − V ∞ , and the optimal value function V * is the unique fixed point of T (•).Given the optimal value function, an optimal policy π * can be calculated as [6] π * (s) ∈ arg min

Value Iteration, Q-Value Iteration
A standard scheme for finding the optimal value function (and hence an optimal policy) is value iteration.One starts with an arbitrary function V 0 .At the kth iteration, given the current iterate V k , we calculate V k+1 = T V k .Since T (•) is a contraction mapping, by Banach fixed point theorem, V k → V * .Another way to find the optimal value function is via Q-value iteration.Though this requires more computation than the value iteration, Q-value iteration is extremely useful in developing learning algorithms for MDPs.
where d = |S||A|.Similar to the Bellman operator T , Q-value operator G is also a contraction mapping, i.e., G(Q This Q * is called the optimal Q-value.By the uniqueness of V * , it is clear that V * = min a∈A Q * (s, a).Thus, given Q * , one can compute V * and hence an optimal policy π * .The standard method to compute Q * is Q-value iteration.We start with an arbitrary Q 0 and then update Q k+1 = G(Q k ).Due to the contraction property of G, Q k → Q * a.s.

Empirical Q-Value Iteration for MDPs
The Bellman operator T and the Q-value operator G require the knowledge of the exact transition kernel p(•|•, •).In practical applications, these transition probabilities may not be readily available, but it may be possible to simulate a transition according to any of these probabilities.Without loss of generality, we assume that the MDP is driven by uniform random noise according to the simulation function ψ : S × S × [0, 1] → S such that Pr(ψ(s, a, ξ) = s ) = p(s |s, a) (6) where ξ is a random variable distributed uniformly in [0,1].Using this convention, the Q-value operator can be written as In empirical Q-value iteration (EQVI) algorithm, we replace the expectation in the above equation by an empirical estimate.Given a sample of n i.i.d.random variables distributed uniformly in [0, 1], denoted ).We summarize our empirical Q-value iteration algorithm below.
, and compute Our main result is the following.
Theorem 1.The empirical Q-value iteration converges to the optimal Q-value function, i.e., Q k → Q * a.s. as k → ∞ for any fixed n.
Proof is given in Section 3.

Comparison with Classical Q-learning
Synchronous variant of the classical Q-learning algorithm for discounted MDPs works as follows (see [4,Section 5.6]).For every state-action pair (s, a) ∈ S × A, we maintain a Q-value function and use the update rule where ξ k (s, a) is a random noise sampled uniformly from [0, 1] and [4].The rate of convergence depends on the sequence {α k , k ≥ 0} [8].In general, the convergence is very slow.
Empirical Q-value iteration algorithm does not use stochastic approximation, and is a nonincremental scheme.The rate of convergence will depend on the number of noise samples n.

Analysis
In this section, we now provide an anlysis of EQVI and a proof of Theorem 1.The main idea that we exploit is the fact that (exact) Q-value iterations in finite time is equal to finite horizon Q-values obtained by "looking backward" with the initial guess as the terminal cost.More precisely, when the transition kernels p( 5)) with an initial guess Q 0 .One can show that this Q k is equal to Q k which is the Q-value obtained by "looking backward" where So, rather than showing that the forward iteration Q k converges to the optimal Q-value function Q * , one can also establish the convergence of the (exact) Q-value iteration by showing that the backward iterate Q k converges to Q * almost surely.When the transition kernels are known, this is obviously a convoluted route because the convergence of the forward iteration However, when the transition kernels are unknown, it is not clear if we can directly prove the convergence of the (simulation-based) forward iteration sequence Q k (ω) (given in Algorithm 1 and formalized in equation ( 14)) to the optimal Q-value function Q * .To overcome this difficulty, we take the approach mentioned above and define the (simulation-based) backward iteration sequence Q k (ω) (c.f.equation ( 26)) similar to the Q k above and we rigorously show that Q k (ω) = Q k (ω), ∀ω (c.f.Proposition 2).Then, using an approach similar to the well known Propp-Wilson backward simulation algorithm [2], we show that Q k (and hence Q k ) converges to a random variable Q * (ω) almost surely (c.f.Proposition 3).Finally, we show that this random variable Q * (ω) is a deterministic quantity and that is indeed equal to the optimal Q-value function Q * .
In the following, we first formally set the notations for the underlying probability space and define EQVI iterate Q k using those notations (c.f. ( 14)).Then we define the forward simulation model for controlled Markov chains (c.f.equation ( 19)), and show the finite time coupling property of this simulated chain (c.f.Proposition 1).Then we define the backward simulation model for controlled Markov chain (c.f.24).Equipped with these notions, we proceed to prove Proposition 2 and Proposition 3 and then use these results to prove our main result, Theorem 1.
Let (Ω 1 , F 1 , P 1 ) be the probability space of one-sided infinite sequences ω such that ω = {ω k : k ∈ Z * }, where Z * is the set of non-negative integers.Each element ω k of the sequence is a vector ω k = (ξ k i (s, a), 1 ≤ i ≤ n, s ∈ S, a ∈ A), where ξ k i (s, a) is a random noise distributed uniformly in [0, 1].We assume that ξ k i (s, a) are i.i.d.∀i, ∀(s, a) ∈ S × A and ∀k ∈ Z * .E 1 denotes expectation with respect to measure P 1 .
For each k ∈ Z * , θ k denotes the left shift operator, i.e., Also, let Γ be the projection operator such that Γ( Recall that ψ is the simulation function defined in equation ( 6) such that Using ψ, for each ω ∈ Ω 1 we define a sequence of empirical transition kernels p(ω) = ( p k (ω k )) k≥0 as I{ψ(s, a, ξ k i (s, a)) = s }. (11) We dropped ω k from the above definition for ease of notation.For any π ∈ Π, we also define the transition probability matrix P π k as, Note that the rows of P π k are independent due to the independence assumption on the elements of the vector ω k .Also, P π k are independent ∀k.We define the empirical Q-value operator G : Then, the empirical Q-value iteration given in Algorithm 1 can be succinctly represented as We drop ω from the notation of Q k whenever it is not necessary.Note that from equation ( 10) and ( 13), for any fixed Q, where G is the Q-value operator defined in equation (5).We define another probability space ( Each element ν k of the sequence ν is a |S||A|-dimensional vector, ν k = (ν k (s, a), s ∈ S, a ∈ A) where ν k (s, a) is a random variable distributed uniformly in [0, 1].We assume that ν k (s, a) are i.i.d.∀(s, a) ∈ S × A and ∀k ∈ Z * .Likewise, let ν = {ν k : k ∈ Z * } ∈ Ω 2 .Each element νk of the sequence ν is a |S|-dimensional vector, νk = (ν k (s), s ∈ S) , where νk (s) is a random variable distributed uniformly in [0, 1].We assume that νk (s) are i.i.d., independent of ν, ∀s ∈ S and ∀k ∈ Z * .E 2 denotes the expectation with respect to P 2 .Let P be the product measure, P = P 1 ⊗ P 2 and let E denote the expectation with respect to P.
For each ω ∈ Ω 1 , i.e., for each sequence of transition kernels p(ω) = ( p k (ω k )) k≥0 , we define a sequence of simulation functions and φ 2 k is the (randomized) control strategy that maps the output of the function φ 1 k to an action space-valued random variable φ 2 k (φ 1 k (s, a, ν k (s, a)), νk (φ 1 k (s, a, ν k (s, a)))).We note that the control strategy can be identified with an element π, resp.σ, of the set Π or Σ, when, resp., In such a case, we write φ 2 k ≈ π or φ 2 k ≈ σ k as the case may be.For Given an ω ∈ Ω 1 , ν ∈ Ω 2 and an initial condition (s 0 , a 0 ), we can simulate an MDP with state-action sequence (X k (ω, ν), Z k (ω, ν)) k≥0 as follows: We call this simulation method as forward simulation.The dependence on the control strategy φ 2 k is implicit and is not used in the notation.Whenever not necessary, we also drop ω and ν from the notation and denote the simulated chain by (X k , Z k ).Since We prove that the expected value of the coupling time is finite.
k ) k≥k 0 be two sequences of state-action pairs for an MDP simulated according to (20) using an arbitrary control strategy φ 2 k ≈ σ k .Let τ ω * ,ν * be the coupling time as defined in equation (21).Then, Proof is given in the Appendix.We now consider the backward simulation of an MDP.This is similar to the coupling from the past idea introduced in [2].Note that for us this is a proof technique, a 'thought experiment', and not the actual algorithm. Given m=−k 0 of length k 0 + 1 using the backward simulation.As a first step, we do an offline computation of all possible simulation trajectories as follows: Then we simulate ( X m (ω, ν)) 0 m=−k 0 +1 as, starting from the initial condition 24) where φ m = ( φ 1 m , φ 2 m ).Recall that (c.f.(20)) in the forward simulation starting from k = 0, we go from a path of length k 0 to a path of length k 0 + 1 by taking the composition φ k 0 • Φ k 0 0 (s 0 , a 0 ).In backward simulation, we do this by taking the composition Φ 0 −k 0 +1 • φ −k 0 (s 0 , a 0 ).So, forward simulation is done by forward composition of simulation functions whereas the backward simulation is done by backward composition of the simulation functions.Furthermore, in forward simulation, we can successively generate consecutive states of a single controlled Markov chain trajectory one transition at a time, whereas in backward simulation one is obliged to generate one transition per state and any trajectory from −k 0 to 0 has to be traced out of this collection by choosing contiguous state transitions at each successive time.This feature is familiar from the Propp-Wilson backward simulation algorithm mentioned above.
In the following, we fix the control strategy φ 2 k as, where Q k is defined as, Note that the expectation in the above equation is with respect to the measure P 2 for a given ω (i.e., for a given sequence of transition kernels ( p k (ω k )) k≥0 .We now show an important connection between the Q k iterate defined above and the em- Proof.We prove this by induction.First note that by the definition of Q k given in equation (14), for all (s, a) ∈ S × A, we get

Now, by the definition in equation (26),
where we used the fact that Then, We shall need the following lemma of Blackwell and Dubins [9], [10, Chapter 3, Theorem 3.3.8].
Lemma 1. [9] Let Y k , k = 1, 2, . . ., ∞ be a real random variables on a probability space We now show that Q k (ω) converges to a random variable Q * (ω) almost surely.By the above proposition, this will imply the almost sure convergence of Q k to Q * (ω).Proposition 3.For a given ω ∈ Ω 1 , let p(ω) = ( p k (ω)) k≥0 be the corresponding sequence of transition kernels and Q k (ω), k ≥ 0, be the corresponding Q-value iterates as defined in equation (14).Then, there exists a random variable Proof.Consider the backward simulation described above.For ω ∈ Ω 1 , ν ∈ Ω 2 , we trace out two MDPs with state-action sequences Note that by construction, two Markov chain paths initiated at time −k traced from the above backward simulation but in forward time beginning at −k, will merge once they hit a common state, i.e., get 'coupled' (cf.[2]).Decrease −k until all paths initiated at −k couple.Once they couple, they follow the same sample path.We first show the following: Claim 1.These chains couple with probability 1 as k → ∞.
Proof of Claim: Note that by construction, two Markov chain paths initiated at time −k traced from the above backward simulation in forward time beginning at −k will merge once they hit a common state, i.e., get 'coupled' (cf.[2]).Let τ k ω,ν be the time after which these chains couple, i.e., X −k+ τ k ω,ν = X −k+ τ k ω,ν and X −k+l = X −k+l for all 0 ≤ l < τ k ω,ν .Since these chains are of finite length (from −k to 0), we may need to define the value of τ k ω,ν arbitrarily if they don't couple during this time.
To overcome this, we let these chains run to infinity.This can be done without loss of generality as follows.For −k ≤ m < 0, simulate the chains according to the backward simulation method specified by ( 22)-(23).Suppose the i.i.d.random vectors ν m are generated for all −∞ < m < ∞.For m ≥ 0, continue the simulation to generate chains ( It is easy to see that the τ k ω,ν has the same statistical properties as the coupling time defined in equation ( 21).So, by Proposition 1, Also, it is easy to see that τ k ω,ν s are identically distributed ∀k.So, Then, by Borel-Cantelli lemma, τ n ω,ν − n → −∞, (ω, ν)-a.s.Thus, the chains will couple with probability 1.
End of Proof of Claim.Now, by construction, Since the chains will couple with probability 1, the RHS of the above equation will converge to a random variable R(ω)(s, a, s , a ), ω-a.s. as k → ∞, i.e, We revert to the 'forward time' picture henceforth.Now, Since p k depends only on ω, we can define another random variable R k (ω)(s, a) such that , ω − a.s., it follows from the preceding lemma that there exists another random variable R * (ω) such that Clearly, Now we give the proof of our main theorem (Theorem 1).
Proof.Let (Ω 1 , F 1 , P 1 ) be the probability space as defined before.By Taking conditional expectation and using Lemma 1 we get, where G is the Q-value operator defined in equation ( 5).This gives Then, by the continuity of G, G( Q * (ω)) = Q * (ω) which implies that Q * (ω) is indeed equal to the optimal Q-function Q * , by the uniqueness of the fixed point of G.

Rate of Convergence and Asynchronous EQVI
In this section, we now provide a rate of convergence, or a non-asymptotic sample complexity bound.This follows from methods that had been developed in [11] for empirical value and policy iteration, which however only provide a convergence in probability guarantee.In the second subsection, we provide an argument of why asynchronous EQVI will also work.This also uses methods developed earlier in [11].

Rate of Convergence
One notable observation about Theorem 1 is that the almost sure convergence of the EQVI iterate holds for any n.However, the rate of convergence will and does depend on n and this is confirmed by the simulation results (c.f.Section 5).While almost sure convergence guarantee, i.e., Q k → Q * almost surely as k → ∞ is a strong result, rate of convergence is an important consideration in practical applications.Unfortunately, the coupling argument used in the proof of Theorem 1 does not yield a rate of convergence.However, we note that exact Q-value operator G(•) is a contraction, and its empirical variant G(•) is a 'random contraction operator'.
In [11], a technique for analyzing the rate of convergence of a random sequence resulting from iteration of a random contraction operator was developed.This was used to show the probabilistic convergence of empirical value iteration and explicit bounds were given on the number of simulations samples n and the number of iterations k that are needed to get an -optimal value function with a probability greater than (1 − δ).We now argue that the exact Q-value operator G(•) is a contraction, and its empirical variants G(•) satisfy Assumptions (4.1)-(4.4) in [11], and thus a very similar methodology can be used in establishing convergence in probability of the iterates of EQVI (weaker than Theorem 1 in this paper).But more importantly, it yields a rate of convergence and a non-asymptotic sample complexity result, i.e., for any given > 0, δ > 0, we give an explicit bound on the number of simulation samples n and the number of iterations k that are needed to get an -optimal Q-value with probability greater that (1 − δ).Assumptions.The classical operator G and a sequence of random operators { G n } satisfy the following: 4.1 P lim n→∞ G n q − G q ≥ = 0 ∀ > 0 and ∀q ∈ R S .Also G has a (possibly nonunique) fixed point q * such that Gq * = q * .4.2 There exists a κ * < ∞ such that qk n ≤ κ * almost surely for all k ≥ 0, n ≥ 1.Also, q * ≤ κ * .4.3 G q − q * ≤ γ q − q * for all q ∈ R |S| .4.4 There is a sequence {p n } n≥1 such that It can be shown that the exact Q-value operator G and its empirical variants G n (where the index n is for number of samples) satisfies the above assumptions.It can be argued easily by using strong law of large numbers that Assumption 4.1 is satisfied.Assumption 4.2 is satisfied easily when rewards are bounded.Assumption 4.3 is satisfied since G is a contraction operator.It can easily be checked that Assumption 4.4 is satisfied with (30) This implies convergence in probability of the Q-value iterates (weaker than in the previous section) to the optimal Q-value.Now, following arguments and construction similar to Section 5.1 in [11], we can derive a non-asymptotic sample complexity bound which we summarize below.We will denote k iterate when using n samples by Q n k .
Theorem 2. Given ∈ (0, 1) and δ ∈ (0, 1), fix g = /η * and select δ 1 , δ 2 > 0 such that δ 1 + 2δ 2 ≤ δ where η * = 2/(1 − γ) .Select an n such that where κ * = max (s,a)∈K c(s, a)/(1 − γ) and select a k such that where µ n,min = min i µ n (i) and µ n (i) is given by Then, Since the details of the proofs are the same as in [11], we only give a short outline here.Readers are referred to [11] for details.The proof is based on the idea of constructing a sequence of Markov chains that stochastically dominate a discrete error process.More precisely, we are interested in the rate of convergence of the sequence { Q k −Q * , k ≥ 0} to 0. But since the error process { Q k − Q * , k ≥ 0} is continuous-valued,we first discretize it and get a discrete error process now defined on non-negative integers.Unfortunately, this process is not Markovian.Hence, we construct a Markov chain {Y n k , k ≥ 0} that has the following structure: Note that p n is close to 1 for sufficiently large n.The Markov chain {Y n k , k ≥ 0} will either move one unit closer to zero until it reaches η * , or it will move (as far away from zero as possible) to N * (and hence bounds are very conservative).We can show that this Markov chain stochastically dominates the discrete error process.Further, as n goes to infinity, the invariant distribution of the Markov chain will concentrate at zero, which establishes convergence of the error process { Q k − Q * , k ≥ 0} to zero in probability.Now the mixing rate of the Markov chain gives the rate of convergence and the sample complexity bound for EQVI in the above theorem.

Asynchronous EQVI
We now show that just as for EVI, the asynchronous version of EQVI works as well.That is, the Q-value function estimates converge in probability even when the updates are asynchronous, including in the "online" case when updates are done for one state at a time.We consider each state to be visited at least once to complete a full cycle, and the time for a full cycle could be random.
Let (σ k , α k ) k≥0 be any infinite sequence of states and actions.This sequence (σ k , α k ) k≥0 may be deterministic or stochastic, and it may even depend online on the Q-value function updates.For shorthand, denote z = (σ, α).For any z ∈ S × A, we define the asynchronous Q-value operator G z as Also define its empirical variant with n samples as: otherwise.
The operators G z and G z,n only update the Q-value function for state s and action a, and leaves the other estimates unchanged.This will then produce a sequence of updates {Q k } and { Q n k } respectively starting from some intial seed Q 0 .
Suppose that in some finite number of steps K 1 , each state-action pair is visited at least once.Define, which is a contraction with constant γ.It is well-known [8] that if each state-action pair is visited infinitely often, the sequence produced by asynchrononus Q-value iteration, {Q k } will converge to Q * , the optimal Q-value.Now define the time of (m + 1)th full update includes every state-action pair in S × A , with K 0 = 0. We can now give a slightly modified stochastic dominance argument to show that asynchronous EVI will converge in a probabilistic sense by checking the progress of the algorithm at these hitting times, i.e., we look at the sequence { Q n Km } m≥0 .In the simplest update scheme, each state-action pair is updated in turn and the length of a full update cycle is |S||A|.Now, analogous to G, we can define an operator G n , Note that each random operator in this iteration introduces an error /|S||A| as compared to the corresponding non-random operator.This can be ensured by picking n large enough such that where κ * is a constant that can be computed.This can be used now to guarantee that Now, the stochastic dominance argument developed in [11] can be applied to obtain the following result.Theorem 3. If each state-action pair is visited in turn infinitely often, the iterates of asynchronous EQVI, Remark 2. We note that the "online" version of asynchronous EQVI is like the popular Qlearning algorithm used for reinforcement learning.As we see in the numerical results in the next section, online EQVI has a much faster convergence than Q-learning, though the theoretical guarantees are weaker, i.e., convergence in probability for EQVI and almost sure for Q-learning.

Numerical Results
In this section, we show some numerical results comparing the classical Q-Learning (QL) algorithm (given in equation ( 8)) with our empirical Q-value iteration (EQVI).We generate a 'random' MDP, with |S| = 500 and |A| = 10, where the transition matrix P and the cost c(s, a) are generated randomly.We plot relative error e k := || Q k − Q * ||/||Q * vs the number of iterations.Note that the synchronous verion of QL was used in which we used more than one simulation samples and updated all state-action pairs at the same time.
We can represent the update equations of both QL and EQVI using the operator G (c.f. ( 13), ( 8)) EQVI : So, both EQVI and QL can be run using the same matlab code.For EQVI, set α k = 1, ∀k.Note that this does not make EQVI a stochastic approximation scheme since it does not satisfy the step-size requirement.
As you can see from Figure 1, the rate of decay of relative error is way faster in EQVI (and pretty close to exact QVI) as compared to Synchronous QL.In fact, to reach 5% relative error, QVI takes about 30 iterations, EQVI takes just a bit more (about 35), while Synchronous QL takes more than 300.Thus, EQVI promises at least a 10x speedup over Synchronous QL.In fact, in about 35 iterations, Synchronous QL has a 50% relative error.The relative error has been estimated from 50 simulation runs and the confidence intervals are very tight.Note that as we take per samples per iteration, we start to approach performance of exact QVI.
Figure 2 shows asynchronous EQVI and QL for a random MDP with 500 states and 10 actions wherein state-action pairs were chosen randomly in each iteration.As can be seen, exact QVI and EQVI get to within 5% relative error in about 500 iterations (quite remarkable since there are 5000 state-action pairs), while QL in 500 iterations has a 90% relative error.In fact, (asynchronous) QL is so slow that even after 10,000 iterations, the relative error is still about 50%.As before, the relative error has been estimated from 50 simulation runs and the confidence intervals are very tight.
From these simulations.it is clear that EQVI promises significantly faster performance than Q-learning, in both a synchronous, as well as asynchronous setting.
Remark 3. As mentioned above, we get a very fast convergence with EQVI to a ball park estimate but then an extremely slow (in fact, imperceptible in the given time frame) movement to the exact value as guaranteed by theory.To get some intution about why so, consider the uncontrolled case.Then, the iterations are of the form  where G is a random affine contraction.This may further be written as where Ǧ(x) = Ax+b for suitably defined A, b is a deterministic affine contraction and {M k }, M k := GQ k − ǦQ k , a martingale difference sequence.Note that A in our case is γ times a stochastic matrix, hence a stable matrix.Then The first term on the right decays to zero, the second converges to the desired limit, and the third represents noise.If {M k } were i.i.d., this would converge to a stationary process, not to zero.In our case, it does converge to zero as implicit in the proof of Theorem 1.In case of stochastic approximation, M k would be weighted by a square-summable step-size which accelerates this convergence to zero.But in our case, in the absence of such additional damping, the fluctuations can be expected to diminish only very slowly.On the other hand, the decay of dependence on initial condition and convergence of the middle term to the desired limit are no longer incremental as in the stochastic approximation counterpart and therefore very rapid.This is in tune with the well known bias-variance trade-off and not surprising.This does, however, suggest that a hybrid scheme that runs empirical Q-Value Iteration initially and then switches to conventional Q-Learning will have the best of both the worlds if a faster almost sure convergence is needed.Note also that the performance of our scheme improves rapidly with increasing n.For practical problems, using EQVI until the relative error is below some threshold (e.g., 1-5 %) may be enough.

Conclusions
We have presented a new (offline and online) Q-value iteration algorithm for discounted-cost MDPs.We have rigourously established the convergence of this algorithm to the desired limit with probability one.Unlike the classical learning schemes for MDPs such as Q-learning and actor-critic algorithms, our algorithm or analysis does not use a stochastic approximation method and is a non-incremental scheme.Preliminary experimental results suggest a faster rate of convergence for our algorithm than currently popularly used algorithms.
A particularly interesting and useful aspect is whether distributed and asynchronous implementation of EQVI will work.We have been able to show that for the special case where each state-action pair is updated in turn.Moreover, the convergence guarantee is only probabilistic.It would be useful to show that even with randomly picked state-action pairs, as long as each one of them is picked infinitely often, we will get convergence and in the stronger almost sure sense (as for our main result for the synchronous case.) Another useful direction will be to show that this would work with infinite (even continuous) state and action spaces.This would then make such an algorithm useful even for partially observed MDP problems.This will require combining current methods with function approximation in an appropriate space (e.g., a Reproducing Kernel Hilbert Space (RKHS)).
Another useful direction would be the average reward case.Average reward MDPs are typically hard to analyze because the dynamic programming operator for average reward MDP is not a contraction mapping.There are, however, provably convergent Q-learning and actorcritic algorithms for average reward MDPs due to the powerful ODE approach to stochastic approximation [12] [13].It would be interesting to see if our algorithm works for learning in MDPs with average reward criterion.
These are directions for future research.

Proofs Proof of Proposition 1
We present this as a series of lemmas.Given an initial time k 0 and states s 0 , s ∈ S, we define the hitting time τ ω,ν of the controlled Markov chain (X k (ω, ν)) k≥k 0 as We first show that the expected value of the hitting time is finite when the chain is controlled by a stationary strategy, i.e., φ 2 k ≈ π ∈ Π, ∀k.
Lemma 2. Let (X k (ω, ν), Z k ) k≥k 0 be the sequence of state-action pairs for the MDP simulated according to (20) using a stationary control strategy φ 2 k ≈ π ∈ Π, ∀k.Let τ ω,ν be the hitting time as defined in equation (34).Then, Proof.Consider a sequence of states, (s k 0 +j ) r j=0 , with s k 0 = s 0 and s k 0 +r = s such that P π (s k 0 , s k 0 +1 ) • • • P π (s k 0 +r−1 , s k 0 +r ) > 0. By Remark 1, such a sequence of states exists.Furthermore, r can be picked independent of the choice of s 0 , s and we assume that it is so.Let Using ( 10)-( 12), E 1 [ P π k ] = P π ∀k.Since P π k are i.i.d., So there exist > 0, δ > 0 such that P 1 (W π > ) > δ.Then, Due to the i.i.d.nature of ω and the Markov property of X k (ω, ν), it is clear that the above probability does not depend on k 0 and hence, for any k > 0, We next show that the expected value of the coupling time is finite when the chain is controlled by a stationary strategy, i.e., φ 2 k ) k≥k 0 be two sequences of state-action pairs for an MDP simulated according to (20) using a stationary control strategy φ 2 k ≈ π ∈ Π, ∀k.Let τ ω * ,ν * be the coupling time as defined in equation (21).Then, Proof.Consider two sequences of states, (s 1 k 0 +j ) r j=0 and (s 2 k 0 +j ) r j=0 with s 1 k 0 = s 1 0 , s 2 k 0 = s 2 0 , s 1 k 0 +r = s 2 k 0 +r = s, for some s ∈ S such that By Remark 1, such (s 1 k 0 +j ) r j=0 and (s 2 k 0 +j ) r j=0 exist.Using, by abuse of notation, some common notation for entities defined on the two copies of (Ω, F, P), let As in the proof of Lemma 2, So there exist > 0, δ > 0 such that P 1 (W π 1 > ) > δ and P 1 (W π 2 > ) > δ.Moreover, due to the independence of P π k 0 +j (s 1 k 0 +j , s 1 k 0 +j+1 ) and P π k 0 +j (s 2 k 0 +j , s 2 k 0 +j+1 ), Then, by an argument analogous to that of Lemma 2, we have P τ ω * ,ν * (s 1 0 , s 2 0 ) ≤ r ≥ P 2 τ ω,ν (s 1 0 , s 2 0 where the , δ may be chosen independent of the choice of s 1 0 , s 2 0 .Hence P τ ω * ,ν * (s 1 0 , s 2 0 ) > r ≤ (1 − 2 δ 2 ).Now the same arguments as in the proof of Lemma 2 can be applied to get the desired conclusion.
We now extend the result of Lemma 2 and Lemma 3 to non-stationary control strategies.For that, we use the following result from [14] for a homogeneous MDP defined by the original transition kernel p(•|•, •).We include the proof for completeness.Since the state and action spaces are finite, the laws of {(X α k , Z α k ), k ≥ k 0 }, α ≥ 1, are tight.By dropping to a subsequence if necessary and invoking Skorohod's theorem, we may assume that these chains are defined on a common probability space, and there exists a controlled Markov chain {X ∞ k , k ≥ k 0 } governed by controls Z ∞ k , k ≥ k 0 , corresponding to a control strategy σ ∞ with X ∞ k 0 = s, such that (X α k , Z α k ) k≥0 → (X ∞ k , Z ∞ k ) k≥0 a.s.Since P τ α (s, s ) > j = E for τ ∞ (s, s ) := min{k ≥ 0|X ∞ k 0 +k = s , X ∞ k 0 = s}.Then, τ ∞ = ∞ a.s.This is possible only if there exists a non-empty subset H of S\{s } such that for each i ∈ H, max k ∈H min a∈A p(k|i, a) = 0. Let a i be the action at which the above minimum is achieved.Then the chain starting at H and governed by a stationary control strategy π such that π(i) = a i never leaves H.This contradicts Assumption 1 that under any stationary control strategy, S is irreducible.Thus, the given statement must hold.Now we extend the result of Lemma 2 to non-stationary control strategies.Lemma 5. Let (X k (ω, ν), Z k ) k≥k 0 be the sequence of state-action pairs for the MDP simulated according to (20) using an arbitrary control strategy φ 2 k ≈ σ k , ∀k.Let τ ω,ν be the hitting time as defined in equation (34).Then, E τ ω,ν (s 0 , s ) < ∞, ∀s 0 , s ∈ S Proof.Proof is similar to that of Lemma 2. By Lemma 4, there exists a j * , 0 < j * ≤ r * and a sequence of states, (s k 0 +j ) j * j=0 , with s k 0 = s 0 and s k 0 +j * = s such that P σ k 0 +1 (s k 0 , s k 0 +1 ) • • • P σ k 0 +j * (s k 0 +r−1 , s k 0 +j * ) > 0 where P σ k is defined as in (1) by replacing π with σ k .Let Then, there exists an > 0, δ > 0 such that P 1 (W σ > ) > δ.Then, as in the proof of Lemma 2, P τ ω,ν (s 0 , s ) > r ≤ (1 − δ), and E τ ω,ν (s 0 , s ) < ∞.
Now, the proof of Proposition 1 is straightforward by combining the proofs of Lemma 3 and Lemma 5.
Define the Q-value operator G : R d + → R d + as G(Q)(s, a) := c(s, a) + γ s ∈S p(s |s, a) min b Q(s , b) the unique fixed point of G(•), i.e., Q * (s, a) = c(s, a) + γ s ∈S p(s |s, a) min b Q * (s , b).

Figure 2 :
Figure 2: Comparison of Asynchronous exact QVI, EQVI and QL for a 500 state and 10 action random MDP with multiple samples in each step.For QL, the step size α k = 1/k θ , θ = 0.6.Averaging over 50 runs.