Asymptotics of Reinforcement Learning with Neural Networks

We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution which is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on i.i.d. data with stochastic gradient descent under the widely-used Xavier initialization.


Introduction
Reinforcement learning with neural networks (frequently called "deep reinforcement learning") has had a number of recent successes, including learning to play video games [22,23], mastering the game Go [31], and robotics [16].In deep reinforcement learning, a neural network is trained to learn the optimal action given the current state.
Despite many advances in applications, a number of mathematical questions remain open regarding reinforcement learning with neural networks.Our paper studies the Q-learning algorithm with neural networks (typically called "deep Q-learning"), which is a popular reinforcement learning method for training a neural network to learn the optimal control for a stochastic optimal control problem.The deep Q-learning algorithm uses a neural network to approximate the value of an action a in a state x [22].This neural network approximator is called the "Q-network".The Q-learning algorithm estimates the Q-network by taking stochastic steps which attempt to train the Q-network to satisfy the Bellman equation.
The literature on (deep) reinforcement learning and Q-learning is substantial.Instead of providing a complete literature review here we refer interested readers to classical texts [2,19,35], to the more recent book [12], and to the extensive survey on recent developments in [1].The majority of reinforcement learning algorithms are based on some variation of the Q-learning or policy gradient methods [36].Q-learning originated in [40] and proofs of convergence can be found in [41,37].The neural network approach to reinforcement learning (i.e., using Q-networks) was proposed in [22].More recent developments include deep recurrent Q-networks [13], dueling architectures for deep reinforcement learning [39], double Q-learning [28], bootstrapped deep Q-networks [27], and asynchronous methods for deep reinforcement learning [24].Although the performance of Q−networks has been extensively studied in numerical experiments, there has been relatively little theoretical investigation.
We study the behavior of a single-layer Q−network in the asymptotic regime of large numbers of hidden units and large numbers of training steps.We prove that the Q−network (which models the value function for the related optimal control problem) converges to the solution of a random ordinary differential equation (ODE).We characterize the limiting random ODE in both the infinite and finite time horizon discounted reward cases.Then, we study the behavior of the solution to the limiting random ODE as time t → ∞.
In the infinite time horizon case, we show that the limit ODE has a unique stationary solution which equals the solution of the associated Bellman equation.Thus, the unique stationary solution of the limit Q-network gives the optimal control for the problem.In the infinite time horizon case, we also show that the limit ODE converges to the unique stationary solution for small values of the discount factor.Convergence of ODEs to stationary solutions has been studied in related problems in the classical papers [3,37].The difference in our work is that, in contrast to [3,37], we study the effect of a neural network as a function approximator in the Q-learning algorithm.
The presence of a neural network in the Q-learning algorithm introduces additional technical challenges, which lead us to be able to prove, in the infinite time horizon case, convergence of the limiting ODE to the stationary solution only for small values of the discount factor.We elaborate more on this issue in Remark 3.6.The situation is somewhat different in the finite time horizon case, where we can prove that the limit ODE converges to a global minimum, which is the solution of the associated Bellman equation, for all values of the discount factor in (0, 1]. As a by-product of our analysis, we also prove that a single-layer neural network trained on i.i.d.data with stochastic gradient descent under the Xavier initialization [10] converges to a limit ODE.In addition to characterizing the limiting behavior of the neural network as the number of hidden units and stochastic gradient descent steps grow to infinity, we also obtain that the neural network in the limit converges to a global minimum with zero training loss (see Section 4).Convergence to a global minimum for a neural network in regression or classification on i.i.d.data (not in the reinforcement learning setting) has been recently proven in [6], [7], and [38].Our result shows that convergence to a global minimum can also be viewed as a simple consequence of the limit ODE for neural networks.
The rest of the paper is organized as follows.The Q-learning algorithm is introduced in Section 2. Section 3 presents our main theorems.Section 4 discusses the limiting behavior of single-layer neural networks when trained on i.i.d.data with stochastic gradient descent under the Xavier initialization.Section 5 contains the proof for the infinite time horizon reinforcement learning case.The proof for the finite time horizon reinforcement learning case is in Section 6. Section 7 contains a proof that a certain matrix in the limit ODE is positive definite, which is useful for establishing convergence properties of the limiting ODEs.Appendix A collects the proofs of intermediate results.

Q-learning Algorithm
We consider a Markov decision problem defined on the finite state space X ⊂ R dx .For every state x ∈ X there is a finite set A ⊂ R da of actions that can be taken.The homogeneous Markov chain x k ∈ X has a probability transition function P[x j+1 = z|x j = x, a j = a] = p(z|x, a) which governs the probability that x j+1 = z given that x j = x and a j = a.For every state x and action a there is a reward function r(x, a).Let λ denote an admissible control policy (i.e., it is chosen based on a probability law such that it depends only on the history up to the present).
For a given initial state x ∈ X and admissible control policy λ, the infinite time horizon reward is defined to be where the actions a j for j ≥ 0 are chosen according to the policy λ and γ ∈ (0, 1] is the discount factor. Let V (x, a) be the reward given that we start at state x ∈ X , action a ∈ A is taken, and the optimal policy is subsequently used.As is well known (see for example [19]), max where a * (x) = arg max a∈A V (x, a) is an optimal policy.The Bellman equation (2.1) can be derived using the principle of optimality for dynamic programming.
In the finite time horizon case, for a given initial state x ∈ X and admissible control policy λ, the finite time horizon reward is defined to be where r j = r(j, x j , a j ) for j = 0, 1, . . ., J − 1 and r J = r(J, x J ).Similar to the infinite time horizon discount case, the optimal control a * (j, x) is given by the solution to the Bellman equation with the optimal control given by a * (j, x) = arg max a∈A V (j, x, a).
In principle, the Bellman equations (2.1) and (2.3) can be solved to find the optimal control.However, there are two obstacles.First, the transition probability function p(z|x, a) (i.e., the state dynamics) may not be known.Secondly, even if it is known, the state space may be too high-dimensional for standard numerical methods to solve (2.1) and (2.3) due to the curse of dimensionality.For these reasons, reinforcement learning methods can be used to learn the solution to the Bellman equations (2.1) and (2.3).
Reinforcement learning approximates the solution to the Bellman equation with a function approximator, which typically is a neural network model.The parameters θ (i.e., the weights) of the neural network are estimated using the Q-learning algorithm.The neural network Q(x, a; θ) in Q-learning is referred to as a "Q-network".
The Q-learning algorithm attempts to minimize the objective function where π(x, a) is a probability mass function (to be specified later on) which is strictly positive for every (x, a) ∈ X × A and the "target" Y is In the case of the infinite time horizon problem (and analogously for the finite time horizon problem), if L(θ) = 0, then Q(x, a; θ) is a solution to the Bellman equation (2.1).In practice, the hope is that the Q-learning algorithm will learn a model Q such that L(θ) is small and therefore Q(x, a; θ) is a good approximation for the Bellman solution V (x, a).
The Q-learning updates for the parameters θ are: where (x k , a k ) is an ergodic Markov chain with π(x, a) as its limiting distribution.The Q−network, which models the value of a state x and action a, is the neural network where The parametric model (2.6) receives an input vector containing both the state and action in the enlarged Euclidean space R d .This formulation is a common choice in practice; see for example [8].Other variations of the parametric model (2.6), such as an input vector of the state and an output vector which is the length of the number of possible actions, are of course possible and can also be studied using this paper's techniques.The number of hidden units is N and the output is scaled by a factor 1 √ N , which is commonly used in practice and is called the "Xavier initialization" [10].The set of parameters that must be estimated is In the infinite-time horizon case, the Q-learning algorithm for training the parameters θ is for k = 0, 1, . ... We assume that the action a k is sampled uniformly at random from all possible actions A (i.e., "pure exploration").
In this paper, we study the asymptotic behavior of the Q-network Q N (x, a; θ k ) as the number of hidden units N and number of stochastic gradient descent iterates k go to infinity.As we will see, after appropriate scalings, the Q-network converges to the solution of a limiting ODE.
It is worthwhile noting that the Q-learning algorithm is similar to the stochastic gradient descent algorithm in that they both use stochastic samples to take training steps to minimize the objective function.However, unlike in the stochastic gradient descent algorithm (which we also discuss in Section 4), the Qlearning update directions G k are not necessarily unbiased estimates of a descent direction for the objective function L(θ).The Q-learning algorithm calculates its update by taking the derivative of L(θ) while treating the target Y as a constant.Since Y actually depends upon θ, This fact together with the presence of the neural network function approximator leads to certain difficulties in the proofs.We will return to this issue in Remark 3.6.
Let us next present the main results of the paper in Section 3.

Main results
In this section we present our main results.We start with the infinite time horizon setting.We consider the Q-network (2.6) which models the value of a state and action.The parameters θ for the Q-network are trained using the Q-learning algorithm (2.7).We prove that, as the number of hidden units and training steps become large, the Q-network converges in distribution to a random ordinary differential equation.Assumption 3.1.Our results are proven under the following assumptions: • The activation function σ ∈ C 2 b (R), i.e. σ is twice continuously differentiable and bounded.• The randomly initialized parameters (C i 0 , W i 0 ) are i.i.d., mean-zero random variables with a distribution µ 0 (dc, dw).We assume that µ 0 is absolutely continuous with respect to Lebesgue measure.
• The random variable C i 0 is bounded and w , µ 0 < ∞.
• The reward function r is uniformly bounded in its arguments.
• The Markov chain x k has a limiting distribution π, namely 1 {x k =x|x0=z} , almost surely, for all initial states z, exists, is independent of the initial state z, x∈X π(x) = 1 and π(x) > 0 for all x ∈ X .
• X and A are finite, discrete spaces.
We shall also assume that the action a ∈ A is sampled uniformly at random from all possible actions (referred to as "pure exploration").The uniform distribution of the actions combined with the fact that x k is assumed to have a limiting distribution π(x) imply that the Markov chain ζ k = (x k , a k ) will have limiting distribution π(x, a) =1 K π(x) where K = |A|.In addition, the Markov chain (x k+1 , x k , a k ) will have π(x ′ , x, a) = p(x ′ |x, a)π(x, a) as its limiting distribution. 1  Assumption 3.2.Certain properties of the limit ODE also require the following assumptions: • The activation function σ is non-polynomial (e.g., a tanh or sigmoid function).
with positive Lebesgue measure.Define the empirical measure In addition, let us set Q N k (x, a) = Q N (x, a; θ k ) and define the scaled processes where G is a mean-zero Gaussian random variable.
The variable h N t is the output of the neural network after t T × 100% of the training has been completed.We will study convergence in distribution of the random process (µ ) is the Skorokhod space and M(S) is the space of probability measures on S.
Before presenting the first main convergence result, Theorem 3.4, we present a lemma stating that a certain matrix A which appears in the limit ODE is positive definite.Lemma 3.3.Let Assumption 3.2 hold.Consider the matrix A with elements Then, the matrix A is positive definite.
Proof.The proof is deferred to Section 7.
We then have the following theorem.Theorem 3.4.Let Assumption 3.1 hold and let the learning rate be α The tensor A is Furthermore, if Assumption 3.2 holds, (3.2) has a unique stationary point which equals the solution V of the Bellman equation (2.1).
Proof.The proof of this result is in Section 5. Theorem 3.4 shows that there is a single fixed point for the limit dynamics of the Q-network.Moreover, this unique fixed point is the solution to the Bellman equation (2.1) and therefore gives the optimal control.This is interesting since the pre-limit neural network is a non-convex function and therefore there are potentially many fixed points.
Theorem 3.4 does not prove that h t converges to V .It only shows that, if h t converges to a limit point, that limit point must be V , the solution of the Bellman equation.We are able to prove convergence in the following lemma for small γ (i.e., when the nonlinearity is not too strong).
where K is the number of possible actions in the set A. Then, we have lim Remark 3.6.Convergence of ODEs of the type (3.2) to solutions of the corresponding stationary equation has been studied in the literature in [3,37].The difference between our case and these earlier works is the nature of the matrix A which appears in the ODE (e.g., see equation (3.2) with the matrix A).In previous papers such as [3,37], the matrix A is either an identity matrix or a diagonal matrix with diagonal elements uniformly bounded away from zero and with an upper bound of one.In our case, the Q-learning algorithm with a neural network produces an ODE with a matrix A that is not a diagonal matrix.The arguments of [3,37] do not establish convergence in the case where A is non-diagonal.Lemma 3.5 proves convergence for a non-diagonal matrix A for small γ.
Despite our best efforts we did not succeed in proving Lemma 3.5 for all 0 < γ < 1 in our general case with the non-diagonal matrix A, which is produced by the neural network approximator in the Q-learning algorithm.As we discussed in Section 2, the difficulties that arise here are also related to the fact that the Q-learning algorithm calculates its update by taking the derivative of L(θ) while treating the target Y as a constant.Hence, the asymptotic dynamics of the Q-network as N and k grow to infinity, which is the solution to the ODE (3.2), may not necessarily move in the descent direction of the limiting objective function (this is in contrast to the standard regression problem with i.i.d.data that we study in Section 4).
However, as shown in Theorem 3.8 below, one can prove convergence for all values of the discount factor γ ∈ (0, 1] in the finite time horizon case.We are able to prove convergence for all γ ∈ (0, 1] because in the finite time horizon case one can study the large time limit of the limiting ODE recursively.
We now consider the finite time horizon problem.The Q-network, which models the value of state x and action a at time j, is where and σ : R → R. Note that the parameter W i is shared across all times j.The model parameters θ are trained using the Q-learning algorithm.The parameter updates are given by, for training iterations k = 0, 1, . . .and times j = 0, . . ., J − 1, the following equations: where (x k , a k ) N k=1 are independent random variables and r k,j = r(j, x k,j , a k,j ).For notational convenience, define Assumption 3.7.Certain properties of the limit ODE also require the following assumptions: • The activation function σ is non-polynomial (e.g., tanh or sigmoid functions).
Similar to the infinite time horizon case, we define the processes ν N,j k ).We will study convergence in distribution of the random process ( Denote the probability distribution of (x k,j , a k,j ) denoted as π j (x k , a k ).We then have the following theorem.Theorem 3.8.Let Assumption 3.1 hold and let the learning rate be α where j = 0, 1, . . ., J − 1 and h t (J, x, a) = r(J, x).The tensor A is Furthermore, if Assumption 3.7 holds, the neural network converges to the solution of the Bellman equation (2.3): Proof.The proof of this result is in Section 6.

A special case: neural networks and regression
The asymptotic approach developed in this paper can be used to study other popular cases in machine learning.For example, consider the case of the objective function (2.4) but with y k now being independent samples from a fixed distribution.Then, (2.4) is simply the mean-squared error objective function for regression.Using the same techniques as we employ on (the more difficult) Q-learning problem discussed in the previous section, we can establish the asymptotic behavior of neural network models used in regression.Let the neural network be where where the data (X, Y ) ∼ π(dx, dy), Y ∈ R, and the parameters θ The model parameters θ are trained using stochastic gradient descent.The parameter updates are given by: for k = 0, 1, . ... α N is the learning rate (which may depend upon N ).The data samples are (x k , y k ) are i.i.d.samples from a distribution π(dx, dy).
In Theorem 4.2 we prove that a neural network with the Xavier initialization (i.e., with the 1 √ N normalization in the formula for g N (x; θ)) and trained with stochastic gradient descent converges in distribution to a random ODE as the number of units and training steps become large.Although the pre-limit problem of optimizing a neural network is non-convex (and therefore the neural network may converge to a local minimum), the limit equation minimizes a quadratic objective function.In Theorem 4.3, we also show that the neural network (in the limit) will converge to a global minimum with zero loss on the training set.Convergence to a global minimum for a neural network has been recently proven in [6], [7], and [38].Our result shows that convergence to a global minimum can also be viewed as a simple consequence of the mean-field limit for neural networks.
For completeness, we also mention here that other scaling regimes have also been studied in the literature.In particular, [4,25,29,32,33] study the asymptotics of single-layer neural networks with a 1 N normalization; that is, [34] studies the asymptotics of deep (i.e., multi-layer) neural networks with a 1 N normalization in each hidden layer.The 1 N normalization is convenient since the single-layer neural network is then in a traditional mean-field framework where it can be described via an empirical measure of the parameters.In the single layer case, the limit for the neural network satisfies a partial differential equation.As discussed in [25], it is not necessarily true that the limiting equation (a PDE in this case) will converge to the global minimum of an objective function with zero training error.However, the 1   √ N normalization that we study in this paper is more widely-used in practice (see [10]) and, importantly, as we demonstrate in Theorem 4.3, the limit equation converges to a global minimum with zero training error.Lastly, we mention here that [15] proved, using different methods, a limit for neural networks with a 1 √ N Xavier initialization when they are trained with continuous-time gradient descent.Our result in Theorem 4.2 proves a limit for neural networks trained with the (standard) discrete-time stochastic gradient descent algorithm which is used in practice, and rigorously passes from discrete time (where the stochastic gradient descent updates evolve) to continuous time.
Assumption 4.1.We impose the following assumptions: σ is twice continuously differentiable and bounded.• The randomly initialized parameters (C i 0 , W i 0 ) are i.i.d., mean-zero random variables with a distribution µ 0 (dc, dw).
• The random variable C i 0 has compact support and w , µ 0 < ∞. • The sequence of data samples (x k , y k ) is i.i.d.from the probability distribution π(dx, dy).In particular, there is a fixed dataset of M data samples (x (i) , y (i) ) M i=1 and therefore π(dx, dy) Note that the last assumption also implies that π(dx, dy) has compact support.Following the asymptotic procedure developed in this paper, we can study the limiting behavior of the network output g N k (x) = g N (x; θ k ) for x ∈ D = {x (1) , . . ., x (M) } as the number of hidden units N and stochastic gradient descent steps k simultaneously become large, after appropriately relating k and N .The network output converges in distribution to the solution of a random ODE as N → ∞.
For this purpose, let us recall the empirical measure defined in (3.1).Note that the neural network output can be written as the inner-product Due to Assumption 4.1, as N → ∞ and for x ∈ D, where G ∈ R M is a Gaussian random variable.We also of course have that Define the scaled processes ) .Now, we are ready to state the main result of this section, Theorem 4.2.Theorem 4.2.Let Assumption 4.1 hold, set α N = α N and define Proof.The proof of this theorem is omitted because it is exactly parallel to the proof of Theorem 3.4.
Recall that G ∈ R M is a Gaussian random variable; see equation (4.1).In addition, note that μt in the limit equation (4.2) is a constant, i.e. µ t = µ 0 for t ∈ [0, T ].Therefore, (4.2) reduces to ), the solution h t is unique.
To better understand (4.3), define the matrix A ∈ R M×M where where x, x ′ ∈ D. A is finite-dimensional since we fixed a training set of size M in the beginning.Then, (4.3) becomes where Ŷ = (y (1) , . . ., y (M) ).Therefore, h t is the solution to a continuous-time gradient descent algorithm which minimizes a quadratic objective function.
Therefore, even though the pre-limit optimization problem is non-convex, the neural network's limit will minimize a quadratic objective function.
An interesting question is whether h t → Ŷ as t → ∞.That is, in the limit of large numbers of hidden units and many training steps, does the neural network model converge to a global minimum with zero training error?Theorem 4.3 shows that h t → Ŷ as t → ∞ if A is positive definite.Lemma 3.3 proves that, under reasonable hyperparameter choices and if the data samples are distinct, A will be positive definite.
Proof.Consider the transformation ht = h t − Ŷ .Then, Then, ht → 0 (and consequently h t → Ŷ ) as t → ∞ if A is positive definite.Lemma 3.3 proves that A is positive definite under the Assumption 3.2 and if the data samples are distinct.
In connection to Theorem 4.3 we mention that the data samples in the dataset will be distinct with probability 1 if the random variable X has a probability density function.

Proof of Convergence in Infinite time Horizon Case
In this section we prove Theorem 3.4.The proof is divided into three parts.Let ρ N be the probability measure of a convergent subsequence of µ N , h N 0≤t≤T .In Section 5.1 we write the prelimit in a form that is convenient in order to establish the desired limiting behavior.In Section 5.2, we prove that any limit point of ρ N is a probability measure of the random ODE (3.2).In Section 5.3, we prove that the sequence ρ N is relatively compact (which implies that there is a subsequence ρ N k which weakly converges).In 5.4, we prove that the limit point is unique.These three results are collected together in Section 5.5 to prove that (µ N , h N ) converges in distribution to (µ, h).

Evolution of the Pre-limit Process
For notational convenience, let for points W i, * k and W i, * , * k in the line segment connecting the points W i k and W i k+1 .Let α N = α N .Substituting (2.7) into (5.1)yields +O p (N −3/2 ). (5.2) We can re-write the evolution of Using (5.3), we can write the evolution of h N t (ζ) for t ∈ [0, T ] as This can then be rewritten as follows where The fluctuation terms are where Later on, in Lemma 5.3, we prove that that the fluctuation terms M i,N t (ζ) go to zero in L 1 as N → ∞.The evolution of the empirical measure ν N k can be characterized in terms of their projection onto test (5.7) Similarly, we can also obtain that (5.8)

Identification of the Limit
We must first establish that M 1,N t , M 2,N t , M 3,N t p → 0 as N → ∞.For this purpose, we first prove two lemmas.Lemma 5.1.Consider a Markov chain z k on a finite, discrete space S with a unique limiting distribution q(z) and a random function f N : S → R. Suppose f N is uniformly bounded in L 2 with respect to N.Then, lim The proof of Lemma 5.1 should be known.However, given that we could not locate an exact reference we provide its short proof in the Appendix A. Lemma 5.2.Consider the notation and assumptions of Lemma 5.1.Define the quantity where the function (5.9) Then we have that, lim Proof.For any K ∈ N and ∆ = t K , we have where the term o(1) goes to zero, at least, in L 1 as N → ∞.We will need to show that, for each j = 0, 1, . . ., K − 1, (5.11) This can be proven in the following way.We already know that Combining (5.11) and Lemma 5.1, we can show that for each j = 0, 1, . . ., K − 1, lim We next consider the second term in (5.10).To bound this term, we will use the assumption (5.9). ≤C∆.
Collecting our results, we have shown that lim sup Note that K was arbitrary.Consequently, we obtain lim concluding the proof of the lemma.
This now allows us to prove the following lemma.
Proof.The process Q N k (x, a) satisfies the uniform L 2 bound in equation (5.9) due to Lemma 5.6.It also satisfies the regularity condition in equation (5.9).Indeed, recalling the notation ζ = (x, a) and ζ k = (x k , a k ), we have where we have used the bounds from Lemmas 5.5 and 5.6, the boundedness of σ(•) and σ ′ (•), and the Cauchy-Schwartz inequality.
In addition, where we have used the bound (5.12).The term B N x,a,x ′ ,a ′ ,k that appears in the formula for M 1,N t can be treated analogously using (5.7) and Lemma 5.6.The result for M 1,N t then immediately follows from Lemmas 5.1 and 5.2 and the triangle inequality.Using the same approach, one can obtain the claim for M 2,N t and M 3,N t , and the proof is omitted due to the similarity of the argument.Let ρ N be the probability measure of a convergent subsequence of µ N , h N 0≤t≤T .Each ρ N takes values in the set of probability measures M D E ([0, T ]) .Relative compactness, proven in Section 5.3, implies that there is a subsequence ρ N k which weakly converges.We must prove that any limit point ρ of a convergent subsequence ρ N k will satisfy the evolution equation (3.2).Lemma 5.4.Let ρ N k be a convergent subsequence with a limit point ρ.Then, ρ is a Dirac measure concentrated on (µ, h) ∈ D E ([0, T ]) and (µ, h) satisfies equation (3.2).

Relative Compactness
In this section we prove that the family of processes {µ N , h N } N is relatively compact.Section 5.3.1 proves compact containment.Section 5.3.2proves regularity.Section 5.3.3combines these results to prove relative compactness.

Compact Containment
We first establish a priori bounds for the parameters (C i k , W i k ).Lemma 5.5.For all i ∈ N and all k such that k/N ≤ T , Proof.The unimportant finite constant C < ∞ may change from line to line.We first observe that where the last inequality follows from the definition of Q N k (x, a) and the uniform boundedness assumption on σ(•).
Then, we subsequently obtain that This implies that By the discrete Gronwall lemma and using k/N ≤ T , Note that the constants may depend on T .We can now combine the bounds (5.15) and (5.14) to yield, for any 0 where the last inequality follows from the random variables C i 0 taking values in a compact set.Now, we turn to the bound for W i k .We start with the bound (using Young's inequality) for a constant C < ∞ that may change from line to line.Taking an expectation, using Assumption 3.1, the bound (5.16), and using the fact that k/N ≤ T , we obtain for all i ∈ N and all k such that k/N ≤ T , concluding the proof of the lemma.
Using the bounds from Lemma 5.5, we can now establish a bound for Q N k (x, a) for (x, a) ∈ X × A. Lemma 5.6.For all i ∈ N, all k such that k/N ≤ T , Proof.Recall equation (5.2), which describes the evolution of Q N k (x, a).Recall the notation ζ = (x, a) and where C(ω) is a random variable (independent of N ) that is bounded in mean square sense.This leads to the bound We now square both sides of the above inequality.
where the last line uses Young's inequality.Therefore, we obtain Then, using a telescoping series, we have Taking expectations, we subsequently obtain Recall that where (C i 0 , W i 0 ) are i.i.d., mean-zero random variables.Then, Substituting this bound into equation (5.17) produces the desired bound We now prove compact containment for the process Proof.For each L > 0, define K L = [−L, L] 1+d .Then, we have that K L is a compact subset of R 1+d , and for each t ≥ 0 and N ∈ N, where we have used Markov's inequality and the bounds from Lemma 5.5.We define the compact subsets of for all j ∈ N and we observe that (L+j) 3/2 = 0, we have that, for each η > 0, there exists a compact set KL such that sup N ∈N,0≤t≤T Due to Lemma 5.6 and Markov's inequality, we also know that, for each η > 0, there exists a compact set Therefore, for each η > 0, there exists a compact set KL

Regularity
We now establish regularity of the process Proof.We start by noticing that a Taylor expansion gives for 0 for points Ci , W i in the segments connecting C i ⌊N s⌋ with C i ⌊N t⌋ and W i ⌊N s⌋ with W i ⌊N t⌋ , respectively.Let's now establish a bound on where Assumption 3.1 was used as well as the bounds from Lemmas 5.5 and 5.6.Let's now establish a bound on for s < t ≤ T with 0 < t − s ≤ δ < 1.We obtain where we have again used the bounds from Lemmas 5.5 and 5.6.Now, we return to equation (5.18).Due to Lemma 5.5, the quantities ( Ci ⌊N t⌋ , W i ⌊N t⌋ ) are bounded in expectation for 0 < s < t ≤ T .Therefore, for 0 < s < t ≤ T with 0 < t − s ≤ δ < 1 where C < ∞ is some unimportant constant.Then, the statement of the Lemma follows.
We next establish regularity of the process h N t in D R M ([0, T ]).For the purposes of the following lemma, let the function q(z 1 , z 2 ) = min{ z 1 − z 2 , 1} where Lemma 5.9.For any δ ∈ (0, 1), there is a constant C < ∞ such that for 0 Proof.Recall that Therefore, This yields the bound where we have used the boundedness of σ ′ (•) (from Assumption 3.1) and the bounds from Lemma 5.5.Taking expectations, Using the bounds (5.19) and (5.20), (5.21) Therefore, we have obtained that The statement of the Lemma then follows.

Combining our results to prove relative compactness
Lemma 5.10.The family of processes {µ N , h N } N ∈N is relatively compact in D E ([0, T ]).
Proof.Combining Lemmas 5.7 and 5.8, and Theorem 8.6 of Chapter 3 of [9] proves that {µ N } N ∈N is relatively compact in D M(R 1+d ) ([0, T ]).(See also Remark 8.7 B of Chapter 3 of [9] regarding replacing sup N with lim N in the regularity condition B of Theorem 8.6.).Similarly, combining Lemmas 5.7 and 5.9 proves that {h N } N ∈N is relatively compact in D R M ([0, T ]).
From these, we finally obtain that {µ N , h N } N ∈N is relatively compact as a D E ([0, T ])−valued random variable where E = M(R 1+d ) × R M .

Uniqueness
We prove uniqueness of the limit equation (3.2) for h t .Suppose there are two solutions h 1 t and h 2 t .Let us define their difference to be φ t = h 1 t − h 2 t .Recall that A is the tensor For notational convenience, define ζ = (x, a), ζ ′ = (x ′ , a ′ ), and The matrix A is positive definite; see Section 7 for the proof.We also define Then, φ t , at the point ζ, i.e. φ t (ζ) satisfies the following equation The latter, using (5.22) and the boundedness of the elements A ζ,ζ ′ , implies, An application of Gronwall's inequality proves that φ t (ζ) = 0 for all 0 ≤ t ≤ T and for all ζ ∈ X × A. Therefore, the solution h t is indeed unique.

Proof of Convergence
We now combine the previous results of Sections 5.3 and 5.2 to prove Theorem 3.8.Let ρ N be the probability measure corresponding to (µ N , h N ).Each ρ N takes values in the set of probability measures M D E ([0, T ]) .Relative compactness, proven in Section 5.3, implies that every subsequence ρ N k has a further sub-sequence ρ N km which weakly converges.Section 5.2 proves that any limit point ρ of ρ N km will satisfy the evolution equation (3.2).Equation (3.2) has a unique solution (proven in Section 5.4).Therefore, by Prokhorov's Theorem, ρ N weakly converges to ρ, where ρ is the distribution of (µ, h), the unique solution of (3.2).That is, (µ N , h N ) converges in distribution to (µ, h).

Analysis of the Limit Equation
It is easy to show that there is a unique stationary point of the limit equation (3.2) where h = V , the solution of the Bellman equation (2.1).We define ζ, ζ ′ , and A ζ,ζ ′ as in Section 5.4.Any stationary point h of (3.2) must satisfy (5.24) (5.24) is exactly the Bellman equation (2.1), which has the unique solution V .Therefore, h t has a unique stationary point which equals the solution V of the Bellman equation.We now prove convergence of h t to V for small γ.Define φ t = h t − V where V is the unique solution to the Bellman equation (2.1).We also define the matrix G where where ⊙ is the element-wise product.The matrix A is positive definite.Thus, A −1 exists and is also positive definite.Define the process Then, where φ 2 t denotes the element-wise square φ 2 t = φ t ⊙ φ t .Let us now study the second term in equation (5.25).Let Γ t := γφ ⊤ t (π ⊙ G t ).Then, We can bound the second term as Consequently, Suppose γ 2 1+K .Then, there exists an ǫ > 0 such that Y t is clearly decreasing in time t and, since A is positive definite, has a lower bound of zero.We also have the following upper bound using Young's inequality and the finite number of states in X × A: where C > 0. This leads to the lower bound φ ⊤ t φ t ≥ Yt C and the bound where C 0 > 0. By Gronwall's inequality, Consequently, lim t→∞ Y t = 0, concluding the proof of Lemma 3.5 due to the positive-definiteness of the matrix A.
6 Proof of Convergence in Finite Time Horizon Case In this section we address the proof of Theorem 3.8.The proof for the finite time horizon case is essentially exactly the same as the proof for the infinite time-horizon case.The main difference is that we can prove for any 0 < γ < 1 that the limit equation h t converges to the Bellman equation solution V as t → ∞.
Let us begin by calculating the pre-limit evolution of the neural network output Q N k (j, x, a).For convenience, let ζ = (x, a).
We can then re-write the evolution of Q N k (j, x, a) in terms of the empirical measure ν N k .
3) leads to an evolution equation for the re-scaled process h t : where the coefficients A 0 and A j,m are The fluctuation term M N t (j, x, a) takes the form Using the same analysis as in the infinite time horizon case (see Lemma 5.3), we can show that M N t p → 0 as N → ∞.

Identification of the Limit, Relative Compactness, and Uniqueness
Let ρ N be the probability measure of a convergent subsequence of µ N , h N 0≤t≤T .Each ρ N takes values in the set of probability measures M D E ([0, T ]) .We can prove the following results.Lemma 6.1.The sequence ρ N is relatively compact in M D E ([0, T ]) .
Proof.The result is obtained by following the exact same steps as in the proofs of Lemmas 5.7, 5.8, 5.9, and 5.10.Therefore, its proof will not be repeated here.Lemma 6.2.Let ρ N k be a convergent subsequence with a limit point ρ.Then, ρ is a Dirac measure concentrated on (µ, h) ∈ D E ([0, T ]) and (µ, h) satisfies equation (3.4).
Proof.The proof is exactly the same as in Lemma 5.4, and we do not repeat it here.We only note here for completeness that due to the fact that the random variables C i,j 0 , W i 0 are assumed to be mean zero, independent random variables (see Assumption 3.1), the terms A j,m ζ,ζ ′ (s) with m = j will become zero in the limit as N → ∞ in the expression for (6.4).Lemma 6.3.The solution (µ, h) to the equation (3.4) is unique.
Proof.The proof follows the same steps as in Section 5.4, and we do not repeat it here.
In fact, using induction, we can prove that lim t→∞ φ t (j, x, a) = 0 for j = 0, 1, . . ., J. Indeed, let us assume that lim t→∞ φ t (j + 1, y) = 0 for each ζ ∈ X × A. Let Y t = 1 2 φ ⊤ t,j A −1 φ t,j where φ t,j = φ t (j, •).The process Y t satisfies the differential equation dY t =φ ⊤ t,j A −1 dφ t,j = − φ ⊤ t,j A −1 A π j ⊙ (φ t,j − γG t ) dt = − φ ⊤ t,j π j ⊙ (φ t,j − γG t ) dt = − π j • φ 2 t,j dt + γφ ⊤ t,j (π j ⊙ G t,j+1 )dt, ( where the vector G t,j+1 is given by This upper bound implies that Y t < 0 for some t > T 1 .However, Y t ≥ 0 for all t ≥ 0 and thus this is a contradiction.Consequently, there exists a T 2 > T 0 such that Y T2 = ǫ. Suppose that there exists a T ≤ǫ, (6.12) which is a contradiction.Therefore, for any ǫ > 0, there exists a T 2 > 0 such that Y t ≤ ǫ for all t ≥ T 2 .Since ǫ is arbitrary, we have proven that where (W, C) ∼ µ 0 .Since C is a mean-zero random variable which is independent of W , Note that if σ(•) is an odd function (e.g., the tanh function) and the distribution of W is even, then A is a covariance matrix.
To prove that A is positive definite, we need to show that z ⊤ Az > 0 for every non-zero z ∈ R M .
Therefore, we obtain that lim concluding the proof of the lemma.

a∈AV
(x, a) = sup λ W λ (x) and the maximum expected reward V satisfies the Bellman equation 0

Theorem 4 . 3 .
If Assumption 3.2 holds and the data samples are distinct, then a), ζ = (x, a), and ζ k = (x k , a k ).We study the evolution of Q N k (x, a) during training.

. 1 )
for points W i, * k and W i, * , * k in the line segment connecting the points W i k and W i k+1 .Let α N = α N .Substituting (3.3) into (6.1)yields