Large Deviations for Small Noise Diffusions in a Fast Markovian Environment

A large deviation principle is established for a two-scale stochastic system in which the slow component is a continuous process given by a small noise finite dimensional It\^{o} stochastic differential equation, and the fast component is a finite state pure jump process. Previous works have considered settings where the coupling between the components is weak in a certain sense. In the current work we study a fully coupled system in which the drift and diffusion coefficient of the slow component and the jump intensity function and jump distribution of the fast process depend on the states of both components. In addition, the diffusion can be degenerate. Our proofs use certain stochastic control representations for expectations of exponential functionals of finite dimensional Brownian motions and Poisson random measures together with weak convergence arguments. A key challenge is in the proof of the large deviation lower bound where, due to the interplay between the degeneracy of the diffusion and the full dependence of the coefficients on the two components, the associated local rate function has poor regularity properties.


Introduction
We study a stochastic system with two time scales where the slow scale evolution is described through a continuous stochastic process, given by a small noise finite dimensional Itô stochastic differential equation, and the fast component is given as a rapidly oscillating pure jump process.The two processes are fully coupled in that the drift and diffusion coefficient of the slow process and the jump intensity function and jump distribution of the fast process depend on the states of both components.Multiscale systems of the form considered in this work arise in many problems from systems biology, financial engineering, queuing systems, etc.For example, most cellular processes are inherently multiscale in nature with reactions occurring at varying speeds.This is especially true in many genetic networks, where protein concentration, usually modeled by a small-noise diffusion process, is controlled by different genes rapidly switching between their respective active and inactive states [9].The key characterizing feature of such slow-fast systems is that the fast component reaches its equilibrium state at much shorter time scales at which the slow system effectively remains unchanged.This local equilibration phenomenon allows the approximation of the properties of the slow system by averaging out the coefficients over the local stationary distributions of the fast component.Such approximations yield a significant model simplification and are mathematically justified by establishing an appropriate averaging principle.
The averaging principle, which has its roots in the works of Laplace and Lagrange, has a long history of applications in celestial mechanics, oscillation theory, radiophysics, etc.For deterministic systems, the first rigorous results were obtained by Bogoliubov and Mitropolsky [3], and further developments and generalizations were subsequently carried out by Volosov, Anosov, Neishtadt, Arnold and others (for example, see [1,22]).The stochastic version of the theory originated with the seminal paper of Khasminskii [14] and later advanced in the works of Freidlin, Lipster, Skorohod, Veretennikov, Wentzel and others (for example, see [13,24,25]).Stochastic averaging principles for various models arising from systems biology have been studied in [2,18,19].As noted above, an averaging principle provides a model simplification in an appropriate scaling regime.In order to capture the approximation errors due to the use of such simplified models one needs a more precise asymptotic analysis.The goal of the current work is to study one such asymptotic result that gives a large deviation principle (LDP) for the slow process as the parameter governing the magnitude of the small noise in the diffusion component and the speed of the fast component approaches its limit.Such a result, in addition to providing estimates on the rate of convergence of the trajectories of the slow component to that of the averaged system, is a starting point for developing accelerated Monte-Carlo schemes for the estimation of probabilities of rare events (cf.[11]).
For a two-scaled system where both components are continuous processes given through finite dimensional Itô stochastic differential equations, the problem has been studied in [21,13,26,27].In all these works the coupling between the two components is weak in a certain sense.By this we mean that either the slow component has no diffusion term [13,26], or the dynamics of the fast component does not depend on the slow one [21], or at least the diffusion coefficient of the fast component does not depend on the slow term [27].A recent paper by Puhalskii [23] studies a large deviation principle for a fully coupled two-scale diffusion system.Under various conditions on the coefficients of the two diffusions including in particular certain nondegeneracy conditions on the diffusion coefficients, the paper uses the exponential tightness and limit characterization approach of [12] to establish a LDP for the slow component.
For settings where the fast component is a jump process, there are only a few results.In [15,17] the authors study a large deviation principle for a two-scale system in which the trajectories of the slow diffusion component is modulated by a fast moving Markov chain (whose evolution does not depend on the slow component).An earlier paper, [16], considered a simpler case with no diffusion term in the equation for the slow component.This simpler case, under a somewhat more restrictive condition, was also studied by Freidlin and Wentzell in [13].However, in all of these works the dynamics of the Markov chain do not depend on that of the slow diffusion component.Large deviation problems for general two-scale jump diffusions have recently been considered in [20].The authors prove a large deviation principle for each fixed time t > 0 using the nonlinear semigroup and viscosity solution based approach developed in [12]; however, a process level large deviation result is not considered.One of the critical assumptions in this work is the validity of a comparison principle for a certain nonlinear Cauchy problem (see Theorem 3 therein).Verification of the comparison principle is in general a challenging task which needs to be done on a case by case basis for different systems.In particular the two examples in [20] (see Section 4 therein) where the comparison principle is shown to hold, are of specific forms -for the first example, the evolution of the fast component does not depend on the slow component, whereas in the second example the fast component is a two state Markov chain whose stationary distribution does not depend on the state of the slow component.We also note that [20] makes the assumption that the jump coefficients are Lipschitz continuous in an appropriate sense.Such a property fails to hold for systems considered in the current paper; specifically, the integrand in the second equation in (2.3) is not Lipschitz continuous (in fact not even continuous).
As noted previously, the current paper studies a setting where the two components are fully coupled.Specifically, for fixed ε > 0, we consider a two component Markov process (X ε , Y ε ), where X ε is a d-dimensional continuous stochastic process given as the solution of a stochastic equation of the form where W is a m-dimensional Brownian motion, and Y ε is a process with a finite state space described in terms of a jump intensity function c(•, •) and a probability transition kernel r(•, •, dy), both of which depend on the states of X ε and Y ε .We make standard Lipschitz assumptions on the coefficients of the diffusion, however we do not impose any non-degeneracy restrictions on the diffusion.In the setting we consider methods based on approximations, exponential tightness estimates and Girsanov change of measure appear to be quite hard to implement.One of the main challenges in the analysis is due to the interplay between the possible degeneracy of the diffusion coefficient and the dependence of the various coefficients (b, a, c and r) on both components.In our approach we bypass discretizations and approximations by using certain variational representations of expectations of positive functionals of Brownian motions and Poisson random measures together with weak convergence techniques.The variational representations for these noise processes that we use were developed in [4,6] and have been previously used in proving large deviation principles for a variety of complex systems (see [5,7,8] and references therein).Using these representations, the proof of the upper bound reduces to proving the tightness and characterizations of weak limit points of certain controlled versions of the state process X ε .We note that in the description of these controlled systems there are two types of controls -one that controls the drift of the Brownian noise and the other that controls the intensity of the underlying Poisson random measure through a random 'thinning' function.The presence of these two controls coupled with the strong dependence of the coefficients on both the components make the required asymptotic analysis challenging.
The main challenge in this work arises in the proof of the lower bound.When using the variational representations, the proof of the lower bound requires the construction of controls which lead to a prescribed limit trajectory with a prescribed cost.In particular, when multiple times scales are present, one generally needs to establish the convergence of the empirical measure for the fast variables to an a priori identified measure (which could depend on the state of the slow variables).A natural technique is to first show that the velocities of the trajectory can be made piecewise smooth (e.g., piecewise constant), so that transition probabilities associated with the fast variables can be treated as essentially constant over each interval where the velocity is continuous.Unfortunately, this smoothing in time of the state requires establishing regularity properties of the local rate function, which is the function L(x, β) when the rate function is written in the somewhat standard form It is the need for these regularity properties which leads to undesirable assumptions that may not in fact be necessary (e.g., nondegeneracy of a diffusion coefficient).
We will use a different method to establish convergence that does not rely on any smoothing in the time variable, and which in particular will allow for degenerate diffusion coefficients.This alternative approach instead slightly perturbs the controls used on the noise space (both the control of the Brownian term that directly impacts the slow variables and the control of the Poisson term determining evolution of the fast variables), in such a way that the resulting mapping from controls into the state trajectory is unique.This uniqueness result is the key to the construction of near optimal controls for the prelimit process for which the appropriate convergence properties can be proved and from which the lower bound follows readily.The perturbation argument and resulting uniqueness, which is given in Proposition 4.1, is described in detail at the beginning of Section 5.The strategy for the proof of Proposition 4.1 is explained in Remark 4.2.
The rest of the paper is organized as follows.In Section 2 we give a precise mathematical formulation of the model and the statement of our main result.The large deviation upper bound is proved in Section 3. Section 4 constructs suitable near optimal controls and controlled trajectories with appropriate uniqueness properties.The large deviation lower bound is proved in Section 5.
Notation: The following mathematical notation and conventions will be used in the paper.For a Polish space S, we denote by P(S) (resp.M F (S)) the space of probability measures (resp.finite measures) on S equipped with the topology of weak convergence.We denote by C b (S) the space of real continuous and bounded functions on S. The space of continuous functions from [0, T ] to S, equipped with the uniform topology, will be denoted as C([0, T ] : S).For a bounded R d valued function f on S, we define f ∞ = sup x∈S f (x) .For a finite set L, we denote by M(L) the space of real functions on L. Cardinality of such a set will be denoted as |L|.Given a probability function r : L → [0, 1] (i.e. x∈L r(x) = 1), we denote, abusing notation, x∈A r(x) by r(A) for all A ⊂ L and x∈L f (x)r(x) by L f (x)r(dx) for all f ∈ M(L).Space of Borel measurable maps from [0, T ] to a metric space S will be denoted as M([0, T ] : S).Infimum over an empty set, by convention, is taken to be ∞.

Mathematical Preliminaries and Main Result
For fixed ε > 0, we consider a two component Markov process {(X ε (t), Y ε (t))} 0≤t≤T with values in G = R d × L, where L = {1, . . ., |L|} is equipped with the usual operation of addition modulo |L|.A precise stochastic evolution equation for the pair (X ε , Y ε ) will be given below in terms of a m-dimensional Brownian motion and suitable Poisson random measure.However, roughly speaking, the pair (X ε , Y ε ) describes a jump-diffusion, where the diffusion component (namely X ε ) has "small noise" while the jump component (Y ε ) has jumps at rate O(ε −1 ).The drift and diffusion coefficients of the continuous component are given by suitable functions b : G → R d and a : G → R d×m .The evolution of the pure-jump fast component is described through a jump intensity function c : G → [0, ∞) and a transition probability function r : Our main assumptions on these functions are as follows.
1.There exists d lip ∈ (0, ∞) such that for all y, y ′ ∈ L and x, x 2. c is a bounded function.

Let
ς .= sup From Assumption 2.1, for some κ 2 ∈ (0, ∞), for all x, x ′ ∈ R d , where ∆ denotes the symmetric difference.For each fixed x ∈ R d , the operator Π x acting on M(L) and defined by describes the generator of an L-valued Markov process.Let be the n-step transition probability kernel of the corresponding embedded chain.Define Recall that, from Assumption 2.1(2) ς .
We will make the following assumption.
Assumptions 2.1 and 2.3 will be taken to hold throughout this work and will not always be mentioned in the statement of various results.Let A be an |L| × |L| matrix with A ij = 1 if (i, j) ∈ T and 0 otherwise.Then, Assumption 2.3 in particular says that the adjacency matrix A is irreducible.
The evolution of Y ε can be described through a stochastic differential equation driven by a finite collection of Poisson random measures which is constructed as follows.For (i, j) ∈ T let Nij be a Poisson random measure (PRM) on [0, ζ] × [0, T ] × R + with intensity measure λ ζ ⊗ λ T ⊗ λ ∞ , where λ T (resp.λ ∞ ) denotes the Lebesgue measure on [0, T ] (resp.R + ), on some complete filtered probability space (Ω, F, P, {F t } 0≤t≤T ) such that for t ∈ [0, T ], and can be regarded as a random variable with values in M F ([0, ζ]×[0, T ]), the space of finite measures on [0, ζ]×[0, T ] equipped with the weak topology.The processes ( Nij ) (i,j)∈T are taken to be mutually independent.We also suppose that on this filtered probability space there is an m-dimensional F t -Brownian motion W = {W (t)} 0≤t≤T (which is then independent of N ).In terms of W and N ε −1 , the Markov process (X ε , Y ε ) ≡ {(X ε (t), Y ε (t))} 0≤t≤T with initial condition (x 0 , y 0 ) ∈ G is defined as the unique pathwise solution of the following system of equations: From unique pathwise solvability it follows that for every ε > 0, there exists a measurable map The following is an immediate consequence of our assumptions.
Theorem 2.4.For each x ∈ R d , there is a unique invariant probability measure, ν(x) for the L-valued Markov process with generator Π x .
Proof.Since α > 0, the L-valued Markov chain with transition probabilities is Lipschitz continuous, it follows that p x yy ′ is Lipschitz continuous in x and inf y,y ′ ∈L inf x∈R d p x yy ′ > 0. Denote the unique invariant measure of this chain by π(x).From Lemma 3.1 in [13], π(x) is given as a ratio of polynomials in {p x yz } y,z∈L .Thus x → π y (x) is Lipschitz continuous for every y ∈ L (with Lipschitz constant depending on κ 2 , κ 3 ).The lemma now follows on observing that ν y (x) ∝ πy(x) c(x,y) , and hence the assertion follows from the Lipschitz property of x → c(x, y) and the properties ς > 0 and ς < ∞.
Then f is a locally Lipschitz function on R d with linear growth.
Proof.The linear growth property is clear from the Lipschitz property of f .The local Lipschitz property follows by noting that for any compact Let b(x) = y∈L b(x, y)ν y (x), and note that by Lemma 2.6 b is locally Lipschitz function with linear growth.The proof of the following theorem follows along the lines of [24, Chapter 2, Theorem 8].We omit the details since a similar result in a controlled setting will be shown in Proposition 3.7.
Theorem 2.7.Fix (x 0 , y 0 ) ∈ G. Let (X ε , Y ε ) be the solution of (2.3).Then as ε → 0, X ε converges uniformly on compacts in probability to the unique solution of The unique solvability of (2.4) is a consequence of the properties of b stated before the theorem.
The solution X ε of the system (2.3) can be regarded as a C([0, T ] : R d )-valued random variable.The main result of this work establishes a large deviation principle (LDP) for X ε in C([0, T ] : R d ) as ε → 0. In rest of this section we formulate the rate function for {X ε } and present our main result.

Rate function
Recall that M([0, T ] : P(L)), M([0, T ] : R d ) denote the space of measurable maps from [0, T ] to P(L) and from [0, T ] to R d respectively.For ψ = (ψ j ) j∈L , with ψ j : [0, ζ] → R + a measurable map for every j, let where for i, j Note that any ψ as above, such that ψ j is integrable for each j, can be identified with η = (η j ) j∈L such that for each j, η j ∈ M F ([0, ζ]) on setting η j (dz) .= ψ j (z)dz.More generally, for x ∈ R d , and any η = (η j ) j∈L such that each η j ∈ M F ([0, ζ]), we define A η (x) as (2.7) We will make use of such A η in the next section.Although the introduction of a second notation for the controlled intensities is regrettable, the measure formulation is more natural when discussing topologies.
For ξ ∈ C([0, T ] : R d ), define where A(ξ) is the collection of all and where ϕ i,• = (ϕ i,j ) j∈L (with the convention ϕ i,j = 1 if (i, j) / ∈ T).Equation (2.10) characterizes the invariant distributions that would be associated with controlled PRMs with controls ϕ ij , which influence the rate of transition from i to j through (2.6).This form of the rate function is very much analogous to the control formulation of a small noise diffusion as in [4].In (2.10) we follow the convention that 0 The following is the main result of this work.Recall that Assumptions 2.1 and 2.3 are taken to hold throughout the paper.A function The proof of the Laplace upper bound which corresponds to a variational lower bound, is given in Section 3. The corresponding lower bound lim inf which is a variational upper bound, is proven in Section 5.The fact that I is a rate function is shown in Section 2.3 (Proposition 2.14).
(a) Note that the rate function I depends on the initial value of (X ε , Y ε ), namely (x 0 , y 0 ).
To emphasize this dependence denote I as I x 0 ,y 0 .Using a straightforward argument by contradiction, one can show that the Laplace limit (2.11), with I replaced by I x 0 ,y 0 on the right side, hold uniformly for y 0 ∈ L and for x 0 in any compact subset of Then Theorem 2.8 can be strengthened as follows: The pair (X ε , ̺ ε ) satisfies a large deviation principle on C([0, T ] : R d × M F (L)) with rate function Ī where for (ξ, ϑ) ∈ C([0, T ] : R d × M F (L)), Ī(ξ, ϑ) is defined by the right side of (2.8) by replacing A(ξ) with Ā(ξ, ϑ) which is the collection of all (u, ϕ, π) that satisfy in addition to (2.9) and (2.10) the equality

An Equivalent Representation for the Rate Function
In this section we present a different representation for the rate function that will be more convenient to work with in some instances.Recall that with an abuse of notation we define l(η) = i∈L l(η i ).
Let P 1 (H T ) denote the space of finite measures Q on In other words, denoting the marginal on the i th coordinate of For notational simplicity, we will denote a typical (s, y, η, z) ∈ H T as v. Q encodes time (s), the state of the controlled fast process (y), the measures controlling the jump rates (η), and the control (z) applied to perturb the mean of the Brownian motion.Recall that b(x, y), a(x, y) are the same as b y (x), a y (x).For ξ ∈ C([0, T ] : R d ), let Â(ξ) be the family of all Q ∈ P 1 (H T ) such that and where ) gives the controlled dynamics, and (2.17) guarantees that the conditional distribution of Q, in the y-variable (i.e. the second coordinate) given the time instant s ∈ [0, T ], the state ξ(s) of the dynamics, and that the rate control measure η are used, is the stationary distribution associated with the generator A η (ξ(s)).Define the function Î : In the expression for Î, all statistical relations between the controls (z, η) and the empirical measure for the fast variables are determined by the joint distribution.It is a natural object for purposes of weak convergence analysis, and these relations can be determined by the use of suitable test functions.The following result shows that Î and I are the same.
Proof.Fix ξ ∈ C([0, T ] : R d ).We first show that Î(ξ) ≤ I(ξ).Without loss of generality we assume that Then it is easy to verify that Q ∈ Â(ξ) and equals the left side of (2.19).This proves that Î(ξ) ≤ I(ξ) + ε.Since ε > 0 is arbitrary, we have Î(ξ) ≤ I(ξ).We now consider the reverse inequality, namely I(ξ) ≤ Î(ξ).We assume without loss of generality that Î(ξ) < ∞.Let Q ∈ Â(ξ) be such that Let [Q] 34|12 (dη×dz|y, s) denote the conditional distribution on the third and fourth coordinates given the first and second.Disintegrate the measure Q as and write ηy = (η yy ′ ) y ′ ∈L .By convexity and therefore Define Then note that A similar convexity argument shows that i∈L 1 2 To complete the proof it suffices to show that First, it is easily checked that ξ satisfies (2.9).Thus it remains to verify (2.10).Since Q ∈ Â(ξ), from (2.17) we have that for all j ∈ L and a.e.s ∈ [0, T ] y∈L π y (s) This equality can be rewritten as Using the definition of A η in (2.5), the last display becomes which owing to the definition of ηy in (2.20) is the same as From the definition of (ϕ ij ) in (2.21) it is now immediate that this is same as (2.10).

Compact Level Sets
We first prove the following lemmas, which will be used in the proof of the main result of this section.
The following inequality will be used in the proof: for u, v ∈ (0, ∞) where the inequality on the second line follows from (2.22) and the last inequality follows from (2.2).The result follows.
Remark 2.12.Using the inequality in (2.22) one can similarly show that there exists a The following lemma will be used at several places in weak convergence arguments.
Lemma 2.13.Let (η n , Z n , Y n ) be a sequence of (M F [0, ζ]) |L| ×R d ×L valued random variables given on a probability space ( Ω, F, P), which converges in probability to (η, Z, Ȳ ).Further suppose that, for some Proof.Fix j ∈ L. Using Lemma 2.11 we see that From Fatou's lemma and the lower semicontinuity of l it follows that Ēl (η) ≤ C. Next, assume without loss of generality by using a subsequential argument that the convergence of (η n , Z n , Y n ) holds a.s.Let Ω 0 ∈ F be such that P(Ω 0 ) = 1 and ∀ω ∈ Ω 0 , and l(η(ω)) < ∞.Fix ω ∈ Ω 0 .We will suppress ω from the notation at some places below.Since L is a finite set, there exists an . Since η n → η and ηj is absolutely continuous with respect to λ ζ for every j, we conclude that

Using this in (2.7) we now have that
Combining this with (2.23) we have that Finally to show the L 1 -convergence, it suffices to argue that η n j [0, ζ], or equivalently where the first inequality follows from the convexity of ℓ.The desired uniform integrability is now an immediate consequence of the superlinearity of ℓ.
We now show that the function I, which is same as the function Î defined in (2.18), is a rate function on C([0, T ] : R d ).
Proposition 2.14.For every M ∈ (0, ∞), the set Proof.Let {ξ n } n∈N be a sequence in U M .Since I(ξ n ) ≤ M , we have from Proposition 2.10 that for each n ∈ N, there exists some

.24)
Recall that P 1 (H T ) is the space of finite measures on H T defined in (2.14) whose first marginal is Lebesgue measure.It suffices to show that {ξ n } is pre-compact, and every limit point belongs to U M .For this, we prove that: (i) {Q n , ξ n } n∈N is pre-compact in P 1 (H T ) × C([0, T ] : R d ); (ii) Any limit point {Q, ξ} satisfies the properties 16) holds, (c) (2.17) holds.
We now prove (i).Since L is a finite (and hence compact) set and [Q n ] 1 = λ T for all n, in order to prove the pre-compactness of {Q n }, it suffices to show that for every δ > 0, there exists a From (2.24) and using (2.22 The inequality in (2.25) is now immediate from the last two displays.Thus {Q n } is pre-compact in P 1 (H T ).Next we argue the pre-compactness of {ξ n }.We first show that Using the linear growth property of a, b (Remark 2.2), we have Thus from (2.26) The inequality in (2.27) now follows by Gronwall's inequality.Next, consider fluctuations of ξ n .For 0 ≤ t 0 ≤ t 1 ≤ T , where the last inequality uses (2.27) and (2.26) and C 3 depends only on C 2 , M, κ 1 and T .This estimate together with (2.27) shows that {ξ n } is pre-compact in C([0, T ] : R d ).We now prove (ii).Let (Q, ξ) be a limit point of the sequence {(Q n , ξ n )} n∈N .Part (a) is immediate from (2.24) using Fatou's lemma and the lower semicontinuity of l.Consider now part (b).
We assume without loss of generality that the full sequence converges to (Q, ξ).From the Lipschitz property of a (Assumption 2.1), we have as n → ∞.A similar calculation shows that as n → ∞ Since (s, y, η, z) → (b(ξ(s), y), a(ξ(s), y)) is a continuous and bounded map and (2.28) Combining the last three convergence statements we have (b).Next we consider part (c).By Lemma 2.11 we have that for (2.29) Sending n → ∞ and then M 0 → ∞ we see from (2.26) that the left side of (2.29) converges to 0 as n → ∞.Finally, by the Skorohod representation theorem, (2.26) and Lemma 2.13, Combining this with (2.29) and recalling that Q n ∈ Â(ξ n ), we have (c).This completes the proof.

Large Deviation Upper Bound
The main result of this section is Theorem 3.8, which shows that for all To do this we show a lower bound on the corresponding variational representations.
Let P denote the predictable σ-field on [0, T ]×Ω associated with the filtration Consider the following spaces.Let With this notation the variational representation of [6] says that (3.3)In fact, a closer inspection of the proof of Theorem 2.8 of [6] (see [7,Theorem 2.4]) shows that (3.3) can be strengthened as follows.For n ∈ N, define Since P M 2 is a closed ball in L 2 ([0, T ]), it is compact under the weak topology.A g ∈ S M can be identified with θ g T = (θ With the usual weak convergence topology on the space of finite measures on [0, ζ] × [0, T ], this identification induces a topology on S M under which it is a compact space.Throughout, we use these topologies on P M 2 and S M .Controlled versions of processes will be denoted by an overbar, with the particular controls used clear from context.Thus for (ψ, ϕ) ∈ U b we consider the coupled equations with ( Xε (0), Ȳ ε (0)) = (x 0 , y 0 ).Recall the map G ε introduced below (2.3).From unique pathwise solvability of (3.4) and a standard argument based on Girsanov's theorem (see for example ) is the unique solution of (3.4) with ψ and ϕ replaced by ψ ε and ϕ ε .Thus the representation in (3.3) yields The following lemma will be used in proving a tightness property.In many places below we will consider controls u subject to an a.s.constraint of the form L T (u) ≤ M .To simplify the notation, the almost sure qualification is omitted.
Lemma 3.1.For every M ∈ (0, ∞) where Also, by Doob's maximal inequality we have with Similar calculation shows that Hence from (3.6) it follows that with The lemma now follows from Gronwall's inequality.
Proof.Write Xε as in (3.6).From (3.7) and Lemma 3.1 as ε → 0. It thus suffices to prove the tightness of { Āε } and { Bε }.For this, note that for 0 ≤ h ≤ T and ε ∈ (0, 1) where the last inequality used the fact that L T (u ε ) ≤ M .Lemma 3.1 now gives that { Āε } is tight in C([0, T ] : R d ).Similar calculations show that { Bε } is tight as well.The result follows.
Remark 3.3.The estimate in Proposition 3.2 in particular shows that for every M ∈ (0, ∞) there exists some c 2 (M ) ∈ (0, ∞), such that sup ε∈(0,1) Given ε > 0, let u ε = (ψ ε , ϕ ε ) ∈ U b and ( Xε , Ȳ ε ) be as in Proposition 3.2.Note that by (3.4)only the controlled rates ϕ ε Ȳ ε (t−),• affect the evolution of ( Xε , Ȳ ε ).In proving the Laplace upper bound, we can (and will) assume without loss of generality that The proof of the Laplace upper bound relies on the asymptotic analysis of the following occupation measure.We fix a collection {∆ ε } ε>0 of positive reals such that By convention we will take The main step in the proof will be to characterize the limit points of Q ε .We begin with some preliminary estimates.
), be as in Proposition 3.2.Let ∆ ε be as in (3.9).Then and Proof.We only prove (3.13).The proof of (3.12) is similar but easier and therefore omitted.Changing the order of the integration, for t ≥ ∆ ε , Thus for such t, (3.14) For T (1) we have (T Similarly, (T Thus in view of Lemma 3.1, sup 0≤t≤T E(T (i) t ) 2 → 0, as ε → 0, for i = 1, 3. For the second term, using the Lipschitz property of a we have (T where d lip is as in Assumption 2.1.Using Remark 3.3 we have which converges to 0 as ε → 0. Thus we have shown that (3.13) holds with the sup 0≤t≤T on the left replaced by sup ∆ε≤t≤T .A similar calculation can be used to prove the statement for t ) 2 .The result follows.
We now give a convenient lower bound for L T (u ε ) for u ε ∈ U b .
Lemma 3.6.For ε ∈ (0, 1), let u ε = (ψ ε , ϕ ε ) ∈ U b and define Q ε as in (3.11).Then Proof.Recall from (3.2) that and using the convention for the definition of ψ ε (u) and η ε (u) when u > T , where η ε is as in (3.10).As in the proof of (3.13), changing the order of integration, we can rewrite the first term in the last display as where the third term is 0 using our convention that ψ ε (u) = 0, for u > T .We can write and thus For the second term note from (3.8) and (3.10) that and we have in a similar manner that The result follows.
We now prove the tightness of {( Xε , Q ε )} and characterize the limit points.
From Lemma 3.6 we have To prove the tightness of {Q ε }, it suffices to show that for any δ ∈ (0, ∞), there exists C 1 ∈ (0, ∞) such that: However, this is proved exactly as (2.25) using (3.18) instead of (2.24).The inequality in part 1 follows immediately from Lemma 3.6 using Fatou's Lemma and lower semicontinuity of η → l(η).We now prove 2. For this we assume without loss of generality (using the Skorokhod representation) that ( Xε , Q ε ) converges a.s. to (ξ, Q).Following similar steps as in the proof of Proposition 2.14 (see the proof of (2.28)), we conclude that It now follows from (3.17) and Lemma 3.5 that (2.16) holds.To prove that (2.17) holds, we estimate the difference between Ht A η y,j ( Xε (s))Q ε (dv) and Ȳ ε (u),j ( Xε (u))du for j ∈ L and t ∈ [0, T ].By a change of the order of integration, we have for t Using this along with Lemma 2.11 and Remark 2.12 we have that there exists C 2 ∈ (0, ∞) such that, for any A similar calculation shows that the above inequality is also true for all where the second inequality follows on noting that since Xε → ξ, the collection { Xε } is equicontinuous.Recall the sets E ij (x) defined in (2.1).Then from (3.8) and (3.10), for any φ where M ε φ is the martingale given by where the last inequality uses (2.22).It follows that sup 0≤s≤T |ε M ε φ (s)| converges to 0 in probability as ε → 0. Next, from (3.20) we see that Since φ is bounded, we conclude that as ε → 0 sup For fixed j ∈ L, taking φ .= 1 {j} , we now see from (2.7) that sup Ȳ ε (s),j ( Xε (s))ds converges to 0 as ε → 0. Hence from (3.19), Ht A η y,j ( Xε (s))Q ε (dv) → 0, uniformly in t ∈ [0, T ].Now as in the proof of Proposition 2.14 (see (2.29) and (2.30)), Ht A η y,j ( Xε (s))Q ε (dv) → Ht A η y,j (ξ(s))Q(dv).Thus (2.17) is satisfied and the result follows.

Near Optimal Paths with a Unique Characterization
In order to prove the large deviation lower bound (2.13), a natural approach is to consider a ξ that is a near infimum for the right side in (2.13) and construct a sequence of controls (ψ ε , ϕ ε ) such that Xε ⇒ ξ where Xε is as in (3.4) with (ψ, ϕ) replaced by (ψ ε , ϕ ε ) respectively.
Along with an appropriate convergence of costs, the variational representation in (3.5) can then be used to argue that (2.13) holds.For a near optimal ξ, let (u, ϕ, π) ∈ A(ξ) be a near infimum for the expression on the right side of (2.8).The control pair (u, ϕ) suggests a natural sequence of controls (ψ ε , ϕ ε ) (see (5.4)) for the construction of controlled processes Xε and an occupation measure Q ε of the form in (3.11).Our strategy in the proof of the lower bound given in Section 5 will be to show that any limit point ξ of Xε and a suitable marginal π of the limit point Q of Q ε solves the system in (2.9)-(2.10)for the given (u, ϕ).The key result then needed in order to complete the proof is to argue that the system admits a unique solution for the given choice of (u, ϕ), thereby proving ( ξ, π) = (ξ, π) a.s.Although proving such a result for an arbitrary ξ and an arbitrary (u, ϕ, π) ∈ A(ξ) appears to be challenging, in this section we show that one can perturb ξ slightly to ξ * , without affecting the cost too much, and find a near optimal (u * , ϕ * , π * ) ∈ A(ξ * ) such that the desired uniqueness property discussed above does in fact hold for (u * , ϕ * ).See in particular parts 4 and 5 of the following proposition.
and there is (u ) with the following properties.

There is a measurable map
and for some c 1 ∈ (0, ∞) 4. If for the given u * and ϕ * , (2.9) and (2.10) are satisfied for any other ( ξ, π) ∈ C([0, T ] : 5. The cost associated with (u * , ϕ * ) satisfies: Remark 4.2.We now given an outline of the proof strategy.Let ξ ∈ C([0, T ] : R d ) be such that I(ξ) < ∞ and fix γ ∈ (0, 1).Let (u, ϕ, π) ∈ A(ξ) be such that Note that there are four time dependent objects appearing in the limit deterministic controlled dynamics: the trajectory ξ, the empirical measure on the fast variables π, the controls u that correspond to shifting the mean of the Brownian noises, and the thinning function ϕ that controls the rates for the fast variables.In addition, there is complete coupling of the fast and slow variables, and in particular the dynamics of the fast variables at time s will depend on both the controlled rates and ξ at s.The key issue regarding uniqueness is to make sure that these thinning functions ϕ can be bounded away from zero, which will imply ergodicity of the associated Markov processes giving the uniqueness of the corresponding π.If for a given collection (u, ϕ, π, ξ) the rates are not bounded away from zero, then we must show they can be perturbed so this is true, while at the same time making only a small change in ξ and the cost.
The steps are as follows.(a) We first perturb π to π δ (see (4.6)), so that every state has strictly positive mass under π δ .This positivity is used crucially in the remaining steps.(b) Replacing π with π δ in (2.9) leads to a perturbation of the target trajectory ξ.To ensure that the trajectory perturbation is not too large, we modify the control u to u δ in a way that compensates for the change in π (see (4.7)).(c) The perturbed measure π δ need not be stationary (i.e., satisfy (2.10)) for the original thinning control ϕ and the new trajectory ξ δ .In order to remedy this we next perturb ϕ to ϕ δ (see (4.9)).(d) With the perturbed ϕ δ (2.10) is satisfied so the π δ would be stationary, but with ξ rather than ξ δ .In particular the constructed (u δ , ϕ δ , π δ ) is not in general in A(ξ δ ).This leads to our last modification where we change ϕ δ to φδ .It is at this point that the formulation of the original dynamics as the solution to an SDE driven by a collection of PRMs, and corresponding formulation of the control problem in terms of thinning functions, is very convenient (see (4.13)).With this change we now have a (u δ , φδ , π δ ) ∈ A(ξ δ ).Furthermore with δ = δ * sufficiently small these perturbed quantities (u δ * , φδ * , π δ * , ξ δ * ) = (u * , ϕ * , π * , ξ * ) will satisfy all the desired properties.
Proof of Proposition 4.1.Let ξ and (u, ϕ, π) ∈ A(ξ) be as in Remark 4.2.In particular, We claim that without loss of generality it can be assumed that for some m 0 ∈ (0, ∞), We first prove the proposition assuming the claim.The proof of the claim is given at the end.For x ∈ R d let ν(x) as in Theorem 2.4 be the stationary distribution for the fast system when the slow variable equals x.Fix δ > 0 and define π δ j (s) .
= (1 − δ)π j (s) + δν j (ξ(s)).Define Then u δ j (s)π δ j (s) = u j (s)π j (s) for all s and j.Define From the Lipschitz properties of b j , σ j , (4.8) has a unique solution for the given u δ and π δ .Note that with M = I(ξ) + 1, Then by Gronwall's lemma ξ − ξ δ T ≤ Kδ, where K = 2M 0 a(M )e d lip a(M ) and M 0 = sup 0≤s≤T,j∈L ( b j (ξ(s)) + a j (ξ(s)) ).Now define, for (i, j) ∈ T and (s, z) and and β δ ii (s) .= − j:j =i β δ ij (s).Then since the π δ i will cancel and (u, ϕ, π) ∈ A(ξ), with A δ (s) Thus π δ (s) is stationary for A δ (s).However, from (2.6) , and hence (u δ , ϕ δ , π δ ) is not in general in A(ξ δ ).We now construct a further modification, φδ , of ϕ δ such that (u δ , φδ , π δ ) ∈ A(ξ δ ), and such that the uniqueness of (2.9) -(2.10) (with (u, ϕ) replaced by (u δ , φδ )) holds. Let where we recall that c i (x) is the overall rate of transitions out of i for the fast process when the slow process is in state x, and r ij (x) gives the probability of transition to state j.For (i, j) ∈ T define (4.13) Then for such (i, j) Then, from (4.8) and (4.11), (u δ , φδ , π δ ) ∈ A(ξ δ ) and , where ρ ψ ij (x) was defined in (2.6), and φδ i denotes the collection of controls ( φδ ij , j ∈ L).Next note that by construction, for (i, j) ∈ T, and, from (4.5), for all δ > 0 Let r .= ςκ 3 , where ς and κ 3 are defined above Assumption 2.3.Then for (i, j) ∈ T the definition (4.10) implies Also, from (4.12), for each s and using convexity It is easy to check that for a ≥ 0 and b, c > 0, The Lipschitz properties of the underlying transition rates ρ ij (•) (see (2.2)) yield the following inequalities, each of which is explained after the display: The first inequality uses the previous three displays, the second uses the definition of β δ ij in (4.10), the third uses (4.9), and the final one uses the definition ξ − ξ δ T ≤ δ, and the fact that x ≤ e(1 + ℓ(x)) for all x ≥ 0. Hence we obtain where the last line is a consequence of the fact that However if π i (s) ≥ π δ i (s) then π i (s) ≥ ν i (ξ(s)) follows, and thus π δ i (s) ≥ ν.Therefore Thus in this case Combining the two cases we have from (4.4) Taking δ * .= min{γ/K, γ/4K 1 , γν/8M } we now see that with namely item 1 in the proposition holds.From (4.16) and the definition (2.6), for (i, j) ∈ T, s ∈ [0, T ] and x ∈ R d ρ Using (4.17 where the last line is from (4.4).This proves 5. We now prove the claim made in (4.5).Note that we do not change the dynamics at all if we redefine ϕ ij in the following way.With ϕ on the right equal to the old version and ϕ on the left the new, for (i, j) ∈ T set This amounts to assuming that outside E ij (ξ(s)) the controlled jump rates are the same as for the original system, and that within E ij (ξ(s)) they are constant in z, in such a way the overall jump rates do not change.Owing to convexity of ℓ and ℓ(1) = 0, this can only lower cost while preserving the dynamics, and could have been assumed for any candidate control for the jumps from the outset.Let and for α > 0 and (i, j) ∈ T Thus for α > 0 the controlled jump rates are uniformly scaled by α/v(s), and therefore We now compute the infimum on the left side.For notational simplicity, write ρϕ i (s,•) ij (ξ(s)) as ρij , and ρ ij (ξ(s)) as ρ ij .Also let θ Differentiating with respect to α and setting the derivative to 0, we get log(α) .
It is easily checked that there is m 6 ∈ (0, ∞) such that for all s ∈ [0, T ] the minimizing α satisfies α ≤ m 6 .Thus from the definition of ϕ α and since from (4.19) π i (s)ρ ij (ξ(s)) ≤ 1, we see that with this α, for all (i, j) ∈ T, (s, z) }.This proves the claim in (4.5) and completes the proof of the proposition.

Large Deviation Lower Bound
The goal of this section is to prove the following theorem.
where X ε solves (2.3), and I is as defined in (2.18).
To do this we will show an upper bound on the corresponding variational representations.The situation is in some sense simpler than the corresponding lower bound, since a fixed control is being used.Thus the analysis is essentially just the law of large numbers for a two time scale system, with the main effort being to justify the replacement of the empirical measure of the fast component by the corresponding stationary measure in the limit.We begin with an elementary lemma.Lemma 5.2.Let m n , m be finite measures on [0, T ] × L such that the first marginal of m n and m is Lebesgue measure: Suppose that m n converges weakly to m.Let v : [0, T ] → R be an integrable map, i.e., (5.4) From (3.5) we get that where LT and LT were defined at the beginning of Section 3.With ∆ ε as in (3.9), define Q ε ∈ P 1 (H T ) by (3.11) where η ε is as introduced in (3.10).From (4.3) and (4.2) (note that (4.2) gives a lower bound on π * i (s)) it follows that for all ε > 0 LT (ψ ε ) + LT (ϕ ε ) ≤ c 1 (I(ξ) + 1) .= M < ∞. (5.5) It then follows from Proposition 3.7 that ( Xε , Q ε ) is a tight family of C([0, T ] : R d )×M F (H T )valued random variables, and if ( ξ, Q) is a weak limit point of {( Xε , Q ε )} then equations (2.16) and (2.17) hold with (ξ, Q) replaced with ( ξ, Q).Disintegrate Q as Q(ds × {y} × dη × dz) = dsπ y (s)[ Q] 34|12 (dη × dz). (5.6) We will now show that, a.s., ( ξ(s), π(s)) = (ξ * (s), π * (s)) for a.e.s ∈ [0, T ]. (5.7) We can assume without loss of generality that convergence of ( Xε , Q ε ) to ( ξ, Q) holds a.s.along the full sequence.We begin by showing that for every y ∈ L, t ∈ [0, T ], and any continuous map h : [0, T ] → R m , with h(s) ′ denoting the transpose,  This proves (5.11).Using this fact in (5.10) we now have that T t → 0 as ε → 0. Thus R ε t → 0 and hence (5.8) follows.
Applying (5.8) to the rows of a j ( ξ(s)) and summing over j, we have that b j ( ξ(s))π j (s)ds.
Finally we consider the convergence of costs.Note that LT (ψ ε ) = 1 2

Theorem 2 . 8 .
The map I in (2.8) is a rate function on C([0, T ] : R d ) and {X ε } ε>0 satisfies the Laplace principle on C([0, T ] : R d ), as ε → 0, with rate function I. Namely, for all and consequently I is a rate function on C([0, T ] : R d ).
and let Āb = ∪ ∞ n=1 Āb,n .Also let U b = P 2 × Āb .Then in the equality in(3.3),U on the right side can be replaced by U b .