Markov chains in random environment with applications in queueing theory and machine learning

We prove the existence of limiting distributions for a large class of Markov chains on a general state space in a random environment. We assume suitable versions of the standard drift and minorization conditions. In particular, the system dynamics should be contractive on the average with respect to the Lyapunov function and large enough small sets should exist with large enough minorization constants. We also establish that a law of large numbers holds for bounded functionals of the process. Applications to queuing systems and to machine learning algorithms are presented.


Introduction
Markov chains in stationary random environments (MCREs) with a general (not necessarily countable) state space feature in several branches of applied probability. Rough volatility models of mathematical finance (see [5,6]), queuing models with non-i.i.d. service and interarrival times (see [3] and Section 3 below) and sequential Monte Carlo methods (see Section 4 below) are prominent examples. It seems that existing studies on the ergodic theory of MCREs (such as [9,10,16,17]) impose conditions that exclude the treatment of relevant models from the above list of applications.
The article [7], introducing new tools, managed to establish the existence of limiting laws and ergodic theorems for certain classes of MCREs which satisfy suitable versions of the standard drift and minorization conditions of Markov chain theory (as presented e.g. in [13]).
Assumption 2.2 of [7], however, severely restricted the scope of applications by requiring that the system dynamics is contractive whatever the state of the random environment is. The present study aims to remove this restriction: we require only that process dynamics is contractive on the average, in the sense of Assumption 2.3 below.
In the sequel, we employ the convention that l k = 0 and l k = 1 whenever k, l ∈ , k > l.

Main results
Let ( , ) be a measurable space and Y : × Ω → a strongly stationaryvalued stochastic process which we interpret as the environment which influences the evolution of our main process of interest (X below). We consider a parametric family of stochastic kernels, that is a map Q : × × → [0, 1], where for all B ∈ the function ( y, x) → Q( y, x, B) is ⊗ -measurable and for all ( y, x) ∈ × , B → Q( y, x, B) is a Borel probability measure on . We assume that we are given the -valued process X t , t ∈ such that X 0 = x 0 ∈ is fixed and where the filtration is t = σ(X s , 0 ≤ s ≤ t; Y s , s ≤ t), t ∈ . Let µ t ∈ 1 denote the law of X t for t ∈ .
We aim to study the ergodic properties of X t and the convergence of µ t to a limiting law as t → ∞ under various assumptions. This definition makes sense for any non-negative measurable φ, too.
Consistently with Definition 2.1, for y ∈ , Q( y)φ will refer to the action of the kernel Q( y, ·, ·) on φ.

Assumption 2.2. (Drift condition) Let V :
→ + be a measurable function. We assume that there are measurable functions K, γ : → (0, ∞) such that, for all x ∈ and y ∈ , Furthermore, we may and will assume that K(.) ≥ 1.
In contrast with the drift condition used in [7] (cf. Assumption 2.2 on page 2), the domain of γ is and not . Moreover, it is possible that γ( y) ≥ 1 holds for certain y ∈ . This relaxation allows the inclusion of several models that were untractable using the results of [7]. Although γ( y) ≥ 1 may hold, in the next assumption we require that the system dynamics, on average, is contracting in the long run.

Assumption 2.3. (Long-time contractivity condition) We assume that
The next assumption stipulates the existence of suitable "small sets". It corresponds to Assumption 2.5 in [7] but we need a different formulation here.

Assumption 2.4. (Minorization condition) Let λ(.), K(.) be as in Assumption 2.2.
We assume that for some 0 < ǫ < 1/γ 1/2 −1, there is a measurable function α : → (0, 1) and a probability kernel κ : × → [0, 1] such that, for all y ∈ and A ∈ , The following easily verifiable condition controls the tail distribution of α(Y 0 ) which will play a very important role in our convergence estimates. Assumption 2.6. (Thin tail condition) Let us assume that, exists 0 < θ < 1 such that Now come the main results of the present paper: with the above presented assumptions, the law of X t converges to a limiting law as t → ∞, moreover, bounded functionals of X t admit ergodic behavior provided that Y t is ergodic.
Remark 2.10. Since Φ is bounded, convergence in (8) takes place in probability iff it happens in L p for all 1 ≤ p < ∞. We preferred the current formulation of Theorem 2.9 since we obtain L p rates during the proofs, see also Remark 5.13. These rates, however, have too complicated expressions to be stated here.

A queuing model
We consider a single-server queuing model where customers are numbered by n ∈ + . The time between the arrival of customers n + 1 and n is described by the random variable ǫ n+1 , for each n ∈ . The service time for customer n is given by the random variable Y n , for n ∈ .
The waiting time W n of customer n satisfies the Lindley recursion with W 0 := 0 (since customer 0 does not need to wait at all). When (Y n ) n∈ and (ǫ n ) n∈ + are i.i.d. sequences independent of each other then W n is a (general state space) Markov chain whose ergodic properties have been extensively studied. Here we are interested in a more general setting where the process (Y n ) n∈ is assumed merely stationary.
The following condition is standard: in a stable system service times should be shorter on the average than inter-arrival times.

Assumption 3.2.
For some M > 0, the sequence of service times is included in a strict sense [0, M ]-valued stationary process Y n , n ∈ which is independent of (ǫ n ) n≥1 . There is η > 0 such that the limit exists for all α ∈ (−η, η) and Γ is differentiable on (−η, η).

Remark 3.3.
The above assumption is clearly inspired by the Gärtner-Ellis theorem hence sufficient conditions for its fulfillment can be deduced from the literature about large deviation principles. For instance, if Y n = φ(Z n ) for some bounded measurable φ : m → + and an m -valued geometrically ergodic Markov chain Z n , n ∈ started from its invariant distribution then (4) holds true for some η > 0, see Theorem 4.1 of [11] for a precise formulation. Thus Theorem 3.7 is applicable to a large class of models. We also mention a non-Markovian example: holds for all y ∈ [0, M ], w ∈ + , where Q is defined as n ln e α n j=1 (Y j−1 −ǫ j ) , α ∈ (−η, η), n ∈ + are finite and differentiable. They are also clearly convex. Define By the Lagrange mean value theorem and measurable selection, there exists a random variable ξ n (α) ∈ [0, α] such that which is uniformly bounded in α ∈ (0, η) (for n fixed). Hence reverse Fatou's lemma shows that lim sup This implies that, for all n ≥ 1, λ ′ Since λ n (α) → λ(α) for α ∈ (−η, η) by Assumption 3.2 it follows from Theorem 25.7 of [14] that also λ ′ n (0) → λ ′ (0) hence λ ′ (0) < 0 by Assumption 3.1. By Corollary 25.5.1 of [14], differentiability of λ implies its continuous differentiability, too. Hence from λ(0) = 0 and λ ′ (0) < 0 we obtain that there existsᾱ > 0 satisfying Now we define the Lyapunov function V (w) := eᾱ w , w ≥ 0 and choose γ( so (5) holds with K as defined above. By (6), the long-time contractvity condition also holds: lim sup which completes the proof.
Now we present another assumption on the inter-arrival times which will be needed to show the minorization condition. Notice that, for unbounded ǫ 1 , Assumption 3.5 automatically holds. Assumption 3.5. One has (ǫ 1 ≥ τ) > 0 for . Now let us turn to the verification of the minorization condition under the assumption above. Lemma 3.6. Let Assumption 3.5 be in force. Choose ǫ := (1/γ 1/2 − 1)/2. Then there is α ∈ (0, 1) such that, for all y ∈ [0, M ] and A ∈ ( + ), and δ 0 is the one-point mass concentrated on 0.
Theorem 2.7 allows us to deduce that the queuing system in consideration converges to a stationary state and an ergodic theorem is valid. Theorem 3.7 below opens the door for the statistical analysis of such systems. ( + ) such that, for all 0 < ̺ < 1/3 for some c 1 (̺), c 2 (̺) > 0. Furthermore, if (Y n ) n∈ is ergodic, then for an arbitrary measurable and bounded Φ : in L p for all 1 ≤ p < ∞.

Remark 3.8.
It is known that Law(W n ) converges to a limiting distribution under rather mild conditions, see Example 14.1 on page 189 of [3]. Details of this approach seem to be available only in Russian, see [2]. However, as far as we know, Theorem 3.7 above is the first result providing a rate of convergence and a law of large numbers in this setting.

Stochastic gradient Langevin algorithm
We consider, for some λ > 0, where ξ n , n ≥ 1 is an independent sequence of standard d-dimensional Gaussian random variables, Y n , n ∈ is a m -valued strict sense stationary process and H : This algorithm is called "stochastic gradient Langevin dynamics" (SGLD). Suggested by [18], it has recently become widely used for sampling from high-dimensional probability distributions. More precisely, let U : For λ small and n large, Law(θ n ) is expected to be close to the probability defined by see e.g. [18,1]. The literature on SGLD is abundant but practically all studies assume that Y n , n ∈ are i.i.d. For the case where the step size λ n is a decreasing, it has been shown in [19] that, under suitable assumptions, the averages In the case of fixed λ, [15] estimated the L 2 distance between D n and D.
In the present article we also keep λ fixed, but we establish a novel result: the SGLD recursion converges to a limiting law µ(λ) (in total variation) and D n tends As far as we know this ergodic property has not yet been pointed out, even in the case of i.i.d. Y n , n ∈ . We can now prove it for a broad class of stationary processes Y n , n ∈ . We think of Y n as an observed data sequence. As these are rarely i.i.d. in practice, Theorem 4.6 below formulates strong theoretical support for the use of SGLD with possibly dependent data.
The following standard dissipativity condition is required, see e.g. [12].

Assumption 4.1.
There is a measurable ∆ : m → + and b ≥ 0 such that, for all θ ∈ d and y ∈ m , We may and will assume that ∆ is a bounded function.
Note that Assumption 4.3 holds, in particular, if H is Lipschitz-continuous.

Remark 4.5.
Boundedness of Y 0 could be relaxed to assuming only E[e β|Y 0 | 2 ] < ∞ for some β > 0. This relaxation leads to a weaker rate estimate through rather tedious technicalities hence we prefer not to treat it here.
It turns out that the law of θ n tends to a limit as n → ∞ and ergodic averages converge to the expectation under the limit law.
as n → ∞ in L p for all p ≥ 1.
Remark 4.7. The convergence rates given by the above theorem are not sharp enough for practical purposes. However, Theorem 4.6 provides a universal ergodic property for the stochastic gradient Langevin dynamics, irrespective of dependencies in the data stream (as long as they satisfy Assumption 4.2). No result of this calibre has heretofore been available in the related literature.
Note that, for | y| ≤ M and θ , z ∈ C( y), we have Clearly, we can choose λ small enough such that According to our previous estimate for Q( y, θ , A), for λ small enough, we have for suitablec,ĉ > 0 depending on b, d, M and sup y∈ m |∆( y)|.

Proofs
In this section, we gathered the proofs of Theorem 2.7 and 2.9. For R > 0, denote by c(R) the set of mappings from into whose restriction to V −1 ([0, R]) is constant. Through this section, ǫ > 0 and R( y) will be as in Assumption 2.4.

Preliminary lemmas and notations
The following random mapping representation of Q will play a crucial role in the proofs. It generalizes the idea of Lemma 6.1 in [7].

Lemma 5.1. There exists a sequence of measurable functions T t
for all t ∈ , y ∈ , x ∈ , A ∈ and there are events J t ( y) ∈ , for all t ∈ , y ∈ such that Furthermore, the sigma-algebras σ(T t ( y, x, ·), x ∈ , y ∈ ), t ∈ are independent.
Proof. We follow the proof of Lemma 6.1 in [7]. So, let U n and ǫ n , n ∈ be sequences of i.i.d. uniform random variables on [0, 1] independent of each other. Without loss of generality, we may assume that The case of countable is easy hence omitted. In the case of uncountable we can also assume (by the Borel isomorphism theorem) that = and ( ) is the standard Borel σ-algebra of .
For y ∈ , x ∈ and A ∈ ( ), let and define Obviously, Furthermore, for all r ∈ , t ∈ and for any fixed y ∈ and x ∈ By the definition of the pseudoinverse, we can write and similarly as we desired. It remains only to show that T t is measurable with respect to sigma algebras ⊗ ( ) ⊗ σ ({U t , ǫ t | t ∈ }) and ( ). Indeed, T t is a composition of measurable functions. The claimed independence of the sigma-algebras clearly holds too.
We drop the dependence of the mappings T t on ω in the notation and will simply write T t ( y) x := T t (x, y, ·). For s ∈ and x ∈ , define the family of auxiliary processes where y = (. . . , y −1 , y 0 , y 1 , . . .) ∈ is a fixed trajectory.
Lemma 5.4. For x ∈ , y ∈ and k, l ∈ , l < k, we have Proof. We prove by induction. Let x ∈ and l ∈ be arbitrary and fixed. For which holds by Assumption 2.2. Let N ∈ + , λ ∈ (1/2, 1) be fixed and P 1 , P 2 : Ω → arbitrary 0 -measurable random variables, which may depend on y. Furthermore, in the remaining part of this subsection, we assume that y ∈ B λ N 3 ,3 . Our purpose will be to prove that, with a large probability Z P 1 ,y 0,N 3 = Z P 2 ,y 0,N 3 for N large enough. In other words, a coupling between the processes Z P 1 ,y 0,N 3 and Z P 2 ,y 0,N 3 is realized. First, we are going to prove that the process Z t := Z P 1 ,y 0,t , Z P 2 ,y 0,t , t ∈ visits the sets D( y t ) frequently enough, where Let us define the successive visiting times that are obviously ( t ) t∈ -stopping times.
Lemma 5.5. For the tail distribution of σ N , we have Proof. If σ N > N 3 , then exists k ∈ {0, . . . , N − 1} for which Z kN 2 +l / ∈ D( y kN 2 +l ), l = 1, . . . , N 2 . Thus we can write We estimate a general term of the latter sum. For typographical reasons, we will write a := kN 2 and b := N 2 . By the tower rule, we have
Let us introduce the abbreviation M N = max 0≤k<N 3 α( y k ) for a moment. We can write Iteration of this argument leads to the following estimation.
which completes the proof.
Lemma 5.7. Let Assumption 2.6 be in force. Then, for every N ∈ + and 1 ≤ p < ∞, More precisely, there exists c, ν, β > 0 depending only on M , p and θ such that

hence by Jensen's inequality and the strong stationarity of
By Assumption 2.6, for some 0 N θ ′ → 0 as N → ∞ and trivially the same holds for any 0 < θ < θ ′ . Let us fix some θ ∈ (0, θ ′ ). Then, by the previous point, we have the estimate where c, ν, β > 0 depends only on M , p and θ thus for sufficiently large n ∈ + ,
The next lemma is a crucial ingredient of the proof both of Theorem 2.7 and 2.9.

Lemma 5.9. Under Assumption 2.6, exists
Proof. According to Lemma 5.8, there exist c, ν > 0 such that Y ∈ A λ n ≥ 1 − cn 2/3 e −νn 1/3 . So, we obtain the following upper bound for the general term which, by Lemma 5.7, has a finite sum.
We notice that the sequence of expected total variation distances has a finite sum, that is which implies the following.

Ergodicity of Φ(Z y t )
Let N ≥ 1 be an arbitrary natural number and y ∈ . Let us define the truncated process We will use the results of Section 6. For p ≥ 1, introduce the quantities M p (W ) = sup t∈ W t p and If τ > ⌊N 1/6 ⌋ 6 , then γ p (W, τ) = 0 thus Γ p (W ) is finite which means that W t , t ∈ is L-mixing of order p with respect to ( t , + t ), t ∈ . According to Lemma 6.2, for p ≥ 2, we have the estimate 1 where C p is a constant that does not depend either on N or on W . Let us consider the estimate and for s, t ∈ , t ≥ s introduce the auxiliary process Note that, W s,t is measurable with respect to + s moreover which will be important later.
Finally, we arrive at the following important result which will play a central role in the proof of Theorem 2.9.
Lemma 5.12. There existsc(p,γ, λ) > 0 depending only on p,γ and λ such that Proof. Without the loss of generality, we may assume that p ≥ 2. Clearly on C λ holds, hence by (22), we can write The square root function is subadditive hence by Lemma 5.11, existsc(p,γ, λ) depending only on p,γ and λ such that Finally, we obtain the desired upper bound which completes the proof.
Only remains to prove that µ * and µ * * coincide. It is clear that for every A ∈ , hence µ * = µ * * .

Proof of Theorem 2.9
Let N ≥ 1 arbitrary integer, 1 ≤ p < ∞ and consider the following estimate.
The stochastic process Y is strongly stationary and ergodic hence the left shift S : → is an ergodic endomorphism of the probability space ( , ⊗ , Law(Y )), moreover ∋ y → Φ(z) µ * (y, dz) is obviously in L 1 hence Birkhoff's ergodic theorem implies that almost surely and also in L p due to Lebesgue's dominated convergence theorem. By the strong stationary property of Y again, for the second term, we have which is a Cèsaro sum and due to Lemma 5.9, the general term tends to zero thus we obtain 1 Finally, due to the definition of µ t (·, ·), for any fixed y ∈ , the law of Z x 0 ,y 0,t equals to µ t (S t−1 y, ·) hence for the last term, we have According to Lemma 5.12, existsc(p,γ, λ) > 0 such that To sum up, because the laws of X t and Z x 0 ,Y 0,t coincides. This completes the proof of Theorem 2.9.
Remark 5.13. Birkhoff's ergodic theorem does not provide uppper bound for the difference between time and space averages hence we have convergence rate for every term in (23) except for the first one. However, in the ideal case this term is of the order 1/ N and this can be shown for Y with suitably favourable ergodic properties.

Appendix
For the reader's convenience, we recall a concept of mixing defined in [8] which was used in some of the estimations above. Let t , t ∈ be an increasing sequence of sigma-algebras and let + t , t ∈ be a decreasing sequence of sigma-algebras such that, for each t ∈ , t is independent of + t . Let W t , t ∈ be a real-valued stochastic process. For each p ≥ 1, introduce For each process W such that M 1 (W ) < ∞ define, for each p ≥ 1, For some p ≥ 1, the process W is called L-mixing of order p with respect to ( t , + t ), t ∈ if it is adapted to ( t ) t∈ and M p (W ) < ∞, Γ p (W ) < ∞. We say that W is L-mixing if it is L-mixing of order p for all p ≥ 1.