Concentration inequalities for Markov processes via coupling

We obtain moment and Gaussian bounds for general Lipschitz functions evaluated along the sample path of a Markov chain. We treat Markov chains on general (possibly unbounded) state spaces via a coupling method. If the first moment of the coupling time exists, then we obtain a variance inequality. If a moment of order 1+epsilon of the coupling time exists, then depending on the behavior of the stationary distribution, we obtain higher moment bounds. This immediately implies polynomial concentration inequalities. In the case that a moment of order 1+epsilon is finite uniformly in the starting point of the coupling, we obtain a Gaussian bound. We illustrate the general results with house of cards processes, in which both uniform and non-uniform behavior of moments of the coupling time can occur.


Introduction
In this paper we consider a stationary Markov chain X n , n ∈ Z, and want to obtain inequalities for the probability that a function f (X 1 , . . . , X n ) deviates from its expectation. In the spirit of concentration inequalities, one can try to bound the exponential moment of f − E(f ) in terms of the sum of squares of the Lipschitz constants of f , as can be done in the case of independent random variables by several methods [17].
In the present paper, we want to continue the line of thought developed in [7,8] where concentration inequalities are obtained via a combination of martingale difference approach (telescoping f − E(f )) and coupling of conditional distributions. In the case of an unbounded state space, we cannot expect to find a coupling of which the tail of the distribution of the coupling time can be controlled uniformly in the starting points. This non-uniform dependence is thus rather the rule than the exception and has to be dealt with if one wants to go beyond the finite (or compact) state space situation. Moreover, if the state space is continuous, then in general two copies of the process cannot be coupled such that they eventually coincide: we expect rather that in a coupling the distance between the two copies can be controlled and becomes small when we go further in time. We show that a control of the distance suffices to obtain concentration inequalities. This leads to a "generalized coupling time" which in discrete settings coincides with the ordinary coupling time (in the case of a successful coupling).
In order to situate our results in the existing literature, we want to stress that the main message of this paper is the connection between the behavior of the generalized coupling time and concentration inequalities. In order to illustrate the possibly non-uniform behavior of the coupling time, we concentrate on the simplest possible example of "house of cards" processes (Markov chains on the natural numbers). In this paper we restrict to the Gaussian concentration inequality and moment inequalities. In principle, moment inequalities with controll on the constants can be "summarized" in the form of Orlicz-norm inequalities, but we do not want to deal with this here.
The case of Markov chains was first considered by Marton [20,21] : for uniformly contracting Markov chains, in particular for ergodic Markov chains with finite state space, Gaussian concentration inequalities are obtained. The method developed in that paper is based on transportation cost-information inequalities. With the same technique, more general processes were considered by her in [22]. Later, Samson [25] obtained Gaussian concentration inequalities for some classes of Markov chains and Φ-mixing processes, by following Marton's approach. Let us also mention the work by Djellout et al. [9] for further results in that direction. Chatterjee [6] introduced a version of Stein's method of exchangeable pairs to prove Gaussian as well as moment concentration inequalities. Notice that moment inequalities were obtained for Lipschitz functions of independent random variables in [3]. Using martingale differences, Gaussian concentration inequalities were obtained in [15,24] for some classes of mixing processes. Markov contraction was used in [16] for "Markov-type" processes (e.g.. hidden Markov chains).
Related work to ours is found in [10,11,12] where deviation or concentration inequalities [10] and speed of convergence to the stationary measure [11,12] are obtained for subgeometric Markov chains, using a technique of regeneration times and Lyapounov functions. Concentration properties of suprema of additive functionals of Markov chains are studied in [1], using a technique of regeneration times.The example of the house of cards process, and in particular its speed of relaxation to the stationary measure is studied in [11], section 3.1. The speed of relaxation to the stationary measure is of course related to the coupling time, see e.g.. [23] for a nice recent account. In fact, using an explicit coupling, we obtain concentration inequalities in the different regimes of relaxation studied in [11].
Our paper is organized as follows. We start by defining the context and introduce the telescoping procedure, combined with coupling. Here the notion of coupling matrix is introduced. In terms of this matrix we can (pointwise) bound the individual terms in the telescopic sum for f − E(f ). We then turn to the Markov case, where there is a further simplification in the coupling matrix due to the Markov property of the coupling. In Section 5 we prove a variance bound under the assumption that the first moment of the (generalized) coupling time exists. In section 6 we turn to moment inequalities. In this case we require that a moment of order 1 + ǫ of the (generalized) coupling time exists. This moment M x,y,1+ǫ depends on the starting point of the coupling. The moment inequality for moments of order 2p will then be valid if (roughly speaking) the 2p-th moment of M x,y,1+ǫ exists. In Section 7 we prove that if a moment of order 1 + ǫ of the coupling is finite, uniformly in the starting point, then we have a Gaussian concentration bound.
Finally, Section 8 contains examples. In particular, we illustrate our approach in the context of so-called house of cards processes, in which both the situation of uniform case (Gaussian bound), as well as the non-uniform case (all moments or moments up to a certain order) are met. We end with application of our moment bounds to measure concentration of Hamming neighborhoods and get non-Gaussian measure concentration bounds.

The process
The state space of our process is denoted by E. It is supposed to be a metric space with distance d. Elements of E are denoted by x, y, z. E is going to serve as state space of a double sided stationary process. Realizations of this process are thus elements of E Z and are denoted by x, y, z.
We denote by (X n ) n∈Z a (two-sided) stationary process with values in E. The joint distribution of (X n ) n∈Z is denoted by P, and E denotes corresponding expectation.
F i −∞ denotes the sigma-fields generated by {X k : k ≤ i}, denotes the tail sigma-field, and We assume in the whole of this paper that P is tail trivial, i.e., for all sets A ∈ F −∞ , P(A) ∈ {0, 1}. For i < j, i, j ∈ Z, we denote by X j i the vector (X i , X i+1 , . . . , X j ), and similarly we have the notation X i −∞ , X ∞ i . Elements of E {i,i+1,...,j} (i.e., realizations of X j i ) are denoted by x j i , and similarly we have

Conditional distributions, Lipschitz functions
We denote by P We assume that this object is defined for all x i −∞ , i.e., that there exists a specification with which P is consistent. This is automatically satisfied in our setting, see [14].
The function f is said to be Lipschitz in the i-th coordinate if δ i (f ) < ∞, and Lipschitz in all coordinates if δ i (f ) < ∞ for all i. We use the notation δ(f ) = (δ i (f )) i∈Z . We denote by Lip(E Z , R) the set of all realvalued functions on E Z which are Lipschitz in all coordinates.

Telescoping and the coupling matrix
We start with f ∈ Lip(E Z , R)∩L 1 (P), and begin with the classical telescoping (martingale-difference) identity . We then write, using the notation of Section 2.1, For f ∈ Lip(E Z , R), we have the following obvious telescopic inequality Combining (1) and (2) one obtains where This is an upper-triangular random matrix which we call the coupling matrix associated with the process (X n ) andP, the coupling of the conditional distributions. As we obtained before in [7], in the context of E a finite set, the decay properties of the matrix elements D The non-uniformity (as a function of the realization of X i −∞ ) of the decay of the matrix elements as a function of j (which we encountered e.g.. in the low-temperature Ising model [7]) will be typical as soon as the state space E is unbounded. Indeed, if starting points in the coupling are further away, then it takes more time to get the copies close in the coupling . REMARK 3.1. The same telescoping procedure can be obtained for "coordinatewise Hölder" functions, i.e., functions such that for some 0 < α < 1 is finite for all i. In (4), we then have to replace d by d α .

The Markov case
We now consider (X n ) n∈Z to be a stationary and ergodic Markov chain. We denote by p(x, dy) := P(X 1 ∈ dy X 0 = x) the transition kernel. We let ν be the unique stationary measure of the Markov chain. We denote by P ν the path space measure of the stationary process (X n ) n∈Z . By P x we denote the distribution of (X ∞ 1 ), for the Markov process conditioned on X 0 = x. We further suppose that the couplingP of Section 2.2 is Markovian, and denote byP x,y the coupling started from x, y, and corresponding expectation byÊ x,y . More precisely, by the Markov property of the coupling we then have thatP In this case the expression (4) of the coupling matrix simplifies to With this notation, (3) reads We define the "generalized coupling time" In the case E is a discrete (finite or countable) alphabet, the "classical" coupling time is defined as usual and hence . Of course, the same inequality remains true if E is a bounded metric space with d(x, y) ≤ 1 for x, y ∈ E. However a "successful coupling" (i.e., a coupling with T < ∞) is not expected to exist in general in the case of a non-discrete state space. It can however exist, see e.g.. [13] for a successful coupling in the context of Zhang's model of self-organized criticality. Let us also mention that the "generalized coupling time" unavoidably appears in the context of dynamical systems [8].
In the discrete case, using (5) and (7), we obtain the following inequality: whereas in the general (not necessarily discrete) case we have, by (6), and monotone convergence, REMARK 4.1. So far, we made a telescoping of f − E(f ) using an increasing family of sigma-fields. One can as well consider a decreasing family of sigmafields, such as F ∞ i , defined to be the sigma-fields generated by {X k : k ≥ i}. We then have, mutatis mutandis, the same inequalities using "backward telescoping" . and estimating ∆ * i in a completely parallel way, by introducing a lowertriangular analogue of the coupling matrix matrix.
Backward telescoping is natural in the context of dynamical systems where the forward process is deterministic, hence cannot be coupled (as defined above) with two different initial conditions such that the copies become closer and closer. However, backwards in time, such processes are non-trivial Markov chains for which a coupling can be possible with good decay properties of the coupling matrix. See [8] for a concrete example with piecewise expanding maps of the interval.

Variance inequality
For a real-valued sequence (a i ) i∈Z , we denote the usual ℓ p -norm by Our first result concerns the variance of a f ∈ Lip(E Z , R). Then As a consequence, we have the concentration inequality Proof. We estimate, using (5) and stationarity where * denotes convolution, and where we extended Ψ to Z by putting it equal to zero for negative integers. Since Using Young's inequality, we then obtain, Now, using the equality in (9) E Ψ X 0 ,X 1 which is (10). Inequality (12) follows from Chebychev's inequality.
The expectation in (11) can be interpreted as follows. We start from a point x drawn from the stationary distribution and generate three independent copies y, u, z from the Markov chain at time t = 1 started from x. With these initial points we start the coupling in couples (y, z) and (u, z), and compute the expected coupling time.

Moment inequalities
In order to control higher moments of (f − E(f )), we have to tackle higher moments of the sum i ∆ 2 i and for these we cannot use the simple stationarity argument used in the estimation of the variance.
We then obtain, using Cauchy-Schwarz inequality: where δ(f ) 2 denotes the sequence with components (δ i (f )) 2 , and where Moment inequalities will now be expressed in terms of moments of Ψ 2 ǫ .

Moment inequalities in the discrete case
We first deal with a discrete state space E. Recall (7). LEMMA 6.1. In the discrete case, i.e., if E is a countable set with the discrete metric, then, for all ǫ > 0, we have the estimate Proof. Start with Proceed now with where we denoted by T 1 and T 2 two independent coupling times corresponding to two independent copies of the coupling started from (X i , z), resp. (X i , u). Now use that for two independent non-negative real-valued random variables we have The lemma is proved.
In order to arrive at moment estimates, we want an estimate for This is the content of the next lemma. We denote, as usual, ζ(s) = ∞ n=1 (1/n) s . LEMMA 6.2. For all ǫ > 0 and integers p > 0 we have Proof. We start from Then use Hölder's inequality and stationarity, to obtain where in the second inequality we used Young's inequality. The lemma now follows from (15).
We can now formulate our moment estimates in the discrete case.
As a consequence we have the concentration inequalities Proof. By Burkholder's inequality [5, Theorem 3.1, p. 87], one gets x,y (T ≥ j) ≤ C(x, y)φ(j) Here C(x, y) is a constant that depends, in general in an unbounded way, on the starting points (x, y) in the coupling and where φ(j), determining the tail of the coupling time does not depend on the starting points. Therefore, for the finiteness of the constant C p in (18) we need that the tail-estimate φ(j) decays fast enough so that j j ǫ φ(j) < ∞ (that does not depend on p), and next the 2p-th power of the constant C(x, y) has to be integrable (this depends on p).

The general state space case
In order to formulate the general state space version of these results, we introduce the expectatioñ E x,y (F (u, v)) = p(x, dz) P y,z (du, dv)F (u, v).
We can then rewrite We introduce This quantity is the analogue ofP(T = j) of the discrete case. We then define M x,y r = j≥0 (j + 1) r α x,y j (20) which is the analogue of the r-th moment of the coupling time. The analogue of Theorem 6.1 then becomes the following. THEOREM 6.2. Let p ≥ 1 be an integer and f ∈ Lip(E Z , R) ∩ L 2p (P). Then for all ǫ > 0 we have the estimate

Gaussian concentration bound
If one has a uniform estimate of the quantity (14), we obtain a corresponding uniform estimate for ∆ 2 i , and via Hoeffding's inequality, a Gaussian bound for (f − E(f )). This is formulated in the following theorem.
In particular, we get the concentration inequality The general state space analogue of these bounds is obtained by replacinĝ (20)).
The assumption that a moment of order 1 + ǫ of the coupling time exists, which is uniformly bounded in the starting point, can be weakened to the same property for the first moment, if we have some form of monotonicity. More precisely, we say that a coupling has the monotonicity property, if there exist "worse case starting points" x u , x l , which have the property that sup for all j ≥ 0. In that case, using (8), we can start from (5) and obtain, in the discrete case, the uniform bound xu,x l (T ≥ j)δ i+j f and via Azuma-Hoeffding inequality, combined with Young's inequality, we then obtain the Gaussian bound (21) with Finally, it can happen (especially if the state space is unbounded) that the coupling has no worst case starting points, but there is a sequence x n u , x n l of elements of the state space such thatP x n u ,x n l (T ≥ j) is a non-decreasing sequence in n for every fixed j and x n u ,x n l (T ≥ j). (E.g.., in the case of the state space Z, we can think of the sequence x n u → ∞ and x n l → −∞.) In that case, from monotone convergence we have the Gaussian concentration bound with C = lim n→∞ 1 2Ê x n u ,x n l (T ).

Finite-state Markov chains
As we mentioned in the introduction, this case was already considered by K. Marton (and others), but it illustrates our method in the most simple setting, and gives also an alternative proof in this setting. Indeed, if the chain is aperiodic and irreducible, then it is well-known [26], sup for all j ≥ 1 and some c > 0. Hence the Gaussian bound (21) holds.

House of cards processes
These are Markov chains on the set of natural numbers which are useful in the construction of couplings for processes with long-range memory, and dynamical systems, see e.g.. [4]. More precisely, a house of cards process is a Markov chain on the natural numbers with transition probabilities P(X k+1 = n + 1|X k = n) = 1 − q n = 1 − P(X k+1 = 0|X k = n), for n = 0, 1, 2, . . ., i.e., the chain can go "up" with one unit or go "down" to zero. Here, 0 < q n < 1.
In the present paper, house of card chains serve as a nice class of examples where we can have moment inequalities up to a certain order, depending on the decay of q n , and even Gaussian inequalities. Given a sequence of independent uniformly distributed random variables (U k ) on [0, 1], we can view the process X k generated via the recursion This representation also yields a coupling of the process for different initial conditions. The coupling has the property that when the coupled chains meet, they stay together forever. In particular, they will stay together forever after they hit together zero. For this coupling, we have the following estimate.
LEMMA 8.1. Consider the coupling defined via (24), started from initial condition (k, m) with k ≥ m. Then we havê where Proof. Call Y k t the process defined by (24) started from k, and define Z k t , a process started from k defined via the recursion where U t is the same sequence of independent uniformly distributed random variables as in (24). We claim that, for all t ≥ 0, Indeed, the inequalities hold at time zero. Suppose they hold at time t, then, since q * n is non-increasing as a function of n, Therefore, in this coupling, if Z k t = 0, then Y m t = Y k t = 0, and hence the coupling time is dominated by the first visit of Z k t to zero, which giveŝ The behavior (25) of the coupling time shows the typical non-uniformity as a function of the initial condition. More precisely, the estimate in the rhs of (25) becomes bad for large k. We now look at three more concrete cases.

Case 1:
q n = 1 n α , n ≥ 2, 0 < α < 1. Then it is easy to deduce from (25) that The stationary (probability) measure is given by: which is bounded from above by ¿From (26), combined with (27), (29), it is then easy to see that the constant C p of (18) is finite for all p ∈ N. Therefore, in that case the moment inequalities (17) hold, for all p ≥ 1.
2. Case 2: q n = γ n (γ > 0) for n ≥ γ + 1, and other values q i are arbitrary. In this case we obtain from (25) the estimateP and for the stationary measure we have (28) with The constant C p of (18) is therefore bounded by where C 1 = (2p −1) 2p (ζ(1 + ǫ)/2) p is finite independent of γ, and where where δ := ǫ/2. To see when C 2 < ∞, we first look at the behavior of The sum in the rhs is convergent for b − a > 1, in which case it behaves as k 1+a−b for k large, which gives for our case a = δ, b = γ, γ > 1 + δ.

Ergodic interacting particle systems
As a final example, we consider spin-flip dynamics in the so-called M < ǫ regime. These are Markov processes on the space E = {0, 1} S , with S a countable set. This is a metric space with distance The space E is interpreted as set of configurations of "spins" η i which can be up (1) or down (0) and are defined on the set S (usually taken to be a lattice such as Z d ). The spin at site i ∈ S flips at a configuration dependent rate c(i, η). The process is then defined via its generator on local functions defined by where η i is the configuration η obtained from η by flipping at site i. See [18] for more details about existence and ergodicity of such processes. We assume here that we are in the so-called "M < ǫ regime", where we have the existence of a coupling (the so-called "basic coupling") for which we have the estimateP η j ,η (η i (t) = ζ i (t)) < e −ǫt e Γ(i,j)t with Γ(i, j) a matrix indexed by S with finite ℓ 1 -norm M < ǫ. As a consequence, from any initial configuration, the system evolves exponentially fast to its unique equilibrium measure which we denote µ. The stationary Markov chain is then defined as X n = η nδ where δ > 0, and η 0 = X 0 is distributed according to µ.

Measure concentration of Hamming neighborhoods
We apply Theorem 6.1 to measure concentration of Hamming neighborhoods. The case of contracting Markov chains was already (and first) obtained in [20] as a consequence of an information divergence inequality. We can easily obtain such Gaussian measure concentration from (21). But, by a wellknown result of Bobkov and Götze [2], (21) and that information divergence inequality are in fact equivalent. The interesting situation is when (21)  Proof. We apply Theorem 6.1 to f =d(·, A), which is a function defined on E n . It is easy to check that δ i (f ) ≤ 1/n, i = 1, . . . , n. We first estimate E(f ) by using (17), which gives (using the fact that f |A = 0) E(f ) ≤ C 1/2p p √ n(P(A)) 1/2p . Now we apply (19) with t = ε, 2p .
The result then easily follows.
As we saw in Section 8.2, we cannot have Gaussian bounds for certain house of cards processes, but only moment estimates up to a critical order. In particular, this means that we cannot have a Gaussian measure concentration of Hamming neighborhoods. But in that case we can apply the previous theorem and get polynomial measure concentration.