A Hoeffding inequality for Markov chains

We prove deviation bounds for the random variable $\sum_{i=1}^{n} f_i(Y_i)$ in which $\{Y_i\}_{i=1}^{\infty}$ is a Markov chain with stationary distribution and state space $[N]$, and $f_i: [N] \rightarrow [-a_i, a_i]$. Our bound improves upon previously known bounds in that the dependence is on $\sqrt{a_1^2+\cdots+a_n^2}$ rather than $\max_{i}\{a_i\}\sqrt{n}.$ We also prove deviation bounds for certain types of sums of vector--valued random variables obtained from a Markov chain in a similar manner. One application includes bounding the expected value of the Schatten $\infty$-norm of a random matrix whose entries are obtained from a Markov chain.


Introduction
Consider a Markov chain {Y i } ∞ i=1 with state space [N ], transition matrix A, and stationary distribution π such that Y 1 is distributed as π. Let E π be the associated averaging operator defined by (E π ) ij = π j , so that for v ∈ R N E π v = E π [v]1 where 1 is the vector whose entries are all 1.
In the case that the Y i are independent, that is A = E π , then it is well known (see [10]) that for functions f 1 , . . . , f n : Gillman generalized Eq. (1.1) to all Markov chains with a stationary distribution, in terms of the quantity λ = A − E π L2(π)→L2(π) in the case f 1 = · · · = f n [7]. These bounds were refined in a long series of work including [4,11,13,22,12,9,3,8,16,15,17]. We state the following version due to Healy [9], which handles the case in which the f i are not necessarily equal.
Back in the case of independent random variables, Hoeffding generalized Eq. (1.1) to the case when the function f i has range [−a i , a i ], obtaining the following bound [10].
In this work, we generalize Eq. (1.3) to Markov chains with a stationary distribution. In particular, we prove the following.
be a stationary Markov chain with state space [N ], transition matrix A, stationary probability measure π, and averaging operator E π , so that Y 1 is distributed according to π. Let λ = A − E π L2(π)→L2(π) and let f 1 , . . . , f n : [N ] → R so that E[f i (Y i )] = 0 for all i and |f i (v)| ≤ a i for all v ∈ [N ] and all i. Then for u ≥ 0, We remark that the dependence on λ in both Eq. (1.2) and Theorem 1.1 is optimal, as shown in [12] which considered the case that the f i are equal. In particular, one can consider the Markov chain on two states with the transition matrix is similar to the sum of n(1 − λ) random variables that are close to 1/(1 − λ) or close to −1/(1 − λ), both with equal probability.
We also remark that Theorem 1.1 holds even for non-reversible Markov chains, continuing the work of [3] who were the first to consider this setting. It is possible, if the Markov chain is not reversible, for A − E π L2(π)→L2(π) to be greater than 1, and thus the bound in Theorem 1.1 is trivial.

Extension to vector-valued random variables
Recently, much attention has been paid to tail bounds for sums of vector-valued random variables. Naor [14] obtained tail bounds for sums of random variables from a Banach space satisfying certain properties. Before stating the corresponding theorem, we define a quantity called the modulus of uniform smoothness.
Let (X, · ) be a Banach space so that ρ X (τ ) ≤ sτ 2 for some s and all τ > 0. When the elements of the Markov chain are independent, for f i : for some universal constant c. We extend Theorem 1.1 to random variables from a fixed Banach space as follows. We stress that the setting in the following theorem is more limited than that of Eq. (1.4).
In particular we only allow random variables of the form f (Y i )X i in which f (Y i ) is a random scalar and X i is a fixed element from the Banach space. Theorem 1.3. Let (X, · ) be a Banach space, and let X 1 , . . . , X n ∈ X. Let {Y i } ∞ i=1 be a stationary Markov chain with state space [N ], transition matrix A, stationary probability measure π, and averaging operator E π , so that Y 1 is distributed according to π. Let λ = A − E π L2(π)→L2(π) , and let f 1 , . . . , f n : Then there exist universal constants C and L, such that for any u ≥ 0, where g 1 , . . . , g n ∼ N (0, 1) are independent standard Gaussian random variables.
Note that Eq. (1.4) implies that E[ g 1 X 1 + · · · + g n X n ] ≤ C s( X 1 2 + · · · + X n 2 ) for some constant C. This follows from the fact that the distribution of the normalized sum of independent Rademacher random variables approaches that of a Gaussian, in the limit. Thus for Banach spaces that satisfy ρ X (τ ) ≤ sτ 2 , we also have the bound

Bounds on the Schatten ∞-norm of a random matrix
As an application, we are able to generalize bounds on the Schatten ∞-norm of a matrix with independent entries to matrices whose entries are obtained from a Markov chain with stationary distribution.
be the set of pairs (i, j) such that i ≤ j, and let B = (b i,j ) ∈ R d×d be a symmetric matrix with positive entries. Let X ∈ R d×d be the random symmetric matrix whose entries are where ε i,j are independent Rademacher random variables. Then it was shown in [2] that We generalize Eq. (1.5) to Markov chains with a stationary distribution. In particular, we obtain a similar bound in terms of λ = A − E π L2(π)→L2(π) on the Schatten ∞-norm of a matrix whose entries are chosen in the following manner. We start by choosing an arbitrary permutation of the entries in the diagonal and upper triangular part of the matrix. Then we fill in the entries according to the order given by the permutation, using the values given by the Markov chain. Finally we fill in the entries in the lower triangular part of the matrix, so that the matrix is symmetric. The case that the transition matrix is A = E π corresponds to choosing the entries of the diagonal and upper triangular part of the matrix independently, as in [2].
be a stationary Markov chain with state space [N ], transition matrix A, stationary probability measure π, and averaging operator E π , so that Y 1 is distributed according to π. Let λ = A − E π L2(π)→L2(π) , let f : V → [−1, 1] be such that Then, for some absolute constant C, where σ and σ * are defined as in Eq. (1.6).

Related Work
In recent independent work by Fan, Jiang, and Sun [5], a Hoeffding bound for general Markov chains was also given. Their bound is sharper, and in particular the constant 64e can be replaced by 2 after replacing 1 − λ by (1 − λ)/(1 + λ). However, our proof is arguably somewhat simpler.
In work by Garg

Preliminaries
Given vectors v, π ∈ R N so that π has positive entries, (typically π will be a distribution over We define the inner product for two vectors u, v ∈ R N and π ∈ R N with positive entries to be Additionally, we let the operator norm of a matrix A ∈ R N ×N be defined as Av Lq(π) .
We will use p in place of L p (1) where 1 is the vector whose entries are all 1. The Schatten p-norm of a matrix A ∈ R N ×N is defined to be ECP 24 (2019), paper 14.
Let A be a stochastic matrix, and let π be a stationary distribution forA. We let (E π ) ij = π j be the averaging operator on L ∞ (π) → L ∞ (π). Note that E π is also stochastic, and that E π A = AE π = E 2 π = E π . The following simple claim bounds T L2(π)→L2(π) for a matrix T in terms of T L1(π)→L1(π) and T L∞(π)→L∞(π) . This can be viewed as a special case of interpolation of matrix norms. Claim 2.1. For any matrix T , Proof. For all x, π ∈ R n so that π has positive entries, where the first inequality follows by Cauchy-Schwarz, and • denotes entrywise product.

Proof of Theorem 1.1
To prove Theorem 1.1, we follow the strategy of bounding the qth moment for some even integer q, and using Markov's inequality to obtain a tail bound. We start by expanding (f 1 (Y 1 ) + · · · + f n (Y n )) q into a sum of monomials.
The following lemma bounds the expectation of monomials in the f i (Y i ). The statement is similar to Lemma 3.3 in [17]. Most of the proof is the same and is deferred to the appendix. Let S q−1 ⊂ {0, 1} q−1 be the set of strings with no consecutive 0's and so that s 1 , s q−1 = 1 for all s ∈ S q−1 .
be a stationary Markov chain with state space [N ], transition matrix A, stationary probability measure π, and averaging operator E π , so that Y 1 is distributed according to π. Let λ = A − E π L2(π)→L2(π) be less than 1, and let f 1 , . . . , f n : [N ] → R so that E[f i (Y i )] = 0 for all i and |f i (v)| ≤ a i for all v ∈ [N ] and all i. Then for even q, . ECP 24 (2019), paper 14.
Proof. Let σ : [n] q → [n] q be the function where σ(w) is the sorted list of coordinates of w in non-decreasing order. Then by Lemma 3.1, q/2 denote the collection of subsets of [q] of size exactly q/2. For each subset I ∈ [q] q/2 , let W I ⊂ [n] q be the set of all vectors w such that for each j ∈ [n], |{i : i ∈ I and w i = j}| = |{i : i ∈ {1, 3, 5, . . . , q − 1} and σ(w) i = j}| , i.e. the multi-set i∈I {w i } is equal to the multi-set {σ(w) 1 , σ(w) 3 , σ(w) 5 , . . . , σ(w) q−1 }. Let w I , w [q]\I ∈ [n] q/2 be the restriction of w to the coordinates in I and [q]\I respectively. Additionally, for each I ∈ [q] q/2 and s ∈ S q−1 , let T I,s be the n q/2 × n q/2 matrix defined as follows. For each w ∈ [n] q , the entry in the w I th row and w [q]\I th column of T I,s is Eq. (3.1) can be bounded above by where a ⊗q/2 ∈ R n q/2 is the vector such that a ⊗q/2 i1,...,i q/2 = a i1 a i2 · · · a i q/2 for i ∈ [n] q/2 and thus a ⊗q/2 2 = a q/2 2 . Both |S q−1 | and q q/2 are each bounded above by 2 q . Thus by Claim 2.1, it is enough to show that We show this for T I,s ∞→ ∞ ; the proof for T I,s 1→ 1 is similar.
Because the entries of T are positive, T I,s ∞→ ∞ is just the largest row sum of T I,s . Without loss of generality, assume that I = {1, 3, 5, . . . , q − 1}. Then the sum of the entries of the row corresponding to w I = (w 1 , w 3 , w 5 , . . . , w q−1 ) is w2,w4,...,wq: w∈W I , as desired. The first inequality follows from the fact that w ∈ W I and w 1 , w 3 , w 5 , . . . , w q−1 determine σ(w) 1 , σ(w) 3 , σ(w) 5 , . . . , σ(w) q−1 exactly, and that there are at most (q/2)! ECP 24 (2019), paper 14. possible orderings of w 2 , w 4 , . . . , w q . The second inequality follows from the definition of S q−1 , which implies that for every positive even integer k ≤ q, either s k−1 = 1 or s k = 1, along with the formula for the sum of an infinite geometric series.
Finally, Theorem 1.1 follows by considering the moment generating function and applying Markov's inequality.
Proof of Theorem 1.1. If λ ≥ 1 or if u ≤ 8/ √ 1 − λ, the theorem holds trivially as the right-hand side is greater than 1.
We note that it is possible to obtain stronger tail bounds that improve on the constant factor by optimizing some of the calculations above, but we will not do so here.

Extension to vector-valued random variables
To prove Theorem 1.3 we use the techniques of Talagrand's generic chaining. These techniques apply to random variables that satisfy the "increment condition," which we define below.
When (Z t ) t∈T is a gaussian process, that is Z t is gaussian for all t ∈ T , we can equip T with the canonical distance, d(s, t) = E[(Z s − Z t ) 2 ] 1/2 . Theorem 1.1 essentially states that for a a Markov chain {Y i } ∞ i=1 and functions . . , f n (Y n )t n ) for T = R n satisfies the increment condition if the associated distance is 32e/(1 − λ) times the Euclidean distance. We also define the γ 2 functional.
where the infimum is taken over all sequences of subsets T 0 ⊆ T 1 ⊆ · · · ⊆ T such that |T 0 | = 1 and |T i | ≤ 2 2 i for i ≥ 1.
The majorizing measures theorem, due to Talagrand [19] (see also Theorem 2.4.1 in [20]), gives bounds on the expected value of sup t∈T Z t , where (Z t ) t∈T is a gaussian process, in terms of γ 2 (T, d) where d is the canonical distance. We state the theorem below.
We also use the following tail bound for any process that satisfies the increment condition, which is given as Theorem 2.2.27 in [20].  We now describe how to select T to apply the above tools to the setting of Theorem 1.3. Let (X, · ) be a Banach space, and let (X * , · * ) be the dual space of X with closed unit ball B * . Recall that for x ∈ X, (see for instance, Theorem 4.3 in [18]). For fixed X 1 , . . . , X n ∈ X, let T ⊂ R n be the set of points, Note that T is symmetric, as for every x * ∈ B * , we also have −x * ∈ B * . It follows that f 1 X 1 + · · · + f n X n = sup t∈T f, t . Additionally, consider the Gaussian process (Z t ) t∈T on the metric space (T, d ), so that Z t = g 1 t 1 + · · · + g n t n for independent standard Gaussian variables g 1 , . . . , g n and d = E[(Z s − Z t ) 2 ] 1/2 . Then by Theorem 4.3, The theorem then follows from Theorem 4.4 the observation that sup s,t |Z s − Z t | = 2 sup t Z t as T is symmetric, and Eq. (4.2).

Comparison to matrices with independent entries
We prove Corollary 1.4, which follows from a straightforward application of Theorem 1.3.
In order to apply Theorem 1.3, we need a bound on E[ X S∞ ] when X is the random symmetric matrix whose entries are where g i,j ∼ N (0, 1) are independent standard Gaussian random variables (rather than Rademacher random variables, as in Eq. (1.5)). This setting was also discussed in [2] in which it was shown that where σ and σ * are defined as in Eq. (1.6).
Finally, because |f (v)| ≤ 1 for all v ∈ [N ] and B has positive entries, it follows that X S∞ ≤ B S∞ , always.

A Appendix
In this section, we give the tools needed to prove Lemma 3.1. They are either taken directly from [17] (which is based on techniques used in [15]), or are straightforward adaptations.
The claim now follows by applying Claim A.2.