A martingale proof of Dobrushin's theorem for non-homogeneous Markov chains

In 1956, Dobrushin proved a definitive central limit theorem for non-homogeneous Markov chains. In this note, a shorter and different proof elucidating more the assumptions is given through martingale approximation.


Introduction and Results
Nearly fifty years ago, R. Dobrushin proved in his thesis [2] a definitive central limit theorem (CLT) for Markov chains in discrete time that are not necessarily homogeneous in time. Previously, Markov, Bernstein, Sapagov, and Linnik, among others, had considered the central limit question under various sufficient conditions. Roughly, the progression of results relaxed the state space structure from 2 states to an arbitrary set of states, and also the level of asymptotic degeneracy allowed for the transition probabilities of the chain.
After Dobrushin's work, some refinements and extensions of his CLT, some of which under more stringent assumptions, were proved by Statulevicius [16] and Sarymsakov [13]. See also Hanen [6] in this regard. A corresponding invariance principle was also proved by Gudinas [4]. More general references on non-homogeneous Markov processes can be found in Isaacson and Madsen [7], Iosifescu [8], Iosifescu and Theodorescu [9], and Winkler [18].
We now define what is meant by "degeneracy." Although there are many measures of "degeneracy," the measure which turns out to be most useful to work with is that in terms of the contraction coefficient. This coefficient has appeared in early results concerning Markov chains, however, in his thesis, Dobrushin popularized its use, and developed many of its important properties. [See Seneta [14] for some history.] Let π = π(x, dy) be a Markov transition probability on (X, B(X)). Define the contraction coefficient δ(π) of π as δ(π) = sup Also, define the related coefficient α(π) = 1 − δ(π). Clearly, 0 ≤ δ(π) ≤ 1, and δ(π) = 0 if and only if π(x, dy) is independent of x. It makes sense to call π "non-degenerate" if 0 ≤ δ(π) < 1. We use the standard convention and denote by µπ and πu the transformations induced by π on countably additive signed measures and bounded measurable functions respectively, (µπ)(A) = π(x, A) µ(dx) and (πu)(x) = u(y) π(x, dy) It is easy to see that δ(π) has the following properties.
In addition, let {f (n) i : 1 ≤ i ≤ n} be real valued functions on X. Define, for n ≥ 1, the sum we have, regardless of the initial distribution, that In general, the result is not true if condition (1.4) is not met.
In [2], Dobrushin also states the direct corollary which simplifies some of the assumptions. (1. 6) We remark that in [2] (e.g. Theorems 3,8) there are also results where the boundedness condition on f (n) i is replaced by integrability conditions. As these results follow from truncation methods and Theorem 1.1 for bounded variables, we only consider Dobrushin's theorem in the bounded case.
Also, for the ease of the reader, and to be complete, we will discuss in the next section an example, given in [2] and due to Dobrushin and Bernstein, of how the weak convergence (1.5) may fail when the condition (1.4) is not satisfied.
We now consider Dobrushin's methods. The techniques used in [2] to prove the above results fall under the general heading of the "blocking method." The condition (1.4) ensures that well-separated blocks of observations may be approximated by independent versions with small error. Indeed, in many remarkable steps, Dobrushin exploits the Markov property and several contraction coefficient properties, which he himself derives, to deduce error bounds sufficient to apply CLT's for independent variables. However, in [2], it is difficult to see, even at the technical level, why condition (1.4) is natural.
The aim of this note is to provide a different, shorter proof of Theorem 1.1 which explains more why condition (1.4) appears in the result. The methods are through martingale approximations and martingale CLT's which perhaps were not as codified in the early 1950's as they are today. These methods go back at least to Gordin [3] in the context of homogeneous processes, and has been used by others in other "related" situations (e.g. Kifer [10]; see also Pinsky [12]). There are three main ingredients in this approximation with respect to the non-homogeneous setting of Theorem 1.1, (1) negligibility estimates for individual components, (2) a law of large numbers for conditional variances, and (3) lower bounds for the variance V (S n ). Negligibility bounds and a LLN are well known requirements for martingale CLT's (cf. Hall-Heyde [5, ch. 3]), and in fact, as will be seen, the sufficiency of condition (1.4) is transparent in the proofs of these two components (Lemma 4.2, and Lemmas 4.3 and 4.4). The variance lower bounds which we will use were as well derived by Dobrushin in his proof. However, using some martingale properties, we give a more direct argument for a better estimate.
We note also, with this martingale approximation, that an invariance principle for the partial sums holds through standard martingale propositions, Hall-Heyde [5], among other results. In fact, from the martingale invariance principle, it should be possible to derive Gudynas's theorems [4] although this is not done here.
We now explain the structure of the article. In section 2, we give the Bernstein-Dobrushin example of a Markov chain with anomalous behavior. In section 3, we discuss needed properties of the contraction coefficient. In section 4, we state the martingale CLT that will be utilized, and, as a preview of the non-homogeneous chain proof, we quickly reprise the argument with respect to homogeneous chains. In section 5, we prove Theorem 1.1 with martingale approximation assuming a lower bound on the variance V (S n ). And last, in section 6, we prove this variance estimate.

Anomalous Example
Here, we summarize the example in Dobrushin's thesis, attributed to Bernstein, which shows that condition (1.4) is sharp.
, and consider the 2 × 2 transition matrices on X, The invariant measures for all the Q(p) are the same p(1) = p(2) = 1 2 . We will be looking at Q(p) for p close to 0 or 1 and the special case of p = 1 2 . However, when p is small, the homogeneous chains behave very differently under Q(p) and Q(1 − p). More specifically, when p is small there are very few switches between the two states whereas when 1 − p is small it switches most of the time. In fact, this behavior can be made more precise (see Dobrushin [1], Hanen [6], or from direct computation). Let T n = n i=1 1 {1} (X i ) count the number of visits to state 1, say, in n steps. Case A. Consider the homogeneous chain under Q(p) with p = 1 n and initial distribution p(1) = p(2) = 1 2 . Then, Case B. Consider the homogeneous chain run under Q(p) with p = 1 − 1 n and initial distribution p(1) = p(2) = 1 2 . Then, where F is a proper distribution function.
Let a sequence α n → 0 with α n ≥ n − 1 3 be given . To construct the anomalous Markov chain, it will be helpful to split the time horizon [1, 2, . . . , n] into roughly nα n blocks of size α −1 n . We interpose a Q( 1 2 ) between any two blocks that has the effect of making the blocks independent of each other. More precisely let k Consider the non-homogeneous chain with respect to {π From the definition of the chain, one observes, as Q( 1 2 ) does not distinguish between states, that the process in time horizons i ) the counts in the first n steps and in steps k to l respectively. It follows from the discussion of independence above that is the sum of independent sub-counts where, additionally, the sub-counts for 1 ≤ i ≤ m n − 1 are identically distributed, the last sub-count perhaps being shorter. Also, as the initial distribution is invariant, we have V (1 {1} (X (n) i )) = 1/4 for all i and n. Then, in the notation of Corollary 1.1, C = 1 and c = 1/4. From (2.1), we have that Also, from (2.2) and independence of m n sub-counts, we have that From these calculations, we see if n 1/3 α n → ∞, then α −2 n << nα n , and so the major contribution to T (n) is from T (n) (k (n) 1 +1, n). However, since this last count is (virtually) the sum of m n i.i.d. sub-counts, we have that T (n) , properly normalized, converges to N(0, 1), as predicted by Dobrushin's theorem 1.1.
On the other hand, if α n = n −1/3 , we have α −2 n = nα n , and count T (n) (1, k 1 ), independent of T (n) (k (n) 1 , n), also contributes to the sum T (n) . After appropriate scaling, then, T (n) approaches the convolution of a non-trivial non-normal distribution and a normal distribution, and therefore is certainly not Gaussian.

Martingale CLT
The central limit theorem for martingale differences is by now a standard tool. We quote the following (strong) form of the result implied by Corollary 3.1 in Hall and Heyde [5].
Note that the first and second limit conditions are the negligibility assumption on the sequence, and law of large numbers for conditional variances mentioned in the introduction. We now sketch a proof of Corollary 1.1 in the case of a homogeneous Markov chain on a finite state space. Assume that we have a Markov chain with transition probability P on a finite state space X. If δ(P ) < 1, and f : X → R is a function with mean 0 with respect to the invariant distribution π on X, it is in the range of I − P and the equation (I − P )u = f has a solution. The following argument is implicit in Gordin [3], and also explicitly used in Kipnis and Varadhan [11].
Using the relation E[u(X j+1 )|F j ] = (P u)(X j ), it is easy to check that is a martingale difference. Then, We will apply the martingale CLT (Proposition 3.1) to the array formed from W So, by the ergodic theorem, the last expression converges almost surely to V 0 = E π [q(X 0 )] < ∞. It is not difficult to see that V 0 > 0. Therefore, V (M n ) ∼ nV 0 and (nV 0 ) −1/2 M n ⇒ N(0, 1) by Proposition 3.1. Since the difference we have V (S n ) ∼ nV 0 and S n / V (S n ) ⇒ N(0, 1) also.

Proof of Theorem 1.1
We give here a short proof for Theorem 1.1 through martingale approximation, illustrated for homogeneous chains in the previous section. Consider the non-homogeneous setting of Theorem 1.1. To follow the homogeneous argument, we will need to find the non-homogeneous analogue of the resolvent function "u = (I − P ) −1 f ." To simplify notation, we will assume throughout that the functions {f for k = n.
. Then, we have the decomposition, and the martingale M : 1 ≤ l ≤ k} for n ≥ 1. The plan to obtain Theorem 1.1 will now be to approximate S n by M Proof. Since f The second estimate now follows from this estimate. Indeed, We now state a lower bound for the variance which will be proved in the next section using martingale ideas. We remark in [2] that actually the bound, i )), is found by different methods (see also section 1.2.2 [9]).
The next estimate shows that the asymptotics of S n / V (S n ) depend only on the martingale approximant M (n) n , and that the differences ξ The lemma follows now from (1.4).
We now turn to showing the LLN part of Proposition 3.1 for the array{M Then, The first sum on the right-hand side is bounded as follows. From non-negativity, Consider now the second sum. Write From the oscillation assumption, we have that Putting together these statements, we obtain the lemma.
To apply this result to our situation, we will need the following oscillation estimate.
where we rewrite ξ (n) j with (4.4) in the third line, and use Lemma 4.1 in the last line. Therefore, let us consider oscillations of From Lemma 4.1, we have the bound, for j ≤ m, Therefore, the oscillations of (4.6) are bounded by 16C 2 n α −2 n uniformly in l. Hence, using Proposition 4.1, we obtain by (1.4).

Proof of Variance Lower Bound
In this section, we prove Proposition 4.1.
Lemma 5.1 Let f and g be measurable functions on (X, B(X)). Let λ be a probability measure on X × X with marginals α and β respectively. Let π(x 1 , dx 2 ) and π(x 2 , dx 1 ) be the transition probabilities in the two directions so that Proof. Let us construct a measure on X × X × X by starting with λ on X × X and using reversed π(x 2 , dx 3 ) to go from x 2 to x 3 . The transition probability from x 1 to x 3 defined by satisfies δ(Q) ≤ δ(π). Moreover αQ = α and the operator Q is self adjoint and bounded with norm 1 on L 2 (α). Then, if f is a bounded function with f (x)α(dx) = 0 (and so E α [Q n f ] = 0), we have for n ≥ 1, Hence, as bounded functions are dense, on the subspace of functions, M = {f ∈ L 2 (α) : f (x)α(dx) = 0}, the top of the spectrum of Q is less than δ(Q) and so Q L 2 (α,M ) ≤ δ(Q). Indeed, suppose the spectral radius of Q on M is larger than δ(Q) + ǫ for ǫ > 0, and f ∈ M is a non-trivial bounded function whose spectral decomposition is with respect to spectral values larger than δ(Q)+ǫ. Then, Q n f L 2 (α) ≥ f L 2 (α) (δ(Q)+ǫ) n which contradicts the bound (5.1) when n ↑ ∞. [cf. Thm. 2.10 [15] for a proof in discrete space settings.] Then, .
Lemma 5.2 Let f (x 1 ) and g(x 2 ) be square integrable with respect to α and β respectively. Then, as well as Proof. We can assume without loss of generality that f and g have mean 0 with respect to α and β respectively. Then ≥ (1 − δ(π)) f 2 L 2 (α) = α(π) f 2 L 2 (α) The proof of the second half is identical.
On the other hand from (4.2), for 1 ≤ k ≤ n − 1, we have Summing over k, and noting variance decomposition near (4.3),