Orthogonality and probability: mixing times

We produce the first example of bounding total variation distance to stationarity and estimating mixing times via orthogonal polynomials diagonalization of discrete reversible Markov chains, the Karlin-McGregor approach.


Introduction
If P is a reversible Markov chain over a sample space Ω, and π is a reversibility function (not necessarily a probability distribution), then P is a self-adjoint operator in ℓ 2 (π), the space generated by the inner product < f, g > π = x∈S f (x)g(x)π(x) induced by π. If P is tridiagonal operator (i.e. a nearest-neighbor random walk) on Ω = {0, 1, 2, . . . }, then it must have a simple spectrum, and is diagonalizable via orthogonal polynomials as it was studied in the 50's and 60's by Karlin and McGregor, see [2], [8]. There the extended eigenfuctions Q j (λ) (Q 0 ≡ 1) are orthogonal polynomials with respect to a probability measure ψ and where π j (π 0 = 1) is the reversibility measure of P .
In this paper we are testing a possibility of calculating mixing rates using Karlin-McGregor diagonalization with orthogonal polynomials. In order to measure the rate of convergence to a stationary distribution, the following distance is used. Definition 1. If µ and ν are two probability distributions over a sample space Ω, then the total variation distance is Observe that the total variation distance measures the coincidence between the distributions on a scale from zero to one.
If ρ = ∞ k=0 π k < ∞, then ν = 1 ρ π is the stationary probability distribution. If in addition, the aperiodic nearest neighbor Markov chain originates at site i, then the total variation distance between the distribution µ t = µ 0 P t and ν is given by as measure ψ contains a point mass of weight 1 ρ at 1, see [3]. The rates of convergence are quantified via mixing times. In the case of a Markov chain over an infinite state space with a unique stationary distribution, the notion of a mixing time depends on the state of origination of the chain.
Definition 2. Suppose P is a Markov chain with a stationary probability distribution ν that commences at X 0 = i. Given an ǫ > 0, the mixing time t mix (ǫ) is defined as In the case of a nearest-neighbor process on Ω = {0, 1, 2, . . . } commencing at i, the corresponding mixing time has the following simple expression in orthogonal polynomials Observe that the above expression is simplified when i = 0. Here we concentrate on calculating mixing times for simple positive recurrent nearest-neighbor Markov chains over Ω, originating from i = 0. Our main result concerns the distance to stationarity for a simple random walk with a drift. In the main theorem and its corollary we will explore the following Markov chain Theorem 1. Suppose the above Markov chain begins at the origin, i = 0. Consider the orthogonal polynomials Q n for the chain. Then the integral (−1,1) λ t Q n (λ)dψ(λ) can be expressed as " dz and the total variation distance ν − µ t T V is bounded above by " . Therefore, taking ε ↓ 0, the mixing time where m(p, q) = max (r + 2 √ pq), q q+r .
Observe that in the above complex integral all three finite poles are located inside the unit circle. Thus we only need to consider a pole at infinity.
The proof is provided in section 3. The result in Theorem 1 is the first instance the Karlin-McGregor orthogonal polynomials approach is used to estimate mixing rates. As it was suggested in [3] we would like the approach to work for a larger class of reversible Markov chains over an infinite state space with a unique stationary distribution. There is an immediate corollary (see section 3): for t large enough, i.e. we have a lower bound of matching order.
Observe that one can easily adjust these results for any origination site X 0 = i. In the next section we will compare the above Karlin-McGregor approach to some of the classical techniques for estimating the "distance to stationarity" ν − µ t T V .

Comparison to the other techniques
For the case of geometrically ergodic Markov chains, there are several techniques that produce an upper bound on the distance to stationarity that were developed specifically for the cases when the sample space is large, but finite. These methods are not directly applicable to chains on general state spaces. The coupling method stands out as the most universal. Here we compare the geometric rate in Theorem 1 to the one obtained via a classical coupling argument. Then we explain why other geometric ergodicity methods based on renewal theory will not do better than coupling. See [6] and [4] for detailed overview of geometric convergence and coupling.

Geometric convergence via coupling
Consider a coupling process (X t , Y t ), where X 0 = 0 as in Theorem 1, while Y 0 is distributed according to the stationary distribution ν = 1 ρ π. A classical Markovian coupling construction allows X t and Y t evolve independently until the coupling time τ coupling = min{t : X t = Y t }. It is natural to compare P (τ coupling > t) to P (τ > t), where τ = min{t : Y t = 0} is a hitting time, as the chain is positive recurrent. Now, simple combinatorics implies, for k ≥ n, where '≤' appears because π 0 = 1 < 1 p , but it does not change the asymptotic rate of convergence, i.e. we could write '≈' instead of '≤'. The right hand side can be rewritten as 1 Recall that in the Corollary, if q is sufficiently larger than r and p, then q q+r t dominates (r + 2 √ pq) t , and the total variation distance where A and B are given in Theorem 1 of this paper. Thus we need to explain why, when q is sufficiently large, in the equation (1), we fail to notice the dominating term of q q+r t . In order to understand why, observe that the second largest eigenvalue − q q+r originates from the difference between τ coupling and τ . In fact, Y t can reach state zero without ever sharing a site with X t (they will cross each other, of course). Consider the case when p is either zero, or close to zero. There, the problem reduces to essentially coupling a two state Markov chain with transition probabilities p(0, 1) = 1 and p(1, 0) = q q+r . Thus the coupling time will be expressed via a geometric random variable with the failure probability of q q+r . Of course, one could make the Markov chain P "lazier" by increasing r at the expense of p and q, while keeping proportion q p fixed, i.e. we can consider P ε = 1 1+ε (P + εI). This will minimize the chance of X t and Y t missing each other, but this also means increasing (r + 2 √ pq), and slowing down the rate of convergence in (1).
In order to obtain the correct exponents of convergence, we need to redo the coupling rules as follows. We now let the movements of X t and Y t be synchronized whenever both are not at zero (i.e. {X t , Y t } ∩ {0} = ∅), while letting X t and Y t move independently when one of them is at zero, and the other is not. Then at the hitting time τ , either X t = Y t = 0 and the processes are successfully coupled, or X t = 1 and Y t = 0. In the latter case we are back to the geometric variable with the failure probability of q q+r . That is, the only way for X t and Y t to couple would be if one of the two is at state 0 and the other is at state 1. Using the set theory notations, if with probability r q+r , {0, 1} with probability q q+r , When q q+r > r+2 √ pq, the above modified coupling captures the order q q+r t . The coefficient A however is much harder to estimate using the coupling approach, while it is immediately provided in Theorem 1 and its corollary. Take for example p = 1 11 , r = 1 11 and q = 9 11 . There q q+r > r + 2 √ pq, and according to the Corollary, the lower

Drift, minorization and geometric ergodicity
The optimal "energy function" V (x) = q p x/2 converts the geometric drift inequality in Meyn and Tweedie [6] Chapter 15 into equality thus confirming the geometric convergence rate of (r + 2 √ pq) t for the tail probability P (τ C > t), where C = {0} is the obvious choice for the "small set", and τ C is the hitting time. Once again all the challenge is at the origin. In fact there is only a trivial "minorization condition" when C = {0}. The minorization condition reads where, if C = {0}, the only choice for the probability measure Q is Q = δ 1 , and ǫ = 1. With ǫ = 1 the split of the Markov chain is trivial, and as far as the corresponding coupling goes, the only issue would be (as we mentioned before) to compute the tail of the hitting time min{t : (X t , Y t ) ∈ C ×C} when q is large. If C = {0, 1, . . . , k} for some k > 0, there is no minorization condition. In the latter case, estimating the hitting time min{t : (X t , Y t ) ∈ C × C} is straightforward, but without minorization, this will not be enough to estimate the tail for the coupling time. The "splitting technique" will not work, rather a coupling approach of the preceding subsection to be pursued. The case of recurrent reflecting random walk (the M/M/1 queue) had been considered as one of the four benchmark examples in the geometric ergodicity theory (see [1] and references therein). There, in the absence of the second largest eigenvalue of − q q+r , with r = 0, the rate of (2 √ pq) t was proven to be the optimal (see [5]). The methods in the theory of geometric convergence are in the most part based on the renewal theory (see [6], [1], [7] and references therein), and concentrate more on the tail probability for the hitting time τ C in the splitting method. As for the Markov chain P considered in this paper, in the absence of a useful splitting of it, the approach that works is the coupling. While the coupling provides the right exponents, it does not necessarily produce the tight coefficients.

The proof of Theorem 1
Proof. Since we require r > 0 for aperiodicity, we will need to obtain the spectral measure ψ via an argument similar to that of Karlin and McGregor in [2], where the case of r = 0 was solved. The orthogonal polynomials are obtained via solving a simple linear recursion: are the roots of the characteristic equation for the recursion, and c 1 = ρ 2 −λ ρ 2 −ρ 1 and c 2 = λ−ρ 1 ρ 2 −ρ 1 . Now π 0 = 1, π n = p n−1 q n (n ≥ 1) and ρ = q−p+1 q−p . Also, we observe that and ρ 1 ρ 2 = q p . The above will help us to identify the point mass locations in the measure ψ since each point mass in ψ occurs when k π k Q 2 k (λ) < ∞. Thus we need to find all λ ∈ (r + 2 √ pq, 1] such that c 1 (λ) = 0 and all λ ∈ [−1, r − 2 √ pq) such that c 2 (λ) = 0. There are two roots, λ = 1 and λ = − q q+r .
The only other point mass is at λ = − q q+r . One can verify that ρ 1 − q q+r = − q q+r and Q k − q q+r = − q q+r k , and therefore is the reciprocal of the point mass at λ = − q q+r .