Occupation laws for some time-nonhomogeneous Markov chains

We consider finite-state time-nonhomogeneous Markov chains where the probability of moving from state $i$ to state $j\neq i$ at time $n$ is $G(i,j)/n^\zeta$ for a ``generator'' matrix $G$ and strength parameter $\zeta>0$. In these chains, as time grows, the positions are less and less likely to change, and so form simple models of age-dependent time-reinforcing behaviors. These chains, however, exhibit some different, perhaps unexpected, asymptotic occupation laws depending on parameters. Although on the one hand it is shown that the asymptotic position converges to a point-mixture for all $\zeta>0$, on the other hand, the average position, when variously $0<\zeta<1$, $\zeta>1$ or $\zeta=1$, is shown to converges to a constant, a point-mixture, or a distribution $\mu_G$ with no atoms and full support on a certain simplex respectively. The last type of limit can be seen as a sort of ``spreading'' between the cases $0<\zeta<1$ and $\zeta>1$. In particular, when $G$ is appropriately chosen, $\mu_G$ is a Dirichlet distribution with certain parameters, reminiscent of results in Polya urns.


Introduction and Results
In this article, we study laws of large numbers (LLN) for a class of finite space timenonhomogeneous Markov chains where, as time increases, positions are less likely to fair mixture of point-masses at 0 and 1, the limit when ζ > 1 and starting at random (cf. Fig. 1).
In the literature, there are only a few results on LLN's for time-nonhomogeneous Markov chains, often related to simulated annealing and Metropolis algorithms which can be viewed in terms of a generalized model where ζ = ζ(i, j) is a non-negative function. These results relate to the case "max ζ(i, j) < 1" when the LLN limit is a constant [8], Ch. 7 [28], [9]. See also Ch. 1 [16], [19], [20]; and texts [6], [14], [15] for more on nonhomogeneous Markov chains. In this light, the non-degenerate limits µ G found here seem to be novel objects. In terms of simulated annealing, these limits suggest a more complicated LLN picture at the "critical" cooling schedule when ζ(i, j) = 1 for some pairs i, j in the state space.
The advent of Dirichlet limits, when G is chosen appropriately, seems of particular interest, given similar results for limit color-frequencies in Pólya urns [4], [10], as it hints at an even larger role for Dirichlet measures in related but different "reinforcement"type models (see [17], [23], [22], and references therein, for more on urn and reinforcement schemes). In this context, the set of "spreading" limits µ G in Theorem 1.3, in which Dirichlet measures are but a subset, appears intriguing as well (cf. Remarks 1.4, 1.5 and Fig. 2).
In another vein, although different, Ex. 1.1 seems not so far from the case of independent Bernoulli trials with success probability 1/n at the nth trial. For such trials much is known about the spacings between successes, and connections to GEM random allocation models and Poisson-Dirichlet measures [27], [1], [2], [3], [24], [25].
We also mention, in a different, neighbor setting, some interesting but distinct LLN's have been shown for arrays of time-homogeneous Markov sequences where the transition matrix P n for the nth row converges to a limit matrix P [7], [11], Section 5.3 [15]; see also [21] which comments on some "metastability" concerns. We now develop some notation to state results. Let Σ = {1, 2, . . . , m} be a finite set of m ≥ 2 points. We say a matrix M = {M (i, j) : 1 ≤ i, j ≤ m} on Σ is a generator matrix if M (i, j) ≥ 0 for all distinct 1 ≤ i, j ≤ m, and M (i, i) = − j =i M (i, j) for 1 ≤ i ≤ m. In particular, M is a generator with nonzero entries if M (i, j) > 0 for 1 ≤ i, j ≤ m distinct, and M (i, i) < 0 for 1 ≤ i ≤ m.
To avoid technicalities, e.g. with reducibility, we work with the following matrices, G = G ∈ R m×m : G is a generator matrix with nonzero entries , although extensions should be possible for a larger class. For G ∈ G, let n(G, ζ) = ⌈max 1≤i≤m |G(i, i)| 1/ζ ⌉, and define for ζ > 0 P G,ζ n = I for 1 ≤ n ≤ n(G, ζ) I + G/n ζ for n ≥ n(G, ζ) + 1 where I is the m × m identity matrix. Then, for all n ≥ 1, P G,ζ n is ensured to be a stochastic matrix.
Let π be a distribution on Σ, and let P G,ζ π be the (nonhomogeneous) Markov measure on the sequence space Σ N with Borel sets B(Σ N ) corresponding to initial distribution π and transition kernels {P G,ζ n }. That is, with respect to the coordinate process, X = X 0 , X 1 , . . . , we have P G,ζ π (X 0 = i) = π(i) and the Markov property for all i, j ∈ Σ and n ≥ 0. Our convention then is that P G,ζ n+1 controls "transitions" between times n and n + 1. Let also E G,ζ π be expectation with respect to P G,ζ π . More generally, E µ denotes expectation with respect to measure µ.
Define the occupation statistic Z n = Z 1,n , · · · , Z m,n for n ≥ 1 where The first result is on convergence of the position of the process. For G ∈ G, let ν G be the stationary distribution corresponding to G (of the associated continuous time homogeneous Markov chain), that is the unique left eigenvector, with positive entries, normalized to unit sum, of the eigenvalue 0. Theorem 1.1 For G ∈ G, ζ > 0, and initial distribution π, under P G,ζ π , where ν G,π,ζ is a probability vector on Σ depending in general on ζ, G, and π. When 0 < ζ ≤ 1, ν G,π,ζ does not depend on π and ζ and reduces to ν G,π,ζ = ν G .
Also, (2) µ G has no atoms. Remark 1. 5 We suspect better estimates in the proof of Theorem 1.5 will show µ G is in fact mutually absolutely continuous with respect to Lebesgue measure on ∆ m . Of course, in this case, it would be of interest to find the density of µ G . Meanwhile, we give two histograms, found by calculating 1000 averages, each on a run of time-length 10000 starting at random on Σ at time n(G, 1) (= 3, 1 respectively), in Figure 2 of the empirical density when m = 3 and G takes forms To help visualize plots, ∆ 3 is mapped to the plane by linear transformation f (x) = The map maintains a distance √ 2 between the transformed vertices.
We now comment on the plan of the paper. The proofs of Theorems 1.1 and 1.2, 1.3, 1.4, and 1.5 (1) and (2) are in sections 2,3,4, 5, and 6 respectively. These sections do not depend structurally on each other.

Proofs of Theorems 1.1 and 1.2
We first recall some results for nonhomogeneous Markov chains in the literature. For a stochastic matrix P on Σ, define the "contraction coefficient" The following is, for instance, Theorem 4.5.1 [28].
Then, ν = lim n→∞ ν n exists, and, starting from any initial distribution π, we have for The following is stated in Section 2 [8] as a consequence of results (1.2.22) and Theorem 1.2.23 in [16].

Proposition 2.2 Given the setting of Proposition 2.1, suppose (2.2) is satisfied, and
c n = max n 0 ≤i≤n c(P i ) < 1 for all n ≥ n 0 for some n 0 ≥ 1. Let π and f be any initial distribution, and function f : Σ → R. Then, we have convergence in the following senses: Proof of Theorem 1.1. We first consider when ζ > 1. In this case there are only a finite number of movements by Borel-Cantelli since n≥1 P G,ζ π (X n = X n+1 ) ≤ C n≥1 n −ζ < ∞. Hence there is a time of last movement N < ∞ a.s. Then, lim X n = X N a.s., and, for k ∈ Σ, the limit distribution ν G,π,ζ is defined and given by P G,ζ π (X N = k) = ν G,π,ζ (k). When 0 < ζ ≤ 1, as G ∈ G, by calculation with (2.1), c(P G,ζ n ) = 1 − C G /n ζ for all n ≥ n 0 (G, ζ) large enough and a constant C G > 0. Then, Since for n > n(G, ζ), ν t G P G,ζ n = ν t G (I − G/n ζ ) = ν t G , the second condition of Proposition 2.1 is trivially satisfied, and hence the result follows.
Proof of Theorem 1.2. When ζ > 1, as mentioned in the proof of Theorem 1.1, there are only a finite number of moves a.s., and so a.s. lim Z n = m k=1 1 [X N =k] k concentrates on basis vectors {k}. Hence, as defined in proof of Theorem 1.1, P G,ζ π (X N = k) = ν G,π,ζ (k), and the result follows.
In this section, as ζ = 1 is fixed, we suppress notational dependence on ζ. Also, as Z n takes values on the compact set ∆ m , the weak convergence in Theorem 1.3 follows by convergence of the moments. The next lemma establishes convergence of the first moments.
We now focus on a useful class of diagonalizable matrices where {λ G l } are the eigenvalues of G. As Re(λ G l ) ≤ 0 for 1 ≤ l ≤ m when G ∈ G, certainly all diagonalizable G ∈ G belong to G * . The relevance of this class, in the subsequent arguments, is that for G ∈ G * the resolvent (xI − G) −1 exists for x ≥ 1.
For G ∈ G * , let V G be the matrix of eigenvectors and D G be a diagonal matrix with corresponding eigenvalue entries . We also denote for a 1 , . . . , a m ∈ C, the diagonal matrix Diag(a · ) with ith diagonal entry a i for 1 ≤ i ≤ m. We also extend the definitions of P G n and P G i,j to G ∈ G * with the same formulas. In the following, we use the principal value of the complex logarithm, and the usual convention a b+ic = e (b+ic) log(a) for a, b, c ∈ R with a > 0.
Proof. Straightforwardly, To expand further, we note for z ∈ C such that |z − 1| < 1, we have Let now L be so large such that max 1≤u≤m |λ G u |/L < 1/2. Then, for 1 ≤ s ≤ m and k ≥ L, uniformly over j and s as i ↑ ∞. This allows us to write Defining ν(s; i, j) = exp(c(s; i, j) + d(s; i, j)) gives after multiplying out that completing the proof.
The right-hand bound is integrable: Indeed, by Tonelli's Lemma and induction, we have .
Hence, the lemma follows by dominated convergence and Fubini's Theorem.
At this point, by straightforwardly combining the previous lemmas, we have proved Theorem 1.2 for G ∈ G diagonalizable. The method in extending to non-diagonalizable generators is accomplished by approximating with suitable "lower" and "upper" diagonal matrices.   (1) the spectrum varies continuously with respect to the matrix norm · M (cf. Appendix D [13]), and (2) diagonalizable real matrices are dense (cf. Theorem 1 [12]).

Proof of Theorem 1.4
The proof follows by evaluating the moment expressions in Theorem 1.2 when G = Θ as those corresponding to the Dirichlet distribution with parameters θ 1 , . . . , θ m (1.1).
The next statement is an immediate corollary of Theorem 1.3 and Lemma 4.1. .
We now evaluate the last expression of Lemma 4.2 by first specifying of the value of σγ. Recall, by convention θ l · · · (θ l + γ l − 1) = 1 when γ l = 0 for 1 ≤ l ≤ m.
Proof. The proof will be by induction onγ.

Induction
Step. Without loss of generality and to ease notation, let k = 1. Then, by specifying the next-to-last element σγ −1 , and simple counting, we have We now use induction to evaluate the right-side above as θ l · · · (θ l + γ l − 1).
By now adding over 1 ≤ k ≤ m in the previous lemma, we finish the proof of Theorem 1.4. .

Proof of Theorem 1.(1)
Let p = p 1 , . . . , p m ∈ Int∆ m be a point in the simplex with p i > 0 for 1 ≤ i ≤ m.
For ǫ > 0 small, let B(p, ǫ) ⊂ Int∆ m be a ball with radius ǫ and center p. To prove Theorem 1.5 (1), it is enough to show for all large n the lower bound P G π Z n ∈ B(p, ǫ) > C(p, ǫ) > 0.
To this end, letp 0 = 0 andp i = i l=1 p l for 1 ≤ i ≤ m. Also, define, for 1 ≤ k ≤ l, X l k = X k , . . . , X l . Then, there exist small δ, β > 0 such that wherek a = a l=1 k l , and i is a vector with all coordinates equal to i of the appropriate length. The last event represents the process being in the fixed location j for times ⌊np j−1 ⌋ −k j−1 + 1 to ⌊np j ⌋ −k j for 1 ≤ j ≤ m where we take 1 −k 0 = ⌊nδ⌋. Now, as G has strictly negative diagonal entries, C 1 = max s |G(s, s)| > 0, and so for all large n, Also, as G has positive nondiagonal entries, C 2 = min s G(s, s + 1) > 0. Then, Hence, for all large n, as P G π (X ⌊nδ⌋ = 1) ≥ ν G (1)/2 (Theorem 1.1), 6 Proof of Theorem 1.5 (2) The proof of Theorem 1.5 (2) follows from the next two propositions.
Note as p ∈ ∆ m \ {1, . . . , m} at least two coordinates of p are positive. Then, as δ <p/2, when (1/n) n i=1 1 1 (ω i ), . . . , 1 m (ω i ) ∈ B(p, δ), at least one switch is in ω n . For j ≥ 1 and a path in T (j), let α 1 , . . . , α j denote the j switch times in the sequence; let also θ 1 , . . . , θ j+1 be the j + 1 locations visited by the sequence. We now partition . . , U j−1 and V = V 1 , . . . , V j+1 denote possible switch times (up to the j − 1st switch time) and visit locations respectively: In this decomposition, paths in A j (U, V) are in 1 : 1 correspondence with jth switch times α j -the only feature allowed to vary. Now, for each set A j (U, V), we define a path η(j, U, V) = η 1 , . . . , η n where the last jth switch is "removed," Note that the sequence η(j, U, V) belongs to T (j − 1), can be obtained no matter the location V j+1 (which could range on the m values in the state space), and is in 1 : 1 correspondence with pair U 1 , . . . , U j−1 and V 1 , . . . , V j . In particular, recalling X n 1 = X 1 , . . . , X n denotes the coordinate sequence up to time n, we have where the sum is over all U, V corresponding to the decomposition into sets . The next lemma estimates the location of the last switch time α j , and the size of the set A j (U, V). The proof is deferred to the end.
A consequence of these bounds on the position and cardinality of α j 's associated to a fixed set A j (U, V), is that where ′ refers to adding over all last switch times U j associated to paths in A j (U, V). Let nowĜ = max{|G(i, j)| : 1 ≤ i, j ≤ m}.
The proposition follows by taking limit on n, and weak convergence.
Proof of Lemma 6.1. For a path ω n ∈ A j (U, V) and 1 ≤ k ≤ j + 1, let τ k be the number of visits to state V k (some τ k 's may be the same if V k is repeated). For 1 ≤ i ≤ τ k , let n k i and n k i be the start and end of the ith visit to V k . Certainly, (n k i − n k i + 1) ≤ n(p V k + δ). (6.4) Hence, as the disjoint sojourns {[n k i , n k i ] : 1 ≤ i ≤ τ k } occur between times 1 and n k τ k , their total sum length is less than n k τ k , and we deduce n(p V k − δ) ≤ n k τ k . Now, for p ∈ ∆ m \ {1, . . . , m}, at least one of the {p V i : V i = V j+1 , 1 ≤ i ≤ j} is positive: Indeed, there are two coordinates of p, say p s and p t , which are positive. Say V j+1 = s; then, as (1/n) n i=1 1 s (ω i ) = (1/n) 1 s (ω i )−p s | ≤ δ, and p s − δ > 0, the path must visit state s before time α j , e.g. V i = s for some 1 ≤ i ≤ j.
Then, from the deduction just after (6.4), we have giving the first statement. For the second statement, note that −n j τ j + τ j −1 i=1 (n j i − n j i + 1) (with convention the sum vanishes when τ j = 1) is independent of paths in A j (U, V) being some combination of {U i : 1 ≤ i ≤ j − 1}. Hence, with k = j in (6.4), we observe α j = n j τ j + 1 takes on at most ⌊2nδ + 1⌋ distinct values. The result now follows as paths in A j (U, V) are in 1 : 1 correspondence with last switch times α j .