Singular value distribution of dense random matrices with block Markovian dependence

A block Markov chain is a Markov chain whose state space can be partitioned into a finite number of clusters such that the transition probabilities only depend on the clusters. Block Markov chains thus serve as a model for Markov chains with communities. This paper establishes limiting laws for the singular value distributions of the empirical transition matrix and empirical frequency matrix associated to a sample path of the block Markov chain whenever the length of the sample path is $\Theta(n^2)$ with $n$ the size of the state space. The proof approach is split into two parts. First, we introduce a class of symmetric random matrices with dependent entries called approximately uncorrelated random matrices with variance profile. We establish their limiting eigenvalue distributions by means of the moment method. Second, we develop a coupling argument to show that this general-purpose result applies to the singular value distributions associated with the block Markov chain.

important component of the process which generated the data [48,56,64,68]. Another example occurs in principal component analysis, where the eigenvalue distribution of an empirical covariance matrix can be used to detect the appropriate number of principal components [46,75]. If the matrix which is built from the data is non-Hermitian then algorithms typically employ singular values and singular vectors instead of eigenvalues and eigenvectors. Here recall that the ith largest built from a dependent data sequence X 0 , X 1 , . . . , X taking values in {1, . . . , n} for some positive integer n ∈ Z ≥1 . The term singular value distribution ofN X here refers to the measure νN X defined by for any a < b. While the singular value distribution of a random matrix is well understood when the elements are independent, this is not so when the matrix is constructed from dependent sequential data as is the case forN X . This paper models the sequential data X 0 , X 1 , . . . , X as being generated by a block Markov chain. Block Markov chains are a model for dependent sequential data with an underlying community structure and have been used to develop and analyse community detection algorithms for sequential data; see [66,78]. Besides these two papers and the present paper, the only other rigorous analysis of the spectral properties ofN X when X is a block Markov chain can be found in [67]. There, an asymptotic distance between the K largest singular values and the n − K smallest singular values is established.
The current paper establishes a limiting law for the singular value distribution of block Markovian random matrices such asN X as the size n of the state space tends to infinity and the length of the data sequence satisfies = Θ(n 2 ). For example, Theorem 1.1 describes the limiting law associated toN X , which is visualized in Figure 2. Theorem 1.1 furthermore implies that the singular value distribution in block Markovian random matrices behaves as one would predict assuming the entrieŝ N X,ij are independent. This gives some legitimacy to the practice of assuming some independence, particularly when the dependencies are asymptotically equivalent to those of a block Markov chain and the data sequence is sufficiently long. One can however not ignore the Markovian dependence entirely since it causes the frequencies for different transitions to have different distributions. Indeed, the singular value distribution following from Theorem 1.1 does not necessarily agree with the singular value distribution which one would find assuming that the different entriesN X,ij are identically distributed.
We next state our main results in Section 1.1. This is followed by an overview of the literature in Section 2. Section 3 then provides notation and preliminaries in preparation of the proofs. Proof outlines are given in Section 4, and a brief comparison between our theoretical predictions and the singular value distribution ]. The thick arrows visualize the cluster transition probabilities p k,l and the thin arrow visualise the transitions (X t , X t+1 ) of a sample path (X t ) t=0 . The starting point X 0 was chosen to lie in the leftmost cluster V 1 .
obtained from an actual dataset is done in Section 5. Finally, all details for the proof are provided in Section 6.
1.1. Results. Our main object of study, which will subsequently be formally defined, are Markov chains which have a community structure. More precisely, the transition probabilities between states should only depend on the communities to which these states belong.
Fix a positive integer K ∈ Z ≥1 . For any n ∈ Z ≥K , pick a partition (V k ) K k=1 of V := {1, . . . , n} consisting only of nonempty sets. Let p be the transition matrix for an irreducible acyclic Markov chain on {1, . . . , K} with equilibrium distribution π. The block Markov chain with cluster transition matrix p and clusters (V k ) K k=1 is then the Markov chain on V with transition probability P i,j := p x,y /#V y for every i ∈ V x and j ∈ V y . Figure 1 schematically depicts a block Markov chain.
In the subsequent results we are concerned with the asymptotic regime where n tends to infinity. Fix a sequence of strictly positive real numbers α := (α 1 , . . . , α K ) with K k=1 α k = 1 and assume that for every k ∈ {1, . . . , K} it holds that #V k = α k n + o(n). Let X := (X t ) t=0 denote a sample path from the block Markov chain with an arbitrary starting distribution for X 0 and with length = λn 2 + o(n 2 ) for some fixed λ ∈ R >0 . Recall the definition of the empirical frequency matrixN X from (1) and note thatN X,ij counts the number of traversals of edge (i, j). The empirical transition matrixP X associated with the sample path is given bŷ P X := (P X,ij ) n i,j=1 whereP X,ij :=N X,ij n k=1N X,ik . ( It will be shown in Corollary 6.10 that there is no division by zero in (3) asymptotically almost surely. Some terminology is required to state the main results. The Stieltjes transform of a finite nonzero measure µ on R is the analytic function s : C + → C − given by s(z) = 1/(z − x)dµ(x) where C + := {z ∈ C : Im(z) > 0} is the upper half-plane and C − := {z ∈ C : Im(z) < 0} is the lower half-plane. Let us remark that some authors refer to the Stieltjes transform by another name such as Cauchy transform or Cauchy-Stieltjes transform. Further, let us warn that some authors employ a convention which differs by a minus sign from the notation employed here; they instead consider the map z → 1/(x − z)dµ(x). The relevance of the Stieltjes transform for our purposes is that µ can be recovered from s(z) by the Stieltjes inversion formula [8,Theorem B.8.] which states that for any continuity points Im(s(x + √ −1ε))dx.
A sequence of random measures µ n on R is said to converge weakly in probability to a finite measure µ if f (x)dµ n (x) → f (x)dµ(x) in probability for every continuous bounded function f ∈ C b (R). Finally, the symmetrization of a measure µ on R ≥0 is the measure sym(µ) on R given by A → (µ(A ∩ R ≥0 ) + µ((−A) ∩ R ≥0 ))/2 where A ranges over all measurable subsets of R and −A := {−a : a ∈ A}.
Visualizations of the following results are shown in Figure 2.
Theorem 1.1. The empirical singular value distribution νN X / √ n converges weakly in probability to a compactly supported probability measure ν on R ≥0 . Moreover, the symmetrization sym(ν) has Stieltjes transform s(z) = K i=1 α i (a i (z) + a K+i (z))/2 where a 1 , . . . , a 2K are the unique analytic functions from C + to C − such that the following system of equations is satisfied λπ(j)α −1 i p j,i a j (z) (6) for i = 1, . . . , K.
Theorem 1.2. The empirical singular value distribution ν √ nP X converges weakly in probability to a compactly supported probability measure ν on R ≥0 . Moreover, the symmetrization sym(ν) has Stieltjes transform s(z) = K i=1 α i (a i (z) + a K+i (z))/2 where a 1 , . . . , a 2K are the unique analytic functions from C + to C − such that the following system of equations is satisfied for i = 1, . . . , K.
Observe that the limiting law described in Theorem 1.1 is the same as occurs for a random matrix M with mean-zero independent entries matching the variance profile ofN X [80, Theorem 6.1]. By matching variance profile, it is here meant that Var[M ij ] = Var[N X,ij ] for all i, j. Similarly, the limiting law in Theorem 1.2 corresponds to the limiting law of a random matrix with independent entries which instead matches the variance profile of √ nP X . This hints at an underlying more general universality principle, which is commonplace in random matrix theory. Informally, universality states that the spectrum of a random matrix often only depends on the variance of the entries.
Indeed, the first step of our proof establishes a precise version of this universality statement in Corollary 4.3. The proof strategy is essentially a modification of the moment method. More specifically, we generalize a result from [40] concerning the eigenvalue distributions of approximately uncorrelated random matrices to include the possibility of a variance profile. The explicit description of the Stieltjes transform of the limiting laws relies on [80].
The second step of our proof establishes that this general-purpose universality principle applies to the block Markovian random matricesN X andP X . Proposition 4.7 contains the corresponding result. The key difficulty is to control the dependence. To this end, we use a coupling-based approach which is original to this paper and in fact the main new ingredient to establish the results. The reader is referred to Proposition 4.8 for a special case of Proposition 4.7 whose proof contains the key ideas.
For future research, an investigation into what happens when the sample path is much shorter, i.e., = o(n 2 ), can be considered. We anticipate that the results of this paper can be extended to such regimes, in which the empirical frequency matrix is sparser. Nontrivial modifications would however be required in the part of the proof which relies on the moment method. This is because, in a sparse random matrix, normalizing for the variance causes all higher moments of the entries to diverge. This issue can already be observed at the level of a scalar random variable. Namely consider a sequence of Bernoulli random variables ξ n with probability of success p n = o(1) as n tends to infinity, and set ζ n := (ξ n − p n )/ p n (1 − p n ).  [35] for an overview. Community detection problems within the context of block Markov chains received attention more recently [30,31,66,78,79,81]. Spectral clustering algorithms for learning low-rank structures in Markov chains from trajectories that specifically utilize the sample frequency matrixN X or the empirical transition matrixP X have been analyzed in [66,78]. The fact thatN X and P X have previously been used in the context of community detection algorithms for Markov chains is also what motivated us to consider these two specific matrices. One could in principle however also build different matrices from the data. Appropriately modified, the methods of the current paper should still apply and thereby allow one to derive the associated singular value distributions.
In order to compare algorithmic performance to an information-theoretical lower bound on the detection error rate satisfied under any clustering algorithm, [66] required a sufficiently sharp upper bound to the largest singular value ofN X −E[N X ].
The singular values ofN X were recently also considered in [67]. It is established there thatN X has K informative singular values of size Θ( /n), and that the remaining n − K singular values are O( /n). Besides the dense regime = Ω(n 2 ), the sparser regimes = Ω(n ln n) and = ω(n) are also considered in [66,67].

2.2.
Random matrices generated by stochastic processes. The spectral distributions of matrices whose entries are sampled by means of a stochastic process, such as a Markov chain, were considered in [22,34,53,54,60,62]. These results are similar to ours in that the randomness is due to the sampling noise of the stochastic process, but differ in the precise construction of the matrix.
Sample covariance matrices for time series have been considered in [9,10,44,55,61,76]. Let us note that covariance matrices of time series can also be viewed as an instance of the aforementioned study of random matrices with entries sampled from a stochastic process by consideration of the entries of the data matrix. Sample autocovariance matrices of time series have been considered in [12,17,18,49,50,77].
2.3. Coupling arguments. The critical new ingredient in our proof are the coupling arguments which are used to establish Proposition 4.7. Coupling arguments are a natural way to deal with the dependence in a Markov chain. They have been used in this setting since the seminal paper [29]. Coupling arguments in random matrix theory are however not common place; exceptions we are aware of are [9,11].

2.4.
Approximately uncorrelated random matrices. A class of self-adjoint random matrices with dependent entries and decaying covariances, called approximately uncorrelated random matrices, was studied in [40]. The authors establish that the empirical eigenvalue distribution of an approximately uncorrelated random matrix converges weakly in probability to the semicircular law. Our Corollary 4.3 generalizes their approach to admit the possibility that not all entries have the same variance. Improved results to convergence weakly almost surely are established in [19,33] under additional assumptions. It would be interesting to establish similar results in the presence of variance profiles (see the remarks preceding Proposition 6.16).
2.5. Random matrices with a variance profile. Let M be a random matrix with independent centered entries and variance profile S ij := Var[M ij ]. The classical results on the spectral properties of random matrices assume that this variance profile is constant, but extensions to nonconstant variance profiles have also been considered [2,4,27,28,32,37].
It is typically necessary to assume that the variance profile has tractable asymptotic behavior. One notion of tractable asymptotic behavior may be found by employing notions from graphon theory [23,26,80]. In this case the variance profile converges to an integrable function W : [0, 1] 2 → R. Let us note that results characterizing eigenvalue distributions in this setting historically preceded the graphon-theoretic terminology; see [72]. Graphon theory was originally developed in [52] as limiting objects for sequences of dense graphs.
Systems of self-consistent equations as in Theorem 1.1 frequently occur in the theory of random matrices with variance profiles [32,72,80]. However, solving these equations to determine an explicit expression for the Stieltjes transform s(z) is rarely possible. A numerical method based on an iteration of contraction maps has been developed in the field of operator-valued free probability theory [39]. 2.6. Poisson limit theorems for Markov chains. The variance profiles in our block Markovian setting follow from a Poisson limit theorem; see Theorem 4.5. This is in turn deduced from a nonasymptotic Poisson approximation theorem; see Theorem 6.11. Poisson limit theorems for Markov chains are a topic of study in their own right. We refer to [69,Section 2.3] and [38,Section 5] for an overview of the literature.
Distinct from the literature, the emphasis of Theorem 6.11 lies on the fact that the state space is growing. Compare this e.g. to [63] which concerns the number of visits to an increasingly rare cylindrical set in the sample path of a Markov chain on a fixed finite state space. Theorem 6.11's proof relies on a general Poisson approximation result for sums of dependent random variables from [6] which in turn relies on a method from [24]. 2.7. Random transition matrices. There has been recent interest in the spectral properties of random walks in a random environment. The setting of [15] is to first sample a random n × n matrix M of independent and identically distributed nonnegative real random variables of finite variance, and to then construct the random transition matrix P := D −1 M with D the diagonal matrix containing the row sums of M . The results then concern limiting laws for the singular value distribution and eigenvalue distribution of √ nP . Different models for random transition matrices have been considered in [13,14,16,20,21,25,42,43,47,56,58,82].
Our study ofP X differs from the study of P in the source of randomness: the randomness inP X is due to the observation noise in the sampled Markov chain in a deterministic environment whereas the randomness in P is due to a perfect observation of a random environment. Our situation involves the additional subtlety that one has to deal with the dependence intrinsic to the sampling noise of a Markov chain. It should however be mentioned that many of the aforementioned results also concern the distribution of eigenvalues, while we study singular values.

Notation and preliminaries
3.1. Block Markov chains. Block Markov chains were defined in Section 1.1. Denote σ : V → {1, . . . , K} for the map which sends any v ∈ V k to the cluster index k. Note that the definition of a block Markov chain implies that Σ X := (σ(X t )) t=0 is a Markov chain on the space of clusters {1, . . . , K} with transition matrix p.
Recall that it is assumed that |#V k − α k n| = o(n) with α k > 0 for all k ∈ {1, . . . , K}. Further, recall that the V k are assumed to be nonempty for all n. Hence, there exists some α min ∈ R >0 , independent of n, such that #V k > α min n for all k ∈ {1, . . . , K}.
We denote E X := (E X,t ) t=1 for the chain of edges E X,t := (X t−1 , X t ) associated with X.

3.2.
Graphs. All graphs in this paper are assumed to be finite and are allowed to have self-loops or multiple edges. We use the term simple to refer to the case where self-loops or multiple edges are not allowed. An ordered tree is a simple rooted tree such that every vertex is equipped with a total order on its descendants. The collection of all ordered trees on k + 1 vertices is denoted by T k .
For any n ∈ Z ≥1 we denote E n for the set of directed edges {1, . . . , n} 2 corresponding to the state space V := {1, . . . , n}. Given a directed edge e = (i, j) and an n × n matrix M we denote M e := M ij . For two vectors of integers m, m ∈ Z R we denote m ≤ m if m i ≤ m i for all i ∈ {1, . . . , R}.

Graphon theory. A graphon is an integrable map
The cut norm on a graphon W is defined by W := sup S,T ⊆[0,1] | S×T W (x, y)dxdy| where the supremum runs over all measurable subsets S, T of [0, 1]. The cut metric δ on the space of graphons is defined by where the infinimum runs over all measure preserving bijections φ : [0, 1] → [0, 1] and W φ (x, y) := W (φ(x), φ(y)). Given a symmetric matrix M ∈ R n×n one can define a graphon W M by setting The graphon W M can be assigned values on the boundaries x = 1 and y = 1 by extending continuously.
3.4. Measure theory. Recall that the Stieltjes transform of a finite measure on R, weak convergence in probability of random measures, and the symmetrization of a measure on R ≥0 were defined in Section 1.1. For two probability measures µ and ν on a countable space V, we can define the total variation distance as Whenever the kth moment of a measure µ on R exists, k ∈ Z ≥0 , the kth moment will be denoted as m k (µ) := x k dµ(x). Note that the definition of the empirical singular value distribution of an n × n real matrix M with singular values s 1 (M ) ≥ . . . ≥ s n (M ) from (2) can be rephrased as stating that ν M is the measure on 3.5. Compressed notation for conditional probability. Let A, B, C, D ∈ F be events in some probability space (Ω, F, P) with P(C ∩ D) = 0. We will on a few occasions encounter long expressions involving the associated conditional probability.
To preserve readability we may then also use the following compressed notation Similar notation may be used for conditional expectation and unconditional probability.
3.6. Asymptotic notation. We employ the usual conventions for big-O notation: Let (x n ) ∞ n=0 and (y n ) ∞ n=0 be two sequences of real numbers. Then x n = O(y n ) if and only if there exist C, n 0 > 0 such that |x n | ≤ C|y n | for all n ≥ n 0 . Similarly, x n = o(y n ) if and only if for every C > 0 there exists some n 0 such that |x n | ≤ C|y n | for all n ≥ n 0 ; and x n = Ω(y n ) if and only if there exist C, n 0 > 0 such that |x n | ≥ C|y n | for all n ≥ n 0 . Finally, x n = Θ(y n ) if and only if x n = O(y n ) as well as x n = Ω(y n ).
If (x n ) ∞ n=0 depends on some parameters a, b, then the possible dependence of the constants on the parameters is expressed in the notation. For example, x n = O a (y n ) means that there exist C, n 0 > 0 possibly dependent on a but not on b such that |x n | ≤ C|y n | for all n ≥ n 0 . When emphasizing like such, the parameters a, b are assumed not to depend on n.

Proof outline
Our proof of Theorem 1.1 and Theorem 1.2 consists of two parts: a generalpurpose universality result and a reduction argument.
The first part of our proof is given in Section 4.1 where we generalize the results concerning eigenvalues of approximately uncorrelated random matrices from [40] to admit the possibility of a variance profile. The corresponding general-purpose universality results are Theorem 4.2 and Corollary 4.3. Similar results for matrices with variance profile have previously also appeared in [80, Theorem 3.2 and Theorem 3.4] under the assumption that the entries are independent. The combination of variance profiles with dependence, i.e., the notion of approximately uncorrelated random matrices with a variance profile, is new.
The second part of our proof is given in Section 4.2 and consists of a reduction to Corollary 4.3. This involves two key difficulties. First, we need to determine the variance profiles associated with the block Markovian random matricesN X andP X . These variance profiles are established using Theorem 4.5, which states thatN X,ij is asymptotically Poisson distributed with a rate that depends only on the clusters to which i and j belong. Second, we need to establish that the block Markovian random matrices are in the approximately uncorrelated universality regime. To do so, we develop a coupling argument which shows that the covariance between the number of traversals of different edges decays sufficiently quickly; see Proposition 4.7 for the corresponding result. 4.1. Eigenvalue distributions of approximately uncorrelated random matrices with variance profile. The results in this section concern the eigenvalues of a symmetric matrix whereas we are interested in the singular values of the ma-tricesN X andP X . To this end, let us remind the reader of the fact that the study of singular values of any matrix can be reduced to the study of the eigenvalues of a symmetric matrix by a Hermitian dilation (recall (12)).
Definition 4.1. A family of symmetric random matrices (A n ) ∞ n=1 is said to be approximately uncorrelated with variance profile if, for any non-negative integers 0 ≤ r ≤ R and m 1 , . . . , m R ∈ Z ≥0 with m i = 1 for i = 1, . . . , r, it holds that and where the maxima run over all values of (i 1 , j 1 ), . . . , (i R , j R ) ∈ {1, . . . , n} 2 with {i k , j k } = {i l , j l } for all k = l.
In order to identify a limit of the empirical eigenvalue distribution µ An/ √ n it is necessary to assume that the variance profile has tractable asymptotic behavior. We follow the approach taken in [80,Theorem 3.2] and employ the homomorphism density. The homomorphism density from a simple graph F = (V, E) on V = {1, . . . , R} to a symmetric matrix M ∈ R n×n is defined by The name homomorphism density may be explained by the fact that if A is the adjacency matrix of a graph G, then t(F, A) counts the number of graph homomorphisms of F to G. A detailed proof for the following result may be found in Section 6.1.
be a family of symmetric random matrices which are approximately uncorrelated with variance profile S n := (Var[A n,ij ]) n i,j=1 . Assume that for every ordered tree T ∈ ∪ ∞ m=0 T m it holds that t(T, S n ) has a limit as n → ∞. Then, the empirical eigenvalue distribution of µ An/ √ n converges weakly in probability to the unique probability measure µ whose moments are given by for every m ∈ Z ≥0 . Moreover, µ is compactly supported.
Proof sketch. Just as in the classical moment method, the key step is to show that for every k ∈ Z ≥0 . Given a sequence of indices i := (i 1 , . . . , i k , i 1 ), which occurs on the right-hand side of (19), let Viewing i as a cycle on G i , let r 1 (i) be the number of edges which are traversed exactly once and let r 2 (i) be the number of edges which are traversed exactly twice. Note that we could in principle also define r 3 (i), r 4 (i), . . ., but these quantities will not be relevant in the proof.
In a classical application of the moment method, one has assumed that all entries A n,ij with i ≤ j are independent and centered. Under such assumptions, it immediately follows that P (i) = 0 whenever r 1 (i) > 0. The entries of A n are however not independent in our case. It may thus be that P (i) = 0. Instead, one has to rely on part (14) in the definition of an approximately uncorrelated random matrix to deduce that P (i) is small whenever r 1 (i) is large. Combined with a bound on the number of terms with r 1 (i) = r, which is stated in Lemma 6.4 and established in [40], this is still sufficient to argue that the contribution of the terms with r 1 (i) > 0 is asymptotically negligible.
When k = 2m+1 is odd, the number of terms with r 1 (i) = 0 in (19) is of a smaller order than the normalizing factor n −1−k/2 . This yields that n −1−k/2 E[Tr(A k n )] = o k (1) for all odd values of k. When k = 2m is even, the asymptotics are dominated by the contribution of those P (i) for which G i is a tree and r 2 (i) = k/2. This leads to the conclusion that n −1−k/2 E[Tr(A k n )] = T ∈Tm t(T, S n ) + o k (1).
Note that Theorem 4.2 also applies to random matrices with independent entries. Correspondingly, the limit µ may be explicitly identified whenever it is known for independent random matrices with the same variance profile. The following corollary is an instance of this principle and uses the description provided in [80,Theorem 3.4] for eigenvalues of matrices with independent entries. The details for the reduction argument are provided in Section 6.2.
n=1 be a family of symmetric random matrices which are approximately uncorrelated with variance profile S n := (Var[A n,ij ]) n i,j=1 . Assume that there exists some graphon W ∈ W 0 such that δ (W Sn , W ) → 0. Then, the empirical eigenvalue distribution µ An/ √ n converges weakly in probability to the probability measure µ whose Stieltjes transform s(z) is given by where a(z, x) is the unique analytic function from C + × [0, 1] to C − satisfying the following self-consistent equation W (x, y)a(z, y)dy.
Assume that Z is irreducible and acyclic so that it has an equilibrium distribution Π Z . Then, Z is said to start in equilibrium if it has initial distribution Π Z . LetD X denote the n × n diagonal matrix whose ith diagonal value is the sum of the values on the ith row ofN X : Observe that we can write the definition of the empirical transition matrix in (3) asP X =D −1 XN X . The following lemma, whose proof is provided in Section 6.3.1 based on perturbative arguments, allows us to make the following two reductions. First, we may recenterN X and pretend as ifD X is a deterministic matrix. Second, we may assume that X starts in equilibrium. (i) Assume that ν M X / √ n converges weakly in probability to a probability measure ν. Then, νN Y / √ n converges weakly in probability to ν.
(ii) Assume that ν √ nQ X converges weakly in probability to a probability measure ν. Then, ν √ nP Y converges weakly in probability to ν.
The strategy is now to apply Corollary 4.
where X is a block Markov chain which starts in equilibrium.

4.2.2.
Determination of the variance profile. The limiting variance profile ofN X can be established by a direct calculation which shows that the covariance between the different terms ofN X,e = t=1 1 E X,t =e is asymptotically negligible. We instead take a different route: one that yields the stronger claim thatN X,e satisfies a Poisson limit theorem and is conceptually more satisfying.
Theorem 4.5. Assume that X starts in equilibrium. Fix some k 1 , k 2 ∈ {1, . . . , K} with p k1,k2 > 0 and let (e n ) ∞ n=1 be a sequence of directed edges with e n ∈ V k1 × V k2 for all n. ThenN X,en converges in distribution to a Poisson distribution with rate A proof is provided in Section 6.3.2 where one can also find a nonasymptotic Theorem 6.11 which gives a precise upper bound on the total variation distance ofN X,en to a Poisson distribution. The proof relies on a reduction to a general Poisson approximation theorem for sums of dependent random variables from [6].
Let us remark that the proof of Theorem 4.5 may also be used to derive a Poisson limit theorem in different scaling regimes than the running assumption = Θ(n 2 ) and #V k = Θ(n). More precisely, a Poisson limit theorem holds whenever and #V k1 × #V k2 tend to infinity in such a fashion that (#V k1 #V k2 ) −1 converges to a nonzero constant. For example,N X,en also satisfies a Poisson limit theorem in a block Markov chain with two clusters of size #V 1 = Θ(1) and #V 2 = = ω(1) respectively.
The variance profile ofN X now follows by a tightness argument which is provided in Section 6.3.3.
Corollary 4.6. Let e n be as in Theorem 4.5 and assume that X starts in equilibrium. Then, as n tends to infinity, it holds that Graphon limits for the variance profiles of √ 2H(M X ) and √ 2H(nQ X ) are immediate from Corollary 4.6. To be precise, by (10) and (12) the variance profile of √ 2H(M X ) converges to the graphon with respect to the cut metric (9). Here, c i := i−1 k=1 α k and i, j ∈ {1, . . . , K}. Similarly, note that Π X (v) = π(σ(v))/(nα σ(v) ) + o(1) so that the variance profile of √ 2H(nQ X ) converges to the graphon W Q specified by 4.2.3. Approximately uncorrelated. It remains to show that H(M X ) and H(nQ X ) are approximately uncorrelated with variance profiles. In fact, since nQ X is derived from M X by rescaling with n diag(( + 1)Π X ) −1 , which is a deterministic diagonal matrix with entries of size Θ(1), it is sufficient to establish that H(M X ) is approximately uncorrelated with variance profile.
Proposition 4.7. Assume that X starts in equilibrium. Then the sequence of self-adjoint random matrices H(M X ) is approximately uncorrelated with variance profile.
Recall that Definition 4.1 of approximately uncorrelated random matrices with a variance profile consists of two properties, namely (14) and (15). The proof for Proposition 4.7 given in Section 6.3.4 thus comes down to a verification of these two properties: (14) is verified in Proposition 6.15 and (15) is verified in Proposition 6.14.
The proof of Theorem 1.1 and Theorem 1.2 is then complete. Indeed, by using the Hermitian dilation in (12) and the preliminary reduction from Lemma 4.4, it is sufficient to establish limiting laws for the eigenvalues of √ 2H(M X )/ √ 2n and √ 2H(nQ X )/ √ 2n when X starts in equilibrium. This case follows from Corollary 4.3 with the limiting variance profiles in (23) and (24).

4.2.4.
Demonstration of the coupling argument. Proposition 4.7 is the most important ingredient for our results. Let us provide an example for the method of proof by establishing a special case of (14): the covariance between two entries decays at an appropriate rate.
where the maximum runs over all pairs of distinct edges e 1 , e 2 ∈ E n .
Proof. The proof is split into parts. The main ideas are contained in Part 2 and Part 3. In Part 2 we observe that it is sufficient to understand how much the expectation ofN X,e2 changes when it is conditioned on a traversal of e 1 at some predetermined time. This effect of conditioning on a traversal is then understood by a coupling argument in Part 3.

Part 1: Preliminary reduction to K ≥ 5
We claim that there is no loss in generality in assuming that K ≥ 5. The idea is to split a cluster into pieces.
i=K to be a partition of V K into nonempty sets. It can here be ensured that we remain in the asymptotic regime where the clusters have size Θ(n). Indeed, this for instance follows if the subdivision of V K is taken to be into clusters of roughly equal size so that the ratio #V i /#V K tends to 1/5 for every i ∈ {K, . . . , K + 4}. Further define a (K + 4) × (K + 4) stochastic matrix p by p i,j := (#V j /#V min{K,j} )p min{K,i},min{K,j} for all i, j = 1, . . . , K + 4.
The reduction to K ≥ 5 now follows by observing that the clusters and cluster transition matrix p define exactly the same block Markov chain as we started with. Indeed, if P i,j denotes the transition probabilities of the 'new' block Markov chain then for any i ∈ V x and j ∈ V y it holds that Pick two distinct edges e 1 , e 2 ∈ E n . Recall from Section 3.1 that E X = (E X,t ) t=1 denotes the induced Markov chain of edges E X,t = (X t−1 , X t ). It may be assumed that e 1 is such that P(E X,1 = e 1 ) = 0, otherwise M X,e1 = 0 and there is nothing to prove. The clusters of the block Markov chain allow us to ensure that ] to be small because a traversal of e 2 in this small period of time is unlikely.
Note that all edges whose starting point and ending point have the same clusters as the starting point and ending point of e 1 respectively are equally likely to be traversed at time t 1 . There are at least α 2 min n 2 such edges. Hence, P(E X,t1 = e 1 ) ≤ α −2 min n −2 . Considering that there are = Θ(n 2 ) terms on the right-hand side of (28), it remains to be shown that

Part 3: Construction of coupled chains (X, Y )
Recall that we ensured that K ≥ 5. In particular there exists some k ∈ {1, . . . , K} such that V k does not contain any endpoint of e 1 and e 2 . To study the difference we construct a pair of chains (X, Y ); see Figure 3 for a visualization.
(i) Sample an infinitely long path X := ( X t ) ∞ t=−∞ from the block Markov chain and independently sample an infinitely long path Y := ( Y t ) ∞ t=−∞ from the block Markov chain conditioned on E Y ,t1 = e 1 . Note that it is possible to sample at negative times by means of a time reversal of the Markov chain. Such time reversal exists by the assumption that the Markov chain associated with p is irreducible and acyclic. (ii) Define and note that T − and T + are finite with probability one due to the assumption that the Markov chain associated with p is irreducible and acyclic. Let By construction, X is a sample path from the block Markov chain whereas Y is a sample path of the block Markov chain conditioned on the event E Y,t1 = e 1 . Now, by the law of total expectation denote the number of times e 2 was traversed by We will establish a bound on the conditional expectation of ∆ Y by using its definition in terms of 1 E Y,t =e2 . To this end we claim that for any t ∈ {1, . . . , }. In case t = t 1 , the left-hand side of (34) is zero and there is nothing to prove. Now consider the case where t = t 1 . For any edge e whose starting point is equal to the starting point of e 2 and whose ending point is in the same cluster as the ending point of . This implies (34) whenever t > t 1 since there are at least α min n such edges e. The case t < t 1 may be deduced similarly by reversing the roles of the ending point and the starting point of e. Combine (34) with the fact that ∆ Y = The same conclusion applies to ∆ X . Combine finally with (32) and (33) to deduce that Recall that the transition dynamics p for σ(X t ) and σ(Y t ) are assumed to be acyclic and irreducible. It follows that P(T + > t) shows exponential decay in t. Because there are only K 2 possibilities for the initial state Σ + (X,Y ),0 it follows that E[T + ] ≤ B + for some constant B + ∈ R >0 which does not depend on e 1 , e 2 or n. A similar argument shows that From (28) and (37) it now follows that Observe that the right-hand side of (38) is independent of e 1 , e 2 . Since = Θ(n 2 ) this concludes the proof.

Numerical experiment on Manhattan taxi trips
We will now demonstrate that Theorem 1.1 and Theorem 1.2 can give nontrivial predictions for the singular value distributions on an actual dataset. Specifically, we will analyze the first six months of 2016 in the New York City yellow cab dataset [57]. Each datapoint contains the pick-up and drop-off location of one trip. Here, pick-up locations are typically close to drop-off locations so that the dataset may be modelled as a fragmented sample path of a Markov chain.
Spectral clusterings using the Markovian structure of this dataset have previously been analyzed in [78]. Our preprocessing is similar to what was done in [78], and is as follows. The map is subdivided into a fine grid and we trim all states which have been visited fewer than 200 times. We further remove all self-transitions. This results in a state space of size n = 4486 with = 55 × 10 6 transitions.
Note that /n 2 ≈ 2.7. This empirical observation allows us to make a relevant remark concerning our theoretical assumption = Θ(n 2 ). Some of our readers may namely be familiar with the literature on random graphs. It is an empirical observation that most real-world graphs are sparse. This sparsity is correspondingly a key difficulty which one should interact with in the setting of random graphs. Sparsity is not irrelevant in the setting of sequential data but it has a different meaning; it relates to the amount of time that the process was observed. As opposed to random graphs it is not unusual to encounter dense sequential data in the real world. The key novel difficulty is rather that sequential data has dependence. This is precisely the difficulty which our proofs interact with; recall the coupling argument in the proof of Proposition 4.8.
A clustering (V k ) K k=1 is found by applying both steps from the algorithm in [66] with K = 4 clusters; the result is displayed on the left-hand side of Figure 4. Having obtained these clusters we may estimate the parameters of the block Markov chain asλ These parameters may be substituted in Theorem 1.1 and Theorem 1.2 to yield predictions for the singular value distributions ofN X andP X . The theoretical predictions and the empirical observations are displayed in Figure 4. For comparison, we have also displayed the quarter circle law which is the universal law for the singular values of a random matrix with independent entries and identical variance. In other words, the quarter circle law is the prediction corresponding to K = 1.
Taking into account the fact that we used just K = 4 clusters, we conclude that the predictions match the shape of the singular value distributions fairly well. Observe that the quarter circle law does not even match the general shape of the distributions: the quarter circle law has a concave density whereas the observed empirical distributions have convex densities.
. On the right: √ nP X and its singular values compared to our theoretical predictions and the quarter-circle law with density This is to say that (A n ) ∞ n=1 is a family of symmetric n × n random matrices which are approximately uncorrelated with variance profile. Recall that for any fixed ordered tree T ∈ T k it is assumed that t(T, S n ) has a limit as n → ∞ where S n = (Var(A n,ij )) n i,j=1 denotes the variance profile of A n . The proof of Theorem 4.2 comes down to a modification of the proof in [40] to include the variance profile. Let us start by including some background on the moment method. These results are well-known and included for the reader's convenience.
The following lemma is implicit in the proof of the Wigner semicircle law in [ n=1 be a sequence of random probability measures on R and let µ be a deterministic and compactly supported probability measure on R. If, for every k ∈ Z ≥0 , then µ n converges weakly in probability to µ.
The following result is moreover a direct consequence of the Stone-Weierstrass theorem and the Riesz representation theorem [65, Theorem 2.14].
Lemma 6.2. For any compactly supported probability measure µ it holds that the sequence of moments (m k (µ)) ∞ k=0 satisfies the following properties: (i) It holds that m 0 (µ) = 1.
(iii) There exists a constant c ∈ R >0 such that |m k (µ)| ≤ c k for all k ∈ Z ≥0 . Moreover, for any sequence of real number (m k ) ∞ k=0 satisfying these properties there exists a unique probability measure µ with (m k (µ)) ∞ k=0 = (m k ) ∞ k=0 and this measure µ is compactly supported. Lemma 6.3. There exists a unique probability measure µ whose moments are given by for every m ∈ Z ≥0 . Moreover, µ is compactly supported.
Proof. The existence of such a probability measure µ is known [80,Theorem 3.2]. However, it is not explicitly stated in [80] that µ is compactly supported so, for the sake of completeness, let us provide an argument based on Lemma 6.2. The condition that m 0 (µ) = 1 is satisfied since µ is a probability measure. Further, the positive semi-definiteness of the Hankel matrix (m i+j (µ)) k i,j=0 for every k ∈ Z ≥0 is equivalent to the trivial statement that p( . It remains to show that the moments of µ are exponentially bounded. It follows from (14) in the definition of an approximately uncorrelated random matrix with a variance profile that max i,j=1,...,n S n,ij ≤ c 1 . Therefore, using the definition of homomorphism densities in (16) it holds that for every m ∈ Z ≥0 and any ordered tree T ∈ T m , It holds that #T m = C m where C m is the mth Catalan number. It is known that there exists some c 2 ∈ R >0 such that C m ≤ c m 2 for all m ∈ Z ≥0 . Hence, for some constant c 3 ∈ R >0 and all m ≥ 1. Equation (42) provides an exponential bound on the rate of growth of the even moments m 2m (µ). Further, recall that we already know that m 0 (µ) = 1 and m 2m+1 = 0. It follows that m k (µ) ≤ c k 3 for all k ∈ Z ≥0 . Conclude by Lemma 6.2 that µ is compactly supported and unique. We adapt the same notation as in the sketch of Theorem 4.2. This is to say that given integers i := (i 1 , . . . , i k , i 1 ) with i j ∈ {1, . . . , n} for every j = 1, . . . , k we denote the induced undirected graph with vertex set V (i) . Viewing i as a cycle on G i we let r 1 (i) be the number of edges which are traversed exactly once and r 2 (i) be the number of edges which are traversed exactly twice. Finally, we The following combinatorial result will be essential. . With i as above it holds that #V (i) ≤ (k + r 1 (i))/2 + 1 with strict inequality whenever r 1 (i) > 0. Figure 5. Visualization of a sequence of integers i for which P (i) has an asymptotically relevant contribution to (45). The corresponding undirected graph G i is the tree found by identifying the doubled edges.
Proof. Let k ∈ Z ≥0 be a positive integer and recall the expansion for E[m k (µ An/ √ n )] in (19). Let i := (i 1 , . . . , i k , i 1 ) be a sequence of integers as occurs in the right-hand side of (19). By property (14) in the definition of an approximately uncorrelated random matrix it holds that P (i) = O k (n −r/2 ) whenever r 1 (i) = r. By the part concerning the strict inequality in Lemma 6.4 we have that for any r ∈ Z >0 there are o k (n (k+r)/2+1 ) terms in the right-hand side of (19) with r 1 (i) = r. This means that only the terms with r 1 (i) = 0 survive the normalization by n −1−k/2 : Note that G i is connected. Hence, for any i with r 1 (i) = 0 it has to hold that #V (i) ≤ k/2 + 1 with equality if and only if k is even, G i is a tree and r 2 (i) = k/2. In particular, for k odd there are o k (n 1+k/2 ) terms on the right-hand side of (45). Hence, for any m ∈ Z ≥0 taking k = 2m + 1 yields that Now let k = 2m be even. As above, the contribution of the terms for which G i is not a tree or r 2 (i) = m is asymptotically negligible. Consider some sequence of indices i = (i 1 , . . . , i k , i 1 ) for which G i is a tree and r 2 (i) = m. An example of such a sequence i is depicted in Figure 5. Equip G i with the unique order such that i is the path traversed by depth-first search. Now, where the isomorphism condition is to be considered in the space of ordered trees. The ordering on any T ∈ T m induces a canonical numbering of the vertex set corresponding to the order of visits in depth-first search. This is to say that it can be assumed that V Note that (16) was used in the second step. By (47) it now follows that for every Combine (46) and (50) to conclude the proof.
A resulting sequence is schematically depicted in Figure 6. By definition of approximately uncorrelated it holds that P (i, j) = O k (n −r/2 ) whenever r 1 ( i,j ) = r. Further, it follows from Lemma 6.4 that #(V (i) ∪ V (j)) ≤ (2k + r 1 ( i,j ))/2 + 1. Hence, for any r ∈ Z ≥0 there are O k ((2k + r)/2 + 1) terms with V (i)∩V (j) = ∅ and r 1 ( i,j ) = r. This implies that the contribution is asymptotically negligible: The argument to deal with the terms with V (i) ∩ V (j) = ∅ are identical to those used in the proof of Lemma 6.5. In particular, it can be established that where it is used that the contribution of those pairs of trees G i , G j which were removed by restricting our attention to the case of V (i) ∩ V (j) = ∅ is asymptotically negligible due to the fact that there are only O k (n k+1 ) values of i, j with V (i) ∩ V (j) = ∅ and r 2 (i) = r 2 (j) = k/2.
Combine (55), (56), and the limit established in Lemma 6.5 for Proof of Theorem 4.2. By Lemma 6.3 there exists a compactly supported probability measure µ with the specified moments. The result is now immediate by Lemma 6.1 whose assumptions were verified in Lemma 6.5 and Lemma 6.6.

Proof of Corollary 4.3.
Proof. Recall that it is assumed that δ (W Sn , W ) → 0. Then, by [51,Theorem 11.5], it holds that t(F, W Sn ) converges as n → ∞ for every fixed tree. Consequently, by Theorem 4.2, it holds that µ An/ √ n converges weakly in probability to some limit µ and it remains to show that the Stieltjes transform of µ is given by (21).
To this end remark that Theorem 4.2 also applies to random matrices with independent entries. Consequently, if we consider a sequence of random matrices B n with independent entries and the same variance profile as A n , then also µ Bn/  (21) with codomain C − also admits a Laurent series, namely a(z, Let us warn here that the convention for the Stieltjes transform which was used in [3] differs by a minus sign from the convention which was used here and in [80]. One should correspondingly take m x (z) = −a(z, x) when applying [3, Theorem 2.1]. This difference by a sign also explains the codomain in [3] is C + whereas we claim that the appropriate codomain is C − . Now observe that a Laurent series of the form ∞ k=1 c k (x)z −k satisfies (21) if and only if c 1 (x) = 1, c 2 (x) = 0, and This implies that the coefficients of the Laurent series of a(z, x) and a(z, x) have to be equal since both functions satisfy (21). It follows that a(z, x) and a(z, x) are equal in a neighborhood of z = ∞ and consequently everywhere on C + by the identity theorem for analytic functions. This shows that the function a(z, x) provided in [80,Equation (3.4)] indeed has codomain C − . Lemma 6.7. Let A n be a sequence of random n×n matrices such that ν An converges weakly in probability to some probability measure ν.
(i) Let B n be a sequence of random n × n matrices such that 1 n B n 2 F converges to 0 in probability. Then ν An+Bn converges weakly in probability to ν.
(ii) Let C n be a sequence of random symmetric n × n matrices such that 1 n rank(C n ) converges to 0 in probability. Then ν An+Cn converges weakly in probability to ν. (iii) Let D n be a sequence of random diagonal n × n matrices such that D n − Id op converges to 0 in probability. Then ν DnAn converges weakly in probability to ν.
Proof. Statement (i) follows after a Hermitian dilation (12) for every continuous bounded function f ∈ C b (R). Since ν and ν DnAn are probability measures we may further restrict ourselves to the case where f is compactly supported. This follows by consideration of f g with g a bump function; see e.g.
[45, Proof of Lemma 6.21]. Let c f ∈ R >0 be a sufficiently large constant so that supp(f ) ⊆ [−c f , c f ]. By f being compactly supported and continuous it follows that f is uniformly continuous. There therefore exists some δ > 0 such that |f (x) − f (y)| < ε/2 whenever |x − y| < δ.
Observe that D n op ≤ 1 + D n − Id op and D −1 respectively. This is equivalent to saying that Since D n − Id op converges to zero in probability it holds that Deduce from (59) that whenever the event contained in the left-hand side of (60) holds, it follows that σ i (D n A n ) > c f for any i with σ i (A n ) > 2c f and that |σ i (D n A n ) − σ i (A n )| < δ for any i with σ i (A n ) ≤ 2c f . Therefore, whenever the event contained in the left-hand side of (60) holds, it follows that We used here that supp(f ) ⊆ [−c f , c f ]. Combine (60) with (61) and the assumption that ν An converges to ν weakly in probability to conclude that (58) holds. This concludes the proof.
We will ultimately use Lemma 6.7(iii) with D n :=D −1 X diag(( +1)Π X ) to replacê D X by a deterministic matrix. This requires thatD X ≈ diag(( + 1)Π X ), which will be shown by means of a concentration inequality that follows from the short mixing times of block Markov chains.
Let Z be a Markov chain on the state space V which is irreducible and acyclic. For any ε ∈ [0, 1) the ε-mixing time of Z is defined as t Z mix (ε) := min{t ∈ Z ≥0 : d(t) ≤ ε} where and d TV denotes the total variation distance defined in (11). Set t Z mix := t Z mix (1/4). Observe that since X is a block Markov chain it holds that t X mix ≤ max{t Σ X mix , 1} where Σ X,t := σ(X t ) is the induced Markov chain on the clusters {1, . . . , K}. Observe furthermore that the dynamics of Σ X are independent of n by definition of a block Markov chain, so that t Σ X mix is a constant. Thus t X mix = O(1). We will rely on a concentration inequality from [59] which we reproduce here for the reader's convenience. Similar proofs for concentration in block Markov chains using this concentration inequality may be found in [66,67].
The concentration inequality is provided in terms of an invariant γ Z ps called the pseudo-spectral gap: Here P Z denotes the transition matrix of the Markov chain Z and P T Z is the transpose of this matrix. The pseudo-spectral gap is closely related to the mixing time: by [59,Proposition 3.4] whose assumptions are satisfied because Z is trivially uniformly ergodic as an irreducible and acyclic chain on a finite state space.
For any function f : and Var Π Z [f ] for the expectation and variance of the random variable f (Z 0 ), respectively, where Z 0 has distribution Π Z . The following proposition occurs in [59] in a more general setting with possibly infinite state spaces. We state the result here for the reader's convenience since it will be used multiple times.
for all r ∈ R ≥0 . Lemma 6.9. Let X := (X t ) t=0 be a sample path of the block Markov chain with an arbitrary initial distribution. Then, there exist constants c 4 , c 5 ∈ R >0 not depending on the initial distribution such that for all i = 1, . . . , n.
Observe that since the Markov chain associated with p is assumed to be irreducible and acyclic, it follows that P(T > t) ≤ c 6 exp(−c 7 t) for some c 6 , c 7 ∈ R >0 which are independent of n and the initial state X 0 . Write S 1 := We may apply Proposition 6.8 to derive a concentration inequality for S 2 .
Observe that for some c 8 ∈ R >0 . Recall that t X mix = O(1). Correspondingly, by (64), it follows that there exists some constant c 9 ∈ R >0 such that γ X ps ≥ c 9 . By Proposition 6.8 and the fact that = Θ(n 2 ), we find that for any r ∈ R >0 ≤ c 10 exp(−c 11 r).
Further, since T has exponential decay and f ≤ 1, there exist constants c 12 , c 13 ∈ R >0 such that The desired result now follows by the triangle inequality and the fact that The following corollary is immediate by the union bound and the fact that ( + 1)Π X has entries of size Θ(n). Corollary 6.10. Let X := (X t ) t=0 be a sample path from the block Markov chain with an arbitrary initial distribution. Then diag(( +1)Π X ) −1D X −Id op converges to zero almost surely as n tends to infinity. In particular, asymptotically,D X is invertible almost surely andP X thus well-defined.
Let us now proceed to the proof of Lemma 4.4. Recall that Lemma 4.4 concerns a reduction to centered random matrices when starting the chain from equilibrium.
Proof of Lemma 4.4. Let X := ( X t ) ∞ t=0 and Y := ( Y t ) ∞ t=0 denote infinitely long sample paths of the block Markov chain where X starts in equilibrium and Y has some arbitrary initial distribution. Define and observe that since the Markov chain associated with p is irreducible and acyclic, it holds that P(T > t) ≤ c 14 exp(−c 15 t) for some constants c 14 , c 15 ∈ R >0 which are independent of n. Let X := (X t ) t=0 be the path defined by X t = X t for all t < T and X t = Y t for all t ≥ T . Further, let Y := ( Y t ) t=0 be the truncation of Y to length + 1. Observe that X is a sample path from the block Markov chain starting in equilibrium whereas Y is a sample path from the block Markov chain with an arbitrary initial distribution. The desired results concern the singular value distributions associated with Y . Set and observe thatN Y / √ n = M X / √ n + B n + C n almost surely. The first claim now follows by Lemma 6.7 since B n 2 F /n ≤ (2T / √ n) 2 /n converges to zero in probability and rank(C n ) ≤ K.
For the second claim set D n :=D −1 Y diag(( + 1)Π X ) and observe that Y . By Corollary 6.10 and the continuity of M → M −1 in the neighborhood of Id it holds that D n − Id op converges to zero in probability. Hence, it is sufficient to establish that the singular value distribution of √ n diag(( + 1)Π X ) −1N Y has the desired weak limit in probability by Lemma 6.7(iii). In this regard observe that, with notation as in (73), = √ nQ X + n diag(( + 1)Π X ) −1 B n + n diag(( + 1)Π X ) −1 C n .
Here, it holds that Hence, since we already know that B n 2 F /n converges to zero in probability and since n diag(( + 1)Π X ) −1 op = Θ(1) it follows that n diag(( + 1)Π X ) −1 B n 2 F /n converges to zero in probability. Furthermore, Apply Lemma 6.7 to (74) to conclude the proof.
6.3.2. Proof of the Poisson limit Theorem 4.5. We will extract Theorem 4.5 from a nonasymptotic result. For such a nonasymptotic result one has to precisely quantify the mixing behavior of the chain of clusters Σ X = (σ(X t )) t=0 . For our purposes this is most naturally done in terms of the relative pointwise distance. Let Z be a Markov chain on the state space V which is irreducible and acyclic. The relative pointwise distance ∆ Z (r) after r ∈ Z ≥1 steps is given by Note that ∆ Z (r) is related to the quantity d(r) from (62) which was used to define the mixing time. Indeed, a direct calculation with the definitions shows that Theorem 6.11. Let X = (X t ) t=0 be a sample path from a block Markov chain which starts in equilibrium. Pick some ε ∈ [0, 1/2] and r 0 ∈ Z ≥1 such that ∆ Σ X (r) ≤ ε for all r ≥ r 0 . Then, for any k 1 , k 2 ∈ {1, . . . , K} and e ∈ V k1 × V k2 which is not a self-loop and for any self-loop e ∈ V k1 × V k1 Proof. The proof consists of the following parts. Part 1 introduces notation which is used in [6] to quantify local and long-range dependence. The parameters quantifying the local dependence are estimated in Part 2. Finally, the parameter quantifying long-range dependence is estimated in Part 3.

Part 1: Notation for local and long-range dependence
For every t ∈ {1, . . . , } let B r0 (t) := {t ∈ {1, . . . , } : |t − t| ≤ r 0 } and consider the following parameters which quantify local and long-range dependencies: p t,t := P(E X,t = e, E X,t = e), Applying [6, Theorem 1] to the sum of dependent random variablesN X,e = t=1 1 E X,t =e yields that where Here we used that the definition in [6] for total variation distance differs from our definition (11) by a factor two. Indeed, observe that sup A⊆Z ≥0 |µ(A) − ν(A)| = 1 2 x∈Z ≥0 |µ(x) − ν(x)| for any two probability measures µ and ν using that the supremum is realized by By definition of X as a block Markov chain it holds that for any t ∈ {1, . . . , } From (84) it follows that where it was used that π(k 1 )p k1,k2 ≤ 1. It holds for any t, t ∈ {1, . . . , } with |t − t | > 1 that For t ∈ {t + 1, t − 1} it may similarly be deduced that if e is a self-loop then p t,t ≤ π(k 1 )p k1,k1 (#V k1 ) −3 and if e is not a self-loop then p t,t = 0. Hence, Part 3: Bounding the long-range dependence contribution (b 3 ) The first step in this part will be to control s t in terms of a simpler quantity; this step is achieved in (92). Thereafter, control over (92) is achieved in (101) and we deduce the desired bound for b 3 in (105).
Recall that X is uniformly distributed in t=0 V s X,t given that Σ X = s X . It follows that 1 E X,t =e is conditionally independent of (1 E X,t =e ) t ∈{1,..., }\Br 0 (t) given Σ X,t−r0−1 and Σ X,t+r0 . (If t − r 0 − 1 < 0 one can use a time reversal to make sense of Σ X,t−r0−1 ; a time reversal exists since Σ X is assumed to be irreducible and acyclic.) Hence, by the law of total probability On the other hand, we may write Observe that Therefore, by using the triangle inequality on s t 's definition together with (89)-(91), it follows that Observe that for any s 1 , s 2 ∈ {1, . . . , K} with P(Σ X,t+r0 = s 2 | Σ X,t−r0−1 = s 1 ) > 0, the definition of conditional probability yields that .
Here, by definition of a block Markov chain Moreover, the Markov property applied to Σ X yields that Combining (93), (94) and (95) yields that .
Proof of Theorem 4.5. Note that the total variation distance in (11) metrizes convergence in distribution. This is to say thatN X,en converges in distribution if and only if P(N X,en = −) converges with respect to the metric d TV .
Recall that Σ X is assumed to be irreducible and acyclic. Hence, by (78) it holds that ∆ Σ X (r) decays exponentially in r. Recall that = λn 2 + o(n 2 ) and #V k = α k n+o(n). In particular (#V k1 #V k2 ) −1 converges to α −1 k1 α −1 k2 λ as n tends to infinity. The result now follows by taking r 0 = log(#V k1 #V k2 ) in Theorem 6.11. Assume that X starts in equilibrium. Let e n be as in Theorem 4.5, and Y a Poisson distributed random variable with rate λπ(k 1 )α −1 k1 α −1 k2 p k1,k2 . Then for every m ∈ Z ≥0 it holds that, as n tends to infinity, Proof. We have already established in Theorem 4.5 thatN X,en converges in distribution to Y . Hence, to derive convergence of the moments it suffices to verify that for every m ∈ Z ≥0 it holds that the sequence of random variablesN m+1 X,en is uniformly integrable [36, 7.10. (15)].
We will apply Proposition 6.8 to the Markov chain of edges E X := (E X,t ) t=1 with the function f (e) = 1 e=en . Observe that Note that E X induces a chain Σ E X on the reduced space {1, . . . , K} × {1, . . . , K} and that t E X mix ≤ max{2, t Σ E X mix } by X being a block Markov chain. Since X is a block Markov chain it follows that the dynamics of Σ E X do not depend on n. It follows that t Σ E X mix is a constant. Correspondingly, by (64) it holds that 1/γ E X ps = O(1). Hence, since = Θ(n 2 ), Proposition 6.8 yields that for some c 16 , c 17 ∈ R >0 which do not depend on n. The constant for all n and the constant c 17 ≥ 2 is chosen so that c 17 exp(−c 16 r) ≥ 1 whenever r ≤ 1. Note that π(k 1 )(#V k1 ) −1 (#V k2 ) −1 p k1,k2 converges to a finite constant as n tends to infinity. Hence, for some sufficiently large constant c 18 ∈ R >0 P(N X,en > r) ≤ c 18 exp(−c 16 r). (110) In particular, The right-hand side of (112) is finite and independent of n. This shows thatN m+1 X,en is uniformly integrable and concludes the proof.  (14) and (15). These two parts are established in Proposition 6.15 and Proposition 6.14 respectively. Proposition 6.14. Assume that X starts in equilibrium and let R ∈ Z ≥1 . Then, for all m ∈ Z R ≥0 it holds that as n tends to infinity max e1,...,e R ∈ En Proof. We proceed by induction on R. The base-case R = 1 is trivial. Now assume that R > 1 and that the claim is known to hold for any product with strictly less than R edges e j . The proof is subdivided in three main steps. First, we construct a pair of Markov chains X, Y which are equal most of the time but are sometimes allowed to diverge. The goal of this process is to ensure two properties: (i) It holds that M X,ej ≈ M Y,ej for every j ∈ {2, . . . , R}. (ii) It holds that X is independent of M Y,e1 . The approximation from the first item may be used to show that The independence from the second item allows us to factorize Combined, (114) and (115) · · · M m R X,e R ] up to some error term. The induction hypothesis then yields the desired result provided we show that the error term is small. A bound on the size of the error terms is established in the third step.
Some preliminary reductions are applicable. Firstly, precisely as in Part 1 of the proof of Proposition 4.8, it can be assumed that K ≥ 2R + 1 by splitting a cluster into pieces of asymptotically equal proportions. Further, the value of can only depend on the isomorphism type of the labeled directed subgraph G induced by {e 1 , . . . , e R } with vertex-labels given by the clusters and edge-labels by the index j of e j . As n tends to infinity the number of possible isomorphism types for G remains bounded. Hence, we can fix some isomorphism type for G. Informally, this allows us to pretend that the edges e 1 , . . . , e R ∈ E n stay fixed as n tends to infinity. Due to the induction hypothesis it may also be assumed that the e j are distinct.

Part 1: Construction of Markov chains (X, Y )
Recall that it was ensured that K ≥ 2R+1. Hence, there exists some k ∈ {1, . . . , K} such that V k does not contain any endpoint of e j for j = 1, . . . , R.
Use the following procedure to construct a triple of sample paths ( X, Y , Z) with random length ≥ +1. These will be trimmed to paths X, Y and Z of length exactly equal to + 1 afterwards. See Figure 7 for a visualization of the construction.
(i) Sample two independent infinite paths ( X 0 t ) ∞ t=0 and ( Z 0 t ) ∞ t=0 from the block Markov chain starting in equilibrium. Due to the assumption that the Markov chain associated with p is irreducible and acyclic it holds that T 0 := inf{t ≥ 1 : (ii) For i = 1, . . . , extend X, Y and Z by using the following procedure.
(a) Sample two independent infinite paths ( X i t ) ∞ t=0 and ( Z i t ) ∞ t=0 from the block Markov chain with X i 0 = Z i 0 chosen uniformly at random in V k . Figure 7. Visualization of the merging process in the construction of X and Y during the proof of Proposition 6.14. The fragments X i and Y i are allowed to diverge whenever either one uses e 1 . This ensures that all information about Y using e 1 is erased from X. Otherwise, the fragments are merged by taking X i = Y i . This ensures that the two chains are often equal because using e 1 is a rare event. The cluster structure is exploited to ensure that the endpoints of the short fragments X i and Y i can be glued after diverging.
Due to the assumption that the Markov chain associated with p is irreducible and acyclic it will hold that T i is finite with probability one. (c) If e 1 was traversed by ( to the previously constructed parts of X, Y and Z respectively. This is to say that we define X t+ i−1 j=0 Tj := X i t , Y t+ i−1 j=0 Tj := Y i t and Z t+ i−1 j=0 Tj := Z i t for t = 0, . . . , T i − 1. The sampled paths X i := ( X i t ) Ti−1 t=0 , Y i := ( Y i t ) Ti−1 t=0 and Z i := ( Z i t ) Ti−1 t=0 used in the construction of X, Y and Z will be called fragments. For any i = 1, . . . , the ith fragments are said to be merged if ( X i t ) Ti−1 t=0 and ( Z i t ) Ti−1 t=0 did not traverse e 1 . Denote X := ( X t ) t=0 , Y := ( Y t ) t=0 and Z := ( Z t ) t=0 for the truncations to length + 1. Note that these are block Markov chains from the specified model. We further claim that X is independent of M Y,e1 . Let us remark that this is due to the special role of e 1 in the above construction; it will typically not be true that X is independent of M Y,e for any other edge e = e 1 . Indeed, let W be the path found from Z by replacing Z i j=0 Ti with a uniformly random node in V k for every i = 0, . . . , − 1. Then X is independent of W := ( W t ) t=0 . In particular X will be independent of M W,e1 since this is a function of W . Now note that M Y,e1 is equal to M W,e1 by definition of Y and the fact that V k was chosen not to contain any of the endpoints of e 1 . This establishes that X is independent of M Y,e1 as desired.

Part 2: Approximate factorization of E[M m1
Y,e1 · · · M m R Y,e R ] For any j = 2, . . . , R define ∆ j to be the difference between the number of times e j was used in Y and X, that is, Observe that M Y,ej = M X,ej + ∆ j for every j ≥ 2 since E[N Y,ej ] = E[N X,ej ]; after all X and Y have the same distribution. Substitute the binomial expansion for every factor M mj Y,ej = (M X,ej +∆ j ) mj in M m2 Y,e2 · · · M m R Y,e R and use the fact that M Y,e1 is independent of the M X,ej with j ≥ 2 to find that where the summation runs over vectors of integers of length R and the c m are absolute constants with c m = 0 whenever m j = 0 for all j ≥ 2. Note that M m2 X,e2 · · · M m R X,e R is a product with R − 1 factors. It follows that the induction hypothesis is applicable to E[M m2 X,e2 · · · M m R X,e R ] so that it remains to show that By the Cauchy-Schwarz inequality it holds that Due to the independence of X on M Y,e1 we can write which is O m (1) by Corollary 6.13 and the induction hypothesis. It remains to show . Define M i to be the event where X i and Y i were merged and denote ¬M i for the complement of this event. For any fixed j ≥ 2 let ∆ X,j :=N X 0 ,ej + i=1 1 ¬MiNX i ,ej denote the number of times when X used edge e j in a fragment where X and Y were not merged and define ∆ Y,j similarly. Then Substitute the definitions of ∆ X,j and ∆ Y,j in the resulting bound R j=2 ∆ 2m j j ≤ R j=2 (∆ Y,j + ∆ X,j ) 2m j and expand the monomial expression. Then, the Cauchy-Schwarz inequality reduces us to the statement that E[N q X 0 ,ej ] = o q (1) and E[( i=1 where it was used that Y has the same distribution as X. These claims are established in the next part of the proof. In fact, we will establish the stronger claims that E[N q X i ,ej ] = O q (n −2 ) for any (i, j) ∈ {0, 1} × {1, . . . , R} and E[( i=1 1 ¬MiNX i ,ej ) q ] = O q (n −1 ) for any i ≥ 1.
. . , R} and q ∈ Z ≥1 . By the law of total expectation For any edge e whose starting and ending point are in the same clusters as the starting and ending point of e j it holds that P( There are at least α 2 min n 2 such edges so it follows that P( By (121) and (124) it follows that Here, E[T q i ] is a finite constant which does not depend on n. Indeed, consider the product chain Σ ( X i , Z i ) := (σ( X i t ), σ( Z i t )) ∞ t=0 on the space of clusters {1, . . . , K} × {1, . . . , K}. Then T i is the first strictly positive time Σ ( X i , Z i ) is in (k, k). In particular, E[T q i ] is independent of n. The fact that E[T q i ] is finite is immediate from the fact that P(T i > t) shows exponential decay in t since the Markov chain associated with p is assumed to be irreducible and acyclic. This concludes the proof of the statement that E[N q X i ,ej ] = O q (n −2 ).
Note that the random variables 1 ¬MiNX i ,ej with i ≥ 1 are independent and identically distributed due to the fact that V k was chosen not to contain any endpoint of e 1 or e j . Correspondingly, a term with d distinct i l values in the right-hand side of (127) may be factorized as q l X 1 ,ej ] for some q 1 , . . . , q l ∈ {1, . . . , q}.
Consequently, we may bound (127) as for some absolute constants c q,d ∈ R >0 . Since = Θ(n 2 ) it now suffices to show that E[1 ¬M1N q X 1 ,ej ] = O q (n −3 ). This will be shown by an argument resembling Part 3.1.
By the law of total expectation The union bound implies that P(A | ∪ i B i ) ≤ i P(A | B i ) for any countable collection of events A, B i with P(B i ) = 0. Observe that where the second sum in (132) runs over all t 0 ∈ {1, . . . , t} with P(T 1 = t, E X 1 ,t0 = e 1 ) = 0 and the sum in (133) runs over all t 0 ∈ {1, . . . , t} with P(T 1 = t, E Z 1 ,t0 = e 1 ) = 0. First consider the terms with E Z 1 ,t0 = e 1 . Note that there are at least α 2 min n 2 edges e whose starting point and ending point have the same clusters as the starting point and ending point of e j respectively. Conditional on T 1 = t and E Z 1 ,t0 = e 1 any such edge e is equally likely to be traversed at time t 1 by X 1 . It follows that P(E X 1 ,t1 = e j | T 1 = t, E Z 1 ,t0 = e 1 ) ≤ α −2 min n −2 . Let us now consider the terms with E X 1 ,t0 = e 1 . When |t 0 − t 1 | > 1 the foregoing argument applies word-for-word and yields that P(E X 1 ,t1 = e j | T 1 = t, E X 1 ,t0 = e 1 ) ≤ α −2 min n −2 . The cases t 0 = t 1 − 1 and t 0 = t 1 + 1 require a modification. When t 0 = t 1 − 1 note that there at least α min n edges e whose ending point is in the same cluster as the ending point of e j and whose starting point is equal to the starting point of e j . Conditional on T 1 = t and E X 1 ,t0 = e 1 any such edge e is equally likely to be traversed at time t 1 . Hence, P(E X 1 ,t1 = e j | T 1 = t, E X 1 ,t0 = e 1 ) ≤ α −1 min n −1 for all t 0 = t 1 − 1. Similarly, when t 0 = t 1 + 1 note that there are at least α min n edges e whose starting point is in the same cluster as the starting point of e j and whose ending point is equal to the ending point of e j . Conditional on T 1 = t and E X 1 ,t0 = e 1 any such edge e is equally likely to be traversed at time t 1 . Hence, P(E X 1 ,t1 = e j | T 1 = t, E X 1 ,t0 = e 1 ) ≤ α −1 min n −1 for all t 0 = t 1 + 1. Using these bounds in (132) yields that Observe that P( Note that X 1 and Z 1 follow the same distribution and apply (124) with q = 1 to deduce that Hence, (129), (134) and (135) yield that where, as in Part 3.1, it holds that E[T q+2 1 ] is a finite constant which does not depend on n. Hence, E[1 ¬M1N q X 1 ,ej ] = Θ q (n −3 ) which may be combined with (128) and the fact that = Θ(n −2 ) to deduce the desired result: This concludes the proof.
Proposition 6.15. Assume that X starts in equilibrium and let 0 ≤ r ≤ R be positive integers. Then, for any m ∈ Z R ≥0 with m i = 1 for i = 1, . . . , r, it holds that as n tends to infinity max ∀i =j:ei =ej where the maximum runs over all sequences of distinct edges e 1 , . . . , e R ∈ E n .
We will prove a better bound than Proposition 6.15. Recall from Section 2.4 that [33] improved the convergence in probability of [40] to convergence almost surely provided additional assumptions. It turns out that the most immediate generalization of their results would be too restrictive to include block Markov chains because the correlations in Proposition 6.15 do not decay sufficiently quickly.
The correlation between edges e i and e j is maximal when the ending point of e i is equal to the starting point of e j . Correspondingly, Proposition 6.15 can be improved when there are many edges whose ending point is not the starting point of some other edge. This improvement could be a point of departure to strengthen Theorem 1.1 and Theorem 1.2 to convergence almost surely.
For any positive integers 0 ≤ r ≤ R let E R n,r denote the collection of sequences of distinct edges e 1 , . . . , e R ∈ E n such that for every i ∈ {1, . . . , r } it holds that the ending point of e i is not the starting point of any edge e j with j ∈ {1, . . . , R}\{i} and the starting point of e i is not the ending point of any edge e j with j ∈ {1, . . . , R}\{i}.
See Figure 8 for an example with R = 3. The following proposition includes Proposition 6.15 as the special case with r = 0. Proposition 6.16. Assume that X starts in equilibrium and let 0 ≤ r ≤ r ≤ R be positive integers. Then, for any m ∈ Z R ≥0 with m i = 1 for i = 1, . . . , r, it holds that Observe that E R n,R = E R n,R−1 ⊆ E R n,R−2 ⊆ · · · ⊆ E R n,0 as is immediate from the definition.
as n tends to infinity Proof. This proof combines the proof of Proposition 4.8 with an inductive argument as was used in the proof of Proposition 6.14. The main technical difference is that we can no longer use the Cauchy-Schwarz inequality during the inductive step as in (118): the resulting squares would reduce r to zero which weakens the conclusion of the induction hypothesis. Instead, the step analogous to (118) will employ conditional independence. The price we pay for this argument is that it necessitates a stronger induction hypothesis to account for the added conditioning.
The same preliminary reductions as in the proof of Proposition 6.14 are applicable. Firstly, precisely as in Part 1 of the proof of Proposition 4.8, it can be assumed that K ≥ 2R + 1 by splitting a cluster into pieces of asymptotically equal proportions. Further, by fixing an isomorphism type for the labeled directed graph G induced by {e 1 , . . . , e R } we may again pretend that the edges e 1 , . . . , e R stay fixed as n tends to infinity.

Part 1: Set-up of the inductive argument
Recall that it is ensured that K ≥ 2R + 1. Hence, there exists some k ∈ {1, . . . , K} such that V k does not contain any endpoint of e j for j = 1, . . . , R.
For any d ∈ Z ≥0 , ≤ and τ ∈ {0, . . . , } d denote V X,τ for the event where X τi ∈ V k for every i = 1, . . . , d. When d = 0 it is to be understood that V X,τ refers to the universal event. In particular, P(V X,τ ) = 1 in this case. Fix some d, and τ with P(V X,τ ) > 0 and let Y := (Y t ) t=0 be a sample path from the block Markov chain conditioned on the event V Y,τ . We will show that there exist a constant c 19 ∈ R ≥0 , depending on d but not on or τ , such that Taking d = 0 and = then recovers the proposition. The proof of the claim proceeds by induction on r. This is why we require (140) to hold for any d ∈ Z ≥0 and ≤ even though the proposition only concerns the case with d = 0 and = : we will modify d and when reducing r in the inductive step. The argument for the base case r = 0 is provided in Part 4.1. Now let r > 1 and assume that (140) is known to hold for any smaller value of r. Figure 9. Visualization of the construction of the chains Y and Z in the proof of Proposition 6.16. The chain Y is found from Y by cutting out the piece between L − and L + . Both chains are conditioned to be in cluster V k at times τ i but Z has the additional condition of using edge e 1 at time t 1 . The visualized process of gluing e 1 onto Y exploits the cluster structure to ensure that L + − L − is small.
Note that there are at least α 2 min n 2 edges e for which the starting and ending point are in the same clusters as the starting and ending point of e 1 respectively. These edges e are equally likely to be traversed at time t 1 by Y . It follows that It remains to show that the effect of conditioning on E Y,t1 = e 1 in (141) has a small effect on M m2 Y,e2 · · · M m R Y,e R . That is, it has to be shown that Y,e2 · · · M m R Y,e R ] with approximation error uniform over all t 1 ∈ {1, . . . , } with P(E Y,t1 = e 1 ) > 0. Fix such a value of t 1 .

Part 2: Construction of chains (Y, Y , Z)
Use the following procedure to construct a triple of chains (Y, Y , Z) with Y, Z of length + 1 and Y of a random length at most + 1. See also Figure 9 for a visualization of the construction.
(i) Sample an infinitely long path ( Y t ) ∞ t=−∞ from the block Markov chain conditioned on V Y ,τ . Sample ( W t ) ∞ t=−∞ as an infinitely long path from the block Markov chain conditioned on E W ,t1 = e 1 and V W ,τ . Note that it is possible to sample at negative times by use of a time reversal which exists by the assumption that the Markov chain associated with p is acyclic and irreducible.
(ii) Define and note that these values are finite with probability one by the assumption that the Markov chain associated with p is irreducible and acyclic. Let Thus, Y is equal to Y except for the fact that a piece was cut out.
Observe that Y and Z are paths of length + 1 from the block Markov chain conditioned on V Y,τ and the block Markov chain conditioned on V Z,τ and E Z,t1 = e 1 , respectively. Correspondingly,N Z,ej has the same distribution asN Y,ej conditioned on the event E Y,t1 = e 1 .

Part 3: Bounding
We continue with the notation of the previous part, in particular t 1 and τ are still considered to be fixed.
The distribution of Y is obscure unless one conditions on L − and L + : even the length of Y is random without this conditioning. In particular, the usual notation averages out the critical dependence on L + and L − . For this reason it is convenient to define a variant of M Y which takes L − and L + into account: For j = 2, . . . , R define ∆ Y,j :=N Y,ej −N Y ,ej , ∆ Z,j :=N Z,ej −N Y ,ej and set Observe that M Y,ej =M Y ,ej + E Y,j . Moreover, M Y,ej conditioned on E Y,t1 = e 1 has the same distribution asN Z,ej − E[N Y,ej ] =M Y ,ej + E Z,j . Correspondingly, The key observation is that the leading-order terms in the expansion of the righthand side of (148) cancel out. This is to say that where the summation runs over vectors of positive integers m ∈ Z r ≥0 and the c m are absolute constants with c m = 0 if m j = 0 for all j ∈ {2, . . . , R}.
Fix some m ∈ Z r with m j = 0 for some j ∈ {2, . . . , R}. We will consider the terms with E Z,j in (149); those with E Y,j may be treated identically. By the law of total expectation, The compressed notation for conditional expectation employed in (150) We will use the induction hypothesis (140) to deal with where g 3 (m , r ) := #{j ∈ {2, . . . , r } : m j = 0}. It will then follow from (150)-(153) that there exists a constant c 21 ∈ R >0 such that Observe that for any positive integers q 1 , q 2 ∈ Z ≥0 it holds that (q 1 − q 2 − 1)/2 + max{1, q 2 } ≥ q 1 /2 . The condition below (149) stating that c m = 0 when m j = 0 for all j ∈ {2, . . . , R} ensures that g 1 (m ) ≥ max{1, #{j ∈ {2, . . . , r} : m j = 1}} for all terms with c m = 0. Therefore, using the definition of f 1 , for all terms with c m = 0 in (149). Note that f 2 (r, r , m ) + g 2 (m , r ) = 0 when r = 0. If r > 0, note that g 2 (m , r ) ≥ max{1, #{j ∈ {2, . . . , r } : m j = 1}} for all terms in (149) with c m = 0. Hence, using the definition of f 2 , for all terms with c m = 0 in (149). It will be shown in Part 4.2 that E[(T + ) q ] = O q,d (1) for any q ∈ Z ≥0 and a similar conclusion holds for T − . Note that it is here also claimed that the bound is uniform in t 1 and τ provided that d is fixed. Now observe that (L + − L − ) 3 m 1 ≤ (T + +T − ) 3 m 1 . Expand (T + +T − ) 3 m 1 and apply the Cauchy-Schwarz inequality to the resulting monomial terms to derive that It now follows by (149) and (155)-(157) that there exists a constant c 22 ∈ R >0 such that (158) Given that there are ≤ = Θ(n 2 ) terms in (141) and that P(E Y,e1 = e 1 ) = O(n −2 ) by (142) it follows that there exists a constant c 23 ∈ R >0 such that which is the desired result. In this case Y follows the same distribution as X conditioned on the event V X,τ .
We claim that for any fixed d there exists a constant c 24 ∈ R >0 independent of n such that for any τ ∈ {0, . . . , } d with P(V X,τ ) > 0 it holds that Indeed, assume without loss of generality that the τ i are nondecreasing in i. Since X starts in equilibrium it then holds that This implies (160) since, by the Markov chain associated with p being irreducible and acyclic, it holds that p t (k, k) tends to π(k) as t tends to infinity and it holds that π(k) > 0. By the law of total expectation it follows that which establishes the desired result since E[N q X,ej ] = O q (1) by Corollary 6.13. has the same distribution as Σ + (V,V ) conditioned on the events V V,τ and V V ,τ . By (160) there exists a constant c 24 ∈ R >0 independent of n and t 1 such that P(V V,τ ) ≥ c 24 for all τ ∈ {0, . . . , } d with P(V V,τ ) > 0. It can similarly be deduced that there exists a constant c 25 ∈ R >0 such that P(V V ,τ ) ≥ c 25 for all τ ∈ {0, . . . , } d with P(V V ,τ ) > 0. Now, by the law of total expectation and the fact that which establishes the desired result.
Recall the definition for E Z,j in (147). Note that conditional on L + = + and for certain absolute constants c q . The sum runs here over vectors of integers of length R.
Fix some 0 ≤ q ≤ q and let us establish a bound on E[ Observe that the product R j=2 ∆ q j Z,j can only be nonzero if ∆ Z,j = 0 for every j with q j = 0. Recall from (154) that g 1 (q ) is the number of nonzero q j with j ∈ {2, . . . , R} and g 3 (q , r ) is the number of nonzero q j with j ∈ {2, . . . , r }.
Hence the product R j=2 ∆ q j Z,j can only be nonzero if R j=2 1 ∆ Z,j >0 ≥ g 1 (q ) and r j=2 1 ∆ Z,j >0 ≥ g 3 (q , r ). Therefore, by the law of total expectation .

It follows from
The goal thus becomes to establish a bound on the probability in the right-hand side of (168). This bound will be established in two steps. First, we show that the event described by the probability implies that (Z t ) t∈{ − ,..., + }\{t1,t1−1} has to visit the endpoints of e 2 , . . . , e R often. This step is achieved in (182). Second, we show that visiting many endpoints is a rare event. This step is achieved in (190). For every j ∈ {1, . . . , R} consider the following collections of times which measure when we are in e j or its endpoints  Figure 10. Visualization of some 'worst case' realizations of Z for which R j=2 1 Te j =∅ ≥ g 1 (q ) and r j=2 1 Te j =∅ ≥ g 3 (q , r ) when g 1 (q ) = R − 1 = 3 and g 3 (q , r ) = r varies. The points contributing to #(∪ R j=2 T ej \{t 1 , t 1 −1}) are circled. Note that the number of circled points increases as r does so. Indeed, r > 0 avoids losing points due to the exclusion of {t 1 , t 1 − 1} whereas r ≥ j with j ≥ 2 avoids losses due to the possibility that e j shares endpoints with the other e i . Observe that in this notation it holds that ∆ Z,j = #T ej . Hence, the probability in the right-hand side of (168) may be rewritten as P R j=2 1 ∆ Z,j >0 ≥ g 1 (q ) r j=2 1 ∆ Z,j >0 ≥ g 3 (q , r ) = P R j=2 1 Te j =∅ ≥ g 1 (q ) r j=2 1 Te j =∅ ≥ g 3 (q , r ) We claim that whenever the event described in the probability in the right-hand side of (171) holds, it follows that #(∪ R j=2 T ej \ {t 1 , t 1 − 1}) ≥ g 1 (q ) + g 2 (q , r ). The key difficulty is that the sets T e1 , . . ., T e R may not be disjoint due to the fact that the e j can share endpoints; see Figure 10.
Recall the definition of E R n,r from the paragraph preceding Proposition 6.16. It follows that for every (i, j) ∈ {1, . . . , R} × {1, . . . , r } with i = j it holds that T ei ∩ T ej = ∅. Hence, When r > 0 it holds that T e1 is disjoint from T ej for every j ∈ {2, . . . , R}. In particular {t 1 , t 1 − 1} is disjoint from T ej for every j ∈ {2, . . . , R}. Hence, R j=r +1 Whenever ∪ R j=r +1 T ej = ∅ we can construct a subset of ∪ R j=r +1 T ej as Observe that the left-hand side of (177) is a union of disjoint sets due to the fact that e 2 , . . . , e R are distinct edges. It now follows from (176)  1 Te j =∅ .
Combine (168), (169) and (182) with the bound (190) to conclude that where it was used that g 1 (q ) ≤ q 1 and g 2 (q ) ≤ q 1 . This establishes the desired upper bound for the conditional expectation of R j=2 ∆ q j Z,j . Let us remark that the only property about the chain Z which was used in the foregoing argument is that (Z t ) t∈{0,..., }\{t1,t1−1} is uniformly distributed in t∈{0,..., }\{t1,t1−1} V s Z when conditioned on Σ Z = s Z . Hence, the conclusion (191) also applies to other chains with this property such as Y . This is to say that by repeating the argument for (191) word-for-word one finds a constant c 27 ∈ R >0 such that for any 0 ≤ q ≤ q. This term may be studied by means of a coupling argument. Construct a path G of length + 1 by using the following procedure: (i) Let Y := ( Y t ) ∞ t=−∞ be the path used in the construction of Y in Part 2. Independently sample G := ( G t ) ∞ t=−∞ from the same distribution as Y conditioned on L − = − and L + = + . (ii) Define and note that these values are finite with probability one by the assumption that the Markov chain associated with p is irreducible and acyclic. Let L − = max{0, − − T − } and L + = min{ , + + T + }.
(iii) Define G := (G t ) t=0 by G t := G t for t ∈ { L − , . . . , L + } and G t = Y t otherwise. Note that G is here implicitly dependent on − and + but this is suppressed in the notation. Indeed, G has the same distribution as Y conditioned on L − = − and L + = + . For any j ∈ {2, . . . , R} let ∆ G,j and ∆ Y,j denote the number of times (G t ) L + t= L − and (Y t ) L + t= L − traversed edge e j . Then, As in (192) it may be established that for any + , − ∈ Z ≥0 with P( L + = + , L − = − ) > 0 it holds that By the law of total expectation it now follows that and a similar conclusion applies to ∆ G,j . We for some constant c 31 ∈ R >0 . Combine (191) and (206) to find a constant c 32 ∈ R >0 such that for any 0 ≤ q ≤ q where it was used that q 1 + q − q 1 = q 1 . Recall the definition of g 1 and g 2 in (154) and observe that each additional nonzero coordinate in q increases g 1 (q ) + g 2 (q , r ) by at most 1 + 1 r >0 . Hence, the worst bound in (208) is attained at q = q and we may conclude that ≤ c 32 n −g1(q)−g2(q,r ) ( + − − ) 3 q 1 .
Plug (210) into (167) to find a constant c 33 ∈ R >0 such that which is the desired result.