Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees

We establish necessary and sufficient conditions for consistent root reconstruction in continuous-time Markov models with countable state space on bounded-height trees. Here a root state estimator is said to be consistent if the probability that it returns to the true root state converges to 1 as the number of leaves tends to infinity. We also derive quantitative bounds on the error of reconstruction. Our results answer a question of Gascuel and Steel and have implications for ancestral sequence reconstruction in a classical evolutionary model of nucleotide insertion and deletion.


Introduction
Background In biology, the inferred evolutionary history of organisms and their relationships is depicted diagrammatically as a phylogenetic tree, that is, a rooted tree whose leaves represent living species and branchings indicate past speciation events [Fel04]. The evolution of species features, such as protein sequences, linear arrangements of genes on a chromosome or the number of horns of a lizard, is commonly assumed to follow Markovian dynamics along this tree [Ste16]. That is, on each edge of the tree, the state of the feature changes according to a continuous-time Markov process; at bifurcations, two independent copies of the feature evolve along the outgoing edges starting from the state at the branching point. The length of an edge is a measure of the expected amount of change along it. See Section 1.1 for a formal definition.
In this paper, we are concerned with the problem of inferring an ancestral state from observations at the leaves of a given tree under known Markovian dynamics. We Consistent root reconstruction in Markov models on trees

Basic definitions
Markov chains on trees We consider the following class of latent tree models arising in phylogenetics. The model has two main components: • The first component is a tree. More precisely, throughout, by a tree we mean a finite, edge-weighted, rooted tree T = (V, E, ρ, ), where V is the set of vertices, E is the set of edges oriented away from the root ρ, and : E → (0, +∞) is a positive edge-weighting function. We denote by ∂T the leaf set of T . No assumption is made on the degree of the vertices. We think of T as a continuous object, where each edge e is a line segment of length e and whose elements we refer to as points. We let Γ T be the set of points of T .
• The second component is a time-homogeneous, continuous-time Markov process taking values in a countable state space S. Without loss of generality, we let S = {1, . . . , |S|} in the finite case and S = {1, 2, . . .} in the infinite case. We denote by P t = (p ij (t) : i, j ∈ S) the transition matrix at time t ∈ [0, ∞), that is, p ij (t) is the probability that the state at time t is j given that it was i at time 0. We also let p i (t) = (p i1 (t), p i2 (t), . . .), (1.1) be the i-th row in the transition matrix. We assume that (P t ) t admits a Q-matrix Q = (q ij : i, j ∈ S) which is stable and conservative, that is, (1.2) See, e.g., [Lig10,Chapter 2] or [And91] for more background on continuous-time Markov chains.
We consider the following stochastic process indexed by the points of T . The root is assigned a state X ρ ∈ S, which is drawn from a probability distribution on S. This state is then propagated down the tree according to the following recursive process. Moving away from the root, along each edge e = (u, v) ∈ E, conditionally on the state X u , we run the Markov process P t started at X u for an amount of time (u,v) . We denote by X γ the resulting state at γ ∈ e. We call the process X = (X γ ) γ∈Γ T a P t -chain on T . For i ∈ S, we let P i be the probability law when the root state X ρ is i. If X ρ is chosen according to a distribution π, then we denote the probability law by P π . Note that the leaf distribution conditioned on the root state is given by for all (x u ) u∈∂T ∈ S ∂T .

Root reconstruction
In the root reconstruction problem, we seek a good estimator of the root state X ρ based on the leaf states X ∂T . More formally, let {T k = (V k , E k , ρ k , k )} k≥1 be a sequence of trees with |∂T k | → +∞ and let X k = (X k γ ) γ∈Γ T k be a P t -chain on T k with root state distribution π. Definition 1.1 (Consistent root reconstruction). A sequence of root estimators The basic question we address is the following.

Question 1.2.
Under what conditions on {T k } k , (P t ) t , and π does there exist a sequence of consistent root estimators?
Before stating our main theorems, we make some assumptions and introduce further notation.
Basic setup For concreteness, we let {T k } k be a nested sequence of trees with common root ρ. That is, for all k > 1, T k−1 is a restriction of T k , as defined next. Definition 1.3 (Restriction). Let T = (V, E, ρ, ) be a tree. For a subset of leaves L ⊂ ∂T , the restriction of T to L is the tree obtained from T by keeping only those points on a path between the root ρ and a leaf u ∈ L.
Observe that a restriction of T is always rooted at ρ. Without loss of generality, we assume that |∂T k | = k, so that T k is obtained by adding a leaf edge to T k−1 . (More general sequences can be obtained as subsequences.) In a slight abuse of notation, we denote by the edge-weight function for all k. For γ ∈ Γ T , we denote by γ the length of the unique path from the root ρ to γ. We refer to γ as the distance from γ to the root.
Our standing assumptions throughout this paper are as follows.
(i) (Uniformly bounded height) The sequence of trees {T k } k has uniformly bounded height. Denote by h k := max{ x : x ∈ ∂T k } the height of T k . Then the bounded height assumption says that h * := sup k h k < +∞.
(ii) (Initial-state identifiability) The Markov process (P t ) t is initial-state identifiable, that is, all rows of the transition matrix P t are distinct for all t ∈ [0, ∞). In other words, given the distribution at time t, the initial state of the chain is uniquely determined.
Whether the last assumption holds in general for countable-space, continuous-time Markov processes (that are stable and conservative) seems to be open. We show in the appendix that it holds for two broad classes of chains: reversible chains and uniform chains, including finite state spaces. (Observe, on the other hand, that in the discretetime case it is easy to construct a transition matrix which does not satisfy initial-state identifiability.) We use the notation a ∧ b := min{a, b} and a ∨ b := max{a, b}. For two probability measures µ 1 , µ 2 on S, let be the total variation distance between µ 1 and µ 2 . (The last equality follows from noticing where recall that p i (t) was defined in (1.1).
· · · · · · Figure 1: A sequence of trees {T k } k (from left to right) satisfying the big bang condition. The distance from v k to the root is 2 −k .
Big bang condition Our combinatorial condition for consistency says roughly that the T k s are arbitrarily dense around the root.
denote the tree obtained by truncating T at distance s from the root. We refer to T (s) as a truncation of T .
See the left-hand side of Figure 3 for an illustration. Note that, if s is greater than the height of T , then T (s) = T . Definition 1.5 (Big bang condition). We say that a sequence of trees {T k } k satisfies the big bang condition if: for all s ∈ (0, +∞), we have |∂T k (s)| → +∞ as k → +∞.
See Figure 1 for an illustration. For i ∈ S, let D i be the set of states reachable from i, that is, the states j for which p ij (t) > 0 for some t > 0 (and, therefore, for all t > 0; see e.g. [Lig10, Chapter 2]).

Statements of main results
Our main result is the following. Theorem 1.6 (Consistent root reconstruction: necessary and sufficient conditions). Let {T k } k and (P t ) t satisfy our standing assumptions (i) and (ii), and let π be a probability distribution on S. Then there exists a sequence of root estimators that is consistent for {T k } k , (P t ) t and π if and only if at least one of the following conditions hold: (a) (Downstream disjointness) For all i = j such that π(i) ∧ π(j) > 0, the reachable sets D i and D j are disjoint.
(b) (Big bang) The sequence of trees {T k } k satisfies the big bang condition.
An application to DNA evolution by nucleotide insertion and deletion is detailed in Section B. We also derive error bounds under the big bang condition. For > 0, let n < ∞ be the smallest integer such that i>n π(i) < and Λ = {i ∈ S : i ≤ n }.
Define also q * = max i∈Λ (q i ∨ 1), which is positive under initial-state identifiability.
Theorem 1.7 (Root reconstruction: error bounds). Let {T k } k and (P t ) t satisfy our standing assumptions (i) and (ii) as well as the big bang condition, and let π be a probability distribution on S. Fix > 0 and k ≥ 1. Then there exist universal constants C 0 , C 1 > 0 and an estimator F k such that for all s > 0, (1.6) Further, if the chain is uniform, that is, if q * = sup i∈S (q i ∨ 1) < +∞, then there exist universal constants C U 0 , C U 1 , C U 2 > 0 and an estimator F U k such that for all s > 0 and all i The following example gives some intuition for the terms in (1.6) and (1.7).
Example 1.8 (Two-state chain on a pinched star). Consider the following tree T . The root ρ is adjacent to a single vertex ρ through an edge of length s > 0. The vertex ρ is also adjacent to m vertices x 1 , . . . , x m through edges of length h − s > 0, where m is an odd integer. Consider the (P t ) t -chain on T with state space S = {1, 2}, Q-matrix and uniform root distribution π. It can be shown (see e.g. [SS03]) that under this chain p 11 (t) = 1 + e −2qt 2 and p 12 (t) = 1 − e −2qt 2 . (1.8) Let N 1 be the number of leaves in state 1, let α = p 11 (s) ∈ (1/2, 1) and let β = p 12 (h − s) ∈ (0, 1/2). The estimator that maximizes the probability of correct reconstruction is the maximum a posteriori estimate (see Lemma 3.2), which in this case boils down to setting and F (N 1 ) = 2 otherwise. Observing that where we used that α > 1/2, we get that F (N 1 ) = 1 if and only if N 1 > m/2. Hence by symmetry, for i = 1, 2, by Hoeffding's inequality [Hoe63]. By (1.8), as s → 0, EJP 23 (2018), paper 47.
· · · · · · Figure 2: A (sub-)sequence of trees {T k } k (from left to right) satisfying the big bang condition, but such that Spr(T k ) does not tend to 0.

Spread
We begin the proof by relating the big bang condition to a notion of spread introduced in [GS10]. This connection captures the basic combinatorial insights behind the proof of Theorem 1.6. Let T = (V, E, ρ, ) be a tree. We let xy be the length of the shared path from the root ρ to the leaves x and y. That is, if P(u, v) denotes the set of edges on the unique path between vertices u and v, then we have Roughly speaking, a tree is "well-spread" if the average value of xy over all pairs (x, y) is small. The formal definition is as follows.
Definition 2.1 (Spread). The spread of a tree T is defined as where the summation is over all ordered pairs of distinct leaves We show below that, if {T k } k has vanishing spread, then the big bang condition holds.
The converse is false as illustrated in Figure 2, where the root is arbitrarily dense but the spread is dominated by a subtree away from the root. We show however that, if the big bang condition holds, then one can find a sequence of arbitrarily large restrictions with vanishing spread. (Restrictions were introduced in Definition 1.3.) Our main result of this section is the following lemma. Proof. For the if part, we argue by contradiction. Assume the big bang condition fails and let { T k } k be a nested sequence of restrictions of {T k } k with vanishing spread such that |∂ T k | → ∞. Then there exist s 0 ∈ (0, 1), m 0 ≥ 1 and k 0 ≥ 1 such that Figure 3: Consider again the second tree in Figure 2. On the left side, T k (s) is shown where k = 3. On the right side, the subtree T k,s is highlighted.
Also, by the nested property, the truncation T k (s 0 ) remains the same for all k ≥ k 0 . We show that at least one of the subtrees of T k rooted at a point in ∂T k (s 0 ) makes a large contribution to the spread. For k ≥ k 0 and z ∈ ∂T k (s 0 ), let ∂ T k [z] be the leaves of T k below z. Then, since Observe that, for all distinct x, y in ∂ T k [z k ] , it holds that xy ≥ s 0 because the paths to x and y share at least the path to z k . Then, counting only the contribution from ∂ T k [z k ] , we get the following bound on the spread of T k .
For the only if part, assume the big bang condition holds. For every k ≥ 1 and s ∈ (0, 1), we extract a (1 − s)-spread restriction T k,s of T k as follows. See Figure 3 for We let T k,s be the restriction of T k to {x 1 , . . . , x m }. Observe that T k,s is (1 − s)-spread because the paths to each pair of leaves in ∂ T k,s diverge within T k (s). To construct a sequence of restrictions with vanishing spread, we take a sequence of positive reals (s i ) i≥1 with s i ↓ 0 and proceed as follows: • Let k 1 ≥ 1 be such that |∂T k (s 2 )| ≥ 2 for all k > k 1 . The value k 1 exists under the big bang condition. For all k ≤ k 1 , let T k = T k,s1 .
• Let k 2 > k 1 be such that |∂T k (s 3 )| ≥ 3 for all k > k 2 . The value k 2 exists under the big bang condition. For all k 1 < k ≤ k 2 , let T k = T k,s2 .
• And so forth.

Impossibility of reconstruction
The goal of this section is to show that, in the absence of downstream disjointness, the big bang condition is necessary for consistent root reconstruction. The following proposition implies the only if part of Theorem 1.6. Proposition 3.1 (Impossibility of reconstruction without the big bang condition). Let {T k } k and (P t ) t satisfy our standing assumptions (i) and (ii), and let π be a probability distribution on S. Assume that neither downstream disjointness nor the big bang condition hold. Then consistent reconstruction of the root state is impossible, in the sense that there exists an > 0 such that for all k ≥ 1 where the supremum is over all root estimators F k : S ∂T k → S.

Information-theoretic bounds
To prove Proposition 3.1, we need some information-theoretic bounds that relate the best achievable reconstruction probability to the total variation distance between the conditional distributions of pairs of initial states. Our first bound says roughly that the reconstruction probability is only as good as the worst total variation distance. Our second bound shows that a good reconstruction probability can be obtained from selecting a subset of initial states with high prior probability whose corresponding conditional distributions have "little overlap." See e.g. [CT06, Chapter 2] and [SS99,SS02] for some related results.
Lemma 3.2 (Information-theoretic bounds). Let Y 0 and Y 1 be random variables taking values in the countable spaces Y 0 and Y 1 respectively. Let µ 0 denote the distribution of Y 0 and let µ i 1 denote the distribution of Y 1 conditioned on {Y 0 = i}.

(Reconstruction upper bound) It holds that
Proof. For both bounds, our starting point is the formula which follows from the last equality in (1.4).
To derive (3.2), observe first that by (3.4) for any f where f * is a maximum a posteriori estimate Bound (3.2) then follows from (3.5) and taking a supremum over i 1 = i 2 .
For (3.3), define the approximate maximum a posteriori estimator where note that, this time, the supremum is over Λ only. Then (3.4) applied By (3.5), that implies (3.3) and concludes the proof.

Characterization of consistent root reconstruction
From Lemma 3.2, we obtain a characterization of consistent root reconstruction in terms of total variation. This characterization is key to proving both directions of Theorem 1.6. Recall that L i T was defined in (1.3) as the leaf distribution on T given root state i. Let {T k } k and (P t ) t satisfy our standing assumptions (i) and (ii), and let π be a probability distribution on S. Then there exists a sequence of root estimators that is consistent for {T k } k , (P t ) t and π if and only if for all i = j ∈ S such that π(i) ∧ π(j) > 0 lim inf (3.7) EJP 23 (2018), paper 47.
Proof. For the only if part, assume by contradiction that there is i 1 = i 2 ∈ S with π(i 1 ) ∧ π(i 2 ) > 0, > 0 and k 0 ≥ 1 such that By (3.2) in Lemma 3.2, for all k ≥ k 0 and any root estimator F k That proves that consistent root estimation is not possible.
For the if part, assume (3.7) holds. Fix > 0 and let 1 ≤ n < +∞ be the smallest integer such that i≤n π(i) > 1 − ,  .7) and (3.8) Because is arbitrary, we have shown that a sequence of maximum posteriori estimates is consistent for {T k } k , (P t ) t and π.

Proof of Proposition 3.1
We now prove our main result of this section.
Proof of Proposition 3.1. Let {T k } k and (P t ) t satisfy our standing assumptions (i) and (ii), and let π be a probability distribution on S. Assume that {T k } k satisfies neither downstream disjointness nor the big bang condition. Then, as we argued in the proof of Lemma 2.2, there exist s 0 ∈ (0, ∞) and k 0 ≥ 1 such that the truncation T k (s 0 ) remains unchanged for all k ≥ k 0 . Since downstream disjointness fails and u > 0 for all u ∈ ∂T k (by the positivity assumption on ), there are i 1 = i 2 with π(i 1 ) > 0 and π(i 2 ) > 0 such that the supports of P i1 [X k ∂T k (s0) ∈ · ] and P i2 [X k ∂T k (s0) ∈ · ] have a non-empty intersection. This holds for all k and implies that (3.9) Finally we observe that, by the triangle inequality and the conditional independence of X k ρ and X k ∂T k given X k (3.10) Combining this inequality with (3.9) shows by Lemma 3.3 that consistent root estimation is not possible in this case. That concludes the proof.

Consistent root reconstruction
In this section, we prove the if part of Theorem 1.6. Observe first that, under downstream disjointness, the result is immediate. Let u ∈ ∂T 1 and I = {i : π(i) > 0}. Note that, by the nested property, u ∈ ∂T k for all k. Then, let F k (X k ∂T k ) be the state in I from which X k u is reachable. Downstream disjointness ensures that such a state exists and is unique. We then have P π [F k (X k ∂T k ) = X k ρ ] = 1, proving consistency in that case.
Here we show that the big bang condition also suffices for consistent root reconstruction. We use the characterization in Lemma 3.3 to reduce the problem to pairs of initial states. Our strategy is then to extract a "well-spread" subtree of T k , as we did in the proof of Lemma 2.2, and generalize results of [GS10] on root reconstruction for well-spread trees. Formally we prove the following proposition, which together with Lemma 3.3 and the argument above in the downstream disjointness case, implies the if part of Theorem 1.6. (4.1)

Well-spread restriction
We will use the following construction. We extract a well-spread restriction of T k and stretch the leaf edges to enforce that all leaves are at the same distance from the root.
Fix k ≥ 1 and s > 0. Recall that h * is a (uniform) bound on the height of the trees. • Step 1: Well-spread restriction. By Lemma 2.2, there exists a a nested sequence of restrictions with vanishing spread. Let T k,s be the restriction of T k constructed in the proof of Lemma 2.2. Recall that T k,s is (1 − s)-spread and has |∂T k (s)| leaves. • Step 2: Stretching. We then modify T k,s to make all leaves be at distance h * from the root as follows. For each leaf x ∈ ∂ T k,s , we extend the corresponding leaf edge by h * − x and run the P t -chain started at X k x for time h * − x . We then let T k,s be the resulting tree and assign the states generated above along the extensions.
Observe that T k,s , like T k,s , is (1 − s)-spread and has |∂T k (s)| leaves.
Let N (k) j be the number of leaves of the stretched restriction T k,s that are in state j ∈ S and let N k,s = (N k,s 1 , N k,s 2 , · · · ). Denote by M i T k,s the law of N k,s when the root state is i. By a computation similar to (3.10), by the conditional independence of N k,s and X k ρ given X k ∂T k , we have that Therefore, Proposition 4.1 follows from the following lemma. When all leaves of T k are assumed to be at the same distance from the root, T k is said to be ultrametric (see e.g. [SS03,Chapter 7]). Here we do not make this assumption on T k . Instead we enforce it artificially through the stretching in Step 2. The reason we do this is that our proof relies on initial-state identifiability which, by (1.5), implies (4.2) In contrast, it may not be the case that the expected state frequencies at ∂ T k,s , that is, uniquely characterize the root state i.

Variance bound
The proof of Lemma 4.2 relies on the following variance bound, which generalizes a result of [GS10, Proof of Lemma 3.2]. Recall the definition of q i in (1.2).

Lemma 4.3 (Variance bound).
Let T = (V, E, ρ, ) be a tree and let (X γ ) γ∈Γ T be a P t -chain on T . Let N j be the number of leaves of T in state j ∈ S. Then for all i, j ∈ S, where we denote by Var i the variance under P i .
Proof. Let θ j x be the indicator random variable for the event "leaf x is in state j." Then N j = x∈∂T θ j x , and, hence, ≤ 1/4, leading to the first term on the RHS of (4.3). For x = y, we have that which is obtained by conditioning on the state at the divergence point between the paths from the root to x and y. Splitting the sum according to whether k = i, we have (4.6) To see inequality (4.6), note that the second term on the RHS of (4.5) is bounded above by the probability that the state is changed at least once along the shared path from the root to x and y, which is equal to 1 − exp (−q i xy ) ≤ (q i xy ) ∧ 1 (see e.g. imply that, for all t ≥ 0 and δ > 0, The proof is complete in view of (4.4) and the definition of the spread.

Proof of Lemma
. We claim that (4.2) is equivalent to Indeed, by the definition of the norms, we have · * ≤ · TV . For the other direction, note that, for any δ > 0, there exists M such that k>M 2 −k < δ/2 and so µ − ν TV ≤ δ/2 + 2 M µ − ν * for any probability distributions µ and ν. We consider the following Because T k,s is (1 − s)-spread, the variance bound in Lemma 4.3 implies that for i, j ∈ S By the Cauchy-Schwarz inequality and (4.8), where we used (4.9). By the big bang condition and (4.7), taking k → +∞ and then s → 0, we get inf s>0 lim sup Similarly, noting that by the triangle inequality and the definition of ∆ * i,j , The proof is complete.

Error bounds
The proof of Lemma 4.2 actually implies an explicit bound on the error probability (see (4.10)). That bound decays like the inverse of |∂T k (s)|. This is far from best possible: take for instance the star tree where, by conditional independence of the leaf states given the root state, one would expect an exponential inequality. Here we give an improved bound on the achievable error probability which decays exponentially in |∂T k (s)|. We also express this bound in terms of the more natural total variation distance. Our main result is the following proposition, which implies the first part of Theorem 1.7. (The second part of the theorem is proved in Section 5.3.) For > 0, recall that n < ∞ be the smallest integer such that i>n π(i) < and that Λ = {i ∈ S : i ≤ n }, Proposition 5.1 (Achievable error bound). Fix > 0 and k ≥ 1. Then there exist universal constants C 0 , C 1 > 0 and an estimator F k such that the following holds. For all s > 0,

Deviation of frequencies
To prove Proposition 5.1, we devise a root estimator (described in details in the next subsection) based on the combinatorial construction of Section 4.1. Fix k ≥ 1 and s > 0. Given the leaf states X k ∂T k ∈ S ∂T k of the original tree T k , we extract the subtree T k,s , run a simulation of the P t -chain on the extended tree T k,s , and treat the leaf states of T k,s as the observed leaf states. For a subset A ⊆ S, let N k,s A be the number of leaves of T k,s whose state is in A. The proof of Proposition 5.1 requires a bound on the deviation of N k,s A . To obtain such a bound, we proceed by first controlling the number of points in ∂T k (s) whose state coincides with the root state.
Let i be state at the root. For any vertex v, let Z v be 1 if the state at v is i, and let Z v be 0 otherwise. Let W i be those vertices in ∂T k (s) in state i. In particular Let N A be the number of descendant leaves of W i in T k,s whose states are in A. We also let m = |∂T k (s)|. Then, we can bound N k,s A as follows Conditioned on S i , note that N A is a binomial random variable, specifically, Bin(S i , p iA (h * − s)), where p iA (t) denotes the probability that the state is in A given that initially it is i. To bound the probability that N k,s A is close to its expectation, we argue in two steps. We first bound the probability that S i itself is close to its expectation, then we apply a concentration inequality to N k,s A conditioned on that event.
Lemma 5.2 (Control of S i ). Define the event Then, we have the bound Proof. We use Chebyshev's inequality to control the deviation of S i . By the Cauchy-Schwarz inequality, the variance of S i is bounded by where on the last line we used that the probability of being at state i at time s is at least the probability of never having left state i up to time s, i.e., e −qis ≤ p ii (s) ≤ 1 (see e.g. [Lig10, Chapter 2]). The result by Chebyshev's inequality.

Lemma 5.3 (N k,s
A is close to its expectation given E 0 δ ). Fix a subset A ⊆ S. Let δ > 0.
Then, the following bound holds Proof. We proceed in three steps: 1. Conditional control of N A . Condition on S i . Define the event (5.4) By Hoeffding's inequality [Hoe63], we then have To relate it to the expectation of N k,s A , we note that where we used p ii (s) ≤ 1. In turn, by (5.6) and (5.1), Thus, by the above, 1 − e −qis δ 2 + exp − 2δ 2 1 + δ m by (5.3) and (5.5).
That concludes the proof.

Analysis of root estimator
We now describe our root estimator. In fact, we construct a randomized estimator (which can be made deterministic by choosing for each input the output most likely to be correct.) We restrict ourselves to a subset of root states that has high probability under π and we estimate the frequencies of events achieving the total variation distance between the leaf distributions given different root states. Fix > 0 and let Λ = Λ .

Root estimator
Our root estimator G Λ k : S ∂T k → S is defined as follows. Let N k,s A and m be defined as in the previous subsection.
• Define • For every distinct pair of states i 1 , i 2 ∈ Λ, let A i1→i2 ⊆ S be an event achieving the total variation distance between p i1 (h * ) and p i2 (h * ), that is, where we also require that A i1→i2 = A c i2→i1 .
• We let G Λ k (X k ∂T k ) be the state i passing the following tests if such a state exists; otherwise we let G Λ k (X k ∂T k ) be a state chosen uniformly at random in Λ.
Observe that at most one state can satisfy the condition in (5.7). Indeed, for where we used the definition of ∆ and the fact that A i→i = A c i →i . Observe also that G Λ k is randomized as a function of X k ∂T k since it depends on the states at the leaves of the extension T k,s .
Analysis We now prove our main result of this section.
Proof of Proposition 5.1. Let F k = G Λ k be the estimator defined above, let the events A i→i be as defined above and let i be the state at the root. By Lemmas 5.2 and 5.3, (1 − e −q * s ). Take δ = ∆ 8 and s small enough that 1 − e −q * s ≤ ∆ /4. The result follows. Note finally that, if 1 − e −q * s ≤ ∆ /4 fails, then the bound in Proposition 5.1 is trivially true as the RHS is then larger than 1. We leave that condition implicit in the statement.

Uniform chains: minimax error bound
Here we consider chains with unformly bounded rates. We give a minimax error bound, that is, a bound uniform in the root state. We observe in Appendix A that Note that We prove the following proposition, which implies the second part of Theorem 1.7.
Proposition 5.4 (Minimax error bound for uniform chains). Fix k ≥ 1. There exist universal constants C U 0 , C U 1 , C U 2 > 0 and an estimator F U k such that the following holds.
For all s > 0 and all i,

Root estimator
We modify the root estimator from Section 5.2. We use the same estimator G Λ k , but we choose a set Λ depending on the leaf states of the extended restriction. More precisely, fix k ≥ 1 and s > 0. Recall and we set F U k = G Λ k .
Analysis Let i be the state at the root. Recall the definitions of S i and E 0 δ from Section 5.1. We show first that, conditioned on E 0 δ , the set Λ is highly likely to contain i, but highly unlikely to contain any state with low enough probability at the leaves. For α ∈ [0, 1], define J i,α = {j ∈ S : p ij (h * ) ≤ α} .
where the notation above indicates that all sequences begin with the immortal link (and can otherwise be empty). We also refer to the positions of a sequence (including nucleotides and the immortal link) as sites. Let (ν, λ, µ) ∈ (0, ∞) 3 with λ < µ and (π A , π T , π C , π G ) ∈ [0, ∞) 4 with π A + π T + π C + π G = 1 be given parameters. The continuous-time Markovian dynamic is described as follows: if the current state is the sequence x, then the following events occur independently: • (Substitution) Each nucleotide (but not the immortal link) is substituted independently at rate ν > 0. When a substitution occurs, the corresponding nucleotide is replaced by A, T, C and G with probabilities π A , π T , π C and π G respectively.
• (Deletion) Each nucleotide (but not the immortal link) is removed independently at rate µ > 0. • (Insertion) Each site gives birth to a new nucleotide independently at rate λ > 0.
When a birth occurs, a nucleotide is added immediately to the right of its parent site. The newborn site has nucleotide A, T, C and G with probabilities π A , π T , π C and π G respectively. The length of a sequence x = (•, x 1 , x 2 , · · · , x M ) is defined as the number of nucleotides in x and is denoted by | x| = M (with the immortal link alone corresponding to M = 0). When M ≥ 1 we omit the immortal link for simplicity and write x = (x 1 , x 2 , · · · , x M ).
The TKF91 edge process is reversible [TKF91]. Suppose furthermore that 0 < λ < µ, an assumption we make throughout. Then it has an stationary distribution Π, given for each x = (x 1 , x 2 , · · · , x M ) ∈ {A, T, C, G} M where M ≥ 1, and Π(" • ") = 1 − λ µ . In words, under Π, the sequence length is geometrically distributed and, conditioned on the sequence length, all sites are independent with distribution (π σ ) σ∈{A,T,C,G} . Hence, from the argument in Section A, initial-state identifiability holds for the TKF91 edge process. Theorem 1.6 gives: Theorem B.2 (TKF91 process: consistent root estimation). Let {T k } k satisfy assumption (i) and the big bang condition. Let (P t ) t be the TKF91 edge process with λ < µ and let π be the stationary distribution of the process. Then there exists a sequence of consistent root estimators.
In a companion paper [FR], we give an alternative consistent root estimator that is also computationally efficient and provide error bounds that are explicit in the parameters of the model.