Theoretical guarantees for phylogeny inference from single-cell lineage tracing

Significance The recent application of CRISPR-Cas9 barcode editing technologies, by allowing for the continuous accumulation of information-rich mutations, has greatly increased the scale and resolution of studies tracking the developmental lineages of single cells. The inference of the unobserved lineage relationships from mutations observed in cells at one terminal time point is a rich computational problem made more difficult by distinct properties of these data that hinder classical phylogenetic approaches. To address these methodological needs for single-cell CRISPR-Cas9 lineage-tracing applications, we provide a theoretical analysis in this domain with the following key contributions: A robust generative model, algorithms with theoretical guarantees on accurate reconstruction, theoretically motivated experimental design principles, and simulations characterizing the empirical conditions for accurate reconstruction.


A.1.2 Proof of lemma 1:
Let (a, b|c) be a triplet with depth(LCA(a, b, c)) = d, and let ↵ = dist(LCA(a, b, c), LCA(a, b)). Let Y = s(a, b) and X = s(b, c). For a particular character i , let Y i = i (a)= i (b) and let X i = i (b)= i (c) . We use the following results: lemma 4 If for every character i , P [Y i X i = 1] is a decreasing function of ↵ and P [Y i X i = 1] is an increasing function of ↵ for all ↵ 2 [0, 1], then for any t, P [Y X t] is an increasing function of ↵.
Proof: For a given character i , Y i X i = i (a)= i (b) i (b)= i (c) has 3 possible outcomes: {1, 0, 1}. Thus, if P [Y i X i = 1] is a decreasing function of ↵ and P [Y i X i = 1] is an increasing function of ↵ for all ↵ 2 [0, 1], then P [Y i X i t] for any t is an increasing function of ↵. Stating that P [Y i X i t] is an increasing function of ↵ is identical to stating that for any ↵ 1 and ↵ 2 such that ↵ 1 ↵ 2 , P ↵1 [Y i X i t] P ↵2 [Y i X i t]. We use the known result from probability theory that for random variables A i iid ⇠ A and B i iid ⇠ B such P [A t] P [B t] for all t, then P [ P i A i t] P [ no mutation must be acquired at i on the path from the r to LCA(a, b, c) nor the path from LCA (a, b, c) to LCA(a, b). Thus we have: (1 e (1 d ↵) )q j (1 e (1 d) )q j (1 (1 e (1 d ↵) )q j ) Taking only the terms that depend on ↵, we have: To show that this value decreases with ↵, we show that the first derivative is negative with respect to ↵. We use the following form of the derivative: (q j 1)q j e 2 (↵+2) q 2 j e 2 (d+↵) (↵+2) This value is positive owing to the fact that q j 2 [0, 1] for all j. Thus, the function within the summation is decreasing in terms of ↵. Using the fact that the summation of decreasing functions is decreasing, the overall function is thus decreasing in terms of ↵.
Secondly, we examine P [Y i X i = 1]. Y i X i = 1 for i corresponds to a mutation occurring in both a and b, but not in c. A mutation can occur in a and b if it appears on the path from LCA(a, b, c) to LCA(b, c), or if it appears independently in the paths from LCA(a, b) to both a and b. Additionally, this mutation cannot appear on the path from LCA(a, b, c) to c, and no mutations can occur on the path from r to LCA(a, b, c). Thus, we have: Taking on the terms that depend on ↵, we have: (1 e ↵ ) + e ↵ (1 e (1 d ↵) ) 2 q j To show that this value is increasing with ↵, we show that the first derivative is positive with respect to ↵. We use the following form of the derivative: qe 2 (d+↵) (↵+2) (q 1)e 2 (↵+2) This value is positive owing to the fact that q j 2 [0, 1]. Thus, the function within the summation is decreasing in terms of ↵. Using the fact that the summation of decreasing functions is decreasing, the overall function is thus decreasing in terms of ↵.
General Missing Data Case: Next, we examine the general case with both stochastic and heritable missing data. In this case, we define s(a, b) (analogously s(b, c)) as the number of characters shared by a, b that do not have dropout in either a, b or c. As we now simply condition on the fact that a, b, c must all be present, we add an additional (1 p d ) term to both P [Y i X i = 1] and P [Y i X i = 1]. As this term does not depend on ↵, both functions depend on ↵ as they do in the case without missing data.
Stochastic-only Missing Data Case: Finally, we examine the case with only stochastic missing data.
Here we define s(a, b) as the number of mutations shared by a and b in characters that did not su↵er dropout in either sample. Thus, in analyzing Y i X i we must consider additional cases in which dropout in one cell can hide the fact that two cells inherited the same mutation.
First, we examine P [Y i X i = 1]. Y i X i = 1 for a character i corresponds to that character acquiring the same mutation in b and c, not acquiring dropout in neither b nor c, and either observing dropout or not observing that mutation in a. For this to occur, a mutation can occur on the path from r to LCA(a, b, c) while a acquires dropout, or the same mutation can occur on the path from LCA(a, b, c) to LCA(a, b) and the path from LCA(a, b, c) to c while a acquires dropout, or the mutation can occur on the path from LCA(a, b) to b and the path from LCA(a, b, c) to c while not appearing in a. The probability of this is: Taking the terms that depend on ↵: To show that this value decreases with ↵, we show that the first derivative is negative with respect to ↵. We use the following form of the derivative: This value is positive owing to the fact that q j 2 [0, 1] for all j and that p d 2 [0, 1). Thus, the function within the summation is decreasing in terms of ↵. Using the fact that the summation of decreasing functions is decreasing, the overall function is thus decreasing in terms of ↵.
Secondly, we examine P [Y i X i = 1]. Y i X i = 1 for i corresponds to a mutation occurring in both a and b, not acquiring dropout in neither a nor b, and either observing dropout or not observing that mutation in c. For this to occur, a mutation can occur on the path from r to LCA(a, b, c) while c acquires dropout, on the path from LCA(a, b, c) to LCA(a, b) while the mutation is not acquired in c or c acquires dropout, or the mutation occurs independently on the path from LCA(a, b) to a and b while not appearing in c. The probability of this is: Taking on the terms that depend on ↵, we have: Note that this is the same value as above in the case without missing data, and hence the function will have the same dependence on ↵ as in that case. Hence, the function is overall decreasing with ↵.
Proof: By lemmas 4 and 5, P [s(a, b) s(a, c) t] is an increasing function of ↵. Thus, the minimum value of ↵ results in the minimum value of P [s(a, b) s(a, c) t]. This value occurs at ↵ =`⇤, showing lemma 1.

A.1.3 Proof of lemma 2:
First, we will show that condition i) will hold with probability 1 ⇣ if: 3 log(n) + log 1/⇣ To see this, first note that dist(LCA(a, b, c), LCA(b, c)) `⇤. By lemma 1, we can WLOG assume that dist(LCA(a, b, c), LCA(b, c)) =`⇤ because that is the worst case, i.e. the case where P [s(a, b) s(b, c) t] is minimized. Any condition su cient for this case will be su cient overall. Let Y = s c (a, b) and X = s a (b, c).
To bound the probability of both events using the above versions of Hoe↵ding's inequality, we have: The last line follows from the fact that (d)  e d and E(X) Since e d (d) = 0.6(1 q + qe 2 (1 d) ) 0.6(1 q + qe 2 ), in order to ensure that both bad events have probability at most ⇣n 3 , it su ces to take , the above condition holds as long as Applying the same argument to s c (a, b) s b (a, c) and combining both results gives Taking a union bound over all n 3 = O(n 3 ) triplets, we see that the probability of condition i) failing for any triplet is at most ⇣.
To get guarantees on the second condition, note that condition i) implies that condition ii) holds for all triplets separated by an edge of length at least`⇤. Thus we can focus on the triplets that are not covered by condition i). Let (a, b|c) be an arbitrary triplet such that dist(LCA(a, b, c), LCA(a, b)) <`⇤ and again let Y = s c (a, b), X = s a (b, c) and d be the depth of LCA(a, b, c) (note that we are focusing WLOG on s a (b, c) since it has the same distribution as s b (a, c)). We want to show that with high probability, X Y < t. Again, it su ces to upper bound P [Y  E(Y ) t/2] and P [X E(X) t/2] because E(Y ) E(X). Note that we have already bounded the second quantity. To bound the first quantity, note that the worst case scenario is that dist(LCA(a, b, c), LCA(a, b)) is as small as possible. Since in this lemma we make no assumption on the minimal edge length, this quantity can be arbitrarily small and in the worst case, dist(LCA(a, b, c), LCA(a, b)) = 0, which means Y has the same distribution as X. Note that this case technically cannot happen as it would imply that T has a trifurcating branch but it is possible to get arbitrarily close to this case with no restrictions on edge lengths. This gives: Thus, if we take: k max ⇣ (96 log n + 32 log 1/⇣)q ⇤2 ⇤2 , (96 log n + 32 log 1/⇣)(`⇤ + (1 e )q) 0.6 `⇤ 2 ⇤ (1 q + qe 2 ) ⌘ then we have P [X Y t] < ⇣n 3 . By symmetry, this means P [s a (b, c) s c (a, b) t S s b (a, c) s c (a, b) t]  2⇣n 3 Since we can union bound over one bad event of probability at most 2⇣n 3 for each of the n 3 triplets, we have that conditions i) and ii) both hold with probability at least 1 ⇣.

A.1.4 Proof of lemma 3:
Given a triplet (a, b|c), with LCA at depth at most d ⇤ , we defined ✏ to be the probability that no dropout occurs in a particular character of all three cells. First we will prove that ✏ (1 p d ) 3 , which is to say the dropout events are positively correlated. Let p h be the probability that heritable dropout occurs on the path from r to a. Let p b be the probability that a heritable dropout occurs on the path from LCA(a, b) to b given that no dropout has occurred on the path from r to LCA(a, b) and define p c similarly. Note that p b  p c  p h since the probability a dropout occurs along a path increases with the length of the path. Let p s be the probability that a stochastic dropout occurs at given character on a leaf given that no heritable dropout has occurred yet on that character. Then we have Since at least one of the cells in the triplet needs to not incur dropout at a character in order for all three of them to have no dropout, we also have ✏  1 p d .
To prove the bounds on k we will proceed as in the proof of lemma 2 and assume that dist(LCA(a, b, c), LCA(a, b)) =`⇤, noting that lemma 5 extends the result of lemma 1 to the general case with missing data. Let Y = s c (a, b) and X = s a (b, c). Thus, we have: Since E(Y )  k✏e d `⇤ and E(X)  k✏e d 2 q, we see that both probabilities are at most n 3 ⇣ when 0.6k(1 p d ) 3 (1 q + qe 2 ) (d)`⇤ 2 32(`⇤ + (1 e )q) 0.6k✏ (1 q + qe 2 ) (d)`⇤ 2 32(`⇤ + (1 e )q) 3 log(n) + log 1/⇣ If both bad events don't happen, then we have Y X k✏ `⇤ ⇤ /2 k(1 p d ) 3 `⇤ ⇤ /2 = t. This gives the necessary bound for condition i) to hold.
To get guarantees on the second condition, let X = s b (a, c) for any triplet a, b, c whose LCA has depth at most d ⇤ and where c is the outgroup. we have that To ensure that this probability is at most n 3 ⇣, it su ces to take k (96 log n + 32 log 1/⇣)q ⇤2 ⇤2 (1 p d ) 5 The rest of the argument is exactly the same as in the proof of lemma 2.

A.1.5 Proof of Theorem 3
To prove Theorem 3, we must first bound the e↵ects of incorrectly inferred mutations at internal nodes. Assume that r 0 is an internal node generated in line 5 of the algorithm. Further assume that the tree rooted by r 0 is correct (i.e., it is a sub-graph of T ). The process of assigning a mutation state to r 0 (lines 6-12) may incorrectly include mutations that emerged independently at its two child trees. To account for this, we define P i,j (r 0 ) to be the probability that every leaf beneath r 0 has character i mutated to state j, given that character i is not mutated at r 0 . This probability is bounded according to the following lemma (proof shown later).
lemma 6 Given the constraints on ,`and q in Theorem 3, it follows that P i,j (r 0 )  2 2 c`q 2 j for any internal node r 0 2 T , character i and state j.
To guarantee that the algorithm has a low probability of making mistakes in forming cherries (line 13), we focus on a pair of "active" nodes u, w in a given iteration of the algorithm (i.e., nodes that are roots of trees in the current set S; defined in lines 2 and 14). We assume that up until this point, the algorithm did not form any incorrect cherries (i.e., all trees in S are sub-graphs of T ). It su ces to show that with high probability, if u and w do not form a cherry in T , there must be some internal node, v on the path from r 0 = LCA(u, w) to u such that s(u, v) > s(u, w) (note that r 0 is LCA in T ). If this holds, then if u, w are the first incorrect pair to be merged, then there must be some descendent u 0 of v such that s(u, u 0 ) s(u, v) > s(u, w), contradicting the fact that u, w was the most similar pair. Note that u and w can be either leaves or internal nodes.
Let d be the depth of r 0 and let ↵ = dist(r 0 , v). Since the maximum edge length is is greater than this quantity we can simply switch the roles of u and w and redefined ↵ to be the distance from r 0 to the parent of w). Let Z be the number of mutations that occurred on the path from r 0 to v. This gives: Note that due to irreversibility the mutations tallied in Z will be present in all nodes under v, including u and u 0 . Let X be the number of character assignments that are shared between u and w (assigned by the algorithm per lines 7-11) and that are not present in r 0 (per the ground truth). There are two ways in which a character can be assigned state j at u (or w). Firstly, in the ground truth tree, a character can mutate to state j on the paths from r 0 to u (or w), with probability at most (1 e (↵+ p c`) )q j . In that case, the mutation will be present in all downstream leaves, and thus assigned by the algorithm to u (or to w). Secondly, if it did not mutate on that path, it could instead have mutated in enough places in the sub-tree rooted beneath u (or w) such that every leaf in that sub-tree has the mutation at that character. By lemma 6, we see that the probability of this occurring is at most 2 2 c`q 2 j . Requiring that in both u and w we have an occurrence of at least one of these scenarios for a given state j and summing over all states, we get: Let q max = max j q j . Assume E(Z) > E(X) and let = E(Z) E(X). Again by the above versions of Hoe↵ding's inequality, we have the following concentration inequalities: Then, we have: Where the last line follows from the fact that the maximum of the function ↵ + e ↵ occurs at ↵ = 1. Now, it remains to find a bound on so that > 0 and 2 E[Z] is lower bounded. Let C = e + 2 q max . To ensure that > 0, we see that the last line from the previous block needs to be > 0. Taking this inequality and rearranging terms, it su ces to show that: Note that `(e ↵ + q max ) 2 1 e ↵  `(e `+ q max ) 2 1 e `⇡ (e `+ q max ) 2 so our su cient condition can be written as We lower bound 2 E[Z] by the following: In order for this bound to be positive, we need Note that this condition satisfies the above condition on and thus would imply To bound the probability that Z < E[Z] /2 and X E(X) + /2) is at most n 2 ⇣, we need: This is satisfied if k satisfies the following: In other words, for any pair of vertices u, w that are not children of the same node, there will be a vertex u 0 that such that LCA(u, u 0 ) is a descendent of LCA(u, w) and P [s(u, u 0 )  s(u, w)]  2n 2 ⇣. If s(u, u 0 ) > s(u, w), then (u, w) cannot be the first pair of incorrectly joined vertices. Taking a union bound over at most n 2 /2 pairs of vertices, we see that with with probability at least 1 ⇣, there is no first pair of incorrectly joined vertices, which means the algorithm is correct.
A.1.6 Proof of lemma 6: For this proof, we use the following results: lemma 7 Let ⇢ be the maximum edge length in T . For any character, state pair (i, j), if there exists a number p > 0 which satisfies ((1 e ⇢ )q j + p) 2  p, then p is an upper bound on P i,j (v) for any node v 2 T .
Proof: We will proceed by induction on T . Suppose v is a leaf. Then P i,j (v) = 0, since if the i th character does not mutate, it cannot take on state j. Now let v be an arbitrary non-leaf vertex, with children u and w. By our inductive hypothesis, P i,j (w)  p and P i,j (u)  p. Since the length of the edge from v to either of it's children is at most ⇢, the probability that the character mutates to state j on either edge is at most (1 e ⇢ )q j . Thus, if we condition on the fact that i is not mutated on v, we have: Where the last inequality follows from our assumption on p. Given the above lemma, we can simply solve for p to find an upper bound on all P i,j (v) in T . )] Then we have P i,j (v)  2 2 ⇢ 2 q 2 j for any node v 2 T Proof: Let y = (1 e ⇢ )q j . Note that by our above assumption, y  3/16. By our above assumption, we have: (1 e ⇢ ) max (1 e ⇢ )q j  3 16 y  3 16 Next, by lemma 7, we know that if there is a p > 0 such that (y + p) 2  p, then such a p would be an upper bound on all P i,j (v). We can find such a p by setting the inequality to an equality and finding the smallest root of the resulting polynomial.
Our initial assumption of ⇢ guarantees that 4y < 1, which means the smallest root of the polynomial above can be given as Where the last inequality follows from the fact y  3/16, which means the denominator is at least 1/2. Finally, we have 2y 2 = 2(1 e ⇢ ) 2 q 2 j  2 2 ⇢ 2 q 2 j . Now simply take ⇢ = p c`⇤ and apply lemma 8 to show lemma 6.

A.2.1 Alternative Strategy for Regime with only Stochastic Missing Data
Now we assume that missing data only occurs due to technical di culties in reading out mutation sequences in cells, i.e. that only stochastic missing data occurs. Specifically, for any given cell and any given Cas9 recording site, we assume that the sequence of this site is missing from our data with probability p s and that this omission occurs independently from all other cells or mutation sites. With this definition, we consider a slightly modified oracle by defining s(a, b) as the number of mutations shared by a and b in characters that did not su↵er dropout in either sample. Note that this definition of s(a, b) is independent of a third cell c unlike in the general case. Accordingly we revise conditions i) and ii) by setting the decision threshold t to be k(1 p s ) 2 ` ⇤ . These modified definitions lead to a more relaxed dependency between the mutation rate and the extent of missing data: lemma 9 In the presence of stochastic missing data at a rate of p s , condition i) holds with probability at least 1 ⇣ if we we have the following guarantees on the parameters q, ,`⇤, d ⇤ , and k: Both conditions i and ii hold with probability at least 1 ⇣ if we have: ⌘ An empirical demonstration of the tightness of lemma 9 using simulations is provided in appendix Figure S4.
Proof: For a triplet (a, b|c) in the case without dropout, any mutation that occurred on the path from the root to LCA(a, b, c) is inherited by each member of the triplet, and thus s(a, b) s(b, c) = s c (a, b) s a (b, c). But in the case of missing data, this is no longer true as mutations that occurred before LCA(a, b, c) may be obscured by dropout and therefore not present in the character information of a, b, or c. We must now account for these early mutations in our calculations.
Let p s be the stochastic missing data rate and let s(a, b) be the number of mutations shared between a and b, ignoring characters that have dropout in either a or b. The number of mutations shared by a, b after their divergence is now Binomial(k, p) where p is Here (1 p s ) 2 is the probability that this character does not acquire dropout in neither a nor b, 1 e d is the probability that a given mutation occurred before d, and of the remaining terms the left term is the probability the mutation occurred on the path from LCA(a, b, c) to LCA(a, b), and the right term is the probability a given mutation is shared by a, b due to convergent evolution. Thus: Similarly: Thus, we have that: (1 p s ) 2 k(1 e d + e d ((1 e `⇤ ) + e `⇤ (1 e (1 `⇤ d) ) 2 q (1 e d + e d (1 e (1 d) ) 2 q)) = (1-p s ) 2 ke d (1 e `⇤ + q(e `⇤ 1 + e 2 (1 d)+ `⇤ e 2 (1 d) )) The fourth line follows from the fact that 0.6x  1 e x  x for x 2 [0, 1]. Again taking (d) = 0.6e d (1 q) + qe (2 d) , we then have that for any triplet (a, b|c), where depth(LCA(a, b, c)) = d and dist (LCA(a, b, c), LCA(a, b) First, we will show that condition i) will hold with probability 1 ⇣ if: 3 log(n) + log 1/⇣ To see this, let (a, b|c) be any triplet at depth at most d ⇤ and the distance between their LCAs be at least`⇤. By lemma 1, we can WLOG assume that dist(LCA(a, b, c), LCA(b, c)) =`⇤ because that is the worst case, i.e. the case where P [s(a, b) s(b, c) t] is minimized. Note that lemma 5 extends the result of lemma 1 to the case with only stochastic missing data. Any condition su cient for this case will be su cient overall. Let Y = s(a, b) and it su ces to have Y > E(Y ) = t/2 and X < E(X) = t/2. To show that both occur with high probability, we have: The last line follows from the fact that (d)  e d and E(X)  (1 p s ) 2 ke d (1 e ) 2 q + d  (1 p s ) 2 ke d (1 e ) q + d. Since e d (d) = 0.6(1 q + qe 2 (1 d) ) 0.6(1 q + qe 2 ) and any d < d ⇤ , in order to ensure that both bad events have probability at most ⇣n 3 , it su ces to take 0.6(1 p s ) 2 k (1 q + qe 2 ) (d)`⇤ 2 32(`⇤ + (1 e )q + e d ⇤ d ⇤ ) 3 log(n) + log 1/⇣ Applying the same argument to s(a, b) s(a, c) and combining both results gives P [s(a, b) max(s(a, c), s(b, c)) < t]  4⇣n 3 . Taking a union bound over all n 3 = O(n 3 ) triplets, we see that the probability of condition i) failing for any triplet is at most ⇣.
To get guarantees on the second condition, note that condition i) implies that condition ii) holds for all triplets separated by an edge of length at least`⇤. Thus we can focus on the triplets that are not covered by condition i). Let (a, b|c) be an arbitrary triplet such that dist (LCA(a, b, c), LCA(a, b)) <`⇤ and again let Y = s(a, b), X = s(b, c) and d be the depth of LCA(a, b, c) (note that we are focusing WLOG on s(b, c) since it has the same distribution as s(a, c)). We want to show that with high probability, X Y < t. Again, it su ces to upper bound P [Y  E(Y ) t/2] and P [X E(X) t/2] because E(Y ) E(X). Note that we have already bounded the second quantity. To bound the first quantity, note that the worst case scenario is that dist (LCA(a, b, c), LCA(a, b)) is as small as possible, but since this quantity can be arbitrarily small, we can assume that in the worst case, dist(LCA(a, b, c), LCA(a, b)) = 0, which means Y has the same distribution as X. Note that this case technically cannot happen as it would imply that T has a trifurcating branch but it is possible to get arbitrarily close to this case with no restrictions on edge lengths. This gives: Thus, if we take: Since we can union bound over one bad event of probability at most 2⇣n 3 for each of the n 3 triplets, we have that conditions i) and ii) both hold with probability at least 1 ⇣.

A.2.2 Alternative Analysis of the Bottom-up Algorithm
Theorem 4 The constraint on and q in Theorem 3 can be replaced with Where 1 = e p c`+ 2 2 c`q max and 2 = ( p`+ p c + 2 c p`q max ) 2 . In that case, the Bottom-Up Algorithm returns the correct tree with probability at least 1 ⇣ if k e (20 log n + 10 log (1/⇣)) min( `(1 `/2 2 q 2 ), (1 e )(1 2(1 e + 2 1 + 2 1 1 e )q)) Note this means that as`! 0, our constraint on and q becomes q < 1 2c and our bound on k approaches e (20 log n + 10 log(1/⇣)) Proof of Theorem 4: To upper bound the probabilities of the bad events, it su ces to ensure that E(Z) > E(X) and bound the quantity (E(Z) E(X)) 2

E(Z)
= 2 E(Z) . Since this quantity depends on ↵, we will first derive a lower bound on this quantity and determine where the minimum occurs for ↵ 2 [`, 1] as follows: Now let f (↵) = 1 e ↵ 2(1 e (↵+ p c`) + 2 2 c`q max ) 2 q). Next we will show that for any interval [a, b], the minimum of f on [a, b] occurs at either a or b. Let y(↵) = e ↵ , and let g(y) = 1 y 2(1 ye p (c`) + 2 2 c`q max ) 2 q. Then f = g y. Note that g 0 is linear with negative slope, as g 0 (y) = 1 + 4(1 ye p c`+ 2 2 c`q max )qe p cS ince f 0 (x) = g 0 (e x )e x , f 0 (x) = 0 only when g 0 (e x ) = 0. Since e x is an increasing function, there can be at most one point where f 0 is 0. On the other hand, one can verify that lim x! 1 f 0 (x) = 1, which means that if there is any x where f 0 (x) = 0, then f 0 is positive on (1, x) and negative on (x, 1), which means x must be a local maximum. Thus, any minimum of f on an interval [a, b] can only occur on the boundaries.

E[Z]
(1 e 2(1 e + 1 ) 2 q) Thus, in order for the bound to be non-trivial, we need 2(1 e + 2 1 + Note that as`! 0, the RHS approaches < ln(1 1 2q ) which is at least 1 2q , and in general, when satisfies the constraint above, f (1) is lower bounded by a constant independent of`. Also, note that the bound is bound is trival if the term inside the ln is negative. ↵ =`case: In this case, our lower bound becomes as follows: Now let 2 = ( p`+ p c+2 c p`q max ) 2 . Then in order for the bound to be non-trivial, we need 2 +2 q 2 < 1, which means Note that as`! 0, the RHS becomes 1 2qc . Since ke d f (↵), when the constraints: are satisfied, and if we take k e (20 log n + 10 log (1/⇣)) min( `(1 `/2 2 q 2 ), (1 e )(1 2(1 e + 2 1 + 2 1 1 e )q)) the probability that either of the above events occur is at most n 2 ⇣. In other words, for any pair of vertices u, w that are not children of the same node, there will be a vertex u 0 that such that LCA(u, u 0 ) is a descendent of LCA(u, w) and P [s(u, u 0 )  s(u, w)]  2n 2 ⇣. If s(u, u 0 ) > s(u, w), then (u, w) cannot be the first pair of incorrectly joined vertices. Taking a union bound over at most n 2 /2 pairs of vertices, we see that with with probability at least 1 ⇣, there is no first pair of incorrectly joined vertices, which means the algorithm is correct.
Since the bound on f (`) depends on`, it is general much smaller than the bound on f (1), which is lower bounded by a constant. Thus, asymptotically, the bound on k is e (20 log n + 10 log (1/⇣))

A.3 Additional Simulations
A.3.1 Visualization of the (`⇤, d ⇤ ) Oracle: The (`⇤, d ⇤ ) oracle presented above attempts to determine the outgroup of a triplet using the di↵erence in the number of shared mutations between its members. In Figure S1 we visualize how well the decision rule holds for correctly and incorrectly resolved triplets. We plot s(a, b) max(s(a, c), s(b, c)) against dist(LCA(a, b), LCA(a, b, c)) for triplets (a, b, c) on simulated trees. The blue points show the di↵erence between the mutations shared by the ingroup versus the mutations shared by the outgroup, and the orange points show the di↵erence between the mutations shared by the outgroup and the ingroup.
6.6% of triplets are such that s(a, b) max(s(a, c), s(b, c))  t with (a, b) as the ingroup thus violating condition i)for`⇤ = 0. 0.6% of triplets are such that s(a, b) max(s(a, c), s(b, c))  t with (a, b) as the outgroup. Note that as s(a, b) max(s(a, c), s(b, c))  s(a, b) s(a, c), requiring that s(a, b) max(s(a, c), s(b, c)) < t is a slightly weaker condition than condition ii) and thus at most 0.6% of triplets violate it. For the set of parameters used here, t is chosen such that the probability that a triplet is indeterminable is low and the probability that the outgroup is given incorrectly is low, showing that the oracle is relatively accurate on this regime for the number of characters (k = 50). For for exact reconstruction of a tree though, we require that all triplets on that tree be separated by the threshold t.
As dist(LCA(a, b), LCA(a, b, c)) increases, this di↵erence in the number of shared mutations between the ingroup and the non-ingroup pairs grows, making the triplets more separable. This gives us the V-shape in the figure. As the distance increases, the number of triplets that cross the threshold and thus violate either condition decreases, and after a certain point no triplets cross the threshold. This illustrates that in the result in Theorem 2 that, as we are only interested in triplets where dist(LCA(a, b), LCA(a, b, c)) >`⇤ for larger values of`⇤, them the bound on the number of characters k needed to separate these triplets decreases.  s(a, c), s(b, c)) for two cases: when (a, b) notates the true ingroup of the triplet (blue) and 2) when (a, b) notates one member of the ingroup and the outgroup (orange). The histograms show the density of each case for each triplet along the axes. We also report the percentage of triplets for each case that fail the decision rule (cross the threshold t = 1 2 k `⇤ ⇤ ) for`⇤ = 0.

A.3.2 Validating Empirical k Found in Simulations:
In the simulations determining the empirical necessary k, we perform a coarse analysis by sampling 10 trees and determining if 90% of those trees satisfy the given reconstruction condition at each iteration of a binary search. To further reduce noise, we apply a logistic regression to the proportion correct for each k explored in the binary search and use the first integer k value that exceeds 90% correct according to the fitted curve. We further validate the reported empirical necessary values of k through a finer analysis. For each value of and q in each simulation regime and the corresponding necessary k found for those parameters, we simulated additional lineage data sets and calculated the proportion of those data sets that satisfy the reconstruction condition. We then report the di↵erence between the proportion and the expected 0.9. The methods and results are described in Figure S2.
In Figure S2, we see that the di↵erence in the proportions correct in the validation simulations and the expected 0.9 is relatively centered around 0, with almost all negative entries deviating from 0.9 by less than 0.1. Deviations in the negative direction are larger than deviations in the positive direction, but this is somewhat expected as there is a larger range in the distribution below 0 than above 0. In the case of the Asynchronous simulations, entries with large negative values tend to have high variance between the 10 trees, suggesting that a few outlier tree bring down the mean significantly. Overall, these results suggest that our coarse analysis yields empirical necessary values of k that are stable when expanded to a larger data set and mostly achieve or are close to achieving the intended 90% chance of meeting the reconstruction condition. Criterion for k Found in Simulation. In the case of the Uniform simulation framework in which the topology is constant across simulated experiments, 100 lineage-tracing data sets were simulated using the given , q, and the corresponding empirical necessary k found for those values in simulations using the given algorithm. The di↵erence between the proportion of data sets meeting the specified reconstruction criterion (exact or partial reconstruction at d ⇤ = 0.5, 0.2) and the expected proportion of .90 is reported.
In the case of the Asynchronous simulation framework, 10 trees were generated using the Asynchronous framework with 100 lineage-tracing data sets each, with each entry containing the proportion of data sets meeting the reconstruction criterion averaged across 10 trees for that , q and k. The number in each cell is the variance in the proportion of data sets meeting the criterion across 10 trees. The k used in each case is the k determined after applying the logistic regression.

A.3.3 Dependence on n:
To compare the asymptotic dependence of k on n in the theoretical bounds with the dependence of the necessary k in the empirical case, we simulate varying trees of size. In Figure S3 we plot the values of k in simulation against the bounds for both algorithms, in a regime where`= 0.5, q = 0.05. From this figure it can be seen that the bounds are tight (within a constant factor) against the empirical values and share the same shape in the dependence of k on n. These results show that the asymptotic dependence of k in the bounds for both algorithms matches the empirical dependence. k n ℓ = 0.5, q = 0.05 Threshold Algorithm Bottom-Up Algorithm Figure S3: Comparing the Dependence of k on n in Theory and Simulation. The minimum necessary k given n of the Threshold Algorithm (left) and the Bottom-Up Algorithm (right) with the theoretical bounds for each case (90% of full reconstruction). Simulations performed with the uniform edge length regime for trees of size 2 2 , 2 3 , ..., 2 11 leaves. For each value of n, edge lengths were re-scaled to be 1 log2(n)+1 to maintain uniform edge lengths. The bounds are rescaled by a constant factor, 265 for the Threshold Algorithm and 25 for the Bottom-Up Algorithm. Point-wise 95% confidence intervals are generated from the regression coe cients using the delta method, see Materials and Methods.

A.3.4 Missing Data:
Here we present simulations for the minimum k to give 0.9 probability of exact reconstruction for the case of stochastic missing data only (lemma 9). The simulations are performed with uniform edge lengths and use p s = 0.1 proportion of stochastic missing data. Visualizing the bounds for lemma 9 show that indeed higher values of k are necessary to overcome the lost information ( Figure S4A), consistent with the additional (1 p d ) 2 term in the denominator and gap term of e d ⇤ d ⇤ when compared to the bounds of Theorem 1. Notably, the reconstruction now becomes increasingly intractable for high values of , due to the gap term being exponential in . In simulation ( Figure S4B) we see that the necessary k is higher across the board, especially for values with high , matching the theoretical trends.
Surprisingly, unlike in lemma 2, the bound on k in lemma 9 has no asymptotic dependence on q. Taking q to be arbitrarily small (or even q <`/ ) causes the bound in lemma 2 to become O( log ǹ ), yet an arbitrarily small q causes the bound in lemma 9 to remain in O( log ǹ 2 ). Examining the di↵erence in the bounds between lemma 9 and lemma 2 ( Figure S4C (Top)), we see that the di↵erence grows larger in regions where q <`/ , indicating that the bound changes asymptotically in these regions. This same pattern in the dependence on q is reflected in the empirical results ( Figure S4C (Bottom)). Top: Di↵erence in theoretical bounds for k for 0.9 probability of perfect reconstruction,`= 1/9, d ⇤ = 1, and p s = 0.1. Bottom: Di↵erence in minimum k required for 0.9 probability of perfect reconstruction in simulation with a cell division topology with uniform branch lengths,`= 1/9, d ⇤ = 1, and p s = 0.1.

A.3.5 Triplets Correct:
Previous benchmarking works do not use exact reconstruction as a metric for the accuracy of phylogenetic reconstructions. A common, more relaxed metric is the triplets correct metric (or the closely related triples distance), which measures the proportion of sampled leaf triplets that are correctly (incorrectly in the case of triplets distance) inferred by the reconstructed tree [6,40,31]. We present the minimum k necessary in simulation for high probability of a high ( 95%) triplets correct score on uniformly sampled triplets, showing the necessary k when exact reconstruction is not required ( Figure S5). We see that the empirical necessary k decreases substantially overall compared to the case of exact reconstruction for both algorithms, showing that these algorithms can perform well in practice with a low number of characters according to traditional standards of accuracy.
Regarding the Threshold Algorithm, the reduction in the necessary k is potentially due to the fact that if triplets are sampled uniformly, most of them will have an LCA close to the root. We saw from the partial reconstruction results that the necessary k decreases across the board as d ⇤ (the depth up to which triplets must be correct) decreases. We see also that large values of coincide with lower relative k in the triplets case than in the case of full reconstruction, just as in the case of d ⇤ << 1. Thus, we can treat the case of uniformly sampling triplets as similar to setting a low d ⇤ . We see that both of these e↵ects, the lower overall necessary k and the lower k for high values of , also hold true of the Bottom-Up Algorithm in the case of uniformly sampling triplets. This indicates that perhaps reconstruction of triplets that diverge at the top of the tree is easier and less a↵ected by mutation saturation in the case of the Bottom-Up Algorithm as well. Entries are log 10 scaled. Minimum k required for 0.9 probability of  0.95 proportion of 500 uniformly sampled triplets correctly reconstructed in simulation. Top row: Results for the Threshold Algorithm in the case of (From left to right) uniform edge length topology without missing data (`= 1/9, p s = 0.1), uniform edge length topology with missing data (`= 1/9, p s = 0.1), asynchronous topology (`= 0.05). Bottom row: Results for the Bottom-Up Algorithm in the case of (From left to right) uniform edge length topology (`= 1/9) and asynchronous topology (`= 0.05).

A.4 Additional Implementation Details:
A.4.1 Threshold Algorithm: Due to ties in the number of shared characters, sometimes the edge-removal procedure produces more than two connected components on the sample graph G. If this occurs, we enforce a bifurcation in the tree by merging the components C 1 , C 2 , ..., C n into two groups. We expand the previous pseudocode as follows: 1: procedure SplitSamples(V ) 2: G Complete graph over V 3: while G is connected do 4: (u ⇤ , v ⇤ ) = argmin (u,v)2E s(u, v)

12:
Return binary tree with T 1 and T 2 as children of the root. 13: end procedure 14: 15: procedure MergeComponents(C 1 , C 2 , ..., C n ) Return GroupA, GroupB 26: end procedure Here, we use a naive parsimony approach. We infer the character states of the LCA of each component as the most parsimonious states given the states of the leaves in that component (a mutation appears in the LCA of a component only if it shared by all samples in that component, discounting samples with missing data at that character). Then, the mutation shared by the most LCAs is found. Heuristically, this mutation occurred early in the phylogeny and thus components sharing this mutation are more closely related. Thus, we separate components into two groups based on whether or not the LCA has this mutation.
This algorithm is implemented as the "PercolationSolver" class in the "solver" module of the Cassiopeia codebase. Here, the default arguments are used, with "joining solver" specified as an instance of the "Vanil-laGreedySolver" class.

A.4.2 Bottom-Up Algorithm:
The implementation of the Bottom-Up Algorithm follows the description in the main text. If a tie occurs in the number of shared mutations between nodes, then an arbitrary pair is chosen to be merged first.
This algorithm is implemented as the "SharedMutationSolver" class in the "solver" module of the Cassiopeia codebase. Here, the default arguments are used.

A.4.3 Simulating the Asynchronous Cell-Division Topology
Note that in the trees generated by this process sister nodes need not have the same edge length and the root will have a singleton edge as in the binary case along which mutations can occur. We chose rates for the division and death waiting distributions (23.70 and 2.12, respectively) that gave an average of around 256 leaves over 1000 simulations. These rates were chosen assuming that the rate of division was 10 times that of the rate of death, and then correcting the death rate to increase the mean waiting time by a = 0.05 to match the shift in mean in the distribution of division times.
Due to the stochastic nature of this division process, we cannot exactly control for`if we stop the experiment at a specified time. This is due to the fact that edges at the leaves of the tree may be very small if the stopping criterion is reached before the length can reach a, thus making the minimum edge length in the tree technically potentially smaller than a. We contend that these edges should not impact the analysis though. Small edges make it di cult to discern which neighboring clades are actually closer in relation. These small edges only occur at the bottom of the tree though and would only a↵ect the edge lengths leading to single leaves, which are trivially discerned as a cherries with their neighbors, meaning that`would still e↵ectively be 0.05 in this case.

Return FALSE
16: end if 17: end procedure Next we prove its correctness: Claim: All triplets whose LCA is at depth < d in T are resolved correctly in T 0 i↵ for every node n < d in T there exists a node n 0 in T 0 whose daughter clades partition the set of leaf descendants of that node in the same way.
Proof: if: For a triplet (a, b|c) whose LCA is node n, a, b must be in the clade of one daughter of n and c on the other. If there exists a node n 0 in T 0 that partitions the leaf nodes into the same two clades, then for each triplet whose LCA is at n, a, b will be grouped together in the same clade with c on other side, hence every triplet will be resolved correctly. If there exists n 0 for each n < d, then all triplets with LCA < d will be resolved correctly. only if: If there is a node n < d in T with no node n 0 in T 0 with an analogous partition, this implies that there is some partition of the leaves in T 0 starting from the root at or above n that does not match the partition in T , as if all partitions were correct then n 0 must exist. If there is a non-matching partition, there is some partition {a 1 , ..., a m }|{b 1 , ..., b n } in T where in T 0 there is at least one incorrect member in one of the partitioned sets: {a 1 , ..., a 0 m , b 1 , ..., b 0 n }|{a 0 m+1 , ..., a m , b n 0 +1 , ..., b n }. Then in T 0 , WLOG b 1 is closer to some a i than some b i , and all triplets involving b 1 and a i are incorrect in T 0 . As the partition with the non-analogous partition is at depth < d in T , then some triplets whose LCA is at depth < d in are incorrectly resolved in T 0 .
The algorithm will find n 0 for every n < d by matching it to a node in T 0 that has the same partition. If the algorithm does not find n 0 for a certain n < d, then it does not exist as if all partitions checked by the algorithm match up to and including n in T 0 then it must exist, and the given partition cannot be formed later as leaf descendant sets cannot add members down the tree.

Triplets Correct:
Additionally, we report the necessary k needed for 95% triplets correct in simulation. The triplets correct score is determined by sampling 1000 triplets uniformly from the ground truth tree and counting the proportion of triplets resolved accurately in the reconstructed tree.