Testing for dependence on tree structures.

Tree structures, showing hierarchical relationships and the latent structures between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable measured on each sample and the latent group structure represented by some given tree. Currently, this is addressed on an ad hoc basis, usually requiring the user to decide on an appropriate number of clusters to prune out of the tree to be tested against the response variable. Here, we present a statistical method with statistical guarantees that tests for association between the response variable and a fixed tree structure across all levels of the tree hierarchy with high power while accounting for the overall false positive error rate. This enhances the robustness and reproducibility of such findings.


Methods
For the following, we need two additional notations. For an inner node i ∈ VI the two (direct) offspring nodes i1, i2 are denoted as off(i). For an inner node i ∈ VI the nodes of T (i), recall Figure S28, without i are denoted by OFF(i). An illustrative example for this notation is shown in Figure S1. This means that if l ∈ OFF(m) is an active node ofp m , the solutionp l of problem P l equalsp m on the offspring leaves of l. Thus, solving Pm only amounts to finding the position(s) of the last active node(s). Here, an active node is last if there is no other active node which is its ancestor. Employing this DP scheme, the estimatorp can be computed by successively solving the subproblems Pm, starting at the leaves and ending at the root m = 2n − 1 with final solutionp =p 2n−1 . In that way, one makes use of overlapping subproblems without having to run exhaustively over all possible O(n n ) tree segmentations. For each of the O(n) subproblems Pm, inner node m has at most O(n) offspring nodes. Thus, in the worst case, solving Pm amounts to consider O(nk) last active nodes, where for each the multiscale constraint can be checked in O(n) time (1). To this end, note thatk(m) ∈ {k(i1) +k(i2),k(i1) +k(i2) + 1} where {i1, i2} = off(m) are the two direct offspring nodes of m and k(m) denotes the estimated number of active nodes in subproblem Pm. In total, this gives a worst case run time of O(nk +2 ). However, by keeping track of the individual confidence sets C m 1−α of problem Pm, in practice, only a small fraction of the nk possible segmentations has to be considered. This renders the actual runtime to be usually much faster and makes this DP scheme very efficient in practice.

B. Extensions for continuous and other discrete distributions.
The treeSeg methodology is not restricted to binary response variables and can be applied to various other types of continuous and discrete distributions. When the response distribution belongs to a one-parametric distributional family, that is, one observes independently Yi ∼ F θ i , where F θ belongs to a distributional family with parameter θ, treeSeg can segment with respect to θ by simply replacing the log-LR tests T j i in Eq. 2 in the main text by the one for the testing problem H : θi = . . . = θj =θ| [i,j] vs. K : θi = . . . = θj =θ| [i,j] .
The respective multiscale statistic Tn in Eq. 2 in the main text still converges in distribution with a limit bounded by M in Eq. 4 in the main text under very weak assumptions (in particular, whenever F θ belongs to an exponential family) and hence, all confidence statement remain valid (1,2). In the current treeSeg R package an implementation for continuous traits under a Gaussian distribution assumptions is available. Figure S2 shows a similar example as in the main text, but with normally distributed response. We are planning to add implementations for various other distributional families to the treeSeg R package in the future, such as exponential and Poisson. Moreover, recently, a multiscale procedure for quantile segmentation of arbitrary independent traits was proposed (3) and we are also planning to implement this in the future for treeSeg. C. Proof of Theorem 2. In the following we outline necessary modifications of the proof of Theorem 2 for tree structured signals p ∈ S compared to totally ordered signals (1). In the change-point detection problem for totally ordered structures (1) the number of active nodes relates to the number of change-points of the underlying signal. There, one shows the underestimation bound as in Theorem 2 as follows: Consider a true underling signal with k change-points and each of the k + 1 constant segments having at least length λn. For each of the k change-points consider an interval Ii, i = 1, . . . , k, of length λn centered around this change-point. Note that those k intervals will be disjoint and on each of them the true underlying signal makes a jump of size at least δ. Now, when an estimated segmentation has less then k change-point, this implies that for at least one of those k intervals Ii, the estimator is constant taking some value c ∈ R. This implies that the two multiscale tests on the left and on the right half of Ii, with true signal-difference of at least δ, both accepted the constant value c. It follows from power considerations of the LR test that this event has probability vanishing exponentially in n (for fixed λ, δ, α). In our setting, k corresponds to the number of active nodes, which, in general, can be different to the number of change-points in the leaf-signal (pi), i = 1, . . . , n. Note that, depending on the location of the active node v ∈ V (p), it may induce either one or two change-points in p. In order to apply the same proof as in the totally order case (1), for a fixed underlying signal p ∈ S, we have to find intervals I1, . . . , Im, such that a) p is non-constant on Ii, b) p's constant segments in Ii have minimal length Cλn for some constant C, c) any otherp ∈ S with less then k active nodes, that is k(p) < k, will be constant on at least one of the intervals I1, . . . , Im.
For p ∈ S one can construct regions I1, . . . , Im which fulfill a)-c) as follows. To this end, formally write the root v0 = 2n + 1 as an additional active node. Further, let C be some sufficiently small constant (which may depend on λ).
2. Moreover, for each pair of true active nodes vi = vj ∈ V (p), i, j = 0, . . . , k select a union of two intervals Iv i ,v j = It is easy to see that these intervals satisfy a) and b). In the following we show by induction that these intervals also satisfy property c). The base case, for k = 1, follows trivially. Now, as induction hypothesis, assume c) holds true for all k = 1, . . . , κ − 1. Let v1, . . . , vκ be the true active nodes of some underlying signal p ∈ S. Assume there exists somep with active nodesṽ1, . . . ,ṽκ−1 such thatp is not constant for all intervals Ii, Iv i ,v j . We denote the influence region IR of true (and estimated, respectively) active node vi (andṽi, respectively) as those leaf nodes, which are a direct offspring of vi (andṽi, respectively), that is, which have the same success probability as vi (andṽi, respectively). See Figure S4 for an example. We say that estimated active nodeṽi is related to true active node vi if the influence regions ofṽi and vi intersect, that is IR(ṽi) ∩ IR(vi) = ∅. See Figure S3 for an example. That is, when vi is either a direct offspring ofṽi (without other estimated active nodes in between) or whenṽi is a direct offspring of vi (without other true active nodes in between). Assume that there exists one estimated active nodeṽi which is related to exactly one true active node vj, j = 1, . . . , κ. Then, for the reduced problem with true p ∈ S δ,λ having active nodes {v1, . . . , vκ} \ {vj}, there existsp with active nodes {ṽ1, . . . ,ṽκ−1} \ {ṽi}, such thatp is non-constant on all the intervals I1, . . . , I m of p . This is a contraction to the induction hypothesis. The same contraction follows when there exists one estimated active nodeṽi which is related to no true active node vj. Consequently we deduce the following.
• Every estimated active nodeṽj is related to two true active nodes vj 1 , vj 2 , j1, j2 ∈ {1, . . . , κ}. Now, consider some estimated active nodeṽj which has no other offspring estimated active nodes. Let vj 1 and vj 2 be two related true active nodes ofṽj. If vj 1 and vj 2 are both offspring ofṽj, it follows thatp is constant on Ii which connects vj 1 and vj 2 , which is a contraction. If vj 1 and vj 2 are both ancestors ofṽj, it follows thatṽj is not related with one of vj 1 and vj 2 , which is also a contradiction. Consequently, without loss of generality, one can assume that vj 1 is an ancestor ofṽj and vj 2 is an offspring ofṽj, with j1, j2 = 1, . . . , κ. One can applying the same argument successively for all estimated active nodes starting at the leaves and ending at the root to obtain the following.
• All estimated activated nodes have both, an offspring and an ancestor true active node.
In particular, it also follows thatṽj cannot be related with the root active node v0 (as it is an offspring of vj 1 ), which yields the following.
• Every true active node vi is related with at least one estimated active nodeṽj.
To see this note that otherwisep would be constant on the regions which connects vi and v0. Now, consider some true active node vi 1 which has no other offspring true active nodes. Letṽi be its related estimated active node. Note thatṽi has to be an ancestor of vi 1 as otherwiseṽi would have no offspring true active node. Further, let vi 2 be an ancestor true active node ofṽi. Asp is non-constant on Ii which connects vi 1 and vi 2 , it follows for the reduced problem with true signal p ∈ S δ,λ having active nodes {v1, . . . , vκ} \ {vi 1 }, that there existp with active nodes {ṽ1, . . . ,ṽκ−1} \ {ṽi} which is non-constant on all the intervals I1, . . . , I m of p . This is a contraction to the induction hypothesis and finishes the induction step. D. Comparison of estimation rates for tree structures vs. totally ordered change-point structures. The treeSeg procedure incorporates into its analysis a specific ordering of the leaf nodes i = 1, . . . , n that is induced by the tree T . If one was only interested in the change-points of the leaf nodes that correspond to this ordering * , one could also just apply a standard change-point procedure to the ordered observations Y1, . . . , Yn, without taking any further information of the tree T into account.
A natural question is how the statistical change-point estimation rates of treeSeg compare to those of a simple change-point procedure. In particular, one would hope that the additional information from the tree T (beyond just such a simple ordering of the leaf nodes), improves the statistical estimation accuracy for the segmentation. In the following we provide two results which give insight into this question: • First, we show that without making any further assumptions on the tree T , the statistical segmentation rate of order 1/n that we show for treeSeg in Theorem 4 up to the log(n) factor, cannot be improved, in general, see Theorem 5.
• Second, we show that for specific sub-classes of trees (see Definition 1 below) treeSeg does overcome the 1/n rate and even yields a perfect segmentation with high probability, see Theorem 6.
The minimax optimal change-point estimation rate for an independent Bernoulli signal with piecewise constant success probabilities is of order 1/n, see (1,4). Thus, the first result in Theorem 5 shows that, for an arbitrary tree T , the additional tree information beyond the leaf ordering cannot improve the segmentation rate, in general. On the other hand, the second result in Theorem 6 shows that when the tree structure is sufficiently regular, the 1/n lower bound can, indeed, be improved. Hence, treeSeg efficiently explores the full information of the tree T and outperforms regular change-point estimation procedures also in terms of segmentation rates in such cases the tree structure allows this. The overall improved segmentation performance of treeSeg compared to simple change-point procedures, such as (1), can also be observed empirically. We demonstrate this in a small simulation experiment in Figure S10. In the following, let τ1, . . . , τK ⊂ {1, . . . , n} be the change-points of the true signal p, that is, pτ i = pτ i +1 andτ1, . . . ,τK ⊂ {1, . . . , n} the change-points ofp. Define the accuracy of the estimated change-point locations as Recall that Theorem 4 in the main text implies that for every change-point of p there exists a change-point ofp within a neighborhood of order at most log(n). More precisely, from Theorem 4 we obtain that for any fixed minimal scale λ, minimal success probability difference δ, confidence level α, and p ∈ S λ,δ it holds true that with probability at least 1 − C1e −C 2 n , where C1, C2, C3 are positive constants that only depend on α, λ, δ. Also, recall that it follows from Theorem 2 in the main text thatk = k with high probability whenever α is sufficiently small. The following theorem shows that this upper bound in Eq. 2 cannot be improved for a general tree T , possibly up the log(n) factor. To this end, let S λ,δ (k) denote the set of signals p ∈ S λ,δ with k active nodes. Theorem 5. For fixed minimal scale λ ∈ (0, 1) and minimal success probability difference δ ∈ (0, 1), it holds true that there exists a tree T with k active nodes such that Ep (d(p,p)) ≥ C n , * Note that the change-points do not uniquely determine the active nodes, as an active node may induce one or two change-points depending on its location and thus, there can be a different number of active nodes that induce the same change points. where C is a positive constant that only depends on δ. Theorem 5 shows that up to constants and possibly the log-factor, it is not possible to improve the log(n)/n rate as in Eq. 2 obtained from Theorem 4. The proof of Theorem 5 follows directly by the observation that there always exist a tree (such as in Figure S5) which does not impose any additional restriction on the segmentation besides the overall leaf ordering. That means, the totally ordered change-point setting is just a special case within the class of all tree structures and hence, any lower bound for the change-point setting also applies for general tree structures. For sake of completeness, we give the explicit proof of Theorem 5 below.
Proof of Theorem 5. Let T n denote a tree with n leaf nodes as in Figure S5 and assume that k = 1, that is, there is a single active node v which segments the leaf nodes into two segments. Note that the signal space S λ,δ (k) depends on the given tree T , so that we may write S λ,δ (k, T ). Letv andṽ be two inner nodes in T n , such thatv is a direct offspring ofṽ, that is,v ∈ off(ṽ), see Figure S5. Letp andp be success probability signals of length n with a single active nodev andṽ, respectively, such that Figure  S5.
Clearly, for any arbitrary signal p with a single active node v (recall that we assume k = 1), it must hold true that the corresponding change-point of p has distance at least 1 to either of the change-points of p orp, that is, Let i be the single leaf index in T which is an offspring ofṽ but not ofv, see Figure S5. Let fp(Y ) and fp(Y ) denote the data likelihood under signalp andp, respectively. Then we have that such that under modelp we have that Note that Thus, it follows that In the following we complement Theorem 5 with a result of a more positive flavor. Although, the tree structure cannot improve the segmentation rate in general, when one imposes additional assumptions on the tree, treeSeg is able to overcome the 1/n lower bound and even obtains a perfect segmentation with high probability. Definition 1. Let γ, λ ∈ (0, 1), k ∈ N . We denote a binary tree T with n leaf nodes and root v0 = 2n − 1 as γ-spreading at minimal scale λ, if for any collection of inner nodes v1, . . . , v k ∈ VI that induce a segmentation with minimal scale λ, there exist disjoint intervals I 1 1 , I 2 1 , . . . , I 1 m , I 2 m ⊂ {1, . . . , n} of minimal length γλ such that 1. for all i = 1, . . . , m, l = 1, 2 we have that I l i belongs to the influence region of exactly one node vj, that is, I l i ⊂ IR(vj), for some j = 0, . . . , k, 2. for all i = 1, . . . , m we have that I 1 i and I 2 i belong to influence regions of different nodes vj 1 , vj 2 , j1, j2 = 0, . . . , k, that is, , and for any other inner nodes {ṽ1, . . . ,ṽ k } that induce a different segmentation there exists an interval index i = 1, . . . , m and a node j = 0, . . . , k such that I 1 i and I 2 i are both in the influence region ofṽj, that is Definition 1 captures precisely what is needed for treeSeg to obtain perfect segmentation with high probability uniformly among all possible active nodes in the tree T , given a minimal scale λ. The following example shows that γ-spreading trees can be seen as a generalization of perfect trees, see Figure S6. Example 1. (Perfect trees) A simple example of a γ-spreading tree is a binary perfect tree, that is, a tree in which all internal nodes have two children and all leaf nodes are at the same level. Consider the case k = 1. Let v = v1 ∈ VI be some inner node of T which induces a segmentation of minimal scale λ. Assume that v is at level L, that is, Figure S6 for an example. Assume that L ≥ 2 (the case where L = 1 is similar). Let vo, v o be the two direct offspring nodes of v at level L + 1, that is {vo, v o } = off(v). Let vo1 ∈ off(vo) and vo2 ∈ off(v o ) be two offspring nodes at level L + 2 of vo and v o , respectively, see Figure S6. Let va be the direct ancestor node of v and let vs be the (unique) sibling node of v, that is, {v, vs} = off(va). Let v s ∈ {v, vs} be any other inner node at level L. Note that, because the tree T is a perfect binary tree, we have that Figure S6 for an example. Clearly, the intervals I l i fulfill conditions 1. and 2. of Definition 1. Letṽ = v ∈ VI be any other inner node of Tn. We distinguish three different cases.
• Ifṽ is not an offspring of va, then I 1 1 , I 2 1 are both either in the influence region ofṽ, that is, • Ifṽ is an offspring of v, sayṽ is equal to or an offspring of vo (or v o , respectively), then I 1 2 , I 2 2 (or I 1 1 , I 2 1 , respectively) are both in the influence region of the root v0, that is, • Ifṽ is either equal to or an offspring of vs, then I 1 2 , I 2 2 are both in the influence region of the rootṽ0 = 2n − 1, that is, . This shows that a perfect tree T is 0.25-spreading for a single active node k = 1.
The proof of Theorem 6 follows directly from the definition of γ-spreading (see Definition 1) together with power considerations of the LR test, analog as discussed in the proof of Theorem 2, see Section C. Theorem 6 shows that when the tree T is sufficiently regular (as in the definition of γ-spreading of which perfect trees are a special case), then with high probability treeSeg estimates the locations of the active nodes exactly, overcoming the 1/n lower bound from the simple change-point setting (without tree structure).
Moreover, treeSeg's confidence sets C1−α for the active nodes directly capture this exact recovery property. That means, when a tree satisfies the conditions of Theorem 6 then with high probability |C1−α| =k = k, that is, C1−α will only contain the true active nodes C1−α = {v1, . . . , v k } so that exact recovery is directly obtainable from treeSeg's output.

E. Confidence statements with random branch swapping.
Recall that the treeSeg algorithm does, in general, depend on the specific ordering of the leaves induced by the tree structure. This ordering, however, is not uniquely determined by the tree itself. For every inner node of the tree one can swap the two offspring nodes to generate a new ordering of the leaves which represents the tree equally well.
Our theoretical guarantees hold true for any (arbitrary) such ordering, as long as this ordering was chosen a-priori, independent of the data Y . In a simulation study in Section B we also show that treeSeg's overall segmentation among random swapping of tree branches remains reasonably stable.
Clearly, if one chooses the specific branch ordering based on the observations Yi, theoretical guarantees may be violated, as this essentially amount to p-value hunting. This problem is not intrinsic to treeSeg, but is a universal problem of statistics whenever there are several tests for the same hypothesis problem available. Deciding on one particular branch ordering amounts to deciding on one of the particular tests, which opens the door for so called p-value lottery/ p-value hunting.
A natural approach to avoid this arbitrary choice is to apply treeSeg repetitively to B different leaf-orders, obtained via randomly swapping the branches of the tree, and then to aggregate the results. When B is sufficiently large, this will make the analysis (at least approximately) independent of any specific branch swapping.
In the following we describe a simple-to-implement way for such an aggregation which preserves treeSeg's overall confidence statements, by adapting an approach from (5) to a similar problem for p-values in high-dimensional linear regression.
Fix some confidence level α ∈ (0, 1). For b = 1, . . . , B randomly obtained branch swapped version of the tree T and data Y , which correspond to different orderings of Yi. Apply treeSeg at level α/2, to obtain estimates for the number of active nodeŝ k b (q 1−α/2 ), confidence sets for the active nodes C b 1−α/2 , and confidence bands for the success probabilities p b 1−α/2 , p b 1−α/2 , for each b = 1, . . . , B. Aggregate these results as follows: The choice of the median together with the level correction α/2 in Eq. 4 is somewhat arbitrary and can be replaced with any other quantile γ ∈ (0, 1) together with level correction γα, see (5) for details.
It is easy to check that for the aggregated treeSeg results in Eq. 4 all confidence statements hold true at level α. We illustrate this for the estimated number of active nodesk and for the confidence set C, the derivation for the confidence band p, p is analog. We have that where the first inequality follows from Markov's inequality and the second inequality follows because for all individual random branch swaps b treeSeg guarantees that P k b (q1−α) > k ≤ α. Similar, one obtains that where the first inequality follows because the maximum over sums is smaller or equal to the sums over maxima, the second inequality follows from Markov's inequality, and the last inequality from the coverage probability of each F. Noisy tree structure. The treeSeg algorithm takes as input a given tree structure T with n leaf nodes as well as the observations of interest Y1, . . . , Yn. Thereby, the tree T defines some neighborhood structure between the observations Y1, . . . , Yn. The question which treeSeg answers, is whether the distribution of observation Yi depends on its relative location within the neighborhood structure defined by T . Clearly, in many (if not most) applications, the tree T is itself just a noisy surrogate of the actual neighborhood structure of interest. For example, a phylogenetic tree obtained from genetic sequencing data, will usually not exactly correspond to the true ancestral chart of the individuals under consideration.
Similar, for one and the same observations Y1, . . . , Yn there might be different trees that correspond to different neighborhood structures of interest. Consider, for example, the three plots in Figure S7. They all show the same observations Y1, . . . , Y180, simulated from independent Bernoulli observations with success probability as shown as the black dashed line. However, the trees which define their neighborhood structures are different. Arguably, in the very right plot there is no obvious dependence between the tree structure and the distribution of the observations Yi. In the other two plots, however, the neighborhood structure is significantly different and the success probability strongly depends on particular sub-branches in the tree.
The example in Figure S7 was generated as follows. We considered n = 180 Bernoulli random variables Y1, . . . , Y180 such that   For σ = 0.01, 0.07 and 1, we applied hierarchical clustering to the observations Xi to obtain a tree structure as shown in Figure  S7.
For σ = 0.01 the tree T perfectly aligns observations from the same group into a joint sub-branch of the tree. Hence, treeSeg perfectly recovers this segmentation which is represented by the tree. On the other extreme, for σ = 1, the tree essentially does not capture any neighborhood structure from the original grouping of the observations. Hence, treeSeg, which analyses the data conditioned on this particular tree T , does not estimate any active nodes, hence, does not report any dependence between the tree and the observational distribution. Given that this tree essentially does not contain any dependence structure, this is a very reasonable results.
In summary, we stress that treeSeg does not make any assumptions on how the particular tree T was generated. It takes the tree T as given and performs its analysis conditioned on T . Only this enables treeSeg's wide applicability to various settings, where trees might be generated in very different ways.

Simulation results
In the following we report on simulation results for the treeSeg algorithm. We have performed two sets of simulation studies.
In the first study we assessed the power of the method in detecting clades on the tree with distinct phenotype distributions. In the second study we assessed the impact of the tree branch ordering on the outcome of the method.

A. Power of the method in detecting clades with distinct phenotype distributions.
A.1. Single active node is simulated:. First we want to study the detection power of treeSeg of active components on the tree, that is, subtrees where the underlying success probability of the binary response variables at the leaves differs from the rest of the tree. The power is dependent on two parameters, the number of samples contained in the subtree, the larger the subtree, the easier it is to be detected. Similarly, the larger the difference between success probabilities in active sub-tree and rest of the tree is, the easier it is to detect this sub-tree. Therefore, we can quantify the difficulty level of the detection problem via two parameters: First, the size of the smallest subgroup, which we denote by λ. Second, the size of the smallest pairwise difference of success probabilities, which we denote by δ. To study the ability of treeSeg to detect an active subgroup we consider the following setup: We fix the sample size n = 200, fix a confidence level α = 0.1. Then, for each δ = 0.1, 0.2, . . . , 1 and each λ = 5, 15, 25, 35, . . . , 95 we do the following: 1. We generate a random binary tree with n leaves.
2. We select a random inner node which has exactly λ offspring nodes (if this does not exist we go back to 1.).
3. We select randomly two probabilities p1, p2 with |p1 − p2| = δ. 4. We draw binary observations with success probability p1 for the offspring of the selected node and with success probability p2 for other nodes.
Then, we repeat until convergence of the reported summary statistics (we found 1, 000 repetitions to be sufficient). Figure S11 shows the average number of detected active nodes for different values of δ and λ. Note that, in this setting, the true number of active nodes is one. One can see that for reasonable values of λ and δ treeseg detects an active sub-group with very high probability and thus, reveals a strong detection power. In particular, Figure S11 confirms that treeseg reliably controls the probability of over estimation. Indeed, we found that for all combinations of δ and λ the empirical probability of overestimation was even smaller than the nominal level α = 0.1. This reveals that active components, which get detected by treeseg, are indeed present in the signal with very high probability. More importantly, this probability guarantee can be made explicit via the level α (in this setting α = 0.1). Figure S12 gives an insight on the accuracy of the estimated active node (in an detection event). It displays the average distance of the estimated active node in the tree to the true active node. Thereby, as a distance function we considered the minimum number of edges to be traversed to reach one node from other. One can see that for reasonable λ and δ the estimated active node coincides with the true active node, which shows a high estimation accuracy of treeseg.
Finally, we report on the estimation accuracy for the leaf-success probabilities. Figure S13 shows the mean of absolute error (MAE) of of the estimated success probabilities for different λ and δ. Note that MAE are small in general, revealing a high accuracy of treeseg for success probability estimation. For small δ, MAE slightly increases as λ increases. This is because when δ is small, the active node cannot be detected reliably and the smaller λ, the better one can approximate the true success probabilities via a constant. When δ is reasonably large, treeseg reliably detects the active node and MAE naturally increases as the λ increases.
A.2. No active node is simulated:. First, we consider the situation where the true generating model has no active component, that is all responses at the tree leaves have the same success probability. The following setup is considered. We fix the sample size n = 200 and a confidence level α = 0.1.
1. We generate a random binary tree with n leaves.
2. We select randomly a single success probability p1.
3. We draw n binary observations with success probability p1. 4. We apply the treeseg algorithm.
Then, we repeat until convergence of the reported summary statistics (we found 1, 000 repetitions to be sufficient). We found that in 98.6% of cases treeseg agreed with simulation truth. Note that this is even higher than our theoretical guarantee of 1 − α = 0.9. This again shows that if treeseg does detect an active component, this is very strong evidence that this is indeed present in the signal.
A.3. Two active nodes are simulated. In this study, we considered generating models with two active nodes. More precisely, we considered the following setup. We fix the sample size n = 200, fix a confidence level α = 0.1. Then, for each δ = 0.1, 0.2, . . . , 0.5 and each λ = 5, 15, 25, 35, . . . , 65 we do the following: 1. We generate a random binary tree with n leaves.
2. We select two random inner node with each having exactly λ offspring nodes (if this does not exist we go back to 1.).
Then, we repeat until convergence of the reported summary statistics (we found 1, 000 repetitions to be sufficient). Figure S14 shows the average number of detected active nodes for different values of δ and λ. Note that, in this setting, the true number of active nodes is two. One can see that for reasonable values of λ and δ treeseg detects an active sub-group with high probability and thus, reveals a strong detection power. Compared to the setting with just a single active component, δ and λ only need to be slightly larger to obtain detection of both active components simultaneously with high probability.
In particular, Figure S14 confirms that treeseg reliably controls the probability of overestimation. Again, for all combinations of δ and λ the empirical probability of overestimation was even smaller than the nominal level α = 0.1. This reveals that also when several active components are detected with very high probability they are indeed present in the data and this probability guarantee can be made explicit via the level α (in this setting α = 0.1). Figure S15 gives an insight on the accuracy of the estimated active nodes (in an detection event). It displays the average distance of the estimated active node in the tree which was closest to first true active node (for symmetry reasons the results for the second active nodes are the same). One can see that for reasonable λ and δ the estimated active nodes are very close to the true active nodes, which shows a high estimation accuracy of treeseg.
For completeness, we also present simulations results for the MAE (analog to that in Figure S13) in Figure S16. They show a similar pattern as before with a decent estimation accuracy, in general.
B. Impact of tree branching order on the estimation procedure. Finally, we examine the impact of the tree branching order via simulations. Recall that the theoretical guarantees of treeSeg hold true for any specific leaf ordering, as long as this was selected independently of the data, that is, not as in a p-value lottery. In general, however, treeSeg's output may differ for different orderings of the leaf nodes induced by the same tree. Recall the discussion in Section E.
For practical purposes, our current implementation of treeSeg † performs a standardized pre-ordering of the branches via the reorder.phylo(. . ., order = "cladewise") function of the ape R-package. In that way, results are independent of † https://github.com/merlebehr/treeSeg  Figure 1 in the main text but with the branches corresponding to the second active node and its sibling node swapped.
any specific ordering provided by the user. In order to still examine treeSeg's behavior under different leaf orders, we also provide the option tipOrder = unchanged, which keeps the leaf order unchanged. As an example, reconsider the simulated observations as in Figure 1 in the main text. The second active node induces a segment with 0.95 success probability at the very right of the leaf ordering. If one swaps the two branches that correspond to the direct ancestor of this active node, the same segment will appear more in the center, although the tree structure remains unchanged. This is shown in Figure S8, together with treeSeg's output. As can be seen, this branch swapping does not influence treeSeg's estimated segmentation in this example. Only its confidence set C0.9 changes slightly. This is because for a different ordering, the tests that treeSeg applies, will differ slightly and thus, in principle may lead to slightly different results.
In the following, we show in a simulation study that such a stable behavior is the usual situation. As long as branch swapping is done randomly, treeSeg's results typically do not change much. Thus, in order to avoid p-value hunting we found it to be best practice to just fix some pre-specified ordering, e.g., cladewise as in our current implementation. Another approach would be via aggregating results over several random branch swaps, as discussed in Section E. However, the overall branch swapping stability of treeSeg observed in practice, indicates that the additional computational burden of the second approach might not always be justified. To analyze the branch swapping effect, we will consider the following simulation setup. We fix the sample size n = 200 and the confidence level α = 0.1. Further, in order to guarantee some detection, we fix a minimal λ = 20 and δ = 0.3 (we later also consider other difficulty regimes via λ, δ). Then we do the following: 1. We generate a random binary tree with n leaves.
2. We select a random inner node with at least λ offspring nodes (if not exists go back to 1.).
3. We select randomly two probabilities p1, p2 with |p1 − p2| = δ. 4. Draw binary observations with success probability p1 for the offspring of the selected node and with success probability p2 for other nodes.
5. Then, for N = 100 repetitions (a) With probability 1/2 independently swap the offspring of each branch of the tree.
(b) Apply treeseg algorithm to (swapped) data. Figure S17 shows the histogram of empirical standard deviation among 100 randomly swapped leaf orderings for 1, 000 trees (generated as explained above). One can see that for most of the trees the standard deviation was zero, meaning that in all 100 branch swaps the number of detected active nodes was the same. Similar, Figure S18 shows the respective empirical 90% inter quantile distance, which shows for the majority of the 1, 000 trees at least 90 of the 100 randomly swapped leaf orderings resulted in the same number of estimated active nodes. Also, in terms of estimation accuracy, branch swapping has very little influence. This can be seen in Figure S19, which shows that for the vast majority of trees the empirical standard deviation of distance between true and estimated active nodes among the 100 randomly swapped leaf orderings was almost zero.
This robustness even becomes stronger in both, the very high and very low signal to noise ratio regimes. We repeated the experiment, but with minimal λ = 50 and minimal δ = 0.4. Figure S20, S22, and S21 show the empirical standard deviation of number of active nodes, 90% inter quantile distance, and empirical standard deviation of distance between true and estimated active nodes. In this regime, for almost all of the 1, 000 trees branch swapping never influenced the algorithms outcome. The same holds true for the low signal to noise ratio regime, as seen in Figure S23, S25, and S24, where we considered a maximal λ = 20 and maximal δ = 0.2. In summary, the simulations suggest that treeseg is quite robust under branch swapping events, in particular in low and high signal to noise regimes. B.1. Optimal branch ordering. Although, the above simulation study shows that treeSeg is typically robust under (random) branch swapping, a natural question is whether certain orderings of branches have a uniformly higher detection power over others. In other words, are some orderings generally preferred by treeSeg? In the following we give some insight into this question with a simulation example.
We reconsider the example from Figure 1 in the main text and S8, respectively, but remove the first true active node and only keep the second one, with the same two branch orderings as in Figure 1 in the main text and Figure S8, respectively. This is shown in Figure S26, where the left column shows an ordering analog as in Figure 1 in the main text (in the following denoted as Ordering 1) and the right column shows an ordering analog as in Figure S8 (in the following denoted as Ordering 2). We decrease the signal to noise ratio, such that treeSeg's chance to detect this active node is roughly one half. More precisely, we assume that the offspring tips of the single active node have success probability 31.5% and the remaining tips have success probability 10%.
Among 1,000 Monte Carlo Bernoulli draws, we find that for Ordering 1 treeSeg detects the active node in 47.8% of cases and for Ordering 2 in 48.9% of cases. Thus, Ordering 1 is (slightly) more advantageous on average for this particular signal. This is because the multiscale test that underlays treeSeg is based on an interval system where Ordering 1 yields a larger interval that covers a constant segment of this particular signal. However, if we change the signal slightly such that we replace the active node by its sibling, the situation is exactly the other way around (see Figure S27). In that case, among 1,000 Monte Carlo Bernoulli draws, we find that for Ordering 1 treeSeg detects the active node in 46.6% of cases and for Ordering 2 in 46.9% of cases. Thus, Ordering 2 is (slightly) more advantageous on average for this other signal. Similar, the actual detection power also depends on the particular observations. The first rows in Figure S26 and S27 show a realization where Ordering 1 leads to a detection, but Ordering 2 does not. The second row shows a realization where this is the other way around.
In summary, the optimal ordering of the branches depends on both, the signal as well as the particular observations. As the true signal is not known and it is not valid to pick the ordering depending on the observations (recall the discussion on p-value hunting), there generally does not exists any uniformly optimal branch ordering. However, if additional a priori preknowledge about the signal was available, one may be able to explore this in order to optimize the particular branch ordering that is considered by the treeSeg algorithm. This is an interesting direction for future research, which clearly highly depends on the particular application setting.  (6) where a total of 360 Salmonella enterica bacterial strains were studied. The tree shows a maximum-likelihood phylogeny based on whole genome sequencing of the strains. The color code at the leaves of the tree indicates whether the sample was isolated from human (black) or animal (gray) host. Using α = 0.1, treeSeg estimates two active nodes. Localization of the first active node obtains a higher certainly compared to the second one, as seen in the confidence sets and bands. . . , 150 we simulated a random tree with a single active node splitting the leaf nodes into two parts of (approximately) equal size. For the two segments we generated Bernoulli random variables with success probability 0.1 and 0.6, respectively. The y-axis shows d(p,p) as in Eq. 1 for treeSeg (red) and SMUCE (black), both with α = 0.1, averaged over 10, 000 Monte-Carlo repetitions.   S12. Average distance of detected active node to the true active node (conditioned on detection event), with distance function being the minimum number of edges to be traversed to reach one node from other, for different λ and δ as in Figure S11.    S15. Average distance of some detected active node to first true active node (conditioned on some detection event), with distance function being the minimum number of edges to be traversed to reach one node from other, for different λ and δ as in Figure S14.           Figure 1 in the main text. Right column: Ordering 2 analog as in Figure S8. Row 1 and 2

Additional Figures
show two different Bernoulli realizations. When treeSeg (with α = 0.1) detects the active nodes its estimate is shown as a red dot in the tree. It's estimated signalp is shown as red line and the true signal is shown as black dotted line. The confidence band is shown in orange. The two hyper-rectangles correspond to confidence boxes (induces by the tests on the respective intervals) which do not overlap in the detection case, but do overlap in the non-detection case.    Example of a tree with two active nodes (in red) that are not identifiable, as removing one active node keeps the segmentation unchanged. Note that this is an example with k = 2 active nodes but just 2 < k + 1 different segments and thus, excluded by our analysis.