A simpler linear-time algorithm for the common refinement of rooted phylogenetic trees on a common leaf set

Background The supertree problem, i.e., the task of finding a common refinement of a set of rooted trees is an important topic in mathematical phylogenetics. The special case of a common leaf set L is known to be solvable in linear time. Existing approaches refine one input tree using information of the others and then test whether the results are isomorphic. Results An O(k|L|) algorithm, LinCR, for constructing the common refinement T of k input trees with a common leaf set L is proposed that explicitly computes the parent function of T in a bottom-up approach. Conclusion LinCR is simpler to implement than other asymptotically optimal algorithms for the problem and outperforms the alternatives in empirical comparisons. Availability An implementation of LinCR in Python is freely available at https://github.com/david-schaller/tralda.


Introduction
Given a collection of rooted phylogenetic trees T 1 , T 2 , ...T k , the supertree problem in phylogenetics consists in determining whether there is a common tree T that "displays" all input trees T i , 1 ≤ i ≤ k , and if so, a supertree T is to be constructed [1,2]. In its most general form, the leaf sets L(T i ) , representing the taxonomic units (taxa), may differ, and the supertree T has the leaf set L(T ) = k i=1 L(T i ) . Writing n:=|L(T )| , N := k i=1 |L(T i )| , and R:= k i=1 |L(T i )| 2 , this problem is solved by the algorithm of Aho et al. [3], which is commonly called BUILD in the the phylogenetic literature [4], in O(Nn) time for binary trees and O(Rn) time in general.
An O(N 2 ) algorithm to compute all binary trees compatible with the input is described in [5]. Using sophisticated data structures, the effort for computing a single supertree was reduced to O(min(N √ n, N + n 2 log n)) for binary trees and (R log 2 R) for arbitrary input trees [6]. Recently, an O(N log 2 N ) algorithm has become available for the compatibility problem for general trees [7]. The compatibility problem for nested taxa in addition assigns labels to inner vertices and can also be solved in O(N log 2 N ) [8].
Here we consider the special case that the input trees share the same leaf set L(T 1 ) = L(T 2 ) = · · · = L(T k ) = L(T ) = L , and thus N = kn and R = kn 2 . While the general supertree problem arises naturally when attempting to reconcile phylogenetic trees produced in independent studies, the special case appears in particular when incompletely resolved trees are produced with different methods. In a recent work, we have shown that  16:23 such trees can be inferred e.g. as the least resolved trees from best match data [9,10] and from information of horizontal gene transfer [11,12]. Denoting with H(T ) the set of "clusters" in T, we recently showed that the latter type of data can be explained by a common evolutionary scenario if and only if (1) both the best match and the horizontal transfer data can be explained by least resolved trees T 1 and T 2 , respectively, and (2) the union H(T 1 ) ∪ H(T 2 ) is again a hierarchy. In this context it is of practical interest whether the latter property can be tested efficiently, and whether the common refinement T satisfying H(T ) = H(T 1 ) ∪ H(T 2 ) [13] can be constructed efficiently in the positive case. Several linear time, i.e., O(|L|) time, algorithms for the common refinement of two input trees T 1 and T 2 with a common leaf set have become available. The INSERT algorithm [14], which makes use of ideas from [15], inserts the clusters of T 2 into T 1 and vice versa and then uses a linear-time algorithm to check whether the two edited trees are isomorphic [16]. Assuming that the input trees are already known to be compatible, Merge_Trees [17,18] can also be applied to insert the clusters of one tree into the other. For both of these methods, an overall lineartime algorithm for the common refinement of k input trees is then obtained by iteratively computing the common refinement of the input tree T j and the common refinement of first j − 1 trees, resulting in a total effort of O(k|L|).
Here we describe an alternative algorithm that constructs in a single step a candidate refinement T of all k input trees. This is achieved by computing the parent-function of the potential refinement T in a bottom-up fashion. As we shall see, the algorithm is easy to implement and does not require elaborate data structures. The existence of a common refinement is then verified by checking that the parent function defines a tree T and, if so, that T displays each of the input trees T j . This test is also much simpler to implement than the isomorphism test for rooted trees [16].

Notation and preliminaries
Let T be a rooted tree. We write V(T) for its vertex set, E(T) for is edge set, L(T ) ⊆ V (T ) for its leaf set, V 0 (T ):=V (T ) \ L(T ) for the set of inner vertices and ρ ∈ V 0 (T ) for its root. An edge e = {u, v} ∈ E(T ) is an inner edge if u, v ∈ V 0 (T ) . The ancestor partial order T on V(T) is defined by x T y whenever y lies along the unique path connecting x and the root ρ . If x T y and x = y , we write x ≺ T y . For v ∈ V (T ) , we set then v is the unique parent of u. In this case, we write v = parent T (u) . All trees T considered in this contribution are phylogenetic, i.e., they satisfy |child T (v)| ≥ 2 for all v ∈ V 0 (T ).
We denote by T(u) the subtree of T rooted in u and write L(T(u)) for its leaf set. The last common ancestor of a vertex set W ⊆ V (T ) is the unique T -minimal vertex lca T (W ) ∈ V (T ) satisfying w T lca T (W ) for all w ∈ W . For brevity, we write lca T (x, y):=lca T ({x, y}) . Furthermore, we will sometimes write vu ∈ E(T ) as a shorthand for " There is a well-known bijection between rooted phylogenetic trees T with leaf set L and hierarchies on L, see e.g. [4,Thm. 3.5.2]. It is given by H(T ):={L(T (u)) | u ∈ V (T )} ; conversely, the tree T H corresponding to a hierarchy H is the Hasse diagram w.r.t. set inclusion. Thus, if v = lca T (A) for some A ⊆ L(T ) , then L(T(v)) is the inclusion-minimal cluster in H(T ) that contains A, see e.g. [19]. We call the elements of H(T ) clusters and say that two clusters C and C ′ are compatible if C ∩ C ′ ∈ {C, C ′ , ∅} . Note that, by (i), the clusters of the same tree are all pairwise compatible.
A (rooted) triple is a binary tree on three leaves. We say that a tree T displays a triple xy|z if lca T (x, y) ≺ T lca T (x, z) = lca T (y, z) , or equivalently, if there is a cluster C ∈ H(T ) such that x, y ∈ C and z / ∈ C . The set of all triples that are displayed by T is denoted by r(T). A set R of triples is consistent if there is a tree that displays all triples in R.
Let T and T * be phylogenetic trees with L(T ) = L(T * ) . We say that T * is a refinement of T if T can be obtained from T * by contracting a subset of inner edges. Equivalently, T * is a refinement of T if and only if In particular, therefore, T displays a tree T ′ with L(T ′ ) = L(T ) if and only if H(T ′ ) ⊆ H(T ) , i.e., if and only if T is a refinement of T ′ . The minimal common refinement of the trees [4] can be rephrased in the following form: Lemma 1 Let T 1 , T 2 , ..., T k be trees with common leaf set L(T i ) = L such that H:= k i=1 H(T i ) is a hierarchy. Then there is a unique tree T such that H(T ) = H . Furthermore, T is the unique "least resolved" tree in Proof By definition of H and the bijection between phylogenetic trees and hierarchies, there is a unique tree T such that H = H(T ) . Consider an inner edge e = uv . By construction, there is at least one tree T v such that C:=L(T (v)) ∈ H(T v ) . However, H(T e ) = H(T ) \ {C v } and thus T e does not display T v .
By Thm. 1 in [20], a tree T ′ is displayed by a tree T with L(T ′ ) ⊆ L(T ) if and only if r(T ′ ) ⊆ r(T ) . As an immediate consequence, a common refinement of trees with a common leaf set L exists if and only if the union L of their triple sets is consistent. The latter condition can be checked using the BUILD algorithm which, in the positive case, returns a tree BUILD(R, L) that displays all triples in R.

Lemma 2
Suppose that T is the unique least resolved common refinement of the trees T 1 , T 2 , ..., Proof The tree T :=BUILD(R, L) is a common refinement since, by the arguments above, it displays T 1 , T 2 , ..., T k . By Lemma 1, we therefore have H(T ) ⊆ H( T ) . Prop. 4.1 in [21] implies that T is least resolved w.r.t. R , i.e., every tree T ′ obtained from T by contraction of an edge no longer displays all input triples in R . By Thm. 6.4.1 in [4], T i is displayed by T ′ if and only if T ′ displays all triples of T i . Since this is not true for all input trees T i , T ′ does not display all input trees T i , We note that, given a set of triples R, "T is a least resolved displaying R" does not imply that vertex set V(T) is minimal among all such trees. It is possible in general that there is a tree T ′ displaying a given triple set R with |V (T ′ )| < |V (BUILD(R, L))| . In this case, BUILD(R, L) does not display T ′ , see [22] for details. However, uniqueness of the least resolved tree, Lemma 1, rules out this scenario in our setting.
The algorithm BuildST [7] computes the supertree of a set T :={T i | 1 ≤ i ≤ k} of rooted trees without first breaking down each tree to its triple set r(T i ) . Lemma 5 in [7] establishes that BuildST applied to a set of trees and BUILD applied to the triple set R:= k i=1 r(T i ) produce the same output for all instances. If R is consistent, BuildST computes the tree BUILD(R, L) . If all input trees have the same same leaf set L BuildST in particular computes their common refinement. The performance analysis in [7] shows that BuildST runs in O(k|L| log 2 (k|L|)) time for this special case. Linear-time algorithms for the special case of a common leaf set therefore offer a further improvement over the best known general purpose supertree algorithms.

A bottom-up linear time algorithm
The basic idea of our approach is to construct T by means of a simple bottom-up approach that computes the parent function parent T : of a candidate tree T in a stepwise manner. This algorithm is based on three simple observations: From here on, we simply say, by a slight abuse of notation, that v is also a vertex of

iii) T exists if and only if the sets L(T(x)) and
L(T(y)) for all x, y ∈ k i=1 V (T k ) are either comparable by set inclusion or disjoint, i.e., Observation (iii) makes it possible to access the ancestor order ≺ T on V(T) without knowing the common refinement T explicitly. Many of the upcoming definitions are illustrated in Fig. 1.
We introduce, for each v ∈ V (T ) , the index set Let us assume until further notice that a common refinement exists and let T = (V , E) be the unique least resolved common refinement of T 1 , T 2 , ..., T k on a common leaf set. Due to Lemma 1, T is uniquely determined by the parent function parent T . The key ingredient in our construction are the following vertices in T i : Note that in general also both cases in Obs. 4 are possible. Consider the set of vertices By construction and Obs. 4, we have v T x for all x ∈ A(v) . Since all ancestors of a vertex in a tree are mutually comparable w.r.t. the ancestor order, we have (1) Taken together, Observations 3-5 imply that the parent map of T can be expressed in the following form: where the minimum is taken w.r.t. the ancestor order T on T. Since the root ρ i of each T i coincides with the root ρ of T, v is the root of T iff parent T i (v) = ∅ is undefined for one and thus for all i. In this case, we set parent T (v) = ∅.
With this in hand, we show how to compute the maps p i for u:=parent T (v) for all i ∈ {1, . . . , k} . To this end, we distinguish three cases.
In the remaining case, i ∈J (v) , we already know that p i (v) is the T i -minimal ancestor of v. Thus, we have either p i (v) = u = parent T (v) , i.e., a sub-case of (1), or (3) u T p i (v) whenever v / ∈ V (T i ) and u / ∈ V (T i ) . In this case, the definition of p i implies p i (u) = p i (v) . Summarizing the three cases yields the following recursion: Note, although the cases in Eq. (3) are not exclusive (since J (v) ∩ J (u) � = ∅ is possible), they are not in conflict. To see this, observe that if i ∈ J (u) and i ∈ J (v) , then u = parent T i (v) as a consequence of the definition of u.
(2)   Initializing i ∈ J (v) for all i and all leaves v, we can compute J(u) for u = parent T (v) as a by-product by the minimum computation in Eq. (2) by simply keeping track of the equalities encountered since both parent T i (v) and p i (v) are vertices in T i . More precisely, each time a strictly T -smaller vertex u ′ , i.e., a proper set inclusion, is encountered in Eq. (2), the current list of equalities is discarded and re-initialized as {i} , where i is the index of the tree T i in which the new minimum u ′ was found. The indices of the trees T j with u ′ ∈ V (T j ) are then appended.
It remains to ensure that the vertices are processed in the correct order. To this end, we use a queue Q , which is initialized by enqueueing the leaf set. Upon dequeueing v, its parent u and the values p i (u) are computed. Except for the leaves, every vertex u ∈ V (T ) appears as parent of some v ∈ V (T ) . On the other hand, u may appear multiple times as parent. Thus we enqueue u in Q only if the same vertex has not been enqueued already in a previous step. We emphasize that it is not sufficient to check whether u ∈ Q since u may have already been dequeued from Q before re-appearance as a parent. We therefore keep track of all vertices that have ever been enqueued in a set V. To see that this is indeed necessary, consider a tree T i = (a, (b, c)v 1 )v 2 and an initial queue Q = (a, b, c) . Without the auxiliary set V, we obtain , etc., and thus v 2 is enqueued twice.
An implementation of this procedure also needs to keep track of the correspondence between vertices in V(T) and the vertices of V (T i ) . To this end, we can associate with each v ∈ V (T ) a list of pointers to v ∈ V (T i ) for i ∈ J (v) , and pointer from v ∈ V (T i ) back to v ∈ V (T ) . For the leaves, these are assigned upon initialization. Afterwards, they are obtained for u = parent T (v) as a by-product of computing J(u), since the pointers have to be set exactly for the i ∈ J (u) . In particular, whenever the pointer for u found T i has already been set, we know that u ∈ V .
Summarizing the discussion so far, we have shown: Next we observe that it is not necessary to explicitly compute set inclusions. As an immediate consequence of Obs. 5 and the fact that x = y implies L(T (x)) = L(T (y)) because all trees are phylogenetic by assumption, we obtain Observation 7 For any two x, y ∈ A(v) , we have x ≺ T y if and only if |L(T (x))| < |L(T (y))|.
Thus it suffices to evaluate the minimum in Eq.
Proof We construct parent T in Lines 1-24 as described in the proof of Cor. 8. In particular, we determine u:=parent T (v) by virtue of the smallest ℓ i (u) . Hence, we can process each enqueued vertex v in O(k). Moreover, if a common refinement T exists, then Cor. 8 guarantees that we obtain this tree in Line 25.
A tree on |L| leaves has at most |L| − 1 inner vertices with equality holding for binary trees. Therefore, the set V of distinct vertices encountered in Alg. 1, can contain at most 2|L| − 2 vertices (note that by construction the So far, we have assumed that a common refinement exists. By a slight abuse of notation, we also use the function parent T if the refinement T does not exist. In this case, we define parent T on the union of the V (T i ) recursively by Eqs. (2) and (3). Alg. 1 summarizes the procedure based on the leaf set cardinalities for the general case. If no common refinement T exists, then either parent T does not specify a tree, or the tree T defined by parent T is not a common refinement of T 1 , T 2 , ..., T k . The following result shows that we can always efficiently compute parent T and check whether it specifies a common refinement of the input trees.
Checking whether ℓ(v) < ℓ min (= ℓ(u)) in Line 15 ensures that G does not contain cycles and that parent T (v) = u and parent T (u) = v is not possible. Moreover, every vertex v ∈ V is enqueued to Q and receives a parent u such that ℓ(v) < ℓ(u) . Unless u = ρ , u in turn receives a parent u ′ with ℓ(u) < ℓ(u ′ ) . Since V is finite v, u, u ′ , ... are pairwise distinct as a consequence of the cardinality condition, and we conclude that eventually ρ is reached, i.e., a path to ρ exists for all v ∈ V . It follows that G is connected, acyclic, and simple, and thus a tree (with root ρ). i is an isomorphism. To this end, it suffices to traverse T i and to check that

It remains to check whether
Lines 31-32) using the pointers of v and all elements in child T i (v) to the corresponding vertices in T. Note that, in general, the pointer from a vertex v in T i to a vertex in T ′ i may not be set, in which case v / ∈ V (T ′ i ) and thus, we can terminate with a negative answer. The total effort thus is bounded by O(k|L|).
If T on L is a phylogenetic tree displaying all trees T 1 , T 2 , ..., T k , then it is a common refinement of these trees. Since every vertex v ∈ V (T ) is also contained in some T i , i.e., L(T (v)) = L(T i (v)) , we have H(T ) = H(T 1 ) ∪ H(T 2 ) ∪ · · · ∪ H(T k ) .

Computational results
We compare the running times for (a) BUILD [3], (b) BuildST [7], (c) Merge_Trees [18], (c') Loose_Cons_Tree [18], and (d) LinCR (Alg. 1). To this end, we implemented all of these algorithms in Python as part of the tralda library. We note that BUILD operates on a set of triples extracted from the input trees rather than the trees themselves. We use the union of the minimum cardinality sets of representative triples of every T i appearing in the proof of Thm. 2.8 in [23]. Therefore, we have R ∈ O(k|L| 2 ) [24, Thm. 6.4] and BUILD runs in O(k|L| 3 ) time. In the case of Merge_Trees , we implemented a variant that starts with T = T 1 and then iteratively merges the clusters of the tress T i , 2 ≤ i ≤ k , into T. Merge_Trees assumes that the input trees are compatible, which is guaranteed in our benchmarking data set. In practice, however, this condition may be violated, in which case the behavior of Merge_Trees is undefined. We therefore also implemented an O(k|L|) algorithm for constructing the loose consensus tree for a set of trees T 1 , T 2 , ..., T k on the same leaf set, Loose_Cons_Tree , following [18]. The loose consensus comprises all clusters that occur in at least one tree T i , 1 ≤ i ≤ k and that are compatible with all other clusters of the input trees (see [25][26][27] and the references therein). The loose consensus tree by definition coincides with the common refinement whenever the latter exists. Loose_Cons_Tree uses Merge_Trees as a subroutine but ensures compatibility in each step by first deleting incompatible clusters in one of the trees. This is implemented as the deletion of the corresponding inner vertex v followed by reconnecting the children of v to the parent of v. The input trees are compatible if and only if no deletion is necessary. The existence of a common refinement can therefore by checked by keeping track of the number of deletions. However, the subroutine that processes trees to remove incompatible clusters significantly adds to the running time of the Loose_Cons_Tree algorithm. The lineartime algorithms require O(k|L|) space.
We simulate test instances as follows: First, a random tree T * is generated recursively by starting from a single vertex (which becomes the root) and stepwise attaching new leaves to a randomly chosen vertex v until the desired number of leaves |L| is reached. In each step, we add two children to v if v is currently a leaf, and only a single new leaf otherwise. This way, the number of leaves increases by exactly one in each step and the resulting tree T * is phylogenetic (but in general not binary). From T * , we obtain k ∈ {2, 8, 32} trees T 1 , T 2 ,..., T k by random contraction of inner edges in (a copy of ) T * . Each edge is considered for contraction independently with a probability p ∈ {0.1, 0.5, 0.9} . Therefore, T * is a refinement of T i for all 1 ≤ i ≤ k , i.e., a common refinement exists by construction. However, in general we have H(T * ) � = k i=1 H(T i ) , i.e., T * is not necessarily the minimal common refinement of the T i . The trees T 1 , T 2 , ..., T k constructed in this manner serve as input for all algorithms.
The running time comparisons were performed using tralda on an off-the-shelf laptop (Intel ® Core TM i7-4702MQ processor, 16 GB RAM, Ubuntu 20.04, Python 3.7). The time required to compute a least resolved common refinement of the input trees is included in the respective total running time shown in Figs. 2 and 3 . The empirical performance data are consistent with the theoretical result that LinCR scales linearly in k|L|. In particular, the median running times scale linearly with |L|, as shown by the slopes of ≈ 1 in the log/ log plot for the running times of LinCR in Fig. 3.
In accordance with the theoretical complexity of O(k|L| log 2 (k|L|)) for the common refinement problem, the performance curve of BuildST is almost parallel to that of LinCR; however, its computation cost is higher by almost two orders of magnitude. Our implementation of BuildST uses an algorithm for dynamic graph connectivity often referred to as HDT data structure [28] as originally described in [7]. While we do not expect BuildST to become competitive with the other algorithms, we note that a recent experimental study showed that a simplified version of the HDT data structure (with a slightly worse asymptotic bound) outperforms the full version in practice [29]. For both LinCR and BuildST, the contraction probability p appears to have little effect on the running time. In both cases, a larger value of p (i.e., a lower average resolution of the input trees) leads to a moderate decrease of the running time.
In contrast, the resolution of the input trees has a large impact on the efficiency of BUILD. It also scales nearly linearly when the resolution of the individual input trees T i is comparably high (and even terminates faster than LinCR up until a few hundred leaves, cf. top-right panel), whereas its performance drops drastically with increasing p, i.e., for poorly resolved input trees. The reason for this is most likely the cardinality of a minimal triple set that represents the set of input trees. For binary trees, the cardinality of the triple set of T i equals the number of inner edges [23], i.e., there are O(|L|) triples. For very poorly resolved trees, on the other hand, O(|L| 2 ) triples are required [24], matching the differences of the slopes with p observed for BUILD in Fig. 3.
As expected, the curves of the two O(k|L|) algorithms Merge_Trees and Loose_Cons_Tree are also almost parallel to that of LinCR in Fig. 3. For k = 2 , we can even observe that Merge_Trees is slightly faster than LinCR. However, the smaller number of necessary tree traversals in LinCR apparently becomes a noticeable advantage with an increasing number k of input trees. The additional tree processing steps in the more practically relevant Loose_Cons_Tree algorithm, furthermore, result in a longer running time compared to our new approach.

Concluding remarks
We developed a linear-time algorithm to compute the common refinement of trees on the same leaf set. In contrast to the "classical" supertree algorithms BUILD and BuildST, LinCR uses a bottom-up instead of a top-down strategy. This is similar to Loose_Cons_Tree and its subroutine Merge_Trees [18], which can also be used to obtain the common refinement of trees on the same leaf set in linear time. LinCR, however, requires fewer tree traversals and is, in our opinion, simpler to implement. In contrast to Merge_Trees , LinCR in particular does not rely on a data structure that enables linear-time tree preprocessing and constant-time last common ancestor queries for the nodes in the tree [30]. All algorithms were implemented in Python and are freely available for download from https:// github. com/ david-schal ler/ tralda as part of the tralda library. Empirical comparisons of running times show that LinCR consistently outperforms the linear-time alternatives. Only BUILD is faster for very small instances and moderate-size trees that are nearly binary.
Although it may be possible to improve Alg. 1 by a constant factor, it is asymptotically optimal, since the input size is O(k|L|) for k trees with |L| leaves. Furthermore, trivial solutions can be obtained in some limiting cases. For instance, if |V (T i )| = 2|L| − 1 , then T i is binary, i.e., no further refinement is possible. In this case, we can immediately use T = T i as the only viable candidate and only check that T j displays all other T j . However, we cannot entirely omit Lines 1-24 in this case since we require the sets J(v) as well as the correspondence between the vertices in order to check whether T displays every T i .
It is worth noting that the idea behind LinCR does not generalize to more general supertree problems. The main reason is that the set inclusions employed to determine ≺ T do not carry over to the more general case because the inclusion order of C 1 , C 2 ∈ H(T ) cannot be determined from C 1 ∩ L(T i ) and C 2 ∩ L(T j ) for two trees with L(T i ), L(T j ) L(T ).
Depending on the application, a negative answer to the existence of a common refinement may not be sufficient. One possibility is to resort to the loose consensus tree or possibly other notions of consensus trees, see e.g. [25,31]. A natural alternative approach is to extract a maximum subset of consistent triples from k i=1 r(T i ) . This problem, however, is known to be NP-hard for arbitrary triple sets, see e.g. [32] and the references therein.