Agreement forests of caterpillar trees: complexity, kernelization and branching

Given a set $X$ of species, a phylogenetic tree is an unrooted binary tree whose leaves are bijectively labelled by $X$. Such trees can be used to show the way species evolve over time. One way of understanding how topologically different two phylogenetic trees are, is to construct a minimum-size agreement forest: a partition of $X$ into the smallest number of blocks, such that the blocks induce homeomorphic, non-overlapping subtrees in both trees. This comparison yields insight into commonalities and differences in the evolution of $X$ across the two trees. Computing a smallest agreement forest is NP-hard (Hein, Jiang, Wang and Zhang, Discrete Applied Mathematics 71(1-3), 1996). In this work we study the problem on caterpillars, which are path-like phylogenetic trees. We will demonstrate that, even if we restrict the input to this highly restricted subclass, the problem remains NP-hard and is in fact APX-hard. Furthermore we show that for caterpillars two standard reductions rules well known in the literature yield a tight kernel of size at most $7k$, compared to $15k$ for general trees (Kelk and Simone, SIAM Journal on Discrete Mathematics 33(3), 2019). Finally we demonstrate that we can determine if two caterpillars have an agreement forest with at most $k$ blocks in $O^*(2.49^k)$ time, compared to $O^*(3^k)$ for general trees (Chen, Fan and Sze, Theoretical Computater Science 562, 2015), where $O^*(.)$ suppresses polynomial factors.


Introduction
In biology phylogenetic trees are commonplace. These are leaf-labelled trees which show how the entities X at the leaves -most commonly, but not exclusively, species -evolve over time [15]. Unlabelled interior nodes of the trees represent points in time at which hypothetical common ancestors diversified into sub-lineages. Such trees are typically built from data that carries evolutionary signal, such as DNA sequences. However, in the real world there is no unique mapping from DNA sequences to the "true" tree; it depends on the quality of the available data, underlying biological phenomeona and multiple model assumptions. Hence, for given data carrying evolutionary signal for a set of species X it might be possible to generate multiple distinct, but equally plausible, trees, for X. A significant part of the phylogenetics literature is therefore dedicated to understanding when and why trees differ, a phenomenon known as incongruence or discordance [6].
One model for summarizing the topological difference of two trees is the agreement forest. Freely translated, an agreement forest is a summary of the topological building blocks common to both trees. More formally, an agreement forest is a partition of the leaf set X such that the each block induces the same topology in both trees and, within each tree, the induced subtrees do not overlap. More details will be provided in section 2.
A partitition of X into singletons is vacuously a valid agreement forest, but this does not provide any insight. Rather, in the spirit of parsimony we wish to have an agreement forest with a minimum number of building blocks; this is called a maximum agreement forest (MAF), so called because it is an agreement forest that maximizes the agreement between the two trees. Maximum agreement forests have been studied extensively in recent years, we refer to [5,3,11] for overviews. Here we focus on the situation when the input consists of two unrooted, binary trees, writing uMAF to distinguish our problem from the rooted variant. Unfortunately, even in this limited setting finding the number of blocks in an uMAF, d uMAF (T, T ′ ), is an NP-hard problem [8]. Nevertheless, it is appealing to try to solve the problem in practice, not least because of its close relationship to several other measurements used to compare two phylogenetic trees. In particular, it is closely related to the Tree Bisection and Reconnection (TBR) distance, d T BR (T, T ′ ), which (informally) counts the number of times a subtree has to be detached and reconnected to transform T into T ′ ; specifically, we have d T BR (T, T ′ ) = d uMAF (T, T ′ ) − 1 [2]. Distances such as TBR help us to understand the underlying connectivity of tree space [9]. Interestingly, d T BR (T, T ′ ) is in turn exactly equal to the hybridization number of T and T ′ , which is the smallest value of |E| − (|V | − 1) ranging over all phylogenetic networks G = (V, E), generalizations of phylogenetic trees to graphs, that topologically simultaneously embed T and T ′ [16]. This graph-theoretic characterization has been central to recent parameterized complexity results for d T BR and d uMAF (T, T ′ ) [11].
In this article we aim to develop a more fine-grained understanding of what makes computation of d uMAF challenging. We do this by restricting our attention to problem when the input consists of two caterpillars; these are path-like phylogenetic trees. An example is given in Figure 1 T Figure 1: Example of a caterpillar tree T .
We prove several results. In section 3 we will prove that computing d uMAF for two caterpillars is NP-hard. In section 4 we extend this result to APX-hardness, thus excluding the existence of a polynomial-time approximation scheme for computation of d uMAF , unless P=NP. We note that the hardness is automatically inherited by the computation of d uMAF on general trees. This is relevant because the APX-hardness of the general problem was stated, but not proven, in [8]. Our result thus closes this gap in the literature. In section 5 we will prove that for two caterpillars there is a tight 7k kernel, using just two reduction rules, where k is equal to d uMAF . Specifically: when applied exhaustively to two caterpillars, the well-known subtree and reduction rules yield a smaller pair of caterpillars that have at most 7k leaves. The same two reduction rules yield a tight 15k kernel on general trees [10]. There are also smaller kernels of size 11k and 9k for general trees [11,12] 1 but in order to obtain those smaller kernels far more complex reduction rules and analysis are needed. It is interesting that for caterpillars the subtree and chain reduction already bring us substantially below the smallest kernel for general trees. Next, in section 6 we will give a branching algorithm that is faster than, at the time of writing, the best branching algorithm by Chen et al. [4]. The algorithm of Chen et al. runs in time O * (3 k ), on general trees, where the * suppresses polynomial factors. In contrast, our algorithm for two caterpillars runs in time O * (2.49 k ).
Finally, in section 7 we conclude with a number of discussion points and open problems.

Preliminaries
For general background on mathematical phylogenetics we refer to [15,7]. An unrooted binary phylogenetic X-tree is an undirected tree T = (V (T ), E(T )) where every internal vertex has degree 3 and whose leaves are bijectively labelled by a set X, where X is often called the set of taxa (representing the contemporary species, for example). We use n to denote |X| and often simply write phylogenetic tree when it is clear from the context that we are talking about an unrooted binary phylogenetic X-tree. Two phylogenetic trees T, T ′ on X are considered equal if there is an isomorphism between them that is the identity mapping on X i.e. is label-preserving. A cherry of a tree T on X is a pair of distinct taxa x, y ∈ X which have a common parent in T . A tree T on |X| ≥ 4 taxa is a caterpillar if it has exactly two cherries. For convenience we define all trees on |X| ≤ 3 taxa to be caterpillars, too. An equivalent definition is that a tree T is a caterpillar if, after deleting all taxa, the resulting tree is a path. Let T be a tree on X. For X ′ ⊆ X we write T [X ′ ] to denote the miminal subtree of T spanning X ′ , and write T |X ′ to denote the phylogenetic tree obtained from T [X ′ ] by suppressing nodes of degree 2.
Let T and T ′ be two phylogenetic trees on X. Let F = {B 1 , B 2 , . . . , B k } be a partition of X, where each block B i with i ∈ {1, 2, . . . , k} is referred to as a component of F . We say that F is an agreement forest for T and T ′ if the following conditions hold.
. . , B k } be an agreement forest for T and T ′ . The size of F is simply its number of components, k. Moreover, an agreement forest with the minimum number of components (over all agreement forests for T and T ′ ) is called a maximum agreement forest (MAF) for T and T ′ . The number of components of a maximum agreement forest for T and T ′ is denoted by d uMAF (T, T ′ ). The Unrooted Maximum Agreement Forest (uMAF) problem is to compute d uMAF (T, T ′ ). It is NP-hard [8], but permits a polynomial-time 3-approximation [17,18].
Subtrees and chains. Let T be a phylogenetic tree on X. We say that a subtree of T is pendant if it can be detached from T by deleting a single edge. For n ≥ 2, let C = (ℓ 1 , ℓ 2 . . . , ℓ n ) be a sequence of distinct taxa in X. We call C an n-chain of T if there exists a walk p 1 , p 2 , . . . , p n in T and the elements in p 2 , p 3 , . . . , p n−1 are all pairwise distinct. Note that ℓ 1 and ℓ 2 may have a common parent and ℓ n−1 and ℓ n may have a common parent. Furthermore, if p 1 = p 2 or p n−1 = p n then C is said to be pendant in T . To ease reading, we sometimes write C to denote the set {ℓ 1 , ℓ 2 , . . . , ℓ n }. It will always be clear from the context whether C refers to the associated sequence or set of taxa. If a pendant subtree S (resp. an n-chain C) exists in two phylogenetic trees T and T ′ on X, we say that S (resp. C) is a common subtree (resp. chain) of T and T ′ .
Let F = {B 1 , B 1 , B 2 , . . . , B k } be an agreement forestfor two phylogenetic trees T and T ′ on X, and let Y be a subset of X. We say that Y is preserved in F if there exists an element B i in F with i ∈ {1, 2, . . . , k} such that Y ⊆ B i . Later in the article we will make use of the following theorem from [11], referred to as the chain preservation theorem. Theorem 1 ( [11]). Let T and T ′ be two phylogenetic trees on X. Let K be an (arbitrary) set of mutually taxa-disjoint chains that are common to T and T ′ . Then there exists a maximum agreement forest F of T and T ′ such that 1. every n-chain in K with n ≥ 3 is preserved in F , and 2. every 2-chain in K that is pendant in at least one of T and T ′ is preserved in F .

NP-hardness
In this section we will prove the following theorem.
Recall that an independent set of an undirected graph G = (V, E) is a set I ⊆ V of mutually non-adjacent vertices. The problem of computing a maximum-size independent set (MIS) is a well-known NP-hard and APX-hard problem. We will establish our theorem by reducing from MIS on cubic graphs. It is well-known that in a cubic graph, the size of a MIS is at least |V |/4. Moreover, MIS remains NP-hard and APX-hard on cubic graphs [1].
Before we can prove our theorem we need to establish a lemma and an observation and describe how two caterpillars T G and T ′ G are built from a cubic graph G, which is the input to the MIS problem.
Let G = (V, E) be a cubic graph, where n = |V | and m = |E| = 3n/2. For each v ∈ V we introduce three taxa v 1 , v 2 , v 3 . For each edge v ∈ V and for each edge e incident to v, we introduce two taxa e v→ , e v← . Hence, there are in total 3n + (3n · 2) = 9n taxa used to encode the actual graph. There will be an additional 6n + 6m = 6n + 9n = 15n taxa introduced that have an auxiliary function, so 24n taxa in total. The entire construction is as follows.   D v : For each v ∈ V , let D v be a chain with the taxa (v 1 , v 2 , v 3 ).
C i : For 1 ≤ i ≤ 2(n + m), let C i be a chain with 3 taxa with arbitrary labels. These chains will be mirrored (i.e. have opposite orientations) in T G and T ′ G .
T G : Let T G be a caterpillar alternating each A v with two C i ; this uses the first 2n C i chains. Note that these chains are always in pairs of the form C i , C i+1 (i odd). This is then followed by a block of the remaining 2m = 3n C i chains.
T ′ G : Let T ′ G be a caterpillar which alternates each D v with two C i , where the C i are mirrored with respect to their orientation in T G . The second part consists of alternating each B e with two C i , which are again mirrored with respect to their orientation in T G . The C i chains in T ′ G are in the same pairs as in T G . Figure 2: The caterpillars T G and T ′ G constructed from G. Both caterpillars have 24n taxa in total, where n is the number of vertices in G.
In the hardness proof we will, given an uMAF F of T G and T ′ G , construct an independent set I F from it. A v will be used to determine whether vertex v in G is in this independent set or not. For each edge e = {u, v}, B e will be used to ensure that u and v are not both in the independent set. We return to this point later. First let us look at a small example for G =   The following lemma will show that an agreement forest F on T G and T ′ G that preserves all chains C i has nice properties which we can use to build an independent set I F .    Figure 3 Proof. Observe that, as all the C i chains are preserved, every C i appears as a component of F . To see why, note that chain C i (i odd) cannot be in a component together with the chain C i+1 . This is because of the opposite orientations of these two chains in T G and T ′ G . A component C i (i odd) cannot be in a component with taxa that are to its left in T G , because these taxa would have to be on its right in T ′ G -but this would violate the preservation of C i+1 . Symmetrically, chain C i+1 cannot be together in a component with taxa that are to its right in T G . Hence, all the C i are components (without any additional taxa) in F . Combining this with the fact each A v chain is sandwiched between two C i chains in T G , proves that a component cannot simultaneously intersect with A v and something outside A v . Exactly the same reasoning holds for the D v and B e sets, since these are also sandwiched between pairs of C i chains. This establishes (i). Towards (ii), let v be an arbitrary vertex in V and let e 1 , e 2 , e 3 be the three edges it is incident to. Observe that A v consists of the disjoint union of the following four sets: , containing 3,2,2,2 taxa respectively. Due to (i), any component intersecting with one of these four sets, cannot intersect any of the other four sets. Now, if B contains all three taxa from D v , then two components each are required to cover the three (A v ∩ B e i ) sets; so 7 components in total. It can be easily verified in a similar way that if B contains 2 taxa from D v , 7 components are required, and that if B contains exactly 1 taxa from D v then at least 6 components are required -and that in fact the only way to cover A v with 6 components is as follows: The following observation, which we state without proof, will be useful for both the NP-hardness and APX-hardness reductions.
Observation 1. Let F be an arbitrary agreement forest that preserves all the C i chains. If an A v set is covered by 7 or more components, then deleting all the components that intersect with A v and adding the following components yields a new agreement forest F ′ such that |F ′ | ≤ |F | and all the C i chains are still preserved in F ′ : We can now move on to the proof of Theorem 2. The high-level idea is that if an A v set is covered by 6 components, then (by Lemma 1) there is a unique way of doing this, which in turn enforces that each A u set of a neighbour u of v requires at least 7 components. The neighbouring A u can then be assumed via Observation 1 to consist of the 7 components named in that observation. In this way the selection of independent sets with many elements is preferred.
Proof. For the input graph G to the MIS problem we construct T G and T ′ G as described in Definition 1. We let k be the size of a maximum independent set of G and write Opt(T G , T ′ G ) to denote the size of a uMAF of T G and T ′ G . We prove that For each vertex v ∈ I, which has incident edges e 1 , e 2 , e 3 , we introduce the 6 components and for each vertex v ∈ I we introduce the 7 components Finally, we add each C i as its own component. Summarizing, our agreement forest consists of the following components: number of components corresponding to v ∈ I 6k number of components corresponding to v / ∈ I 7(n − k) number of components corresponding to the C i chains 2(n + m) Recalling that m = 3n/2 we obtain a total of 6k +7(n−k)+2n+3n = 12n−k components. This concludes the proof that By applying Theorem 1 (the chain preservation theorem) to the C i chains, we know that there is a maximum agreement forest F of T G and T ′ G in which all the C i are preserved. Fix such an F . As noted in Lemma 1 the components of F that intersect with a given A v , are all completely contained inside A v . Now, we apply Observation 1 to all A v that are covered by 7 or more components; this does not increase the size of the forest. The remaining A v are covered by exactly 6 components each, and from Lemma 1 there is a unique way of doing this. Observe that it is not possible, whenever u and v are adjacent vertices, and e is the edge between them, to cover both A u and A v with 6 components. This would require the agreement forest to contain both the components {e u← , e u→ } and {e v← , e v→ }. However, this would mean that B e is covered by a single component. But B e contains taxa that are also in A u . Thus, the component that covers B e must be fully contained in A v whilst containing taxa not present in A u . This is not possible. Hence, the A v that are covered by exactly 6 components point out an independent set. Let p (respectively, q) be the number of A v in F that are covered by 6 (respectively, 7) components. Giving us Finally, we note that the uMAF problem is definitely in NP. Given a partition F of the taxa X of the two input trees, it is straightforward to check in time O(|X| 4 ) that it induces a valid agreement forest. Specifically, we check that the induced subtrees are mutually disjoint in each input tree, and that the set of quartets (i.e. phylogenetic trees on subsets of 4 taxa) induced by each block of the partition is the same in both input trees: two phylogenetic trees are topologically equal if and only if they each induce the same set of induced quartets [15]. Combined with the above NP-hardness result, this proves that the problem is NP-complete.

APX-completeness
In the previous section we have proven that uMAF on caterpillars is NP-complete. Now we will prove that uMAF on caterpillars is APX-complete. The (general) problem is known to have a polynomial-time 3-approximation [17,18] which immediately places the problem in APX. The APX-hardness of the problem on general trees was stated but not proven in [8]. Our result thus confirms and strengthens this claim to caterpillars. The reduction is essentially the same as NP-hardness proof, with some slight modifications due to us working with approximation algorithms rather than exact algorithms.
Before starting, it is helpful to establish the following corollary to Theorem 1. It differs from that theorem in the sense that it considers agreement forests that are not necessarily optimal, and it is explicitly algorithmic. Corollary 1 ( [11]). Let T and T ′ be two phylogenetic trees on X. Let K be an (arbitrary) set of mutually taxa-disjoint chains that are common to T and T ′ . Let F be an arbitrary agreement forest of T and T ′ . There exists an agreement forest F ′ such that |F ′ | ≤ |F | and, 1. every n-chain in K with n ≥ 3 is preserved in F ′ , and 2. every 2-chain in K that is pendant in at least one of T and T ′ is preserved in F ′ .
Also, F ′ can be constructed from F in polynomial time.
Proof. Although not stated as such, the proof is implicit in the proof of Theorem 1 given in [11]. The proof there argues that, if one of the chains in C ∈ K is not preserved, then the agreement forest can be explicitly modified such that C is preserved, the total number of components does not increase, and any chains that were preserved prior to this transformation are also preserved afterwards. It is easy to check in polynomial time whether a chain in K is preserved, and the constructive modifications described in the proof can also be easily undertaken in polynomial time. The only subtlety in the re-use of the proof is that in several places a contradiction is triggered on the assumption that F was a maximum agreement forest. However, a careful reading shows that it is not necessary to use proof by contradiction here at all, and that the assumption that F is maximum is not required; it was simply an easy way to end the proof. Instead of contradiction, the chain C can be preserved, such that no other preserved chains are damaged, and such that the number of agreements in the agreement forest decreases (which is fine for our purposes).

Theorem 3. uMAF on caterpillars is APX-complete.
Corollary 1 makes it fairly easy to prove APX-hardness. In particular, given a (not necessarily optimal) agreement forest F for T G and T ′ G , Corollary 1 shows that we can construct in polynomial time an agreement forest F ′ that is no larger than F and in which all the C i chains are preserved 2 .
We give the definition of an L-reduction [13] and then show that there exists an Lreduction from MIS on cubic graphs to uMAF on caterpillars, proving that uMAF on caterpillars is APX-hard.

Definition 2.
Define two mappings f and g and two positive constants α and β such that the following holds: 1. f is a function that in polynomial time maps the input G to MIS to two trees T G and T ′ G that are the input for uMAF; 2. For any input G we have: where here Opt(G) denotes the size of a maximum independent set of G; 3. g is a function that maps in polynomial time an agreement forest F for T G and T ′ G to an independent set I F of G; 4. For any agreement forest F for T G and T ′ G we have: Together this forms an L-reduction from MIS to uMAF.

Now for the proof of Theorem 3.
Proof. Let f be the mapping described in Definition 1. Clearly T G and T ′ G can be constructed in polynomial time. This establishes the first property of the L-reduction.
For the second property, G is a cubic graph so Opt(G) ≥ 1 4 n. Let Opt(G) = k. Then, as in the proof of theorem 2 we have Opt(T G , T ′ G ) = 12n − k. We require 12n − k ≤ αk which is equivalent to 12 n k ≤ α + 1. Given that k ≥ n/4, the left hand side of the inequality is at most 48. Hence, it is sufficient to select α = 47.
For the third property, let g be the polynomial-time mapping defined as follows. Let F be an arbitrary agreement forest of T G , T ′ G ; this is the input to g. We apply Corollary 1 to it (letting K be the set of the C i chains); this yields in polynomial time a new agreement forest F ′ such that |F ′ | ≤ |F | and in which all the C i chains are preserved. We then apply Observation 1 to all A v in F ′ which have 7 or more components. Let F ′′ denote this transformed agreement forest; we have |F ′′ | ≤ |F ′ | ≤ |F |. We create an independent set I F ⊆ V as follows: v ∈ I F if and only if A v is covered by 6 components in F ′′ . As argued in the NP-hardness proof, this will create an independent set.
For the fourth property we need to find a value for β such that |Opt(G) − |I F || ≤ β|Opt(T G , T ′ G ) − |F ||. Let ℓ = |I F | be the number of A v in F ′′ that are covered by 6 components. We know that |F | ≥ |F ′ | ≥ |F ′′ | = 12n − ℓ. Observe: So we pick β = 1 and we are done.

A tight 7k kernel
Recall the definitions of common subtrees and common chains from the preliminaries. It is well-known that the following two polynomial-time reduction rules do not alter the size of the uMAF [2]: Subtree reduction. If T and T ′ have a maximal common pendant subtree S with at least two leaves, then reduce T and T ′ to T r and T ′ r , respectively, by replacing S with a single leaf with a new label.
When applied to exhaustion on two unrooted binary trees, at which point we say the trees are fully reduced, these rules yield an instance with (ignoring additive terms) at most 15k taxa [10], where k is the size of the uMAF 3 , and the analysis is tight.
Note that applying the subtree or chain reduction to a caterpillar produces a new caterpillar. In this section we will show that, when applied to exhaustion on two caterpillars, a much smaller kernel is obtained than on general unrooted binary trees. Theorem 4. There is a 7k kernel for uMAF on caterpillars using only the common chain and subtree reductions, and this is tight up to a constant additive term.
Proof. Let F be an uMAF for caterpillars T and T ′ with k components, where T and T ′ are fully reduced. We prove that n ≤ 7k. Suppose that B is a component of F with at least 4 taxa. Observe that in at least one of T and T ′ , say T , there is some taxon x ∈ B such that {x} ∈ F (i.e. x is a singleton component in F ), x is adjacent to T [B] (i.e. there is an edge {x, u} in T such that u is a node of T [B]), and every path in T from x to a taxon of B contains at least three edges. If this was not so, then B would be a common chain of length at least 4 and this would contradict the assumption that the chain reduction had been applied to exhaustion. Thus the existence of B forces x to be a singleton component in F . We say that x has been orphaned by B in T . Observe: 1. Within a given tree, say T , a taxon x can be orphaned by at most one component.

2.
A taxon x can be orphaned by in total at most two different components of F : one in T , and one in T ′ .
3. Within a given tree, say T , a component B ∈ F with at least 4 leaves orphans at least ⌊ |B|−1 3 ⌋ taxa. The third observation is the result of the pigeon-hole principle and the fact, as noted above, that to avoid triggering the chain reduction every sequence of four taxa within a component of F must orphan at least one taxon. So a sequence of 5 taxa needs to orphan at least one taxon, and the same is true for a sequence of 6; a sequence of 7 needs to orphan at least 2 taxa, and so on. Now, we are ready to bound n. Let x i (where i ≥ 1) be the number of components in F that contain exactly i taxa. The total number of taxa is thus i i · x i . To obtain an upper bound, it is sufficient to maximize this sum subject to all x i being non-negative integers and two constraints: The second constraint is the result of combining the second and third observations above. (Note that in the summations we could, if desired, take 6k + 3 as a trivial, finite upper bound on values of i that need to be considered. This is because if i ≥ 6k + 4 then any feasible integral solution to the above constraints must have x i = 0, because of the first constraint.) As we are only seeking an upper bound on the number of taxa, we can relax the integrality constraints on the x i variables and to allow them to be fractional. This gives us a linear program (LP). We can use weak duality to place an upper bound on this LP, which is thus an upper bound on the original integral program and thus an upper bound on the size of the kernel (see e.g. [14] for more background on LP duality). Specifically, we obtain a dual LP 4 with two dual variables y 1 , y 2 (corresponding to the two constraints above), an objective function k · y 1 and two types of constraints. The first constraint (corresponding to x 1 in the original LP) is y 1 − 2y 2 ≥ 1. For i ≥ 2, the corresponding dual constraint is y 1 + ⌊ |i|−1 3 ⌋y 2 ≥ i. It can be verified that taking y 1 = 7 and y 2 = 3 yields a feasible solution to all dual constraints, achieving an objective function value of 7k. This completes the proof that the kernel has at most 7k taxa. Now we will prove that this bound is tight up to additive terms by giving for each k ≥ 3 two fully reduced caterpillars on 7k − 8 leaves, where k is the size of the uMAF. See Figure  6. The  Remark. We observe that the size of the kernel (for the same reduction rules) can alternatively be bounded to 7k + O(1) by leveraging the generator approach of [11]. The details of the generator machinery used there are beyond this article but the high-level idea is as follows; we ignore additive terms here. If the uMAF has k components, this can be modelled as adding leaves to a cubic multigraph with 3k edges, and distributing 2k breakpoints (i.e. k per tree) across those edges, such that each edge receives 0, 1 or 2 breakpoints. An edge with 0 breakpoints can receive at most 3 taxa. In the analysis of [11] it is observed that an edge with 1 breakpoint can receive at most 4 taxa, and the same for edges with 2 breakpoints. However, an edge with 1 or 2 breakpoints, and 2 or more taxa, necessarily includes a cherry (or contradicts the assumed optimality of the uMAF). Crucially, when the input consists of 2 caterpillars, there are only 4 cherries to divide across the edges. Hence, there are at most a constant number of edges that satisfy both properties: contains at least one breakpoint, and has 2 or more taxa. The counting equation is then optimized (i.e. describing the worst case: two caterpillars with the highest number of leaves possible) by taking 2k 0-breakpoint edges with 3 taxa on each, and k edges each with 1 taxon and 2 breakpoints. This gives 7k + O(1) taxa.

An improved FPT branching algorithm for caterpillars
In this section we prove the following theorem. As usual, O * notation suppresses polynomial factors.
Theorem 5. Let T and T ′ be caterpillars on the same set of taxa X. For each k, it can be determined in time O * (2.49 k ) whether T and T ′ have an agreement forest with at most k components.
We start with some simple observations. Recall that we vacuously allow trees on 3 fewer taxa to be regarded as caterpillars.
Observation 2. Let T be a caterpillar on X and let X ′ ⊆ X. Then T |X ′ is also a caterpillar.
Lemma 2. Let T and T ′ both be caterpillars on X. Suppose a ∈ X is part of a cherry in T . For an agreement forest A which does not contain {a} as a singleton component, let B a be the component of A that contains a. Let B a = {a} ∪ L ∪ R where where L and R are the taxa in the two subtrees sibling to a in T |B a = T ′ |B a (see Fig. 7). Then |L| ≤ 1 or |R| ≤ 1. In particular: a is part of a cherry in T |B a = T ′ |B a .
Proof. Towards a contradiction, suppose |L| ≥ 2 and |R| ≥ 2. Then a is not part of a cherry in T |B a = T ′ |B a . However, let c be the taxon from L ∪ R that is closest to a in T . Due to the caterpillar structure of T , it follows that {a, c} is a cherry in T |B a , yielding a contradiction. Hence, |L| ≤ 1 or |R| ≤ 1. Combining this with the fact that L ∪ R = ∅ (due to the assumption that B a = {a}), we have that a is part of a cherry in T |B a = T ′ |B a . T |B a = T ′ |B a L R a Figure 7: If a is part of a cherry in T , then an agreement forest of T and T ′ in which a is part of a non-singleton component B a has the property that |L| ≤ 1 or |R| ≤ 1.

High-level idea of the branching algorithm
We draw inspiration from the branching algorithm of Chen et al. [4] and earlier work in a similar vein such as by Whidden et al [17]. We start with two caterpillar trees T and T ′ on X. The high-level idea is to progressively cut edges in one of the caterpillars, say T ′ , to obtain a forest F ′ (initially F ′ = T ′ ) with an increasing number of components, until it becomes an agreement forest for T and T ′ . Each edge cut increases the size of the forest by 1. Hence, if we wish to know whether there exists an agreement forest with at most k components, we can make at most k − 1 edge cuts in T ′ . As in earlier work, at each step we apply various "tidying up" steps: 1. If a singleton component is created in F ′ comprising a single taxon a, then we delete a from both T and F ′ ; 2. Degree-2 nodes are always suppressed; 3. Common cherries are always reduced into a single taxon in both trees.
The way we will choose edges to cut, combined with the tidying up steps, ensures that (unlike T ′ ) T remains connected at every step. More formally: if, at a given step, T ′ has been cut into a forest F ′ and X ′ is the union of the taxa in F ′ , then T will have been transformed into T |X ′ . To keep notation light we will henceforth refer simply to T and F ′ , with the understanding that we are actually referring to the tree-forest pair encountered at a specific iteration of the algorithm.
The general decision problem is as follows. We are given a (T, F ′ ) tree-forest pair on a set X ′ ⊆ X of taxa, and a parameter k, and we wish to know: is it possible to transform F ′ into an agreement forest making at most k − 1 cuts (more formally: an agreement forest for T |X ′ and T ′ |X ′ )? If we can answer this question we will, in particular, be able to answer the original question of whether the original input (T, T ′ ) has an agreement forest with at most k components. We use a recursive branching algorithm to answer this query and use T (k) to denote the running time required to answer the query.
We will show that T (k) is O * (2.49 k ). We will make heavy use of the following fact.
Observation 3. At every step of the algorithm, T will always be a caterpillar and, due to Observation 2, F ′ will always be a forest of caterpillars.
Our first branching rule is well-known in the literature; see also Figure 8.
Branching rule 0 ( [17]). Suppose that T contains a cherry {a, b} and that, in F ′ , a and b are in different trees T ′ a and T ′ b . Then in any agreement forest obtained by cutting edges in F ′ , at least one of a and b is a singleton component. Hence we can branch by guessing whether to delete a, or to delete b. This immediately yields the recurrence T (k) ≤ 2T (k − 1), where the 2T (k − 1) term corresponds to the fact that we delete a or b, and each such guess requires one edge cut. Such branching yields a bound of at most O * (2 k ) and is thus clearly compatible with our overall goal of O * (2.49 k ). So we can safely apply Branching Rule 0 whenever we come across such a situation.
We henceforth assume Branching Rule 0 does not apply. This means that a and b are a cherry in T and are part of the same tree T ′ ab in F ′ , as shown in Figure 9. We can also assume that a and b do not have a common parent in T ′ ab , because then {a, b} would be a common cherry and would have been reduced by the "tidying up" steps. Due to Observation 3, T and T ′ ab will always be caterpillars. This is the starting point for all our new branching rules. In this case there are two branches. When we write "Cut off L" we mean: cut the edge e La between the L subtree and the parent of a, as shown in Figure 9. "Cut off R" means: cut the edge e Rb between the R subtree and the parent of b. The two branches are: Figure 9: The new branching rules all consider the situation that a and b form a cherry in T , and are part of the same tree T ′ ab in F ′ where they do not form a cherry. Note that the parents of a and b might not be adjacent, and that one or both of L, R might be empty.
This yields at most T (k) ≤ 2T (k − 1), and thus T (k) ≤ 2 k . (Note that if L or R is empty, then {a, b} form a common cherry and would have been reduced.) To prove the correctness of this branching rule we will show that, if F ′ can be transformed into an agreement forest A by cutting at most k − 1 additional edges, then there exists an agreement forest A ′ , obtained by making at most k − 1 additional cuts to F ′ such that the following holds: there exists an edge e ∈ {e La , e Rb } such that, for every component does not contain the edge e.
Proof. Let A be an agreement forest obtained from F ′ with at most k − 1 additional cuts. First, suppose that a is not a singleton in A. Then, a is part of a component B a ∈ A such that a is in a cherry with some taxon c in T |B a = T ′ |B a (Lemma 2). If c = b then it is not possible for B a to simultaneously contain a, b, an element from L and an element of R: because T |B a would contain {a, b} as a cherry, but T ′ |B a would not. Hence, at least one of e La , e Rb is not used by B a , and no other component of A uses that edge either (because B a contains, in addition to a, at least one other taxon).
So, suppose that c = b. We distinguish two subcases.
c}. This is because T |B a = T ′ |B a contains {a, c} as a cherry (by assumption) but also {a, b} as a cherry (due to the topology of T ) and a taxon can only be in two or more cherries of a phylogenetic tree if the tree has three taxa. Once again, this means that it is safe to cut at least one of e La , e Rb .
(ii) b ∈ B a . Note that b is then necessarily a singleton component in A due to {a, b} being a cherry in T and c ∈ B a . Now, we claim that deleting B a and {b} from A, and replacing them with (B a \ {c} ∪ {b}) and {c}, yields an agreement forest A ′ (with the same number of components as A). This main reason for this is that in A, due to Lemma 2, |B a ∩L| ≤ 1 or |B a ∩R| ≤ 1. In particular: if B a ∩L = {c}, or B a ∩R = {c}, then it is easy to verify that A ′ is still an agreement forest. Alternatively, suppose that B a ∩ L contains c and at least one other taxon. (A symmetrical analysis holds if B a ∩ R contains c and at least one other taxon.) Irrespective of how close c is to a in B a , we have that |B a ∩ R| = 0, because otherwise a and c do not form a cherry. It can then again be easily checked that, irrespective of the size of B a that A ′ is an agreement forest.
Crucially, there exists e ∈ {e La , e Rb } such that all components in A ′ avoid e. At this stage of the proof we have shown that, if a is not a singleton in A, then the branching rule is safe. Due to symmetry between a and b, the same proof shows that, if b is not a singleton in A, then the branching rule is safe. Hence, the only situation left to consider is when both a and b are singletons in A. We produce a new agreement forest

Branching Rule 2:
There is at least one taxa between a and b in T ′ ab Assume a and b have a number of taxa between them in T ′ ab . Denote this chain of taxa between a and b as C with i ≥ 1 being the number of taxa in C, as shown in Figure 10. We note that one or both of L, R can potentially be empty. If L is empty then a is in a cherry in T ′ ab with some taxon c ∈ C. The taxon c can then be moved out of C to take the role of L, reducing the number of taxa in C by 1. The same transformation can be applied if R is empty. This allows us to assume that both L and R are non-empty, but at the price of reducing the length of C by at most 2. Observe that if both L and R are empty then, before moving any taxa out of C, we have i ≥ 3 because otherwise we are in Branching Rule 1. In this branching rule we deal with the case i ≥ 2 first. Figure 10: In the tree T ′ ab taxa a and b have a chain C of taxa between them. Note that it is also possible that one or both of L, R are empty but by moving taxa out of the chain C we can assume that L and R are both non-empty. After making sure both L and R are non-empty we assume that C still contains at least 2 taxa. If C only contains 1 taxon we will need to use the later branching rule.
Let A be an agreement forest obtained from F ′ by at most k − 1 additional cuts. Either a is a singleton component in A; or b is a singleton component in A; or a and b are together in some component. If a and b are together in some component B ab ∈ A, then the following implications hold: • If B ab contains at least one taxon from L, then it does not contain any taxa from R or C.
• If B ab contains at least one taxon from R, then it does not contain any taxa from L or C.
• If B ab contains some taxon from C, then it contains exactly one such taxon, and no taxa from L or R i.e. |B ab | = 3.
Note that, if B ab contains exactly one taxon d ∈ C, then all the taxa outside B ab whose parents lie on the path from cherry {a, b} to d in T , will necessarily be singleton components There are a number of different instances we can come across. We will group them together into two main cases. Note that we can again assume that L and R are both not empty. Remember that if either L or R is empty we end up in an instance for Branching Rule 1.
First assume that Y has no common taxa with T ′ ab , Y ∩ T ′ ab = ∅. With this assumption it is sufficient to branch off into three branches. For each branch we denote (an upper bound on) the magnitude of the parameter in the corresponding recursive call. This gives the recursion: It can be verified that the solution to the induced recurrence is the positive root of the polynomial Proof. Proof of correctness: Let A be an agreement forest obtained by making at most k − 1 additional cuts in F ′ .
If A has a component B such that a, b ∈ B but B = {a, b} all taxa in Y have to be singletons, so branch 1 applies.
If B = {a, b} than c must be a singleton, and none of the taxa in L can be connected to the taxa in R, so branch 3 applies.
If there is a component B such that |B| ≥ 2, and exactly one of a, b ∈ B, then we are back in branch 1, since all the taxa in Y must be singletons.
Finally, if both a and b are singletons, we are in branch 2.
Now assume Y has taxa that are also present in T ′ ab , Y ∩ T ′ = ∅. Define D = Y ∩ L and E = Y ∩ R. For d ∈ D define L d→ as the set of taxa in L whose parents are on the path from d to a in T ′ ab . Define L ←d as the set of taxa in L left of d. Thus L = L ←d ∪ {d} ∪ L d→ (imagine reading T ′ ab from the cherry on the left towards a). Define the same for e ∈ E but mirrored to get R = R ←e ∪ {e} ∪ R e→ (this time read T ′ ab from b towards the cherry on the right). We cannot do the same thing for the top two cases. There A can have {a, d, c, e} or {a, d, c} in the first case and {b, e, c, d} or {b, e, c} in the second case. Because we cannot know for sure what the best option is we simply cut off {b} and {a} respectively and continue. All that remains are the cases where Y contains both taxa in and not in T ′ ab with L and R still consisting of a single element.
The next set of branches all depend on the orientation of d and e in Y (or just d or e in Y ). We will go over 1 case given that the rest is analogous. Assume d ∈ Y and is closer to a, b than e (so like the left row in Figure 14 but now with Y containing more taxa). Looking at T ′ ab we can conclude that if a, b are in the same component in A two of c, d or e must be a singleton in A. Much like in the beginning of this section we can conclude that it is optimal to pick the one closest to a, b and cut the other two. In this case that is d. This will form 1 branch. From T ′ ab we can also conclude that if b is inside a component that does not contain a we can always change A to an agreement forest that contains the component {a, b, d}. Thus we don't need a branch that cuts off a. We only need a branch that cuts off b in this case and one that cuts off both a and b because the rest of Y that is not in T ′ ab might want to join together while c, d or e join together. Thus we end up with 3 branches: 1. Cut off b 2. Cut off a and b

Cut off c and e
This gives the recursion: It can be verified that the solution to the induced recurrence is the positive root of the polynomial x 2 − x − 2 = 0 with x = 2. Hence T (k) ≤ 2 k . The same thing can be done for the situations when one of L or R has 2 taxa and the other has one and Y having at most one taxa in either or both L and R.
From this point onward we can assume that L and R each contains at least 2 taxa and D or E is non-empty. Making L ←d or L d→ nonempty for each d ∈ D and R ←e or R e→ nonempty for each e ∈ E. With these definition we create the following branches:    To finish the proof we state that we can use the same arguments starting from the assumption that B does contain b but not a and b forms a cherry with a taxa labeled t. In all cases we either adjust A ending up in branches 2-4, or we have t ∈ E and B = {b, t} containing only other taxa from L so branch 6 applies.
With these two observations we will now prove that our last set of branches or nicely bounded.
Proof. From Observation 4 and 5 we can conclude that for each value of d, e there exists a x 0 ∈ [2, √ 6] such that x 0 is a root of f d,e (x) and for all x > x 0 we get f d,e (x) > 0.
7 Discussion and conclusions 7.1 Caterpillars and TBR distance: a complex relationship We recall the following definition of a tree bisection and reconnection move, defined on unrooted binary phylogenetic trees, and its corresponding distance. Let T be a phylogenetic tree on X. Apply the following three-step operation to T : 1. Delete an edge in T and suppress any resulting degree-2 vertex. Let T 1 and T 2 be the two resulting phylogenetic trees.
2. If T 1 (resp. T 2 ) has at least one edge, subdivide an edge in T 1 (resp. T 2 ) with a new vertex v 1 (resp. v 2 ) and otherwise set v 1 (resp. v 2 ) to be the single isolated vertex of T 1 (resp. T 2 ).
3. Add a new edge {v 1 , v 2 } to obtain a new phylogenetic tree T ′ on X.
We say that T ′ has been obtained from T by a single tree bisection and reconnection (TBR) operation (or, TBR move). We define the TBR distance between two phylogenetic trees T and T ′ on X, denoted by d T BR (T, T ′ ), to be the minimum number of TBR operations that are required to transform T into T ′ . As mentioned earlier it is well known that, on unrooted binary trees the TBR distance between two trees is equal to the size of an uMAF, minus one [2]. In the main part of this article we did not give a formal definition of TBR distance, focussing only on agreement forests. The reason for this, is that when focussing on a restricted subset of tree topologies as we do here (caterpillars), the definition of TBR distance becomes more complex and consequently so does the relationship with agreement forests. In particular: when defining the TBR distance between two caterpillars, should (i) all the intermediate trees also be caterpillars, or (ii) is it permitted that the intermediate trees be general trees? In (ii) the equivalence between TBR distance and agreement forests remains. However, in (i) the relationship with agreement forests breaks down somewhat. In particular, it is possible that although two caterpillars have a maximum agreement forest with k components the intermediate trees constructed by the k − 1 TBR moves aren't all caterpillars. Figure 15: Caterpillars T and T ′ are made of blocks of chains, oriented in opposing directions in each tree. Each chain contains 3 taxa.
For example the two caterpillars in Figure 15 have a maximum agreement forest of size 4, {A, B, C, D}, and we can thus obtain T ′ from T after 3 T BR moves However, the only way to obtain T ′ from T with 3 moves is that the first intermediate tree is not a caterpillar; if we restrict to caterpillars, 4 or more moves are required. It would be interesting to elucidate this relationship further, and whether there is a variant of agreement forests that models this variant of TBR distance on caterpillars.

Future research
A number of interesting questions remain. We have shown that computation of uMAF on caterpillars remains hard; what kind of topological restrictions on input trees make uMAF easy? Can we develop new reduction rules which, for caterpillars, reduce the kernel bound below 7k? Similarly, what kind of new branching rules would be required to reduce the running time of the caterpillar branching algorithm below 2.49 k ? Can the insights from our 2.49 k branching algorithm be leveraged to improve the current state-of-the-art 3 k branching algorithm for general trees? Finally, we echo the point made in [12] and elsewhere: can the analysis of branching rules, and reduction rules, be systematized somehow?