Journal of Graph Algorithms and Applications Phylogenetic Incongruence through the Lens of Monadic Second Order Logic

Within the field of phylogenetics there is growing interest in measures for summarising the dissimilarity, or incongruence, of two or more phylo-genetic trees. Many of these measures are NP-hard to compute and this has stimulated a considerable volume of research into fixed parameter tractable algorithms. In this article we use Monadic Second Order logic (MSOL) to give alternative, compact proofs of fixed parameter tractabil-ity for several well-known incongruence measures. In doing so we wish to demonstrate the considerable potential of MSOL-machinery still largely unknown outside the algorithmic graph theory community-within phylo-genetics, introducing a number of " phylogenetics MSOL primitives'' which will hopefully be of use to other researchers. A crucial component of this work is the observation that many incongruence measures, when bounded, imply the existence of an agreement forest of bounded size, which in turn implies that an auxiliary graph structure, the display graph, has bounded treewidth. It is this bound on treewidth that makes the machinery of MSOL available for proving fixed parameter tractability. Due to the fact that all our formulations are of constant length, and are articulated in the restricted variant of MSOL known as MSO1, we actually obtain the stronger result that all these incongruence measures are fixed parameter tractable purely in the treewidth (in fact, if an appropriate decomposition is given: the cliquewidth) of the display graph. To highlight the potential importance of this, we re-analyse a well-known dataset and show that the treewidth of the display graph grows more slowly than the main incon-gruence measures analysed in this article 1 .


Introduction
The central goal of phylogenetics is to accurately infer the evolutionary history of a set of species (or taxa) X from incomplete information. Classically, phylogenetic reconstruction has access to information about each element in X, such as DNA data, and seeks to infer a phylogenetic tree -a tree whose leaves are bijectively labeled by X -that best fits this data. There is a vast literature available on this topic and many different algorithms exist for constructing phylogenetic trees [16,26]. In practice, it is not uncommon for phylogenetic analysis to generate multiple phylogenetic trees as output. This can occur for various reasons, ranging from software engineering choices (many tree-building packages are designed to generate multiple optimal and near-optimal solutions) to more structural explanations (reticulate evolutionary signals that are comprised of multiple distinct tree signals). Given two (or more) distinct phylogenetic trees, it is natural to compare them to determine whether the difference is significant. This explains the interest of the phylogenetics community for measures that can quantify the dissimilarity, or incongruence, of phylogenetic trees [20]. Some of these measures (such as Tree Bisection and Reconnection distance [1]) are studied to better understand how local-search heuristics, based on rearrangement operations, navigate the space of phylogenetic trees (e.g., [9]). Others, such as Hybridization Number [8], are studied because they assist with the inference of phylogenetic networks, which generalise phylogenetic trees to directed acyclic graphs [20,21].
Unfortunately, many of these measures are NP-hard and APX-hard to compute. On the positive side, however, the phylogenetics community has been quite successful in proving that these measures are fixed parameter tractable (FPT) in their natural parameterizations. Informally this means that a measure that evaluates to k can be computed in time f (k) · poly(n) where f is some function that only depends on k and n is the size of the instance (often taken to be |X|). Such In this article we show that this technique has much broader potential within phylogenetics. To clarify the exposition we focus on binary trees (both rooted and unrooted) on the same set of taxa X. We begin by proving that if two trees have an agreement forest of size k -essentially a partition of the trees into k isomorphic subtrees -the treewidth of the display graph is bounded by a function of k. This simple observation is significant because of the prominent role of agreement forests within the phylogenetics literature. We use this insight to re-analyse three well-known NP-hard phylogenetics problems that were previously shown to be FPT using more conventional analysis. In particular, we give MSOL formulations for (1) Unrooted Maximum Agreement Forest (uMAF), which is equivalent to the problem of computing Tree Bisection and Reconnection distance (TBR) on unrooted trees, (2) rooted Maximum Agreement Forest (rMAF), which is equivalent to the problem of computing Rooted Subtree Prune and Regraft distance (rSPR) on rooted trees, and (3) Hybridization Number (HN) on rooted trees. The formulations for uMAF and rMAF are based on explicitly modelling agreement forests using quartets and edge cuts. The formulation for HN uses agreement forests implicitly to obtain the treewidth bound but, due to the difficulties in encoding acyclic agreement forests, then bypasses the agreement forest abstraction. Instead, it encodes an equivalent, "elimination ordering" formulation of HN which considers sequences of pruned common subtrees. Finally we consider the (4) Maximum Parsimony Distance on Binary Characters problem. This asks for a binary character f on X that maximizes the absolute difference between the parsimony score of f on the two trees. It is NP-hard but not known to be FPT (in the parsimony distance). Here we give an optimization MSOL formulation which shows that the problem is FPT in parameter uMAF. Although this does not settle whether the natural parameterization of the problem is FPT, it does demonstrate a number of interesting principles. Firstly, it demonstrates the power of "simulating" the execution of polynomial-time algorithms (in this case, Fitch's algorithm [18]) within MSOL. Secondly, any subsequent proof that TBR distance is at most a bounded distance above d 2 MP distance and/or that d 2 MP distance induces bounded treewidth display graphs, will automatically prove that d 2

MP
distance is FPT in its natural parameterization. Summarizing, our formulations show the potential for MSOL to generate compact, logical FPT proofs for phylogenetics problems. The machinery of MSOL does not yield practical algorithms but it is an excellent classification tool. Once the existence of FPT algorithms has been confirmed via MSOL one can then switch efforts to finding a good FPT algorithm by more direct analysis, possibly (but not exclusively) through direct analysis of tree decompositions. Our formulations also introduce a number of phylogenetics "primitives" concerning quartets, clusters, subtrees and compatibility that we hope will be of use to other phylogenetics researchers.

Preliminaries
In this section, we define the main objects that will be manipulated in this paper.
An unrooted phylogenetic tree T (unrooted tree for short) is a tree in which no vertex has degree 2 and in which the leaves are bijectively labeled by a label set L(T ). The leaf labels are often called taxa and the symbol X is frequently used as shorthand for L(T ). Internal vertices are not labeled. A rooted phylogenetic tree (rooted tree for short) is defined similarly, except that it has exactly one vertex, called the root of the tree, that is permitted to have degree 2, and edges are directed away from the root. An unrooted tree is binary if every internal vertex has degree 3, and a rooted tree is binary if each internal vertex has indegree 1 and outdegree 2, and the root has outdegree 2 and indegree 0.
Given an unrooted tree T and a subset Y ⊆ L(T ), we use T (Y ) to denote the minimal subtree of T connecting Y . Moreover, we denote by T | Y the tree obtained from T (Y ) when suppressing vertices of degree 2. We say that T | Y is the subtree of T induced by Y . In graph theory terms, T | Y is a label-preserving topological minor of T . Induced subtrees are defined in the same way for rooted trees, except that the root of T | Y becomes the vertex in the minimal connecting subgraph that is closest to the root of T , and we suppress all degree 2 vertices except the new root. We write T − Y to denote T | L(T )−Y . For any node u of a rooted tree T , T u is the subtree of T rooted at u.   Given a label set X, a bipartition (or split ) A|B on X is a partition of X into two non-empty sets. Each edge {u, v} of a tree T induces a split L(T u )|L(T v ), where T u and T v are the two trees obtained from T when {u, v} is deleted. Given a rooted tree T with label set X, a subset X ′ of X is called a clade (or cluster ) of T , if T contains a node v such that L(T v ) = X ′ .
Given an unrooted binary tree T and a set of four distinct labels {u, v, w, y} in L(T ), T | {u,v,w,y} will be exactly one of the three possible unrooted binary trees on {u, v, w, y}. These are called quartets and are denoted respectively by uv|wy, uw|vy and wv|uy, depending on the bipartition induced by its central edge. In Figure 1(a) we see uv|wy and uw|vy. Given a rooted binary tree T and a set of three labels {u, v, w} in L(T ), T | {u,v,w} will be exactly one of the three possible rooted binary trees on {u, v, w}. These are called triplets and are denoted respectively by uv|w, uw|v and wv|u, where ij|k means that the leaf labelled k is incident to the root.
Let T = {T 1 , T 2 , . . . , T k } be a collection of unrooted trees, not necessarily on the same set of taxa. The display graph of T is obtained from the disjoint graph union of all trees in T by identifying vertices with the same label; see Figure 1 Given an undirected graph G = (V, E), a bag is simply a subset of V . A tree decomposition of G consists of a tree T G = (V (T G ), E(T G )) where V (T G ) is a collection of bags such that the following holds: (1) every vertex of V is in at least one bag; (2) for each edge {u, v} ∈ E, there exists some bag that contains both u and v; (3) for each vertex u ∈ V , the bags that contain u induce a connected subtree of T G . The width of a tree decomposition is equal to the cardinality of its largest bag, minus 1. The treewidth of a graph G is equal to the minimum width, ranging over all possible tree decompositions of G. A tree with at least one edge has treewidth 1. For a fixed value of k one can determine in linear time whether a graph has treewidth at most k [5].

Main results
Unless stated otherwise, we assume that T 1 = (V 1 , E 1 ) and T 2 = (V 2 , E 2 ) are both unrooted binary trees on X. Their display graph is denoted by D = (V, E) and R D denotes the vertex-edge incidence relation in D. We use adj to denote the vertex-vertex adjacency relation in D. Note that |V | = 3|X| − 4 and |E| = 4|X| − 6.

TBR / MAF on unrooted trees
We will start by giving the definitions of a TBR move and of the TBR distance between two unrooted binary trees.
Definition 1 (TBR move). Given an unrooted binary tree T , a tree bisection and reconnection (TBR) move on T consists of removing an edge of T , say {u, v}, and then reconnecting the subtrees T u and T v as follows: subdividing an edge of T u with a new vertex p; subdividing an edge of T v with a new vertex q; connecting p to q; and finally suppressing any vertices of degree 2.
TBR distance is then defined naturally as follows: Problem: d T BR (T 1 , T 2 ) Input: Two unrooted binary trees T 1 , T 2 on the same set of taxa X. Output: The minimum number of TBR moves required to transform T 1 into T 2 .
We will now give the definition of an uMAF for two unrooted binary trees T 1 , T 2 on X. Any collection of trees whose label sets partition X is said to be a forest on X. Furthermore, we say that a set F = {F 1 , . . . , F k } of unrooted binary phylogenetic trees -with |F | referred to as the size of F -is a forest for T if F can be obtained from T by deleting a (k − 1)-sized subset E of E(T ), suppressing any unlabeled leaves, and then finally suppressing any vertices with degree 2. To ease reading, we write F = T − E if F can be obtained in this way.
Definition 2 (uMAF). A set F of unrooted trees is an agreement forest for T 1 and T 2 (denoted uAF ) if F is a forest of both T 1 and T 2 . An unrooted maximum agreement forest (uMAF), is an uAF of minimum size.
So, the uMAF problem is defined as follows: Problem: uM AF (T 1 , T 2 ) Input: Two unrooted binary trees T 1 , T 2 on the same set of taxa X. Output: An uMAF for T 1 and T 2 .
The two problems defined above are closely related, and known to be NP-hard [1].
Fortunately, they have been proved to be FPT in their natural parameters [1], and fast algorithms have been recently proposed [27,13]. In this section, we will give a more compact proof of their fixed parameter tractability.
Theorem 2. Let T 1 , T 2 be two unrooted binary trees on the same set of taxa X such that a uAF of size k for these two trees exists. Then, the treewidth of their display graph D is at most k + 1.
Proof. From [19], we know that the display graph of two identical trees has treewidth 2 (or 1 in the case that both trees consist of a single vertex). Thus, if we have an uAF F = {F 1 , . . . , F k } of size k, this means that the display graph D 0 of F (which we define as the display graph constructed from two disjoint copies of F ) has k connected components, and treewidth at most 2. This is because the treewidth of a disconnected graph is equal to the largest treewidth ranging over its connected components. Now, we can construct a tree decomposition of D from the tree decomposition of F as follows: suppose F can be obtained by removing from T 1 , respectively T 2 , a subset of edges K 1 , respectively K 2 , and suppressing vertices with degree 2 and unlabeled leaves. First, note that we can reintroduce the suppressed vertices (and their corresponding edges) in F , obtaining a new forest F ′ , without changing the treewidth. Indeed, given an edge {u, v} in F that corresponded to a path (u, x 1 , · · · , x j , v) before the suppression of the vertices with degree 2, we know that there exists a bag B in the tree decomposition of D 0 such that u and v are in B. Then we can add a set of bags {B 1 , · · · , B j } such that B 1 = {u, x 1 , v}, B 2 = {x 1 , x 2 , v}, · · · , B j = {x j−1 , x j , v}, and add edges {B, B 1 }, {B 1 , B 2 }, · · · , {B j−1 , B j } to the tree decomposition. For the suppressed unlabeled leaves, say u, this is even easier: we add a bag {u, v} as child of any of the bags containing v, where v is the vertex from which the suppressed leaf was hanging. It is easy to see that this is a tree decomposition of the display graph of F ′ with treewidth 2. Now, we can easily reintroduce the k−1 edges in K 1 to the display graph, again without changing the treewidth, by, for each edge {u, v} in K 1 , adding a bag {u, v} between two existing bags, one containing u and the other containing v. Note that the obtained decomposition is still a tree, since we are connecting two components of F ′ . Now, when adding back the edges of K 2 , this is not true anymore. In this case, there exists at least a path in the tree decomposition, connecting a bag containing u to a bag containing v. Then, taking the shortest of these paths and adding u to its bags not containing u, we increase the treewidth by at most 1. If we do this for all edges in K 2 , we obtain a tree decomposition for the display graph of T 1 and T 2 with treewidth at most 2 + (k − 1) = k + 1. Note that this bound is tight, as the following example shows: an uMAF of two quartets with different topologies, uv|wx and ux|vw say, contains 2 components, and the display graph of these two quartets has treewidth 3 (see also [19]). ⊓ ⊔ In the following, we will demonstrate that |uM AF (T 1 , T 2 )| =: k can be computed in time O(f (k)·|X|) for some computable function f that depends only on k. We do this via the machinery of MSOL. The high-level idea is that we formulate a logical query to answer the question "Is k ≤ k ′ ?" for increasing values of k ′ until the answer is yes, and then stop: at this point k ′ = k. We use the stronger variant of MSOL that allows quantification over both edges and vertices. In particular, we will use the extended MSOL framework of Arnborg et al [2]. Following [10,25] we note that the sets V 1 , E 1 , V 2 , E 2 , X (and later, ρ) are all available to the MSOL query i.e. within the query we can distinguish which vertices/edges of D belong to T 1 , which belong to T 2 , and which are taxa.
More formally, we construct an MSOL formula Φ(K 1 , K 2 ) and a relational structure G such that G |= Φ(K 1 , K 2 ) if and only if K 1 is a set of k ′ − 1 edges of E 1 , and K 2 is a set of k ′ − 1 edges of E 2 , such that, after deleting them, the resulting components form an uAF F for both T 1 and To model this, we need to have that: (1) the two forests F 1 and F 2 induce an identical partition of X and (2) the components of the two induced forests must have the same topology. To enforce (1) we observe that (in, say, T 1 ) two taxa x 1 and x 2 are in the same component of the forest resulting from deletion of K 1 if and only if they can still reach each other inside T 1 after deletion of those edges. In turn, this occurs if and only if there is a path from x 1 to x 2 entirely contained inside T 1 which avoids all the edges in K 1 . To enforce (2) we demand that a quartet is in the first forest (i.e. the quartet is contained inside one of the trees in the forest) if and only the quartet is in the second forest. This uses the fact that two unrooted binary trees on the same set of taxa are topologically identical if and only if they induce identical sets of quartets [12].
Before defining Φ(K 1 , K 2 ), we need to introduce several intermediate predicates. These build on a number of basic predicates which we mainly list for the benefit of readers not familiar with MSOL. They are used to: test that Z is equal to the union of two sets P and Q: test if the sets P and Q are a bipartition of Z: check if the nodes p and q are adjacent in D: The predicate P AC(Z, x 1 , x 2 , K i ) ("path avoids cuts?' ') asks: is there a path from x 1 to x 2 entirely contained inside vertices Z that avoids all the edges K i ? We model this by observing that this does not hold if you can partition Z into two pieces P and Q, with x 1 ∈ P and x 2 ∈ Q, such that the only edges that cross the induced cut (if any) are in K i : q)))))) We model that a quartet is in the forest (of, say, T 1 ) by stipulating that there is an embedding (i.e. subdivision) of the quartet, completely contained inside T 1 , which avoids all the edges in K 1 . To model the embedding, we model the five edges of the quartet as five subsets of vertices A, B, C, D, P , representing the subdivisions of the five edges, with P being the central edge and u and v being its endpoints. We demand that (with the exception of u and v) these subsets are disjoint. This is all combined in the following QAC 1 predicate ("quartet avoids cuts in T 1 ?"), which returns true if and only if T 1 contains an embedding of x a x b |x c x d that is disjoint from the edge cuts K 1 .
We can define QAC 2 (x a , x b , x c , x d , K 2 ) in a similar way. Note that, for every four taxa, we need to consider all three possible quartet topologies. Then we define Φ(K 1 , K 2 ) as follows: (The cardinality operator is permitted because the extended MSOL framework of [2] allows the incorporation of an evaluation relation which can test, amongst other things, the cardinalities of free set variables). Proof. We have presented a logical query to answer the question "Is k ≤ k ′ ?" for increasing values of k ′ . For each value of k ′ the MSOL query, which examines the display graph D, has fixed length.
Combining this with the fact that the treewidth of D is bounded by a function of k (by Theorem 2), and that the size of D is a linear function of |X|, we have the desired result. (Note that the actual edge cuts -which can be used to construct a uMAF -can also be obtained in the same time bound by leveraging Theorem 4 of [10].) ⊓ ⊔

rSPR / MAF on rooted trees
In this section, we will give a compact proof that the computation of rSPR distance is FPT in its natural parameter. Before that, we need to introduce some definitions.
Definition 3 (rSPR move). Given a rooted binary tree T , a subtree prune and regraft (rSPR) move on T consists of removing an edge of T , say (u, v), yielding two trees T u and T v , and then reconnecting them as follows: subdividing some edge of T u with a new vertex p; adding an edge directed from p to v, and then suppressing any vertices with indegree and outdegree both equal to 1.
rSPR distance is defined analogously to TBR distance, and a MAF for two rooted binary trees T 1 , T 2 is defined similarly to a uMAF, but in a rooted framework. We refer to [6] for precise definitions. The main difference is that a forest consists of rooted binary trees and this has to be taken into account when comparing the topology of the components. In the rooted context MAFs are mainly studied because of their close relationship to rSPR distance. To accurately model rSPR distance it is necessary to slightly modify each input tree T i as follows: we add a vertex with special label ρ at the end of a pendant edge adjoined to the original root of T i , see Figure 1(c). We then consider ρ to be part of the label set of the tree. Note that the addition of ρ means that we can equivalently view each T i as an unrooted binary tree, with ρ acting as a placeholder for the root location, and this is how the trees will be modelled in the display graph.
The close relationship between MAF (assuming ρ has been added as described) and rSPR distance is summarized by the following well-known result.
Note that these problems have been proved NP-hard and FPT in their natural parameter [6]. The MSOL formulation is similar to the TBR formulation, but with the following changes. When checking that the components induced by the edge cuts partition the taxa in the same way in both T 1 and T 2 (i.e. by considering pairs of taxa that still have a path between them), we need to range over X ∪ {ρ} instead of just X. More significantly, we need predicates for triplets instead of quartets, because we are working in the rooted environment and two rooted binary trees are topologically equivalent if and only if they contain the same set of triplets [11]. Fortunately we can use the fact that triplet xy|z is in T i (x, y, z ∈ X) if and only if quartet xy|ρz is in the unrooted interpretation of T i .
However we cannot simply use ρ as the fourth parameter x d to QAC i because this will evaluate to false if the path from ρ to the rest of the quartet embedding has been cut. This is not what we need: ρ is in this context only there to indicate direction, so its particular arm of the quartet embedding can be cut without consequence. We can remedy this by introducing predicates Quartet i and T riplet i which check whether the corresponding quartet/triplet was in the original tree (i.e. before the edge cuts). We can then leverage the fact that, if three distinct taxa x, y, z are in the same component of the forest, the unique triplet topology they induce within the component will be the same topology as they induced in the original tree.
We first need the following predicate, which is a specialization of the earlier P AC predicate. It tests whether there is a path from x 1 to x 2 that is entirely contained inside vertex set Z: path(Z, x 1 , x 2 ) := (x 1 = x 2 ) ∨ ¬∃P, Q(Bipartition(Z, P, Q) ∧ x 1 ∈ P ∧ x 2 ∈ Q ∧(∀p, q(p ∈ P ∧ q ∈ Q ⇒ ¬adj(p, q)))) For each tree T i , the following predicate checks whether the quartet x a x b |x c x d is contained in T i : For each rooted tree T i , the following predicate checks whether the triplet x a x b |x c is contained in T i (simply by checking whether x a x b |x c ρ is contained in it): Now, we are ready to define T AC i ("triplet avoids cuts in T i ?"), which models whether a triplet is in the forest of T i induced by the edge cuts: Note how we use path rather than P AC to model the path from v to ρ i.e. because it does not matter for the triplet whether this path is cut. The final MSOL formulation is then very similar to that given in Section 3.1:

Theorem 5. Computation of rSPR / MAF on two rooted binary trees on the same set of taxa X is linear time FPT. That is, the optimum k can be computed in time O(f (k) · |X|), for some computable function that only depends on k.
Proof. An agreement forest of the two rooted trees T 1 and T 2 induces an agreement forest (consisting of unrooted binary trees) of the same size of the unrooted interpretations of these trees, simply by ignoring the orientation of edges. Hence the treewidth bound described in Theorem 2 is still applicable, and the theorem follows. (Again, if required one can obtain the actual edge cuts, which can be used to build a MAF, in the same time bound by leveraging Theorem 4 of [10]). ⊓ ⊔

Hybridization Number
In this section, we deal again with rooted trees, and thus we add a vertex labeled ρ to both trees to indicate the root location, as done for rSPR; see Figure 1(c). A rooted phylogenetic network (rooted network for short) N = (V (N ), E(N )) on a set of taxa X is any rooted acyclic digraph in which no vertex has degree 2 (except possibly the root) and whose leaves are bijectively labeled by elements of X. The hybridization number of N , denoted by h(N ), is defined as Given a rooted network N on X and a rooted binary tree T on X ′ , with X ′ ⊆ X, we say that T is displayed by N if T can be obtained from N by deleting a subset of its edges and any resulting degree 0 vertices, and then suppressing vertices with δ − (v) = δ + (v) = 1.
We are now ready to define the hybridization number problem: Problem: HN (T 1 , T 2 ) Input: Two rooted binary trees T 1 , T 2 on the same set of taxa X.
Output: A rooted network N displaying T 1 and T 2 such that h(N ) is minimum over all rooted networks with this property.
The hybridization number for T 1 and T 2 , denoted by h(T 1 , T 2 ), is defined as the hybridization number of this minimum network. As done for TBR and rSPR, we can give a characterization of the hybridization number in terms of agreement forests. To do so, we need to define acyclic agreement forests.
Let F = {F 1 , F 2 , . . . , F k } be an agreement forest for two rooted binary trees T 1 and T 2 on the same set of taxa X, and let AG(T 1 , T 2 , F ) be the directed graph whose vertex set is F and for which (F i , F j ) is an arc iff i = j, and either (1) the root of T 1 (L(F i )) is an ancestor of the root of T 1 (L(F j )) in T 1 , or (2) the root of T 2 (L(F i )) is an ancestor of the root of T 2 (L(F j )) in T 2 .
We call F an acyclic agreement forest (AAF) for T 1 and T 2 if AG(T 1 , T 2 , F ) does not contain any directed cycle. A maximum acyclic agreement forest (MAAF), is an AAF of minimum size.
The acyclicity condition is used to model the fact that species cannot inherit genetic material from their own offspring. The two problems defined above are closely related, as the following well-known result shows.

Theorem 6 ([3]).
Given two rooted binary trees T 1 , T 2 on the same set of taxa X, we have that The above equivalence formed the basis for results proving that both problems are NP-hard [8] and fixed parameter tractable [7].
Here we show an alternative proof that computation of hybridization number on two rooted binary trees with the same set of taxa X is FPT, again using MSOL. We will do this by demonstrating that |M AAF (T 1 , T 2 )| =: k can be computed in time O(f (k) · |X|) for some computable function f that depends only on k. Again, we will formulate a logical query on the display graph to answer the question "Is k ≤ k ′ ?" for increasing values of k ′ , until k ′ = k is reached and the answer to the query is "yes". Unlike the formulations given earlier for TBR and rSPR, the query has no free variables, and the length of each query will grow as a function of k ′ . However, given that k ′ ≤ k, the length will remain bounded by a function of k. Note that, if a MAAF of size k exists for T 1 and T 2 , then an AF of size k exists too, and as argued for rSPR, if two rooted trees have an agreement forest of size k then so do the underlying, unrooted trees. So the treewidth bound of Theorem 2 is still valid, where k = |M AAF (T 1 , T 2 )|, and this implies that an overall running time of the form O(f (k) · |X|) can be achieved.
The major challenge when modelling MAAF is to encode the acyclicity constraints. It is not clear whether the formulations from the previous sections, in which agreement forests are modelled directly as sets of edge-cuts, can be (elegantly) extended to include acyclicity constraints. For this reason we choose to discard the agreement forest abstraction, using it only to generate the treewidth upper bound. For the actual modelling we use an alternative "elimination-ordering" characterization of MAAF/HN, first presented in [23], which we briefly summarize here.
Given a rooted binary tree T on X, we say a subtree T ′ of T is pendant if there exists a vertex u of T such that T ′ = T u . In this case it is then natural to associate T ′ with the subset of X labeling its leaves, i.e. L(T ′ ). We say that T ′ is a common pendant subtree of T 1 and T 2 if it is a pendant subtree of both T 1 and T 2 . We call (S 1 , S 2 , . . . , S p ) (p ≥ 0) a common pendant subtree sequence of T 1 and T 2 of length p if for every 1 ≤ i ≤ p, S i is a common pendant subtree of T 1 − ∪ j<i L(S j ) and T 2 − ∪ j<i L(S j ). We say that such a sequence is additionally a tree sequence if the two trees T 1 − ∪ j≤p L(S j ) and T 2 − ∪ j≤p L(S j ) are identical. Informally, a tree sequence of length p describes a sequence of p common pendant subtrees that can be successively pruned from the original trees to reach a common core tree. If T 1 and T 2 are already identical then we use the empty tree sequence ∅, and take p = 0, to represent this.
The results in [23] establish that h(T 1 , T 2 ) is equal to the smallest p such that a tree sequence of length p exists. This is the characterization of optimality that we will use i.e. each logical query will pose the question, "Does a tree sequence of length k ′ exist?". There is no need to model acyclicity in this formulation. However, we do need to model the concept common pendant subtree and the impact of earlier pruning steps on the original trees.
Before writing down the MSOL formulation we need some new auxiliary predicates. The first predicate checks whether there is a path from x 1 to x 2 within Z that survives the deletion of vertex u. This is similar to the P AC predicate defined earlier.
For a vertex u = ρ in a tree T i and a taxon x ∈ X, observe that x is in the clade rooted at u (i.e. in the label set of the pendant subtree rooted at u) if and only if (x = u) or deleting u from T i destroys all paths from ρ to x (inside T i ). Hence: This leads naturally to a predicate for testing whether C ⊆ X is a clade of T i : As we shall see, it is useful to extend this predicate with an optional list Z 1 , Z 2 , . . . which represent subsets of X describing common pendant subtrees that have already been pruned from the tree. The statement Clade i (C, Z 1 , Z 2 , . . . ...) evaluates to true if and only if C is a clade of T i after the taxa in Z 1 , Z 2 , . . . ... have been pruned away. (To avoid ambiguity the predicate automatically returns false if C intersects with any of the Z i .) Note that the list of Z i is shown in square brackets to emphasize that it is a "macro": there will be a different predicate for each possible list length t. The list of Z i will never be longer than h(T 1 , T 2 ), and length of the generated predicate will be bounded by a function of the list length, so the length of the overall logical query remains bounded by a function of h(T 1 , T 2 ).
We are now ready to define the CPS (i.e. "common pendant subtree") predicate. We do this by observing that C ⊆ X corresponds to a common pendant subtree of T 1 and T 2 if and only if C is a clade of both trees (this ensures that C is pendant in both trees) and the set of triplets induced by C is identical in both trees (this ensures that the pendant subtree has the same topology in both trees).
We extend this now with a list of Z i representing the taxa we have already pruned. This new version of the predicate evaluates to true if and only if C corresponds to a common pendant subtree in the two trees after all the Z i have been pruned away. (Here we make implicit use of the fact that the Clade predicate immediately returns false whenever C intersects with the Z i .) CP S(T 1 , T 2 , C, [Z 1 , . . . , Z t ]) := Clade 1 (C, Z 1 , . . . , Z t ) ∧ Clade 2 (C, Z 1 , . . . , Z t ) ∧ ∀x∀y∀z (x, y, z ∈ C ∧ allDif f (x, y, z) ⇒ (T riplet 1 (x, y, z) ⇔ T riplet 2 (x, y, z))) We are now ready to directly pose the question: is there a tree sequence of length k ′ ? We can assume k ′ ≥ 1 because k ′ = 0 is trivial to check in polynomial time. To make the formulation slightly more compact we actually construct a list of length k ′ + 1, where C k ′ +1 represents the taxa that still remain after the common pendant subtrees have been pruned away: we can then test that the sequence is a tree sequence (i.e. that a common core tree remains) by testing that CP S(T 1 , T 2 , C k ′ +1 , C 1 , . . . , C k ′ ) is true. Note that the HybNum predicate is again a macro, whose expansion depends on k ′ .
The Partition predicate has the expected meaning and definition: Concluding, we have the following result: Theorem 7. Computation of hybridization number / MAAF on two rooted binary trees on the same set of taxa X is linear time FPT. That is, the optimum k can be computed in time O(f (k) · |X|), for some computable function that only depends on k.

Parsimony distance on binary characters
Let T be an unrooted binary tree on a set of taxa X. A binary character f is simply a function f : X → {red, blue}. An extension of f to T is a mapping g : V (T ) → {red, blue} such that, for all x ∈ X, g(x) = f (x). For a given character f , an optimal extension is any extension g of f such that the number of bichromatic edges is minimized. The number of bichromatic edges in an optimal extension is called the parsimony score of f with respect to T , and denoted l f (T ). The well-known algorithm by Fitch can be used to compute l f (T ) (and an optimal extension) in polynomial time [18]. We shall describe Fitch's algorithm in due course. The parsimony distance problem on binary characters, denoted d 2 MP , is defined as follows [17].
Problem: d 2 MP (T 1 , T 2 ) Input: Two unrooted binary trees T 1 , T 2 on the same set of taxa X Output: Construct a binary character f on X such that the value |l f (T 1 ) − l f (T 2 )| is maximized.
We use d 2 MP to denote the optimum value of |l f (T 1 ) − l f (T 2 )|. The problem was recently shown to be NP-hard and APX-hard [22]. It is not known whether the problem is FPT in d 2 MP . The following result, however, is already known.

Lemma 1 ([17]
). Let T 1 , T 2 be two unrooted binary trees on the same set of taxa X. Then Given two trees T 1 , T 2 as input to d 2 MP , it is not known whether the display graph D of T 1 and T 2 has treewidth bounded by a function of d 2 MP . However, from Lemma 1 and earlier results in this article (Theorems 1 and 2) it is clear that D has treewidth bounded by a function of d T BR (T 1 , T 2 ). An MSOL formulation modelling d 2 MP , whose length is bounded by a function of d 2 MP , will therefore give a running time of the form f (d T BR (T 1 , T 2 )) · O(|X|) for some computable function f that only depends on d T BR (T 1 , T 2 ). We now give such a formulation. We will remain within the framework of [2], this time using the ("linear extremum") optimization variant of MSOL. This allows us to maximize or minimize an affine function of (the cardinalities of) the free set variables in the query.
The MSOL formulation we give here, which is based on an ILP formulation from [22], maximizes l f (T 1 ) − l f (T 2 ). (To compute d 2 MP we need to use the MSOL machinery twice, once for l f (T 1 ) − l f (T 2 ) and once for l f (T 2 ) − l f (T 1 ), taking the maximum of the two results. The second call only differs in its objective function so we omit details).
The basic idea is to range over all possible binary characters, simultaneously embedding two static formulations 4 of Fitch's algorithm to "compute" l f (T 1 ) − l f (T 2 ).
Fitch's algorithm proceeds as follows. If T is not rooted, we root it arbitrarily (by subdividing an arbitrary edge). The algorithm then works in two phases, a bottom-up phase which computes l f (T ), and then a top-down phase which actually computes a corrresponding extension. In the bottom-up phase, we start by assigning each taxon x the singleton set of colours S(x) := {f (x)}. For an internal node u with children v 1 , v 2 we set S(u) : in which case we say u is an intersection node) and S(u) : in which case we say that u is a union node). The value l f (T ) is equal to the number of internal nodes that are union nodes. (We omit a description of the constructive top-down phase as it is not relevant for this article).
To translate this into an MSOL formulation, we begin by arbitrarily rooting T 1 and T 2 and using ρ as the placeholder for the root, in the usual fashion. The central idea is to partition the vertices of each tree T i into four possible subsets R i , B i , RB i I and RB i U corresponding to the set of colours that Fitch allocates to each node, and distinguishing union events from intersection events: red, blue, {red, blue} (intersection node) and {red, blue} (union node). We therefore ask the MSOL formulation to instantiate the free set variables R i , B i , RB i I and RB i U (i ∈ {1, 2}) such that the expression |RB 1 U | − |RB 2 U | is maximized. (If desired, this can then be made constructive via Theorem 4 of [10].) The only significant work is simulating the bottom-up execution of Fitch's algorithm. In particular, encoding expressions which describe the state of a parent node u in terms of its two children v 1 , v 2 .
We introduce the auxiliary predicate child i (u, v) which says that v is a child of u in T i . We can model this as follows: v is a child of u in T i if and only if there is an edge e in T i such that v and u are both endpoints of e and there does not exist a path from ρ to v that survives the edge cut e. (Here we have specialized the PAC predicate from earlier so that it only takes a single edge, rather than a set of edges, as its fourth argument.) child i (u, v) := (u = v) ∧ ∃e ∈ E i (R D (e, u) ∧ R D (e, v) ∧ ¬P AC(V i , ρ, v, e)) For each tree T i we add the following constraints, which encode (in this order): -The four subsets R, B, RB I and RB U partition the vertices of the tree; -A vertex in X can only be in R or B; -An internal node is in R if and only if (one child is in R and the other child is not in B); -An internal node is in B if and only if (one child is in B and the other child is not in R); -An internal node is in RB I if and only if (neither child is in R or B); -An internal node is in RB U if and only if (one child is in R and one child is in B).
Finally, we ensure that both trees select the same character as follows: This concludes the formulation. Then we have the following result:

Conclusion
We have demonstrated how agreement forests, which are intensively studied objects in the phylogenetics literature, naturally lead to bounded treewidth in an auxiliary graph structure known as the display graph. This opens the door to compact, "declarative" proofs of fixed parameter tractability for a range of phylogenetics problems by formulating them in Monadic Second Order Logic (MSOL). Our formulations have introduced a number of logical predicates and design principles that will hopefully be of use to other phylogenetics researchers seeking to utilize this powerful machinery elsewhere in phylogenetics. Indeed, a natural follow-up question is to ask: what are the essential characteristics of phylogenetics problems that are amenable to this technique?

Acknowledgements
We thank Mathias Weller for helpful conversations.