Maximum parsimony distance on phylogenetic trees: a linear kernel and constant factor approximation algorithm

Maximum parsimony distance is a measure used to quantify the dissimilarity of two unrooted phylogenetic trees. It is NP-hard to compute, and very few positive algorithmic results are known due to its complex combinatorial structure. Here we address this shortcoming by showing that the problem is ﬁxed parameter tractable. We do this by establishing a linear kernel i.e., that after applying certain reduction rules the resulting instance has size that is bounded by a linear function of the distance. As powerful corollaries to this result we prove that the problem permits a polynomial-time constant-factor approximation algorithm; that the treewidth of a natural auxiliary graph structure encountered in phylogenetics is bounded by a function of the distance; and that the distance is within a constant factor of the size of a maximum agreement forest of the two trees, a well studied object in phylogenetics.


Introduction
Phylogenetics is the science of inferring and comparing trees (or more generally, graphs) that represent the evolutionary history of a set of species [35]. In this article we focus on trees. The inference problem has been comprehensively studied: given only data about the species in X (such as DNA data) construct a phylogenetic tree which optimizes a particular objective function [18,41]. Informally, a phylogenetic tree is simply a tree whose leaves are bijectively labelled by X. Due to different objective functions, multiple optima and the phenomenon that certain genomes are the result of several evolutionary paths (rather than just one) we are often confronted with multiple "good" phylogenetic trees [33]. In such cases we wish to formally quantify how dissimilar these trees really are. This leads naturally to the problem of defining and computing the distance between phylogenetic trees [37]. Many such distances have been proposed, some of which can be computed in polynomial-time, such as Robinson-Foulds (RF) distance [34], and some of which are NP-hard, such as Subtree Prune and Regraft (SPR) distance [9] or Tree Bisection and Reconnection (TBR) distance [1].
Interestingly, distances are not only relevant as a numerical quantification of difference: they also appear in constructive methods for the inference of phylogenetic networks [21], which generalise trees to graphs, and phylogenetic supertrees, which seek to merge multiple trees into a single summary tree [43]. In recent decades NP-hard phylogenetic distances have attracted quite some attention from the discrete optimization and parameterized complexity communities, see e.g. [12,17].
In this article we focus on a relatively new distance measure, maximum parsimony distance, henceforth denoted dMP . Let T1 and T2 be two unrooted (i.e. undirected) binary phylogenetic trees, with the same set of leaf labels X. Consider an arbitrary assignment of colours ("states") to X; we call such an assignment a character. The parsimony score of T1 with respect to the character is the minimum number of bichromatic edges in T1, ranging over all possible colourings of the internal vertices of T1. The parsimony distance of T1 and T2 is the maximum absolute difference between parsimony scores of T1 and T2, ranging over all characters [19,32].
The distance has several attractive properties; it is a metric, and (unlike e.g. RF distance) it is not confounded by the influence of horizontal evolutionary events [19]. Furthermore, the concept of parsimony, which lies at the heart of dMP , is fundamental in phylogenetics since it articulates the idea that explanations of evolutionary history should be no more complex than necessary. Alongside its historical significance for applied phylogenetics [18], the study of character-based parsimony has given rise to many beautiful combinatorial and algorithmic results; we refer to e.g. [38,30,39,2,31] for overviews.
Unfortunately, it is NP-hard to compute dMP [23]. A simple exponentialtime algorithm is known [27], which runs in time O(φ n · poly(n)), where |X| = n and φ ≈ 1.618 is the golden ratio, but beyond this few positive results are known. This is frustrating and surprising, since a number of results link dMP to the well-studied TBR distance, henceforth denoted dT BR . Namely, it has been proven that dMP is a lower bound on dT BR [19], which, informally, asks for the minimum number of topological rearrangement operations to transform one tree into the other; an empirical study has suggested that in practice the distances are often very close [24]. Also, dMP has been used to prove the tightness of the best-known kernelization results for dT BR [26,25]. What, exactly, is the relationship between dMP and dT BR ? This is a pertinent question, which transcends the specifics of TBR distance because, crucially, dT BR can be characterized using the powerful maximum agreement forest abstraction.
Distances based on agreement forests have been intensively and successfully studied in recent years, as the use of the agreement forest abstrac-tion almost always yields fixed parameter tractability and constant-factor approximation algorithms [10], many of which are effective in practice. We refer to [42,40,14,36] for recent overviews of the agreement forest literature, and books such as [15] for an introduction to fixed parameter tractability. In particular, dT BR can be computed in O(3 d T BR · poly(n)) time [13], permits a polynomial-time 3-approximation algorithm, and a kernel of size 11dT BR − 9 [25].
In contrast, prior to this paper very little was known about dMP : nothing was known about the approximability of dMP ; it was not known whether it is fixed parameter tractable (where dMP is the parameter); and, while, as mentioned above, it is known that dMP ≤ dT BR , it remained unclear how much smaller dMP can be than dT BR in the worst case. Despite promising partial results it even remained unclear whether questions such as "Is dMP ≥ k?" can be solved in polynomial time when k is a constant [8,24]. This is another important difference with distances such as dT BR , where corresponding questions are trivially polynomial time solvable for fixed k. The apparent extra complexity of dMP seems to stem from the unusual max-min definition of the problem, and the fact that unlike dT BR , which is based on topological rearrangements of subtrees, dMP is based only on characters.
In this article we take a significant step forward in understanding the deeper complexity of dMP and resolve all of the above questions. Our central result is that we prove that two common polynomial-time reduction rules encountered in phylogenetics, the subtree and chain reductions [1], are sufficient to produce a linear kernel for dMP . This means that, after exhaustive application of these rules, which preserve dMP , the reduced trees will have at most α · dMP leaves, with α = 560. The fixed parameter tractability of computing dMP (parameterized by itself) then follows, by solving the kernel using the exact algorithm from [27]. The fact that the reduction rules preserve dMP was already known [24]. However, proving the bound on the size of the reduced trees requires rather involved combinatorial arguments, which have a very different flavour to the arguments typically encountered in the maximum agreement forest literature. The main goal of this article is to present these arguments as clearly as possible, rather than to optimize the resulting constants.
The kernel confirms that questions such as "Is dMP ≥ k?" can, indeed, be solved in polynomial time: it is striking that here the proof of fixed parameter tractability has preceded the weaker result of polynomial-time solveability for fixed k.
Next, by producing a modified, constructive version of the bounding argument underpinning the kernelization, we are able to demonstrate a polynomial-time α(1 + 1/r)-factor approximation algorithm for computation of dMP for any constant r, placing the problem in APX.
A number of other powerful corollaries result from the kernelization. We leverage the fact that the reduction rules also preserve dT BR , to show that 1 ≤ d T BR d M P ≤ 2α, which limits how much smaller dMP can be than dT BR . Subsequently, we show that the treewidth of an auxiliary graph structure known as the display graph [11] is bounded by a linear function of dMP , resolving an open question posed several times [29,24]. The treewidth bound, and the existence of a non-trivial approximation algo- A character which assigns colour red to {a, b, c} and blue to {d, e, f, g} has parsimony score 1 on the left tree, and 2 on the right, proving that d MP (T 1 , T 2 ) ≥ |1 − 2| = 1. In fact, it can be verified that no character can cause the parsimony scores of these two trees to differ by more, so d MP (T 1 , T 2 ) = 1.
As noted in Section 4.2, d T BR (T 1 , T 2 ) = 2, because a maximum agreement forest of these two trees contains three blocks [24].
rithm for dMP , were specified as sufficient conditions for proving the fixed parameter tractability of dMP via Courcelle's Theorem [24]; our linear kernel implies them. Summarising, our central result shows how kernelization can open the gateway to a host of strong auxiliary results and bypass intermediate steps in the algorithm design process. The structure of the paper is as follows. In Section 2 we give formal definitions and insightful preliminary results. In Section 3 we prove our main result: the linear kernel. The section starts with Subsection 3.1 that gives a high-level overview of how a sequence of lemmas and theorems lead to the kernel, whereas in the rest of the section these lemmas and theorems are proved. Interesting corollaries of the existence of a linear kernel are derived in Section 4: A constant approximation algorithm in Section 4.1; A bound on the ratio between dMP and dT BR in Section 4.2; A bound on the treewidth of the so-called display graph in terms of dMP in Section 4.3. Section 5 concludes with some directions for future research.

Definitions and Preliminaries
An unrooted binary phylogenetic tree on a set of species (or taxa) X is an undirected tree in which all internal vertices have degree 3, and the degree-1 vertices (the leaves) are bijectively labelled with elements from X. For brevity we will refer to unrooted binary phylogenetic trees as phylogenetic trees, or even shorter trees. See Figure 1 for an example.
Given a set S ⊆ X and a tree T on X, we denote by T [S] the spanning subtree on S in T , that is, the minimal connected subgraph T ′ of T such that T ′ contains every element of S. The induced subtree T |S by S in T is the tree derived from T [S] by suppressing any vertices of degree 2.
Given a subset S ⊆ X and a tree T on X, we say that S has degree d in T if there are exactly d edges uv in T for which u is in T [S] and v is not; in other words, d is the number of edges separating T [S] from the rest of T . We call these edges pending edges. of T .
For two disjoint subsets S1, S2 ⊆ X, we say S1 and S2 are spanningdisjoint in T if the spanning subtrees T [S1] and T [S2] are edge-disjoint. (Observe that as T is binary, this also implies that T [S1] and T [S2] are vertex-disjoint.) Similarly, we say a collection S1, . . . Sm of subsets of X are spanning-disjoint in T if Si, Sj are spanning-disjoint in T for any i = j.

Characters and parsimony
A character on X is a function χ : X → C, where C is a set of states. In this paper there is no limit on the size of C, in contrast to some contexts where |C| is assumed to be quite small (for example, in genetic data the nucleobases A,C,G,T). Think of the states as colours, say 1, 2, . . . , t =: [t].
For a given character χ and tree T on X, the parsimony score measures how well T fits χ. It is defined in the following way. Call a colouring for all x ∈ X. We usually omit superscript χ of φ if the character is clear from the context. Denote by ∆T (φ) the number of bichromatic edges uv in T , i.e. for which φ(u) = φ(v). Again, we usually omit subscript T when the tree is clear from context. The parsimony score for T with respect to χ is defined as where the minimum is taken over all possible extensions φ of χ to T . An extension φ that achieves this bound is called an optimal extension of χ to T . An optimal extension, and thus the parsimony score, can be easily computed in polynomial time using dynamic programming or e.g. Fitch's algorithm [20].
Observe that for any T and χ, the parsimony score for T with respect to χ is at least |χ(X)| − 1, i.e. the number of colours assigned by χ minus 1. If lχ(T ) is exactly |χ(X)| − 1, we say that T is a perfect phylogeny for χ. For trees T1, T2 and a character χ on X, the parsimony distance with respect to χ is defined as dMP χ (T1, T2) = |lχ(T1) − lχ(T2)|. Now we are ready to define the maximum parsimony distance between two trees (see also Figure 1). For two trees T1, T2 on X, the maximum parsimony distance is defined as where the maximum is taken over all possible characters χ on X [19,32]. Equivalently, we may write it as where φ χ 1 is an optimal extension of χ to T1, and φ χ 2 an optimal extension of χ to T2. This measure satisfies the properties of a distance metric on the space of unrooted binary phylogenetic trees [19,32]. For two trees on n taxa it is known that dMP is at most n − 2 √ n + 1 [19]. A weaker bound of n − 1 is easily obtained by observing that the parsimony score of a character on a tree is at least 0 and at most n − 1. Given a tree T on X and a colouring φ : V (T ) → [t], the forest induced by φ is derived from T by deleting every bichromatic edge under φ. Observe that the number of connected components in the forest induced by φ is exactly ∆(φ) + 1.
and T is a tree on X, then with equality if and only if S1, . . . St are spanning-disjoint in T .
Proof. To see that lT (χ) ≥ t − 1, consider an optimal extension φ of χ to T , and let F be the forest induced by φ. As each connected component in F is monochromatically coloured by φ, there must be at least t connected components, and thus ∆(φ) ≥ t − 1, which implies lχ(T ) ≥ t − 1. Now suppose that S1, . . . , St are spanning-disjoint in T . Then construct an extension φ of χ to T by first setting φ(u) = i for every vertex u in T [Si] , for each i ∈ [t]. (As the spanning trees are edge-disjoint and thus vertex-disjoint in T , this is well-defined). For any remaining unassigned vertices v, if v has a neighbour u for which φ(u) is defined, then set φ(v) = φ(u). Repeat this process until every vertex is assigned a colour by φ. Now observe that by construction, the vertices assigned colour i by φ form a connected subtree for each i ∈ [t]. Thus the forest induced by φ has exactly t connected components, and so ∆(φ) = t − 1.
Finally, suppose lχ(T ) = t − 1, and let φ be an optimal extension of χ. Then the forest F induced by φ has exactly t connected components, which implies by the pigeonhole principle that each Si is a subset of one connected component in F . Then as each Si is contained within a different connected component of F , the spanning trees T [Si] are also contained within these components, and so S1, . . . St are spanning-disjoint.

Parameterized complexity and kernelization
A parameterized problem is a problem for which the inputs are of the form (x, k), where k is an non-negative integer, called the parameter. A parameterized problem is fixed-parameter tractable (FPT) if there exists an algorithm that solves any instance ( is a polynomial in k then we call this a polynomial kernel ; if g(k) = O(k) then it is a linear kernel. It is well-known that that a parameterized problem is fixed-parameter tractable if and only if it has a (not necessarily polynomial) kernel. For more information, we refer the reader to [16].
For a maximization problem Π and ρ ≥ 1, we say Π has a constant factor approximation with approximation ratio ρ if there exists a polynomialtime algorithm such that for any instance π of Π, the following inequalities hold, where opt(π) denotes the maximum value of a solution to π, and alg(π) denotes the value of the solution to π returned by the algorithm: In this paper we study the following maximization problem: dmp Input: Two trees T1, T2 on a set of taxa X.

Kernel bound 3.1 Overview
In this section we give an overview of the constituent parts of our kernelization result, and how they fit together. The first step is to apply two reduction rules, described in the next section. Rules 1 and 2 correspond roughly to the Cherry and Chain reduction rules that often appear in papers on computational phylogenetics. The correctness of these rules was proved in [24]; our contribution is to show that the exhaustive application of these rules grants a linear kernel, as stated in the following theorem. Then if |X| ≥ αk, it holds that dMP (T1, T2) ≥ k, and we can find a witnessing character, i.e. a character χ yielding dMP χ (T1, T2) ≥ k, in polynomial time.
This theorem, together with the correctness of the reduction rules as proved in [24], immediately implies a linear kernel for dmp.
To show how we prove the theorem, we will need to introduce some terminology as we go.
A quartet Q is any set of 4 elements in X. If T1|Q = T2|Q, we say that Q is a conflicting quartet for (T1, T2).
As a crucial step we prove that for any S large enough with respect to the degree of S in both T1 and T2, either there exists a conflicting quartet or one of the reduction rules applies.
Lemma 2. Let S be a subset of X with d1 the degree of S in T1, and d2 the degree of S in T2. If |S| > 9(d1 + d2) − 12, then either T1|S = T2|S or one of Reduction Rules 1 or 2 applies to (T1, T2). In particular if (T1, T2) is irreducible under Rules 1 or 2 and |S| ≥ 9(d1 + d2) − 11, then there exists a conflicting quartet Q ⊆ S, and such a quartet can be found in polynomial time.
The next result implies that if we have a large enough number of conflicting quartets that are also spanning-disjoint in both T1 and T2, then we are done. While it is intuitively clear that such quartets can be leveraged to create a high parsimony score in one tree, some care has to be taken to keep the parsimony score low in the other tree.
. . , Q k } be a set of conflicting quartets for T1, T2, such that Q1, . . . Q k are spanning-disjoint in T1 and in T2.
Then dMP (T1, T2) ≥ k, and we can find a witnessing character in polynomial time.
In combination, Lemmas 2 and 3 allow us to show that dMP (T1, T2) ≥ k providing that we can find at least k sets S1, . . . S k that are spanningdisjoint in both trees and satisfy the conditions of Lemma 2.
We will find k such sets as part of the construction of a character that witnesses dMP (T1, T2) ≥ k, for any reduced instance with |X| ≥ αk. In order to construct this character, we first create a partition of X into large subsets, as described by the following lemma.
Lemma 4. Suppose that |X| ≥ 2ct for some integers c and t, and let T1 be a phylogenetic tree on X.
Then in polynomial time we can construct a partition S1, . . . , St of X with S1, . . . , St spanning-disjoint in T1, such that |Si| ≥ c for each i.
We note that there is a one-to-one correspondence between partitions and characters on X, in the following sense. Given a partition S1, . . . St of X, we may define a character χ : Call such a character the character defined by S1, . . . St.
Thus let us consider the character χ on X defined by the partition described by Lemma 4. Since S1, . . . St are spanning-disjoint in T1, Lemma 1 tells that the parsimony score of T1 with respect to χ is exactly t − 1.
Lemma 5. Let χ be the character defined by the partition S1, . . . , St where S1, . . . , St are spanning-disjoint in T1, and assume Then either dMP χ (T1, T2) ≥ k, or in polynomial time we can find a set of indices i1, . . . i k ′ with k ′ ≥ k such that: • Si 1 , . . . Si k ′ are spanning-disjoint in T2 (as well as T1); • each Si j has degree at most d1 in T1; and • each Si j has degree at most d2 in T2.
We will prove Theorem 1 by combining these results in the following way. Fix integers d1, d2 to be determined later. Assume (T1, T2) is irreducible under Reduction Rules 1 and 2, and assume that |X| ≥ 2ct, where . By Lemma 4, there exists a partition S1, . . . St of X with S1, . . . St spanningdisjoint in T1 and |Si| ≥ c for each i ∈ [t]. Let χ be the character defined by this partition. If dMP χ (T1, T2) ≥ k, we may return χ. Otherwise, we may apply Lemma 5 to get a set of indices i1, . . . i k such that Si 1 , . . . Si k are spanning-disjoint in T2 (as well as in T1), each Si j has degree at most d1 in T1, and each Si j has degree at most d2 in T2. But then each Si j satisfies the conditions of Lemma 2, and therefore for each j ∈ [k] there exists a conflicting quartet Qj ⊆ Si j . Moreover, as Si 1 , . . . Si k are spanningdisjoint in T1 and T2, the quartets Q1, . . . Q k are also spanning-disjoint in T1 and T2. Then Lemma 3 implies that dMP (T1, T2) ≥ k.
In the next subsections we prove each of these lemmas, and then the main theorem, in turn.

Reduction Rules
We begin by stating the reduction rules for our kernelization result.
Reduction Rule 1. [Cherry reduction rule] If there exist x, y ∈ X such that in each of T1, T2 there exists an internal vertex u adjacent to both x and y, then replace (T1, T2) with (T1| X\{x} , T2| X\{x} ).
The correctness of these rules was previously proved in [24]. T2). Correctness of the chain reduction rule follows from Theorem 3.1 in [24]. Correctness of the cherry reduction rule follows as a subcase of Theorem 4.1 in [24] (in particular, the cherry reduction is an instance of the "traditional" case of the generalized subtree reduction from [24], where the subtree has 2 leaves).
Our main contribution is to show that if an instance is reduced by these rules then its size is bounded by a linear function of dMP .

Small degree sets
In this section we prove Lemma 2.
Lemma 2. Let S be a subset of X with d1 the degree of S in T1, and d2 the degree of S in T2. If |S| > 9(d1 + d2) − 12, then either T1|S = T2|S or one of Reduction Rules 1 or 2 applies to (T1, T2). In particular if (T1, T2) is irreducible under Rules 1 or 2 and |S| ≥ 9(d1 + d2) − 11, then there exists a conflicting quartet Q ⊆ S, and such a quartet can be found in polynomial time.
Proof. Since unrooted binary trees are characterized by their quartets [35, Theorem 6.3.5(iii)] the last statement of the theorem follows directly.
Consider the backbone graph of T |S obtained by deleting all leaves. Let PC be the set of nodes having degree 1 on the backbone, which we refer to as parents of a cherry in T |S. Let PL be the set of nodes having degree 2 on the backbone, which we refer to as parents of a leaf of T |S. All remaining vertices on the backbone have degree 3. Thus |S|, the total number of leaves of T |S is 2|PC | + |PL|. We call the path between any two odd degree vertices on the backbone, having internal nodes only in PL, a side of the backbone.
First notice that for each cherry in T |S, there must exist in T1[S], the spanning tree on S in T1, or in T2[S] a node, incident to a pending edge, between at least one of its two leaves and its corresponding node in PC . Otherwise Reduction Rule 1 can be applied. In particular this implies that |PC | ≤ d1 + d2.
Thus at least PC of the d1 + d2 pending edges must be used for "cutting" the cherries, each of them cutting 1 leaf of a cherry. Let us choose one such leaf from each cherry, and call these the cut-leaves.
After removing cut-leaves, every node in PC and PL is now the parent of 1 leaf in T |S. Every side of the backbone contains at most 4 vertices in PC and PL, unless T1[S] or T2[S] has a node of a pending edge or a node adjacent to a node of a pending edge on that side. We show that every such pending edge on a side may increase the number of PL-nodes on that side by at most 5 (see Figure 2). Indeed, suppose a side of the backbone has in total d pending edges in both T1 and T2, but more than 4 + 5d nodes in PL, i.e. at least 5(d + 1). Then T |S contains a chain of length 5(d + 1), which we can split up into d + 1 chains of length 5. Clearly at least one of these chains has no pending edge in either T1 or T2, and so T1, T2 have a common chain of length 5, a contradiction.
Thus the total number of nodes from PC and PL on a side is at most five times the number of pending edges (in T1[S] or T2[S]) on that side, plus 4. Otherwise Reduction Rule 2 can be applied. Given that we already used |PC | pending edges for cutting the cherries, we have d1 + d2 − |PC | pending edges left to be distributed over the sides.
The number of sides on the backbone is the number of edges in an unrooted binary tree with |PC | leaves, which is 2|PC | − 3. Therefore the total number of leaves of T |S is Clearly, this attains its largest value if |PC | = d1 + d2, in which case |S| ≤ 9(d1 + d2) − 12, as was to be proven.

Combining conflicting quartets
In this section we prove Lemma 3.   path between c and d. Without loss of generality, we may assume Qi = {ai, bi, ci, di}, T1|Q i = aibi|cidi and T2|Q i = aici|bidi for each i ∈ [k].
The idea is to construct χ in such a way that, for each quartet Qi, χ(ai) = χ(bi) = χ(ci) = χ(di). This will ensure that lχ(T2) is at least 2k, as T2 will have at least 2k edge-disjoint paths (from ai to ci and from bi di, for each i ∈ [k]) that each require at least one change in state along some edge.
For each Qi, let eQ i denote an edge in T1 such that in T1[Qi], ei is on the path that separates {ai, bi} from {ci, di}. Now we construct a function φ : V (T1) → {red, blue} as follows. Start by choosing an arbitrary leaf in T1, say without loss of generality a1, and set φ(a1) = red. Now proceed as follows. For any edge uv in T1 such that φ(u) is defined but φ(v) is not, we set φ(v) = φ(u), unless uv = eQ i for some i. In that case, we set φ(v) = blue if φ(u) = red, and set φ(v) = red otherwise. Now we can let χ be the restriction of φ to X.
Since each edge is processed at most once in the construction of χ, it is clear that this construction takes polynomial time.

Constructing an initial partition
In this section we prove Lemma 4.

Lemma 4. Suppose that |X| ≥ 2ct for some integers c and t, and let T1 be a phylogenetic tree on X.
Then in polynomial time we can construct a partition S1, . . . , St of X with S1, . . . , St spanning-disjoint in T1, such that |Si| ≥ c for each i.
Proof. We prove the claim by induction on t. For the base case, if t = 1 then we may let S1 = X, and we have the desired partition.
For the inductive step, assume |X| ≥ 2ct and that the claim is true for smaller values of t. We first fix an arbitrary rooting on T1. That is, choose an arbitrary edge e in T1 and subdivide it with a new (temporary) vertex r, then orient all edges in T1 away from r. Under this rooting, let u be a lowest vertex in T1 for which u has at least c descendants in X. Let St ⊆ X be the set of these descendants, Note that since T1 is binary, |St| < 2c, as otherwise one of the two children of u would be a lower vertex with at least c descendants. Now consider the induced subtree T1| X ′ , where X ′ = X \ St. As |St| < 2c, we have X ′ ≥ 2c(t − 1). Then by the inductive hypothesis, we can construct a partition S1, . . . , St−1 of X ′ with S1, . . . , St−1 spanningdisjoint in T1| X ′ , such that |Si| ≥ c for each i. By construction it is clear that St is spanning-disjoint in T1 from S1, . . . , St−1. Thus S1, . . . , St is the desired partition.
As the construction of St can be done in polynomial time and this process is repeated t ≤ |X| times, the entire process takes polynomial time.

Well-behaved sets
In this section we prove Lemma 5. We start with an observation:

Observation 1. For any (not necessarily binary) unrooted tree T with n vertices, and any integer d ≥ 1, the number of vertices in T with degree strictly greater than d is at most n/d. 1
Proof. For each vertex v in T let d(v) denote the degree of v. Recall that an unrooted tree with n vertices has exactly n − 1 edges. It follows that v∈V (T ) d(v) = 2|E(T )| = 2n − 2. Now suppose that T has m > n/d vertices with degree strictly greater than d, i.e. at least d + 1. The remaining n − m vertices all have degree at least 1, from which it follows that v∈V (T ) d(v) ≥ m(d + 1) + n − m = md + n ≥ (n/d)d + n = 2n, a contradiction.
Lemma 5. Let χ be the character defined by the partition S1, . . . , St where S1, . . . , St are spanning-disjoint in T1, and assume Then either dMP χ (T1, T2) ≥ k, or in polynomial time we can find a set of indices i1, . . . i k ′ with k ′ ≥ k such that: • Si 1 , . . . Si k ′ are spanning-disjoint in T2 (as well as T1); • each Si j has degree at most d1 in T1; and • each Si j has degree at most d2 in T2.
We now construct a partition P1, . . . Ps of X which is spanning-disjoint in T2 (see Figure 3 for an illustration). Let φ2 be an optimal extension of χ to T2. As lχ(T2) = lχ(T1) + δ = t + δ − 1, the forest induced by φ2  • Si has degree at most d1 in T1; and • Si has degree at most d2 in T2.
Note that since P1, . . . Pj are spanning-disjoint in T2, the sets {Si : i ∈ I} are also spanning-disjoint in T2. Notice that it is sufficient to prove that |I| ≥ k, whence any subset of k indices from I satisfies the lemma. We will prove this by providing upper bounds on the number of indices in [t] that do not satisfy the conditions of I.
Next, let I >d 1 denote the set of indices i ∈ [t] for which Si has degree greater than d1 in T1. We will show that |I >d 1 | ≤ t/d1. For each i ∈ [t], compress the spanning subtree T1[Si] to a single vertex, and observe that the degree of this vertex is equal to the degree of Si in T1. Any vertex u which is not part of any T1[Si] is merged with one of its neighbours. Note that this merging process can only increase the degrees of the remaining vertices. Call the resulting tree T ′ 1 . See Figure 4. T ′ 1 has t vertices, each of them corresponding to a subset Si, and having degree at least the degree Note that the internal vertex labelled u is not part of T 1 [S i ] for any i, so we merge it with an arbitrary adjacent vertex. In this case we merge u into S 1 = {a, b, c}, which is why S 1 has degree 1 in T 1 but degree 2 in T ′ 1 .
of the corresponding Si in T1. Now by Observation 1, there are at most t/d1 vertices in T ′ 1 with degree greater than d1. It follows that there are at most t/d1 values of i ∈ [t] for which Si has degree greater than d1 in T1, and thus |I >d 1 | ≤ t/d1 as we wanted to show.
Similarly let J >d 2 denote the set of indices j ∈ [s] for which Pj has degree greater than d2 in T2. By similar arguments as used for I >d 1 above, we can show that |J >d 2 | ≤ s/d2.
Notice that for any i ∈ [t], if i is not in I, then either i ∈ I0, or i ∈ I >d 1 , or there exists j ∈ J >d 2 such that Si = Pj . We therefore have that Now, using that t ≥ (2d 1 d 2 +d 1 ) d 1 d 2 −d 1 −d 2 k, s = t + δ and δ ≤ k − 1, we have: as we needed to prove. To see that I can be constructed in polynomial time, it suffices to observe that the partition P1, . . . , Ps can be constructed in polynomial time (as the φ2 can be found in polynomial time), and after this each Si can be checked for membership in I in polynomial time.

Proof of Theorem 1
Lemma 6. Let d1, d2 be positive integers such that d1d2 − d1 − d2 > 0. Let (T1, T2) be a pair of binary unrooted phylogenetic trees on X that are irreducible under Reduction Rules 1 and 2.
Proof. By Lemma 4, there exists a partition S1, . . . St of X, all spanningdisjoint in T1, and with |Si| ≥ c for all i ∈ [t]. Let χ be the character defined by S1, . . . , St. If χ is a witness to dMP (T1, T2) ≥ k, then we may return χ and we are done. Otherwise, we may apply Lemma 5 to find indices i1, . . . i k such that: • Si 1 , . . . Si k are all spanning-disjoint in T2 (as well as in T1); • each Si j has degree at most d1 in T1; and • each Si j has degree at most d2 in T2. Now for each Si j , we have that Si j has degree d j 1 ≤ d1 in T1 and d j 2 ≤ d2 in T2, that |Si j | ≥ c > 9(d1 + d2) − 11 ≥ 9(d j 1 + d j 2 ) − 11 , and that (T1, T2) is irreducible under Rules 1 and 2. Thus we may apply Lemma 2, to find a conflicting quartet Qj ⊆ Si j for each ij .
Finally, as Si 1 , . . . Si k are spanning-disjoint in both T1 and T2, and as each Qj is a subset of Si j , we have that Q1, . . . , Q k are also spanningdisjoint in both T1 and T2. Therefore we may apply Lemma 3 to find a witnessing character for dMP (T1, T2) ≥ k. As each step of this process takes polynomial time, the construction of a witnessing character takes polynomial time.
It remains to complete the proof of Theorem 1. Then if |X| ≥ αk, it holds that dMP (T1, T2) ≥ k, and we can find a witnessing character, i.e. a character χ yielding dMP χ (T1, T2) ≥ k, in polynomial time.
In the appendix, we show that d1 = 4, d2 = 5 is in fact the optimal choice of values for d1 and d2.
As a corollary to Theorem 1 and Theorem 2, we have that dmp is fixed-parameter tractable with respect to dMP . Specifically, the kernel can be solved using the exponential-time algorithm described in [27], which computes the maximum parsimony distance of two trees on n leaves in time O(1.619 n · poly(n)). Corollary 1. dmp has a kernel of size αk, and can be solved in time O(1.619 αk · poly(αk) + poly(n)), with k = dMP (T1, T2).
For completeness, we clarify that these results also prove that the decision problems "dMP ≤ k?", "dMP ≥ k?" and "dMP = k?" can all be answered in time f (k) · poly(n). To answer "dMP ≤ k?", note that if the kernel has size at least α(k + 1) the answer is definitely NO, and otherwise the algorithm from [27] can be applied to compute dMP directly; this can then be compared to k to resolve the question. The "dMP ≥ k?" question can be answered by asking "dMP ≤ k − 1?" and negating the answer; and "dMP = k?" can be answered by combining the answers to the ≤ and ≥ questions.

.1 A polynomial-time constant-factor approximation algorithm for dmp
We present how a constant factor approximation algorithm for dmp can be designed using Theorem 1 together with Reduction Rules 1 and 2.
In order to incorporate Reduction Rules 1 and 2 into our approximation algorithm, we require a way to construct a witnessing character for the original instance from a witnessing character for the reduced instance.
be an instance of dmp derived from (T1, T2) by an application of Reduction Rule 1 or 2, with T ′ 1 , T ′ 2 trees on X ′ ⊂ X. Then given a character χ ′ on X ′ , we can derive a character χ on X in polynomial time such that dMP χ (T1, T2) ≥ dMP χ ′ (T ′ 1 , T ′ 2 ).
Proof. First observe that by definition of the reduction rules, we may assume that T ′ 1 = T1| X ′ and T ′ 2 = T2| X ′ for some X ′ ⊆ X. Assume without loss of generality that l χ ′ (T ′ 2 ) ≥ l χ ′ (T ′ 1 ), and let φ ′ 1 be an optimal extension of χ ′ to T ′ 1 . We will now define a function φ : Recall that T1| X ′ is derived from the spanning tree T1[X ′ ] by suppressing vertices of degree 2, and therefore T1[X ′ ] can be derived from T ′ 1 = T1| X ′ by repeatedly subdividing edges with degree-2 vertices. Now construct φ1 as follows.
For every edge e = uv that gets subdivided with one or more degree-2 vertices, set φ1(u ′ ) = φ ′ 1 (u) for each such degree-2 vertex u ′ . Thus, φ1 assigns a colour to every vertex in T1[X ′ ], and by construction . In order to assign φ(v) to vertices v of T1 not in T1[X ′ ], take any edge e = uv in T1 such that φ1(u) has been assigned but φ1(v) has not, and set φ1(v) = φ1(u). After completing this process, we have that φ1 assigns a colour to every vertex in T1 (including its leaves) and ∆T 1 (φ1) = ∆ T ′ 1 (φ ′ 1 ), as required.
Theorem 3. For any positive integer r, given an instance (T1, T2) of dmp, we can find in polynomial time a character χ such that where α = 560. That is, dmp has a constant factor approximation with approximation ratio (1 + 1/r)α.

Bounding the distance between d T BR and d M P
Tree Bisection and Reconnection (TBR) distance, denoted dT BR , is a distance measure defined on two unrooted binary phylogenetic trees T1, T2. It is defined as the minimum number of "TBR-moves" required to transform T1 into T2 (or vice-versa): it is a metric [1]. Informally, a TBRmove consists of deleting an edge of a tree and then reconnecting the two resulting components via a new edge. This definition is motivated by the way software for constructing phylogenetic trees heuristically navigates through tree space in search of better trees [37]. However, for algorithmic and analytical purposes dT BR is most interesting because of its equivalence to the agreement forest abstraction. An agreement forest of T1 and T2 on the same set of taxa X is a partition of X into blocks S1, S2 . . . , St such that: (1) for each i, T1|S i = T2|S i ; (2) S1, S2, . . . , St are spanning-disjoint in T1 and in T2. An (unrooted) maximum agreement forest is an agreement forest with a minimum number of blocks, and dT BR (T1, T2) is equal to this minimum, minus 1 [1]. A maximum agreement forest for the two trees in Figure 1 consists of three blocks {a, b}, {f, g} and {c, d, e}, so here dT BR is 2.
The characterization of dT BR via agreement forests is significant, because agreement forests have opened the door to a large number of positive FPT and approximation results in the phylogenetics literature, and they have also attracted attention from outside phylogenetics. We refer to [42,17,40,14,10,36,4] for recent results. Moreover, a number of other problems have been shown to be FPT when parameterized by dT BR , by leveraging properties of the dT BR kernel [24] and/or showing that, via agreement forests, the treewidth of a certain auxiliary graph structure is bounded by a function of dT BR (see the next section) [29]. dT BR is a lower bound on many phylogenetic dissimilarity measurements [29], which helps to prove FPT results for these larger parameters, but what about dMP ? It has previously been shown that dMP (T1, T2) ≤ dT BR (T1, T2) for any pair of trees T1, T2 [19,32]. However, the possibility remained that dMP could be arbitrarily smaller than dT BR , and this hinders our ability to bind dMP to other phylogenetic parameters. Our contribution is to show that dMP and dT BR are in fact within a constant factor of each other: dT BR (T1, T2) ≤ 2αdMP (T1, T2).
To show this, we use the fortunate fact that Reduction Rules 1 and 2, which we used to prove the kernel bound for dmp, preserve dT BR as well as dMP for dT BR . The following theorem is, modulo a small modification, due to [1].
Proof. Theorem 3.4 of [1] shows that dT BR is preserved under reduction rules similar to Reduction Rules 1 and 2, except that common chains are reduced to length 3 instead of 4. For a pair of trees T1, T2 on X, let (T ′′ 1 , T ′′ 2 ) with leaf set X ′′ be the instance derived from (T1, T2) by exhaustively applying these reduction rules. Also let (T ′ 1 , T ′ 2 ) with leaf set X ′ be the instance derived from (T1, T2) by exhaustively applying Reduction Rules 1 and 2. Observe that we may assume X ′′ ⊆ X ′ ⊆ X, since any leaf deleted in an application of Reduction Rule 1 or 2 can also be deleted by an application of one of the reduction rules in [1]. Furthermore by Lemma 2.1 of [1], dT BR distance is non-increasing on subtrees induced by subsets of X, which implies that dT BR (T ′′ T2). As Theorem 3.4 of [1] states that dT BR (T ′′ 1 , T ′′ 2 ) = dT BR (T1, T2), the chain of inequalities becomes a chain of equalities and hence dT BR (T ′ 1 , T ′ 2 ) = dT BR (T1, T2).

The treewidth of the display graph
Let G = (V, E) be an undirected graph. A tree decomposition of G consists of a multi-set of bags, B = {B1, . . . , Bt} where each Bi ⊆ V , and a tree T whose nodes are in bijection with B, such that: (1) Every vertex v ∈ V is in at least one bag; (2) for every edge {u, v}, at least one bag contains both u and v, and (3) for every vertex v ∈ V , the bags of T that contain v induce a connected subtree of T . The width of the tree decomposition is equal to the size of its largest bag, minus one, and the treewidth of G is the minimum width, ranging over all tree decompositions T of G [7]. Treewidth derives its importance in combinatorial optimization from the fact that many NP-hard problems on graphs become fixed parameter tractable when parameterized by the treewidth of the graph [6].
Given two phylogenetic trees T1, T2 on X, where |X| ≥ 3, the display graph of T1 and T2, denoted D(T1, T2), is the graph obtained by identifying the leaves of T1 and T2 with the same label. A sequence of articles have studied the treewidth of display graphs, expressed as a function of various phylogenetic parameters, and used this to prove FPT results for a number of NP-hard phylogenetics problems using Courcelle's Theorem [11,29,22] and explicit dynamic programming algorithms running over tree decompositions of the display graph [5]. However, the question remained whether the treewidth of the display graph, denoted by tw(D(T1, T2)) could be bounded by a function of dMP (T1, T2) [24].
Note that Theorem 7.2 of [28] shows an infinite family of trees where the treewidth of the display graph is 3 but dMP is unbounded.

Conclusion
A natural question is how far the analysis can be tightened, or changed, to improve the existing bound on the size of the kernel. In any case, it can be shown that for these two reduction rules a bound smaller than 20k − 12 is not possible. That is because the family of fully-reduced instances described in [26] have exactly 15k − 9 taxa, where in this specific case k = dT BR = dMP . By replacing the length-3 chains with length-4 chains in this family we obtain the bound 20k − 12. We expect that, in practice, the achieved reduction on realistic trees will be far superior to the bounds proven in this paper.
From the perspective of algorithm design it would be useful to design an explicit algorithm with FPT runtime that does not rely on kernelization; for example, by branching or by dynamic programming over an appropriately defined decomposition. Similarly, in the quest for small constant approximation factors it would be interesting to design polynomialtime approximation algorithms that do not rely on kernelization. It is unlikely that through kernelization we will be able to achieve such truly small constant ratios.
The precise relationship between dMP and dT BR remains intriguing. Although we have now established that they are within a constant factor of each other, we are still a long way from proving or disproving the conjecture that dMP ≥ (1/2)dT BR [24].  Figure 1, based on [24, Figure 5]), so dMP ≥ (1/2)dT BR would be the best possible bound.
On a slightly different note, recent publications have reduced the dT BR kernel size from 28k to 15k − 9 [26], and then to 11k − 9 [25]. The 11k − 9 kernel augments the two reduction rules discussed in this article, with five new reduction rules. Which of these new reduction rules work (possibly in a modified form) for dMP , and how might this help us obtain a smaller linear kernel for dMP ?
Finally, we note that there are several slight variations of dMP in the literature. These include the "asymmetric" version dAMP (T1, T2) := maxχ(lχ(T1) − lχ(T2)), in which T1 is required to have the higher parsimony score, and the "restricted states" version d 2 M P (T1, T2) := maxχ dMP χ (T1, T2), where the maximum is taken over all characters with at most 2 states [29,23]. Many of the results in this article will go through for dAMP (T2, T1), as the characters we construct consistently give a larger score to T2. It is less obvious how our results impact on d 2 M P . In particular, it is not immediately clear whether the reduction rules described in [24] go through for d 2 M P , or how one would prove an analogue of Lemma 5 for d 2 M P . Relatedly, it is unclear how much smaller d 2 M P can be than dMP itself. Specifically, how important are additional states when attempting to maximize the parsimony distance between trees? It is known that 7dMP − 5 states are sufficient to obtain a character that witnesses dMP [8], but it is unclear what happens below this bound.