Computing phylogenetic trees using topologically related minimum spanning trees

Choi et al. [2] introduced a minimum spanning tree (MST)-based method called CLGrouping, for constructing tree-structured probabilistic graphical models, a statistical framework that is commonly used for inferring phylogenetic trees. While CLGrouping works correctly if there is a unique MST, we observe an indeterminacy in the method in the case that there are multiple MSTs. We demonstrate the indeterminacy of CLGrouping using a synthetic quartet tree and a tree over primate genera. The indeterminacy of CLGrouping can be removed if the input MST shares a topological relationship with the corresponding phylogenetic tree. We introduce so-called vertex order based MSTs (VMSTs) that are guaranteed to have the desired topological relationship. We relate the number of leaves in the VMST to the degree of parallelism that is oﬀered by CLGrouping. We provide polynomial-time algorithms for constructing VMSTs and for selecting a VMST with the optimal number of leaves.


Introduction
Phylogenetic trees are tree-structured edge-weighted graphical models of evolutionary relationships containing two types of vertices: labeled vertices that represent observed organisms, and hidden vertices that represent ancestral, unobserved organisms.Phylogenies are usually constructed from homologous genomic sequences, protein sequences, and/or encoded morphological characteristics.Edges of phylogenetic trees are weighted with the average number of sequence changes (substitutions) per site.
Phylogenetic tree inference can be viewed as a combinatorial optimization problem.The most commonly used optimization criteria are maximum likelihood (ML), maximum parsimony (MP), minimum least-squares error (MLS), minimum evolution (ME), and balanced minimum evolution (BME).ML and MP are character based approaches, and, ME and MLS are distance based approaches.We briefly introduce each methodology below.
In the likelihood framework, phylogenies are modeled as tree-structured probabilistic graphical models, The objective in ML is to search for a combination of tree topology and edge lengths that maximizes the marginal likelihood of the sequence data.In an MP analysis the objective is to search for a tree topology that minimizes the total number of character changes over the edges of the tree.The MP problem was formalized by Foulds and Graham [9] as a Steiner minimal tree problem and was shown to be NP-complete.The ML problem was shown to be NP-hard by reduction from MP [3,22].
The objectives MLS and ME are defined in terms of edge lengths that are fitted using least-squares regression to measures of evolutionary distance such as the Jukes-Cantor distance [14].The MLS problem is to find a tree that minimizes the sum of squared errors, and was shown to be NP-complete by Day [5].The ME problem is to find a shortest tree, that is a tree with the smallest sum of edge lengths.ME was shown to be NP-complete by Bastkowski et al. [1].The objective BME is closely related to ME, and defines edge lengths using a special case of weighted least-squares [6].BME was shown to be NP-complete by Fiorini et al. [8].
Due to the computational intractability of the optimization problems stated above, scalable methods in phylogenetics are designed using heuristics and perform local optimization, e.g., FastTree2 [19].Neighbor joining (NJ) [23] is a widely used distance based method that performs a greedy search to find a BME tree [10].
Choi et al. [2] introduced a distance based method called Chow-Liu grouping (CLGrouping).Briefly, CLGrouping operates in two phases.The first phase involves constructing a minimum spanning tree using the pairwise distances.In the second phase, each non-leaf vertex v of the MST is visited and the subgraph that is induced by v and the neighbors N v of v is replaced by a phylogenetic tree over the vertex group {v ∪ N v }.
Choi et al. [2] show that CLGrouping is more accurate than NJ at reconstructing phylogenetic trees with large diameter.Huang et al. [13] showed that CLGrouping affords a high degree of parallelism, because phylogenetic tree re-construction can be performed independently for each vertex group.
During our attempt to implement CLGrouping we discovered that there are instances where the tree that is reconstructed using CLGrouping differs from the phylogenetic tree T , even if the input distances are the tree metric of T .

Our contributions
We show that the indeterminacy of CLGrouping is due to a lack of topological correspondence between the MST and the phylogenetic tree.We demonstrate the indeterminacy of CLGrouping using a synthetic quartet tree and a published primate phylogenetic tree.We introduce so-called vertex order based MSTs (VMSTs) that are guaranteed to remove the indeterminacy of CLGrouping.We relate the number of leaves in the VMST to the degree of parallelism that is offered by CLGrouping.We provide polynomial-time algorithms for constructing VMSTs and for selecting a VMST with the minimum number of leaves.

Terminology
A phylogenetic tree T = (V T = {L T , H T }, E T ) is an undirected edge-weighted acyclic graph with two types of vertices: labeled vertices L T that represent observed organisms, and unlabeled hidden vertices H T that represent unobserved organisms.Information, e.g., in the form of genomic sequences, is only present at labeled vertices.We refer to the edge weights of a phylogenetic tree as edge lengths.The length of an edge quantifies the estimated evolutionary distance between the sequences corresponding to the respective incident vertices.All edge lengths are strictly positive.A phylogenetic tree is a leaf-labeled tree if all the labeled vertices are leaves, otherwise, the phylogenetic tree is a generally labeled tree [15].
Each phylogenetic tree T = ({L T , H T }, E T ) is equipped with a length function w T : E → (0, ∞), and a unique tree metric d T : L T × L T → [0, ∞) over the labeled vertices that is defined as follows.
For each u and v in where w(i, j) is the length of the edge {i, j}, and p T (u, v) is the alternating sequence of vertices and edges, that are visited when traversing the unique path in T from u to v.
A set of distances is additive in T = (V T, E T ) if the corresponding length function w T : E → (0, ∞) gives rise to these distances.The distance graph G = (V G , E G ) of a phylogenetic tree T is the complete graph over L T with each edge {u, v} in E G weighted with the additive distance d T (u, v).A minimum spanning tree (MST) of an edge-weighted graph is a tree that spans all vertices of the graph, and has the minimum sum of edge weights.
Each phylogenetic tree T = ({L T , H T }, E) is equipped with a split function | T that is defined as follows.The split | T : E T → {2 L T , 2 L T } of an edge {u, v} is defined as the collection {A, B} of the disjoint sets of labeled vertices such that {u, v} is contained in each path from a vertex in A to a vertex in B. A split {A, B} is said to be contained in a tree T = (V T , E T ) if there is an edge {u, v} in E T such that | T (u, v) = {A, B}.A and B are referred to as the sides of the split.The most balanced edge of a tree is the edge that induces a split such that the difference in the set sizes of the sides is minimal.
A phylogenetic tree can be rooted by introducing a new hidden vertex ρ called the root, removing an edge {u, v} and adding the edges {ρ, u} and {ρ, v}, with new edge lengths satisfying w(u, ρ) + w(ρ, v) = w(u, v).Rooting a tree constructs a directed acyclic graph in which each edge is directed away from the root.
A phylogenetic tree is an ultrametric tree if the tree can be rooted in such a way that all leaves are equidistant from the root.

Indeterminacy of Chow-Liu grouping
Choi et al. [2] introduced the procedure Chow-Liu grouping (CLGrouping) for constructing latent tree graphical models, a framework that is used for inferring phylogenetic trees.CLGrouping can be used for constructing phylogenetic trees from estimates of evolutionary distances.The authors show that CLGrouping is better at reconstructing phylogenetic trees with large diameter when compared to NJ.If the input distances are additive in the phylogenetic tree T then the authors claim that CLGrouping correctly reconstructs T .
CLGrouping consists of two stages.The first stage involves the construction of an MST M of G.The second stage iterates over the internal vertices of M and, for each internal vertex i that is visited, a vertex set V i comprising i and the neighbors of i is constructed.Subsequently a phylogenetic tree T i is constructed using distances between vertices in V i .In the final step of the iteration, the graph in M that is induced by V i is replaced by T i (see Fig. 1E for an illustration).If i is not the first vertex to be visited then V i may contain newly introduced hidden vertices.Let h j be a hidden vertex that was introduced when processing the labeled vertex j.The distance from h j to a labeled vertex l in V i is computed as d hj l = d jl −d jhj .The distance between two hidden vertices h j and h k is computed as The order in which the internal vertices are visited is not specified by the authors and does not seem to be important.CLGrouping terminates once all the internal vertices of M have been visited once.
This procedure is called Chow-Liu grouping because the MSTs that are constructed using additive distances are topologically equivalent to Chow-Liu trees [4], for certain probability distributions.Please read Choi et al. [2] for further detail.
We demonstrate the indeterminacy of CLGrouping for the quartet tree T (Fig. 1).For the corresponding distance graph G of T , two MSTs of G, M U and M O were constructed by hand.M O is a vertex order based MST that was constructed using the order l 1 < l 2 < l 3 < l 4 .M U is not a vertex order based MST.The intermediate steps, and the final result of applying CLGrouping to M U and M O are shown in Fig. 1E and Fig. 1F, respectively.CLGrouping reconstructs the original phylogenetic tree if it is applied to the VMST M O but not if it is applied to M U .

A primate phylogenetic tree
In this subsection we demonstrate the indeterminacy of CLGrouping, using the phylogeny over the primate genera [18].[11,18].Right: A phylogeny that was constructed by applying Chow-Liu grouping to an MST of the distance graph of T .The edges that are highlighted in red correspond to splits that are contained in one tree but not the other tree.The branches in each phylogeny are scaled in units of million yrs.

Methodological details
The primate phylogeny was downloaded from the TimeTree database which is a comprehensive collection of published phylogenies [11,12,17].The branches of the primate phylogeny represent calendar time and are scaled in units of million yrs.The primate phylogeny contains three branches of length zero that cannot be inferred from the corresponding tree metric.A modified primate phylogenetic tree T was constructed by contracting all branches of length zero.The edges of the distance graph of T were arranged in order of increasing weight, and edges with identical weight were randomly shuffled.One hundred MSTs were constructed by applying Kruskal's algorithm [16] to the edges that were ordered as described above.We implemented CLGrouping such that the phylogeny over each vertex group was constructed using NJ.We applied CLGrouping to each MST, and computed the topological distance between each output phylogeny and the primate phylogeny using the Robinson-Foulds distance [21].The Robinson-Foulds distance is defined as the fraction of unique splits that are present in one tree and not the other.We selected a Chow-Liu grouping tree that maximizes the Robinson-Foulds distance from the primate phylogeny.The selected Chow-Liu grouping tree is 0.4 RF distance away from the primate phylogeny and is shown in Fig. 2. In order to enable a visual comparison we rooted the Chow-Liu grouping tree at the midpoint of the most balanced edge.The primate phylogeny is an ultrametric tree and has been rooted such that the root is equidistant from the leaves.As can be seen, both trees in Fig. 2 are substantially different.

Topological relationship between MSTs and phylogenetic trees
The correctness of CLGrouping depends on a topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. [2].
In order to establish a topological relationship between minimum spanning trees and phylogenetic trees Choi et al. [2] introduced the notion of a surrogate vertex.
The surrogate vertex of a hidden vertex is the closest labeled vertex, wrt distances defined on the phylogenetic tree.Choi et al. [2] claim that minimum spanning trees can be constructed by contracting all edges along the path between each hidden vertex and the corresponding surrogate vertex.Since the procedure that constructs the MST is not aware of the true phylogenetic tree, the surrogate vertex of each hidden vertex must be selected implicitly.
In the example shown in Fig. 1, the MST M O can be constructed by contracting the edges {h 1 , l 1 }, and {h 2 , l 3 }.Clearly there is no selection of surrogate vertices such that M U can be constructed by contracting the path between each hidden vertex and the corresponding surrogate vertex.
Let the surrogate vertex set S(h) of a vertex h be the set of all labeled vertices that are closest to h.Consider two hidden vertices h 1 and h 2 , such that there are multiple labeled vertices, l 1 and l 2 , that are common to the corresponding surrogate vertex sets S(h 1 ) and S(h 2 ).Choi et al. [2] assume that it is always possible to apply the following tie-breaking rule for implicitly selecting the corresponding surrogate vertices.A labeled vertex that is common to S(h 1 ) and S(h 2 ) (either l 1 or l 2 ) is selected as the surrogate vertex of both h 1 and h 2 .This rule for selecting surrogate vertices cannot be applied in general.We demonstrate this with an example.For the tree shown in Fig. 3 we have S(h 1 ) = {l 1 , l 2 }, S(h 2 ) = {l 4 , l 5 }, and S(h 3 ) = {l 1 , l 2 , l 3 , l 4 , l 5 }.There is no selection of surrogate vertices that satisfies the tie-breaking rule.

Vertex order based MSTs
In order to construct an MST that is guaranteed to have the desired topological correspondence with the phylogenetic tree, we propose the following definition of a surrogate vertex.
Definition 1 Given a phylogenetic tree T = (V T = {L T , H T }, E T ) and the corresponding tree metric d T , let there be a total order < V over the set of all labeled vertices of T .The vertex order based surrogate vertex of a vertex v in V T is the labeled vertex in L T that is closest wrt the tree metric d T , and smallest wrt to the vertex order < V .That is, where l < V is the rank of l in the order < V , and the lexicographic order is applied to the ordered pair following "argmin" in the formula.
The inverse surrogate set S −1 (l) of a labeled vertex l is the set of all vertices whose surrogate vertex is l.Note that each labeled vertex is contained in its inverse surrogate set.
In order to ensure that the surrogate vertices are selected on the basis of tree metric and vertex order, it is necessary that information pertaining to vertex order is used when selecting the edges of the MST.We use Kruskal's algorithm for constructing the desired MST.Since Kruskal's algorithm takes as input a set of edges sorted wrt edge weight, we modify the input by sorting edges with respect to edge weight and vertex order as follows.It is easy to modify other algorithms for constructing MSTs in such a way that vertex order is taken into account.
Definition 2 Given an edge-weighted graph G = (V, E), and a total order < V over the vertices in V .Let w(u, v) be the weight of the edge {u, v}.Edges in E are sorted wrt edge weight and vertex order using the lexicographic order that is defined below.Let the sorting be defined using the total order < E .For each pair of edges {a, b} and {c, d} in E, {a, b} < E {c, d} if and only if The modified algorithm for constructing a vertex order based MST (VMST) is described in Algorithm 1.
Algorithm 1 Constructing a vertex-order based MST (VMST) Input: (G = (V, E), < V ) E < V ← edges in E ordered wrt edge weight and vertex order M < V ← MST constructed by applying Kruskal's algorithm to Using the notion of VMSTs we will prove Lemma 1, and consequently show that the indeterminacy of CLGrouping can be removed if CLGrouping is applied to a VMST.
Lemma 1 Adapted from parts (i) and (ii) of Lemma 8 in Choi et al. [2].Given a phylogenetic tree T = (V T , E T ) and a total order < over the labeled vertices in T , let G = (V G , E G ) be the distance graph of T .Let M = (V M , E M ) be the VMST constructed by applying Algorithm 1 to (G, <).The surrogate vertex of each hidden vertex is defined with respect to the tree metric d T and a vertex order as given in Definition 1. M is related to T as follows.
1.If l ∈ V M and h ∈ S −1 (l) s.t.h = l, then every vertex in the path in T that connects l and h belongs to the inverse surrogate set S −1 (l).
2. For any two vertices that are adjacent in T , their surrogate vertices, if distinct, are adjacent in M , i.e., for all i, j ∈ V T with s Proof: (i).Assume that there is a vertex u on the path between h and l, such that s(u There are seven ways to position k wrt h, u, and l (see Figure 4).We only consider the general positions.Each case specifies one of the seven possible positions of a labeled vertex k wrt hidden vertices h and u, and a labeled vertex l.Hidden vertices are represented with white circles and labeled vertices are represented with black circles.Each dashed line represents a path between the two vertices at its end points.The condition on top of each solid arrow describes how the special cases can be constructed from the corresponding general cases.
For case 1 we have For case 2 we have For case 3 we have For case 4 we have . Consider the edge {i, j} in E T such that s(i) = s(j).Let V i and V j be the sides of the split that is induced by the edge {i, j}, such that V i and V j contain i and j, respectively.Let L i and L j be sets of labeled vertices that are defined as V i ∩ V M and V j ∩ V M respectively.From part (i) of Lemma 1 we know that s(i) ∈ L i and s(j) ∈ L j .Consider the labeled vertices l i ∈ L i \{s(i)} and l j ∈ L j \{s(j)}.
We have with equality holding only if The cut property of MSTs states that, given a graph each MST of G contains one of the smallest edges (wrt edge weight) which have one endpoint in V 1 and the other endpoint in V 2 .Thus M contains at most one of the following edges {l i , l j }, {s(i), l j }, {l i , s(j)} and {s(i), s(j)}.Note that the vertex order based MST M is constructed using edges that are sorted wrt edge weight and the vertex order < V .Let the ordered set of edges be defined using the total order < E over E.
From equations ( 1) and (2) we have Thus, according to Definition 2, it follows that {s(i), s(i)} < E {l i , l j }.Through a similar construction it can be shown that {s(i), s(j)} < E {s(i), l j } and {s(i), s(j)} < E {l i , s(j)}.It follows that {s(i), s(j)} ∈ E M . 2 CLGrouping can be shown to be correct using Lemma 1 and the rest of the proof that was provided by Choi et al. [2].The authors of CLGrouping provide a matlab implementation of their algorithm.The implementation takes as input a distance matrix which has the following property: the row index, and the column index of each labeled vertex is equal.The MST that is constructed in the authors implementation is a vertex order based MST.The vertex order is equal to the order over the column/row indices of the labeled vertices.The implementation provided by Choi et al. [2] correctly reconstructs the model tree even if there are multiple MSTs in the underlying distance graph.

Selecting optimal VMSTs
In the context of parallel programming, Huang et al. [13] showed that it is possible to parallelize CLGrouping by independently constructing phylogenetic trees over the vertex group associated with each non-leaf vertex, and merging them in order to construct the full phylogenetic tree.The step involving tree mergers requires a shared memory architecture.
Thus, with respect to parallelism, an optimal VMST would have the maximum number of vertex groups, and equivalently, the minimum number of leaves.In order to relate the shape of a phylogenetic tree to the number of leaves in a corresponding VMST, we consider ultrametric caterpillar trees and ultrametric balanced trees [25].Ultrametric trees are leaf-labeled rooted phylogenetic trees satisfying the condition that the root is equidistant from all leaves.A caterpillar tree is a phylogenetic tree for which all non-leaf vertices are contained in a single path.A balanced tree is a rooted phylogenetic tree for which the path from each leaf to the root contains the same number of edges.

Tree shape
Consider an ultrametric caterpillar tree.There exists a corresponding VMST which has a star topology that can be constructed by contracting edges between each hidden vertex and one labeled vertex that is in the surrogate vertex set of each hidden vertex (see Fig. 5 A).A star-shaped VMST has only one vertex group, comprising all the vertices in the VMST, and does not afford any parallelism.
Instead, if the VMST was to be constructed by contracting edges between each hidden vertex h and a labeled vertex that is incident to h, then the number of the vertex groups would be n − 2, where n is the number of vertices in the phylogenetic tree.The resulting VMST would have the minimum number of leaves (two).
Consider a phylogenetic tree T = ({L T , H T }, E T ) which is an ultrametric balanced tree.For each leaf l 1 in L T there is another leaf l 2 in L T such that l 1 and l 2 are incident to the same hidden vertex h in H T .Since l 1 and l 2 are closest to h, the surrogate vertex of h is either l 1 or l 2 .In each VMST of T , either l 1 or l 2 will be a leaf in the VMST.Since this is true for all leaves in L T , each VMSTs of T will have L T /2 leaves (see Fig. 5 B).
Whether or not the phylogenetic trees that are estimated from real data are ultrametric depends on the set of organisms that are being studied.Genetic sequences that are sampled from closely related organisms have been estimated to undergo substitutions at a similar rate, resulting in ultrametric phylogenetic trees [7].With respect to the phenomenon of adaptation by natural selection, phylogenetic trees are caterpillar-like if there is strong selection; the longest path from the root represents the best-adapted lineage.
In the next section we will present an algorithm for constructing a VMST with the minimum number of leaves.

Overview of our approach
Our approach to selecting optimal VMSTs makes use of three notions, (i), the maximum degree δ max of each vertex across all MSTs, (ii), the so-called MST union graph which is a graph containing all the edges that are present in at least one MST, and, (iii), a common structure over the MSTs that can be defined as a laminar family.
The intuition behind our approach is described as follows.From Lemma 1 it follows that each non-leaf vertex of a VMST is a surrogate vertex.Thus we want to choose a vertex order such that we maximize the number of distinct surrogate vertices.In Section 7, we show that such a vertex order can be constructed by arranging vertices in order of non-decreasing δ max .In Section 6 we show how the common laminar family and the MST union graph can be used to compute δ max .The construction is exemplified graphically in Fig. 5.2.
On a related note, the general problem of selecting an MST with the minimum number of leaves (MLMST) is in NP-complete by reduction from the Hamiltonian path problem.MLMST specializes the problem of finding spanning trees with minimum number of leaves which is also in NP-complete by a similar reduction [24].

A common laminar family
In this section we will prove the existence of a so-called common laminar family over the vertex set of an edge-weighted graph G.A collection F of subsets of a set S is a laminar family over S if, for any two intersecting sets in F, one set contains the other.That is to say, for each pair The common laminar family defines a representation of a tree structure that is common to each MST of G.The notion of a laminar family has been utilized previously by Ravi and Singh [20] for designing an approximation algorithm for constructing a minimum-degree MST.
Semple and Steel [25] note that each rooted phylogenetic tree can be uniquely described as a laminar family over the set of labeled vertices.Laminar family representations of rooted phylogenies are used for comparing and combining information from multiple rooted phylogenetic trees.Later in this section we show that the laminar family representation of an ultrametric tree is equivalent to the common laminar family.

A structure that is common to all MSTs of a graph
Lemma 2 Given an edge-weighted graph G = (V, E) with k distinct weight classes W = {w 1 , w 2 , . . ., w k }, and an MST M of G, let F i be the forest that is formed by removing all edges in G that are heavier than w i .Let C i be the collection comprising the vertex set of each component of F i .Consider the collection F which is constructed as follows: The following is true: 1. F C is a laminar family over V 2. Each vertex set in F C induces a connected graph in each MST of G Proof: (i).Consider any two vertex sets V 1 and V 2 in F. Let w 1 and w 2 be the weights of the heaviest edges in the subgraphs of M that are induced by V 1 and V 2 , respectively.Let F 1 and F 2 be the forests that are formed by removing all edges in M that are heavier than w 1 and w 2 , respectively.Let C 1 and C 2 be the collections comprising the vertex set of each component in F 1 and F 2 , respectively.
By construction, we have Consider the case where then without loss of generality, let w 1 < w 2 .F 2 can be constructed by adding to F 1 all edges in M that are no heavier than w 2 .The vertex set of each component in (ii).Let V i be the vertex set of a component in the graph G i of G that is created by removing all edges in G i that are heavier than w i .It follows that V i induces a connected graph in each minimum spanning forest of G i .Consider an MST M of G. Removing all edges in M that are heavier that w i constructs a minimum spanning forest F of G. Thus V i induces a connected graph in M .It follows that V i induces a connected graph in each MST of G.By construction

Ultrametric trees
Ultrametric trees are rooted phylogenetic trees that satisfy the condition that the root is equidistant from the leaves.All ultrametric trees are leaf-labeled.Semple and Steel [25] note that the hierarchical structure of a rooted tree can be represented using a laminar family.We show that the laminar family F T that represents an ultrametric tree T is equivalent to the laminar family F C that common to all the MSTs of the distance graph associated with T .

Lemma 3
We are given an ultrametric tree T and the corresponding distance graph G. Let F C be the laminar family that is common to each MST of G. Let F T be the laminar family representation of T .The following is true.
Proof: Consider a vertex set S ⊂ F T .Let w be the largest distance between vertices in S. Consider the forest F that is constructed by removing all edges in G that are heavier than w. S induces a connected component C in F since each pairwise distance between vertices in S is not larger than w.Since the distance between each vertex in S and each vertex in V T \S is larger than w, it follows that C does not contain any vertex that is not in S. Since the common laminar family F C contains the vertex set of each component in F , it follows that S ⊂ F C .Since this is true for each set in F T , it follows that Note that the laminar family representation F T of a rooted tree, and the corresponding common laminar family F C , are not equivalent in general.See Fig. 7 for an example.The equivalence between the laminar family representation F T of a rooted phylogenetic tree, and the common laminar family F C , is not true in general.

An algorithm for constructing the common laminar family and the MST union graph
In this subsection we present an algorithm for constructing the common laminar family and the MST union graph.The MST union graph of a graph G is the subgraph of G that contains all the edges that are present in at least one MST of G.
Algorithm 2 Construct the common laminar family F C and the MST union graph G U .Input: Lemma 4 Given an edge-weighted graph G = (V G , E G ) with k distinct weight classes W = {w 1 , w 2 , . . ., w k }, the outputs F C and G U of Algorithm 2 are the common laminar family of G, and the MST union graph of G, respectively.
Proof: Algorithm 2 adds edges to the singleton graph M in order of increasing weight, in such a way that M does not contain any cycles.From Kruskal [16] we know that M is an MST of G.
Consider the forest F i that is constructed by removing all edges in M that are heavier than w i .By construction, F C includes the vertex set of each component of F i .Let C i be the collection comprising the vertex set of each component of  We are given a phylogenetic tree T , the corresponding distance graph G = (V, E).Let F C be the common laminar family of G. Let G U = (V U , E U ) be the MST union graph of G. Let h be a hidden vertex in T such that there is a leaf l in S(h), and h is incident to l.Let V i be a vertex set in F and let w i be the corresponding edge weight.Then the following is true: 1. Let N (v) be the set of all vertices that are adjacent to vertex v in G U .Let C(v) be a smallest sub-collection of F that covers N (v) but not v.Among all MSTs, the maximum vertex degree , . . ., c m } be a smallest sub-collection of F that covers N (v) and does not include v.
Let C(v) contain a set c i that covers multiple vertices in N (v).Let j 1 and j 2 be any two vertices in c i .Let w i be the heaviest weight on the path between j 1 and j 2 in M .The edges {v, j 1 } and {v, j 2 } are heavier than w i .If they were not, then we would have v ∈ c i .Since v, j 1 and j 2 are on a common cycle, each MST of G can only contain one of the two edges {v, j 1 }, and {v, j 2 }.It follows that, for each set c i ∈ C(v), each MST can contain at most one edge which is incident to v and to a vertex in c i .Thus the maximum number of edges that can be incident to v in any MST is the number of vertex sets in C(v), i.e., δ max (v) = |C(v)|.
(ii).Let N (l) and N (v) be the set of all vertices that are incident to l and v in G U , respectively.Let j ∈ N (l)\S(h).The weight of the edge {j, l} ∈ E U is given by d jl .d jh > d vh since j / ∈ S(h).Thus d lj > d lv , and consequently v ∈ N (l).We have that contains the edges {l, v} and {l, h}.Consider the spanning tree M that is formed by removing {l, h} from E M and adding {v, h}.M and M have the same sum of edge weights.Thus we also have j ∈ N (v).Consequently N (l) ⊆ N (v).Let C(l) and C(v) be the smallest sub-collections of F such that C(l) covers N (l) but does not contain l, and  δ max (i) ← δ max (i) + 1 N i ← N i \C < * ← A total order over V such that u < * v =⇒ δ max (u) ≤ δ max (v) M * ← VMST constructed by applying Algorithm 1 to (G, < * ) Output: M * Theorem 1 We are given a phylogenetic tree T and the corresponding distance graph G. Let M be the vertex order based MST that is computed using Algorithm 3.Among all VMSTs of G, M has the minimum number of leaves.
Proof: Let S(h) be the set of vertices that are closest to h wrt the tree metric d T that is associated with T .From Lemma 5(ii), we know that, if there is a leaf l in S(h), then among all vertices in S(h), δ max (l) is smallest.By construction of < * , among all vertices in S(h), l is the smallest vertex wrt < * .It follows that Algorithm 3 implicitly selects l as the surrogate vertex of h.Since each leaf in T is adjacent to at most one hidden vertex, the vertex order that is selected by Algorithm 3, maximizes the number of distinct leaves that are selected as surrogate vertices.Contracting the path in T between a hidden vertex and the corresponding surrogate vertex increases the degree of the surrogate vertex.Thus, among all vertex order based MSTs, M has the minimum number of leaves. 2

Implementation details and time complexity analysis
Algorithm 3 takes as input an edge-weighted graph G = (V, E) and performs the following actions.First, the common laminar family F C and the MST union graph G U are constructed by applying Algorithm 2 to G. Subsequently, a vertex order < V is computed on the basis of F C and G U .Finally, a VMST is constructed by applying Algorithm 1 to (G, < V ).Algorithms 1 and 2 are variants of Kruskal's algorithm and were implemented using a disjoint-set data structure with balanced Union, and Find with path compression [26].The functions F M and U M correspond to a Find operation and a Union operation, respectively.A disjoint-set data structure can be represented as a forest with self-loops and directed edges.Each vertex points to its parent.The root of a component points to itself.A Find operation on a vertex repoints the edge to its former parent to the root of the component containing the vertex.A Union operation takes as input the roots of two components and creates an edge pointing from the root of the smaller component to the root of the larger component.The function C M (u) is designed to return the set of vertices that are in the same component as u.C M is implemented as follows.We store the vertex set of a component in the root of the component.Each time we perform a union operation U M (r 1 , r 2 ) we combine the vertex sets and store the combined vertex set in the root of the component containing r 1 and r 2 .
The main steps of Algorithms 1 and 2 are (i), sorting O(n 2 ) edges and, (ii), performing O(n 2 ) Find operations and O(n) Union operations, where n is the number of vertices in V .Step (i) can be done using mergesort in time O(n 2 log n 2 )=O(n 2 log n).Step (ii) takes time O(n 2 α(n 2 , n)) where α is the inverse of Ackermann's function [26].Since α(n 2 , n) < log n, both the algorithms complete their computations in time O(n 2 log n).
In addition to calling Algorithms 1 and 2, Algorithm 3 sorts the sets in F C and computes δ max for each vertex in V .F C has O(n) sets which can be sorted using mergesort in time O(n log n).For each vertex, δ max can be computed in time O(n).
Thus the total time complexity of Algorithm 3 is O(n 2 log n).

JGAA, 21 Figure 1 :
Figure 1: The example used to demonstrate that CLGrouping may not reconstruct the correct tree if there are multiple MSTs.The phylogenetic tree T that is used in this example is shown in panel A. The distance graph G of T is shown in panel B. Two MSTs of G, M O and M U are shown in panels C and D, respectively.M O is a vertex-order based MST (VMST) and M U is not a VMST.Panels E and F show the intermediate steps, and the final result of implementing CLGrouping using M O and M U respectively.CLGrouping reconstructs the original phylogenetic tree if it uses M O but not if it is uses M U .

Figure 2 :
Figure2: Left: The empirically established phylogeny T over primate genera[11,18].Right: A phylogeny that was constructed by applying Chow-Liu grouping to an MST of the distance graph of T .The edges that are highlighted in red correspond to splits that are contained in one tree but not the other tree.The branches in each phylogeny are scaled in units of million yrs.

Figure 3 :
Figure3: The phylogenetic tree that is used to demonstrate that the tie-breaking rule as defined by Choi et al.[2] cannot be applied in general.

Figure 4 :
Figure4: The cases that were considered in the proof of Lemma 1 part (i).Each case specifies one of the seven possible positions of a labeled vertex k wrt hidden vertices h and u, and a labeled vertex l.Hidden vertices are represented with white circles and labeled vertices are represented with black circles.Each dashed line represents a path between the two vertices at its end points.The condition on top of each solid arrow describes how the special cases can be constructed from the corresponding general cases.

Figure 5 :
Figure 5: Both panels show ultrametric trees (left) and VMSTs with the maximum and the minimum number of leaves (right) that are constructed by contracting corresponding edges that are highlighted in orange and blue, respectively.The difference between the maximum and the minimum number of leaves in VMSTs is largest for the caterpillar tree shown in panel A, and smallest for the balanced tree shown in panel B.

Figure 6 :
Figure 6: Panel A shows a generally labeled phylogenetic tree T with surrogate vertices selected such that the edge contraction would construct the VMST with the minimum number of leaves shown in Panel B. Panel C shows the VMST (in red) superimposed with the common laminar family and the MST union graph.Additionally, each vertex has been labeled with the corresponding δ max .

3 F T ={{l 3 }Figure 7 :
Figure 7: The equivalence between the laminar family representation F T of a rooted phylogenetic tree, and the common laminar family F C , is not true in general.

7. 2
Constructing a VMST with the minimum number of leavesAlgorithm 3 Construct a minimum leaves VMST (MLVMST) Input: G = (V, E) F C ←the common laminar family of G F ≥ C ←sets of F C ordered in order of decreasing size G U ←the MST union graph of G δ max ←empty array for i in V N i ←neighbors of i in G U δ max (i) ← 0 for C in F ≥ C : if C∩N 1 = ∅ and C∩{i} = ∅ are sorted in order of increasing weight w previous ← weight of the lightest edge in E G V w ← ∅ Functions: C M (v) : Returns the vertex set of the component of M containing v F M (v) : Returns id of the component of graph M containing v U M (u, v): Adds edge {u, v} to E M and updates component ids for {u, v} in E G≤ w current ← weight of {u, v} if w current > w previous for {u, v} in E w Lemma 2,we know that F C is the common laminar family of G.E U is constructed by adding the lightest edges that are incident to vertices in different components.The cut property of MSTs states that given a graph one of the lightest edges which have one endpoint in V 1 and the other endpoint in V 2 .It follows that each edge in E U is present in at least one MST of G.2