Tanglegrams: a reduction tool for mathematical phylogenetics

Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial objects until recently. In this paper, we describe how many discrete mathematical questions on trees"factor"through a problem on tanglegrams, and how understanding that factoring can simplify analysis. Depending on the problem, it may be useful to consider a unordered version of tanglegrams, and/or their unrooted counterparts. For all of these definitions, we show how the isomorphism types of tanglegrams can be understood in terms of double cosets of the symmetric group, and we investigate their automorphisms. Understanding tanglegrams better will isolate the distinct problems on leaf-labeled pairs of trees and reveal natural symmetries of spaces associated with such problems.


Introduction
Consider the problem of computing the subtree-prune-regraft (SPR) distance between two leaf-labeled phylogenetic trees. An SPR move cuts one edge of the tree and then reattaches the resulting rooted subtree at another edge ( Figure 1). The SPR distance between two (phylogenetic, meaning leaf-labeled) trees T 1 and T 2 is the minimum number of SPR moves required to transform T 1 into T 2 . This distance is of fundamental importance in phylogenetics, and many papers have been written both applying [1,2] and investigating properties of [3][4][5] this distance.
Say that we wanted to calculate the SPR distance between every pair of trees on a certain number of leaves. Naïvely this would require a large number of SPR calculations, namely the number of leaf-labeled phylogenetic trees choose two. However, the distance between two such trees does not depend on the actual labels of T 1 and T 2 , so one can permute the leaf labels without changing the distance. Furthermore, a path made by intermediate trees between the two trees could also have its labels permuted in order to give a path between the trees with permuted leaf labels. Thus, problems like SPR distance do not concern the actual leaf labels as such, but rather use the leaf labels as markers that can be used to map leaves of one phylogenetic tree on to another: the problem and its solutions are actually defined in terms of a relative leaf labeling (   Figure 1. Two equivalent subtree-prune-regraft moves applied to trees which are identical up to relabeling. The number of such moves required to transform one tree into another only depends on the relative leaf labeling between the two trees. Analogous discrete mathematics problems and objects defined in terms of tuples of labeled combinatorial objects, but without direct reference to the labels themselves, are ubiquitous in computational biology. Any distance between pairs of trees that is computed in terms of tree modifications, such as (rooted or unrooted) subtree-prune-regraft described above, nearest-neighbor-interchange and tree bisection and reattachment (see [4] for a review), satisfy this condition. Such moves are used as the basis of both maximum-likelihood heuristic search and Bayesian Markov chain Monte Carlo (MCMC) tree reconstruction. The corresponding graph, in which trees form vertices and a collection of moves form edges, has natural symmetries of pairs of points in these spaces which have the same relative labeling. For example, hitting times of simple random walks on graphs formed by such moves for given start and end trees [6][7][8] are defined in terms of relative labelings between the start and end trees. The same is true for more complex random walks such as Markov chain Monte Carlo using a label-invariant likelihood, as would be used for sampling from a prior distribution on trees [9]. Graph characteristics such as Ricci-Ollivier curvature [10] under simple random walks or MCMC with a label-invariant likelihood are expressed in terms of relative tree labelings [11]. Analogous considerations hold for the problem of species delimitation, which can naturally be phrased in terms of inference of a partition of relatively labeled objects: neither distances between partitions [12] nor the graphs underlying MCMC over these partitions [13] actually refer to labels themselves.
The concept of a pair of rooted phylogenetic trees with a relative leaf labeling has been formalized as a tanglegram [14,15]. A tanglegram is a pair of trees on the same set of leaves with a bijection between the leaves in the two trees [16] Figure 1, with the bijection shown in gray. When considered as a graph, the black edges are called tree edges, and the gray edges are called between-leaf edges.
( Figure 2). There has been extensive work on the problem of finding the layout of a given tanglegram in the plane that minimizes crossings, with the goal of most clearly visualizing co-evolutionary relationships between species [16][17][18][19][20][21].
However, we are not aware of any work considering tanglegrams as a convenient formalization of the notion of a relative leaf labeling in the context of studying pairs of labeled phylogenetic trees. There has also been little work enumerating or finding other properties of tanglegrams until recently [22]. In addition, more challenging and important problems in mathematical phylogenetics reduce to questions on relatively-labeled collections of more than two trees, and correspondingly one can extend the notion of tanglegram to more than two trees. For example, "supertree" methods reconstruct a tree from collections of trees, each of which is typically considered to express information about the larger tree [2,23,24], which in fact is a problem on multi-tree tanglegrams. The same is true for the minimal hybridization network [25] and maximum agreement subtree [26,27] problems. Thus many problems in the discrete mathematics of phylogenetic trees "factor" through a problem concerning a generalized version of a tanglegram.
With this motivation for studying tanglegrams in more depth, here we formalize more general notions of tanglegram, describe their symmetries, observe that tanglegrams have a convenient algebraic formulation as double cosets of the symmetric group, and provide some enumeration results for four types of tanglegram.

Tanglegrams
An unrooted binary tree T is a finite graph for which there is a unique path between every pair of vertices, and such that every non-leaf vertex has degree three. A rooted tree is an unrooted tree with a distinguished node called the root. We will also make the assumption common in phylogenetics that the root of a rooted tree has degree two, and that there are no degree-two nodes other than the root (if there is a root). The leaves L(T ) of a tree T are degree-one vertices of the tree. The graph of the tanglegram Y is the graph formed from the union of T and S by adding an edge from each leaf x in T to the corresponding leaf φ(x) in S. We will distinguish these between-leaf edges from the tree edges of T and S ( Figure 2).
We have defined tanglegrams in terms of ordered triples Y = (T, φ, S) , so Y = (S, φ −1 , T ) is a different tanglegram. This is a sensible definition when considering sequences of trees with an inherent directionality. However, often there is not such a directionality, such as for subtree-prune-regraft moves, which are easily reversed. This motivates the following concept: where {T, S} is an unordered set of two trees, and φ is a bijection between L(T ) and L(S).

2.1.
Automorphisms and tanglegram equivalence. Let V (X) denote the vertex set of a graph X. An isomorphism between unrooted trees T and S is a bijective map h : V (T ) → V (S) in which f maps edges of T to edges of S. For a rooted tree, we add the requirement that an isomorphism must map the root node of T to the root node of S. An automorphism of a tree T is an isomorphism of T with itself. It is clear that the degree of a node (i.e. the number of adjacent nodes) is preserved under isomorphisms. In phylogenetics, it is common that the root of a tree is the only node of degree two. In this case, there is no distinction between isomorphisms of rooted trees and isomorphisms of these trees as unrooted trees because degrees are preserved under isomorphism.
We start with an "obvious" lemma, the proof of which can be found in the Appendix. First note that any isomorphism between trees T and S preserves the leaf sets L(T ) and L(S), and therefore induces a bijection between L(T ) and L(S).
Lemma 3. An isomorphism between (rooted or unrooted) trees T and S is uniquely determined by the induced bijection between L(T ) and L(S). In particular, an automorphism of a tree T is uniquely determined by the induced permutation of the leaf set L(T ).
Thus we will often consider an isomorphism as such a bijection L(T ) → L(S).
The condition in the definition can be visualized in the commutative diagram Note that if two tanglegrams Y 1 and Y 2 are isomorphic, then there is a 1-1 map from the graph of Y 1 to the graph of Y 2 which maps between-leaf edges to between-leaf edges.

Symmetries of trees.
In order to describe the ensemble of tanglegrams it is necessary to review the symmetries of the trees in the tanglegram. Although this material is classical, we were not able to find a simple presentation, and so provide one here. We will assume familiarity with the basics of group theory (covered by dozens of textbooks, e.g. [28]). Automorphisms of a tree T form a group under composition. Using S n to denote the symmetric group on n objects, leaf automorphisms of T form a subgroup A(T ) of S |L(T )| .
To enumerate symmetries of trees it is convenient to use the notion of a wreath product; we will only define and use wreath product in the case when the acting group is S k . Use G k to denote the k-fold direct product G × · · · × G.
Given a group G, the wreath product G S k of G by S k can be described as the direct product G k × S k with the following group operation. First recall that the group operation on G k is defined by applying G's group operation component-wise. An element of S k acts on G k by permuting the components, such that the group action of σ ∈ S k on g ∈ G k is the element σ(g) ∈ G k with ith component g σ(i) . Given elements g, g in G k and σ, σ ∈ S k , the wreath group law is: For rooted trees, Jordan [29] and Pólya [30] observed that the automorphism group of any rooted tree can be built by repeated direct products and wreath products of symmetric groups as follows. In the simplest case, assume a rooted tree T for which the root has two daughter subtrees T 1 and T 2 . If T 1 and T 2 are isomorphic (and thus have the same automorphism groups), the automorphism group of T is the wreath product A(T 1 ) S 2 . That is, its symmetry group is two copies of A(T 1 ) along with the symmetry exchanging T 1 and T 2 , equipped with the group operation that appropriately exchanges the subtrees before applying symmetries to the subtrees. If T 1 and T 2 are not isomorphic, then Now let T be a tree whose root has some number of daughters, each of which are roots of subtrees T 1 , . . . , T r . We can reorder and partition the subtrees into N partitions: such that the subtrees in each partition are isomorphic to one another and the subtrees in different partitions are not isomorphic. This defines integers i 1 , . . . , i N ; take i 0 to be zero. A more general version of the argument above establishes This defines the automorphism group of a rooted tree recursively, where of course the automorphism group of a single leaf is trivial. Example 6. Let T n denote the perfectly balanced binary tree on 2 n leaves and let G n = A(T n ). G 2 = S 2 and for each n, Example 7. The symmetry group of the Newick-format [31] tree (1,((2,3),((4,5),6))); (shown as the upper-left tree of Figure 1) is the direct product of the symmetry groups of (2, 3) and ((4, 5), 6). Each of these symmetry groups are S 2 .
The automorphism group of an unrooted tree will become clear after we describe a classical and mathematically natural way to root an unrooted tree: at the centroid. Let T be a tree, and let x be a node of T . If we remove x as well as the edges attached to x from T , we obtain a number of disjoint connected and rooted subtrees, X 1 , . . . , X k .
Definition 8. The weight of x, w(x), is defined as the maximum number of nodes of the subtrees X 1 , . . . , X k .
The node x is said to be a centroid of T if w(x) is minimal over all nodes of T .
It is clear that any automorphism of T maps a centroid to a centroid, a fact which we will use to find a root fixed under leaf automorphism. Centroids are unique or nearly so, as shown by the following theorem, the proof of which can be found as a guided exercise in [32, §2.3.4.4].
Theorem 10 (Jordan, 1869). Every tree has either: 1. a unique centroid or 2. two adjacent centroids. In case 2, every automorphism either preserves the centroids or exchanges them.
Let T be an unrooted tree, and let T r be the rooted tree formed by rooting T at either the unique centroid, or by a new node in the edge joining a pair of centroids.
Corollary 11. The automorphism group of an unrooted tree T is identical to the automorphism group of the associated rooted tree T r .

Double cosets and enumeration of tanglegrams.
We are now ready to algebraically describe the set of tanglegrams on a pair of n-leaf trees. Assume nleaf trees T and S, which are both rooted or both unrooted. Arbitrarily mark the elements of the leaf sets L(T ) and L(S) with the same set of n symbols, such that we can identify both A(T ) and A(S) as subgroups of S n . Using this same marking, we can also think of the bijections from L(T ) to L(S) as being elements of S n , thus these elements of S n give tanglegrams on T and S. Recall Definition 4, stating that the set of bijections φ giving the same tanglegram as a given φ are those for which there exist automorphisms g ∈ A(T ) and h ∈ A(S) such that h • φ = φ • g. This criterion is equivalent to φ = hφg −1 as group elements in S n . The set of elements satisfying such a criterion is called a double coset [28].
Definition 13. Given a subgroup J of a group G and g ∈ G, the right coset Jg (resp. left coset gJ) G is the set of elements of the form {jg | j ∈ J} (resp. {gj | j ∈ J}). The number of right cosets of J in G is equal to the number of left cosets. This number is defined as the index of J in G and is denoted [G : J]. Given two subgroups J and K of G, the double coset JgK for some g ∈ G is the set of elements {jgk | j ∈ J, k ∈ K}.
Any two right (left) cosets of J in G are either identical or disjoint and the number of elements in any coset is the same, i.e. |J|. In contrast to single cosets (left or right), the number of elements in a double coset may vary. We state these observations, and the equivalent observations in the unordered case, as a proposition.  Using the properties of double cosets, we find that the number of single cosets in the double coset GwG is the index [G : G ∩ w −1 Gw] = 2. Thus this double coset has 16 elements, and so there must be two double cosets, corresponding to the two tanglegrams.

Symmetries of tanglegrams.
Definition 17. An automorphism of an ordered tanglegram Y is an automorphism of the graph of Y which maps each tree to itself. An automorphism of an unordered tanglegram Y is an automorphism of the graph of Y which preserves the betweenleaf edges, so an automorphism of an unordered tanglegram either maps each tree to itself or switches the two trees. If Y is a rooted tanglegram, then an automorphism of Y is required to preserve the roots of the two trees. If the automorphism f : Y → Y exchanges the two trees, f is described by a pair of isomorphisms: g 1 : T → S and g 2 : S → T . For any leaf x of T , the image of a bijective pair (x, φ(x)) must map to another bijective pair (g 2 (φ(x)), g 1 (x)). This implies that g 1 (x) = φ(g 2 (φ(x))), and thus in general that g 1 = φ • g 2 • φ. If we put the same set of distinguishing marks on the leaves of the trees T and S, we may consider the bijection φ to be an element of the symmetric group S n . With these conventions, we have shown that there exist g 1 ∈ A(T ) and g 2 ∈ A(T ) such that g 1 = φ g 2 φ as group elements when there is an automorphism that switches the two trees. The converse follows from reversing this argument. In summary: Proposition 18. If Y is an unordered tanglegram, then there exists an automorphism of Y that switches the two trees if and only if: • the trees T and S are isomorphic, and On the other hand, if h : Y → Y is an automorphism which maps each tree to itself, then f is described by two automorphisms g : T → T and h : S → S satisfying φ • g = h • φ when restricted to the leaves, or g = φ −1 hφ as elements of the symmetric group.
Similar to the case for trees, tanglegram automorphisms are determined entirely by their action on the leaves of one of the trees. This is analogous to the definition of a leaf-labeled phylogenetic tree [34]. The other tree can be considered to be labeled by the composition of the labeling with the bijection. Applying this labeling to both trees and then forgetting the bijection gives a pair of leaf labeled trees on the same label set, and each such pair of leaf labeled trees obviously determines a labeled tanglegram. Thus, labeled tanglegrams are in one-to-one correspondence with pairs of leaf-labeled phylogenetic trees. If the tanglegram is ordered, then this is an ordered pair of trees, and if unordered it is unordered.
It is natural to ask how many distinct labeled n-tanglegrams have the same underlying ordered or unordered tanglegram. Each leaf has a distinct label, such that the symmetric group acts freely on these labels. By the orbit-stabilizer theorem, Proposition 21. The number of leaf-distinct labelings of a given n-tanglegram Y is equal to n!/|A(Y )|. This is true for ordered and unordered tanglegrams, using their respective automorphism definitions. For example, there are 12 labelings for the ordered tanglegram (1,(2,(3,4))); (((1,2),3),4); but only 6 when considered as an unordered tanglegram.
Given a means of sampling uniformly from tanglegrams [22], we can use this proposition to obtain a weighted sampling scheme for the uniform distribution across pairs of phylogenetic trees on the same labeling set. For example, assume we wanted to approximate the expectation of a function f on uniformly sampled pairs of labeled trees, but which is constant on pairs of trees that make the same tanglegram (such as SPR distance). Then where if f (T 1 , T 2 ) = f (T 2 , T 1 ) for all T 1 , T 2 then the right hand sum can be over unordered tanglegrams Y , and otherwise it is over ordered tanglegrams Y . Here P(T 1 , T 2 |Y ) is simply the indicator function expressing if T 1 and T 2 make Y , divided by the number of pairs of labeled trees making Y as enumerated in Proposition 21. Rather than sampling pairs of trees uniformly and calculating an empirical expectation as on the left side, we can get a lower variance estimator by sampling tanglegrams uniformly and weighting them as on the right hand side. Such a means of sampling uniformly from tanglegrams in the rooted binary ordered case is given in [22].

Variants and special cases
3.1. Multiple trees. The definition of a tanglegram on two trees can be generalized to a version on multiple trees. Definition 22. Given trees T 1 , . . . , T n with the same number of leaves, a multitanglegram on this set of trees is given by a pair of tuples ((T 1 , . . . , T n ), (φ ij ) i,j∈1,...,n ) in which φ ij : L(T i ) → L(T j ) are bijections satisfying: ij for all i, j; 3. φ ik = φ jk • φ ij , for all i, j, k.
We can also generalize the definition of isomorphism to multi-tanglegrams on n trees. Definition 23. Two multi-tanglegrams Y = ((T 1 , . . . , T n ), (φ ij ) i,j∈1,...,n ) and Y = ((T 1 , . . . , T n ), (φ ij ) i,j∈1,...,n ) on the same list of trees are isomorphic if there exist automorphisms (g i : T i → T i ) i∈1,...,n and (h i : It is clear that the n 2 bijections φ ij are completely determined by the n − 1 bi- 1i . With this observation, we can rephrase the definition of isomorphism above, which we will state as a proposition: Proposition 24. Using the notation above, multi-tanglegrams Y 1 and Y 2 are isomorphic if and only if there exist automorphisms g i ∈ A(T i ), i = 1, . . . , n satisfying . Alternatively, the automorphisms φ ij are completely determined by a sequence φ 12 , φ 23 , . . . , φ k−1 k , and thus multi-tanglegrams are called tangled chains by [22].

3.2.
More general classes of graphs. Another direction of generalization involves considering more general classes of graphs. For example, the tanglegram layout problem has been studied for rooted phylogenetic networks [35]. Given a natural number n, define an n-leaved graph as a graph U along with n distinguished vertices L(U ) called leaves.
Definition 25. Given a natural number n, define a generalized n-tanglegram as a triple (U, φ, V ), where U and V are a pair of n-leaved graphs and φ is a bijection between L(U ) and L(V ).
Equivalent statements to those above can also hold in this more general setting. If we require that n-leaved graph automorphisms preserve the leaf set L(U ), we can again define the leaf automorphism group A(U ) to be the automorphism group of U restricted to L(U ). If the graphs are such that any graph automorphism is determined by its action on the leaf set, then generalized tanglegrams on a given pair of n-leaved graphs U and V are in one-to-one correspondence with double cosets A(V )wA(U ) in S n .

3.3.
Partitions. Another line of inquiry in computational evolutionary biology concerns species delimitation, which can naturally be phrased in terms of inference of a partition of labeled objects. In a manner analogous to phylogenetic trees, researchers use MCMC to explore the posterior on such partitions [13], and comparison of the results can be performed using distances between the partitions [12]. Similar considerations hold for random walks and these distances as described in the introduction for trees. These partitions can also be thought of as a certain type of leaf-labeled tree of height two, thus pairs of partitions on the same underlying set also give a type of tanglegram.
All of the above conclusions hold for such partition tanglegrams as well. The automorphisms of a partition are a special case of Theorem 5. For example, the partition 123 | 456 | 78 has automorphism group (S 3 S 2 ) × S 2 .

Enumeration
Using a computer algebra package such as GAP4 [36] which is able to enumerate double cosets, and a package such as Sage [37] which can obtain symmetry groups of graphs, one can apply Proposition 14 to directly enumerate any type of tanglegram on a given pair of trees. We have provided code to enumerate and work with tanglegrams at https://github.com/matsengrp/tangle.
For the case of binary ordered rooted tanglegrams, an elegant formula for the total number of tanglegrams on n leaves t n has recently been found [22]. One can use this formula, along with the number of tanglegrams on pairs of isomorphic trees, to compute the number of unordered tanglegrams as follows.
An unordered tanglegram is represented twice in the list of ordered tanglegrams on n leaves if the two trees are non-isomorphic, or if the trees are isomorphic and the coset is different when the representative is inverted as in Figure 4. For n leaves, we let s n be the number of unordered tanglegrams, and then let t iso n be the number of ordered tanglegrams and s iso n the number of unordered tanglegrams on isomorphic pairs of trees. To get s n , we start with t n and subtract off half the number of ordered tanglegrams on non-isomorphic trees for the first case, and then subtract off t iso n − s iso n for the second. Simplifying t n − (t n − t iso n )/2 − (t iso n − s iso n ), we get s n = t n − t iso n /2 + s iso n for any n ≥ 3. Such direct enumeration of various types of tanglegrams ( Figure 5, Table 1) suggests that their number grows super-exponentially. In fact, that the number of (binary ordered rooted) tanglegrams is O(n! 4 n n −3 ) as shown by [22].
There are thus many fewer such tanglegrams than there are pairs of leaf-labeled trees. Indeed, a simplification of the argument establishing Corollary 8 of [22] shows that the ratio of the number of ordered pairs of leaf-labeled rooted trees to the number of binary ordered rooted tanglegrams is asymptotically a constant times the order of the symmetric group: Intuitively, although the action of the symmetric group is not always free, "for most cases it is close" to free. This may suggest that for n leaves, the ratio of the number of ordered pairs of leaf-labeled unrooted trees to the number of binary ordered unrooted tanglegrams is also of order n!.

Discussion
Tanglegrams have been an object of study since before DNA sequences were widely available for the reconstruction of phylogenetic trees [38]. So far they have been studied before in the context of co-evolutionary analyses, classically that between a host and a parasite, a subject of continuing interest [39,40]. As such, there has been extensive work on the case in which two rooted trees are distinguished between one another, as when one tree represents hosts and one parasites, which we call the ordered rooted case. Here we have broadened the definition of tanglegrams by considering a broader class of underlying graphs, including unordered and/or unrooted tanglegrams.
In this form, tanglegrams formalize statements concerning pairs of phylogenetic trees on the same leaf set that do not directly make reference to the labels themselves. Symmetric tanglegrams also do not make reference to the order of the trees. We observe that many problems in phylogenetic combinatorics "factor" through a problem on tanglegrams. As such, we believe tanglegrams to be a worthwhile object of study in phylogenetic combinatorics, and note that they have already been crucial in an analysis of the geometry of the subtree-prune-regraft graph [11].
These generalized notions of tanglegrams, which are equivalent to the collection of double cosets formed by the automorphism groups of the two trees, invite further investigation by combinatorialists. An elegant formula for the number of binary ordered rooted tanglegrams has recently been found [22], as well as for the multitanglegram case. Here we provide the first several terms of the analogous sequence for unordered and/or unrooted tanglegrams; Ira Gessel has used the theory of species to develop means to enumerating unordered tanglegrams, which will be described in a forthcoming paper [41]. It would be helpful to have a means of efficiently sampling other classes of tanglegrams according to familiar distributions on labeled phylogenetic trees, perhaps building on the method of sampling binary ordered rooted tanglegrams uniformly at random in [22].

Acknowledgements
We would like to thank Steve Evans, Ira Gessel, Michael Landis, Chris Whidden, and Bianca Viray. We also thank the authors of the Sage and GAP4 software, especially Alexander Hulpke.