Variational Supertrees for Bayesian Phylogenetics

Bayesian phylogenetic inference is powerful but computationally intensive. Researchers may find themselves with two phylogenetic posteriors on overlapping data sets and may wish to approximate a combined result without having to re-run potentially expensive Markov chains on the combined data set. This raises the question: given overlapping subsets of a set of taxa (e.g. species or virus samples), and given posterior distributions on phylogenetic tree topologies for each of these taxon sets, how can we optimize a probability distribution on phylogenetic tree topologies for the entire taxon set? In this paper we develop a variational approach to this problem and demonstrate its effectiveness. Specifically, we develop an algorithm to find a suitable support of the variational tree topology distribution on the entire taxon set, as well as a gradient-descent algorithm to minimize the divergence from the restrictions of the variational distribution to each of the given per-subset probability distributions, in an effort to approximate the posterior distribution on the entire taxon set.


Introduction
Fields such as phylogenetics often work with a sort of abstracted family tree, called a phylogenetic tree, frequently abbreviated here as tree.These trees have different members of a population as their tips, and their branching points describe the relations between the tips and how recently they had a common ancestor.If some of the tips are censored, the tree topology simplifies in a process we refer to as restriction.If one has multiple trees restricted from the same original, uncensored tree, one may wish to reconstruct the original supertree.Suppose instead one has multiple probability distributions of restricted trees, then one may be interested in reconstructing the supertree probability distribution.This is a difficult problem both theoretically and computationally without additional structure.We take a variational approach by training a flexible model to approximate the true supertree distribution as closely as possible, while still maintaining computational tractability.
This problem falls in the domain of supertree analysis, a topic that has gone by this name since 1986 but has much earlier roots as reviewed by Sanderson et al. (1998) and Bininda-Emonds (2004).Broadly speaking, there are two goals of supertree analysis.The first goal is to reduce computational complexity by dividing the ensemble of taxa into subsets, performing independent analysis on those subsets, and then combining these analyses into a single tree (Huson et al. 1999), or in the Bayesian case a single posterior distribution.The second goal is to combine information from multiple sources, such as different genes, which may have divergent phylogenetic signal and patterns of presence and absence.Although there is some overlap between these goals, the focus of this paper is on the first goal.Algorithms for the second goal are better served by methods that explicitly model the origins of different phylogenetic signal, such as via the multispecies coalescent (Liu and Pearl 2007;Heled and Drummond 2010).
The eventual goal of our work is to provide a divide-and-conquer strategy for Bayesian phylogenetics, in which taxa are divided into subsets, a Bayesian analysis is run on each, and then knitted back together using a supertree approach.Although this approach in the non-Bayesian case has been a consistent theme in phylogenetics since the work of Huson et al. (1999), the equivalent idea in Bayesian phylogenetics is comparatively underdeveloped.This seems surprising given that Bayesian analyses are much more computationally demanding than their non-Bayesian counterparts, such that the lack of rapid Bayesian inference techniques is limiting their application in important realms such as genomic epidemiology.To motivate our method, we consider the case where researchers have already performed an expensive Bayesian phylogenetic analysis and have subsequently acquired more sequence data, as often happens in epidemiology.Our approach provides a method for combining a new, smaller phylogenetic analysis (including the new sequences) with the earlier results, reducing the need for a new, expensive analysis on all of the sequences at once.
The most relevant existing work, by Ronquist et al. (2004), summarizes phylogenetic posterior distributions in terms of one of two schemes.In the Weighted Independent Binary (WIB) scheme, a tree's probability is proportional to a product of terms, each term being present in the product of the corresponding bipartition present in the tree.This scheme is in a sense a simpler version of the strategy presented here.
The Weighted Additive Binary (WAB) scheme is an extension of the long-standing tradition in supertree analysis of performing parsimony analysis on a data matrix formed from encoding the splits of the tree as binary characters.The weighting in WAB comes from assigning weights to the characters in such an encoding according to their confidences.One can then translate the corresponding parsimony objective into a Bayesian setting by assigning a log-likelihood penalty to each unit of parsimony cost.In total, by taking posterior distributions for trees on each of the subsets, summarizing them in terms of one of these schemes, and then using products of these factors as an approximation for posterior probability.Ronquist et al. (2004) show some correlation of this method with actual tree posterior probabilities for example data sets on six and ten taxa, and that the WAB scheme outperforms the WIB scheme.
In this paper we develop a variational formulation for supertree estimation.Given a collection of reference distributions of tree topologies with overlapping tips (typically acquired via Bayesian phylogenetic inference), in order to approximate the posterior on the full tip set we find a supertree distribution that closely approximates each reference distribution when only considering the tips in that reference.We structure our supertree distribution using subsplit Bayesian networks (SBNs) (Zhang and Matsen IV 2018) reviewed below, which generalize previous formalisms for describing probability distributions on topologies (Höhna and Drummond 2012;Larget 2013).We note in passing that these formalisms, in turn, noted connections between their methods and the supertree work of Ronquist et al. (2004).We focus on the case where the reference distributions are originally given as, or subsequently approximated by, SBNs, but the method is generalizable to arbitrary reference distributions at the cost of computational efficiency.We accomplish our goal of training a supertree distribution using gradient descent to minimize the differences between our reference distributions and our supertree distribution (appropriately restricted).Moreover, we show that the method successfully trains a supertree distribution that is close to the original posterior (SBN) on both simulated and real-world phylogenetic sequence data.

Overview
Suppose we are given a set of probability distributions { p i } on rooted, bifurcating phylogenetic tree topologies, abbreviated as tree topologies or simply topologies, each with a corresponding tip set X i .We refer to these tree distributions on their respective tip sets as our reference distributions.These reference distributions should be thought of as the input data for this method.We construct a supertree distribution by finding a probability distribution q(τ ) of topologies on the entire taxon set X = ∪ i X i so that q is as close as possible to each of the reference distributions when the tips not present in that reference distribution are removed.Figure 1 illustrates the flow of information, taking two reference distributions and producing a supertree distribution on the union of its references' tip sets.
We now establish the formalisms necessary to achieve this goal.Given a taxon subset X ⊂ X and a topology τ on X , define the restriction τ X to be the topology Fig. 1 Illustrating a supertree method on distributions.The top row of two tree distributions are reference distributions, and the bottom tree distribution is the supertree distribution.The circles indicate tips shared by the two reference distributions, whereas the square and diamond represent tips that are only present in one of the two references Fig. 2 Illustration of restricting tree topologies.We restrict a tree τ with tip set {ABC D} (left) to tip set {AB D}, resulting in the tree τ AB D (right).We remove tip C and the internal node marked with an X as it no longer has two descendants induced by removing all taxa that are not in X from τ and removing all internal nodes with fewer than two children (Semple and Steel 2003).We illustrate an example of this process in Fig. 2. Given a probability distribution q on tree topologies with taxon set X and a topology τ on X , we define the restriction of q to X as a marginalization over the topologies that restrict to τ , q X( τ ) := τ :τ X = τ q(τ ). (1) In this paper, our goal is to infer a distribution q(τ ) for topologies on the entire taxon set X such that its restrictions {q X i } resemble the corresponding reference distributions { p i }.In the use case where the reference distributions are all samples from restrictions of one phylogenetic tree posterior on the full tip set, our ultimate goal is to approximate the full posterior using the supertree distribution.Mechanically, our objective will be to minimize our loss function L: the sum of KL-divergences between each reference distribution and q restricted to its taxon subset, Also note that the KL-divergence will be undefined for any tree τ where q X i (τ ) = 0, so some care must be taken to ensure support compatibility.
If we have reason to prioritize some reference distributions differently than others, due to differing confidence in the different distributions among other reasons, we can easily incorporate weights into a weighted loss function, For the algorithms that follow in this paper, any choice of positive weights would be appropriate.However, hereafter we focus on the unweighted version.
For parameterizations of q such that D KL ( p i q X i ) has an efficiently computable gradient with respect to q's parameters, gradient descent is available for minimizing the loss function.We describe one such family of parameterizations using SBNs below and derive efficient KL-divergences and gradients later in this section.

Review of Subsplit Bayesian Networks
Here we review subsplit Bayesian networks (SBNs) in the case of rooted topologies.Our approach will have a different emphasis than the original Zhang and Matsen IV (2018) work.Where the previous work described SBNs as a very general class of Bayesian networks and concentrated on unrooted trees, we will focus on a simpler SBN structure parameterizing rooted, bifurcating (phylogenetic) trees.
We will use the term clade to refer to a subset of a taxon set X .A subsplit is a set containing two disjoint child clades s = {Y , Z }.We define the parent clade of a subsplit as the union of its two child clades, with notation U (s) = Y ∪ Z .If we need to specify a particular child clade Y of a subsplit s = {Y , Z } as being the focus of attention (as opposed to the other child clade), we use the notations (s, Y ) or {Z , Y }.Note that the parent clade of s is allowed to be a proper subset of X , in contrast to the traditional definition of a split as a bipartition of the entire taxon set X (Semple and Steel 2003).We will say that the subsplit s divides a clade W if U (s) = W , with notation W → s.Note that due to our bifurcating assumption, every clade of size two or larger will be divided by a subsplit.We also say that s is a child subsplit of parent subsplit t if U (s) is a child clade of t.We refer to t and s as a parent-child subsplit pair or PCSP with notation t → s when s is a valid child of a subsplit t.
We also extend the concept of valid child subsplits to further descendants.Given subsplit a = {Y , Z }, we say that subsplit d is a valid descendant of (a, Y ) if U (d) ⊆ Y , and we use the notation (a, Y ) → * d.Additionally, we say that d is a valid descendant of a with notation a → * d if (a, Y ) → * d, (a, Z ) → * d, or a = d.Equivalently, we say a is a valid ancestor of d under the same conditions and with the same notation.Note that t → s implies t → * s but not vice versa.
We use the term path to refer to a sequence of subsplits such that each element is a descendant of the previous.For example, the path a → * t → s would refer to a sequence starting with a, proceeding via any number of subsplits to t (including zero if a = t), then directly to t's child s.
It will be convenient to also introduce singletons and trivial subsplits.A singleton corresponds to one of the tips of the tree and is represented by a clade with size one or a subsplit containing a singleton clade and the empty set.A subsplit is trivial if one of its child clades is empty.We typically exclude singletons and trivial subsplits from sets of subsplits, unless explicitly included.
Each bifurcating rooted topology can be uniquely represented as a set of the subsplits it contains.For example, the topology given by the Newick string (Felsenstein 1986) "((t1,t2),t3);" is described by the subsplits {{t1, t2}, {t3}} and {{t1}, {t2}}.We will use the notation s ∈ τ to mean that the subsplit s is found in τ and the notation τ ⊆ S to mean that all of the subsplits in τ are in set S. The same holds true for specifying a topology in terms of PCSPs, and we will use the same notation in that case.Each subsplit s ∈ τ has two child clades which each must correspond to a singleton or a subsplit that divides it.Similarly, for a given topology τ each tip or subsplit s has a parent subsplit t such that U (s) is one of the child clades of t.We will denote the parent subsplit of s with π τ (s).In the above example, {{t1, t2}, {t3}} is the parent of {{t1}, {t2}}, which in turn is the parent of singletons {t1} and {t2}.In order to eliminate having to make a special case for the root subsplit r , we define its parent subsplit to be a special trivial subsplit of the entire taxon set, i.e. π τ (r ) = {X , ∅} = π X .
In order to illustrate how to construct an SBN, we first describe how to sample a topology from an SBN.Starting from the root clade, recursively construct a topology: for any currently-childless clade W larger than a singleton, sample a subsplit that divides W from a probability distribution, supplied by the SBN, conditional on some subset of the ancestors of W .These conditional distributions can be parameterized in different ways using different subsets of the clades' ancestry (Zhang and Matsen IV 2018), with each parameterization defining a family of SBN probability distributions on tree topologies.In this paper, we focus on two families in particular: clade-conditional distributions (CCDs) where the subsplit distributions p(s|U (s)) are conditional on the subsplits' parent clade (Höhna and Drummond 2012;Larget 2013), and subsplit-conditional distributions (SCDs) where the subsplit distributions p(s|(t, U (s))) are conditional on the subsplits' parent subsplit and clade (Zhang and Matsen IV 2018).We choose these two parameterizations for different reasons: CCDs for simplicity and elegance, illustrating the supertree algorithms, and SCDs because they can be trained to a higher fidelity to a target distribution (Zhang and Matsen IV 2018).
We use the notation p(s) := p(s ∈ τ ) for the unconditional probability of a subsplit s being present in a topology τ randomly sampled according to p. Similarly, we use p(t → s) for the unconditional probability of PCSP t → s, namely p(s|(t, U (s))) p(t).For CCD-parameterized SBN p, we define the subsplit support C as the set of subsplits that have positive probability under p.For SCD-parameterized SBN p, we define the PCSP support P as a heterogeneous set containing the PCSPs that have positive probability under p, the subsplits that have positive probability under p, the singletons for p's tip set, and the empty subsplit.

KL-Divergence Between SBNs
Here we show that the KL-divergence between two SBN-parameterized distributions can be computed efficiently.If both p(τ ) and q(τ ) are CCD-parameterized SBNs, Computing this sum is linear time in the number of subsplits in the subsplit support of p.
Similarly, if both p(τ ) and q(τ ) are SCD-parameterized SBNs, (3) Computing this sum is linear time in the number of PCSPs in the PCSP support of p.

Restricting SBNs
Equation 1 defines how to take a distribution q on trees with taxon set X and restrict it to its induced distribution on trees with taxon set X ⊂ X .If q is an SBN-parameterized distribution, we can more efficiently calculate q X from the SBN parameters directly: we can restrict a subsplit s to taxon set X by taking the intersection of both child clades with X .One consequence of this is that some subsplits on X will become trivial subsplits on X .On the other hand, if a restricted subsplit is nontrivial, then we know the original subsplit is nontrivial, because the restricted subsplit separates at least one pair of tips, so the original subsplit will separate those tips at well.Furthermore, subsplits represent recursive bipartitions of sets, so any pair of tips can only be partitioned by a subsplit once.Therefore, no two subsplits that restrict to the same nontrivial subsplit s can appear in the same tree, since any subsplit that restricts to s separates all the same tips that s partitions.By this mutual exclusivity, the probability of a restricted subsplit s under a restricted distribution q X is simply q X( s) = s: s X =s q(s). (4) Similarly, subsplits with the same clade are mutually exclusive, so the unconditional probability of a clade appearing is Ū q X( s ). (5) In order to construct the restricted SBN, we need to compute the appropriate conditional probabilities, which we can easily calculate from unconditional probabilities.In a CCD context, subsplit s probabilities are conditional on observing its clade U (s).We can build upon Eq. 1 to find the restricted SBN induced by restricting to X ⊂ X .We see A slightly more involved construction is needed to gain an equivalent formula for the SCD case.For SCD parameterizations, we need the unconditional probability of a PCSP.PCSPs are mutually exclusive, so an argument similar to the above holds, but subsplits a and d that respectively restrict to the restricted parent t and child s do not themselves have to be a valid parent-child pair before restriction.We illustrate this possibility in Fig. 3.More formally, the following two statements are equivalent: (1) the ancestor-descendant pair a → * d restricts to the PCSP t → s, and (2) a sequence of parent-child pairs in q exists, starting with a, ending with d, with subsplits {t i } in between, such that each t i X is trivial.The converse is elementary, but to show that Fig. 3 Illustration of restricting PCSPs.We restrict a tree with tip set {ABC D} (left) to tip set {AB D} (right).Note how the node with label a on the left corresponds to the node with label t on the right, just as the node with label d corresponds to the node with label s.Note that nodes a and d are not parent and child, but nodes t and s are.This is possible because the node with label * is trivial under restriction, and therefore has no corresponding node on the right (1) implies ( 2), we know a and d exist by assumption, and a restricts to t which is the parent of s.Then one of a's child clades restricts to U (s), and every subsplit between a and d must divide U (s) under restriction.Finally, since the tips in U (s) can only be partitioned once (in s), every subsplit between a and d must be trivial under restriction.
We use the notation q(a → * d) to represent the probability of observing subsplits a and d in a random tree from q if d is a valid descendant of a and zero otherwise.The unconditional probability of a PCSP under a restricted distribution is then, Then under restriction, the conditional probability of a PCSP given its parent is

Supertree Support
Our overall goal is to find a distribution q on topologies that is close to a set of reference distributions { p i } on taxon sets X i .An important part of that goal is to understand the supertree support, namely the set of building blocks (subsplits or PCSPs) that have positive probability under q, and is isomorphic to the set of trees that have positive probability under q.We find the supports for all of our reference distributions p i , which we will call our reference supports.We will refer to process of finding a mutual supertree support for the entire taxon set as mutualizing or mutualization.The details of how this is done will depend on whether we are using a CCD or SCD parameterization.
We aim to construct a suitable supertree support for the sake of computational tractability, so we wish to have as few elements in our supertree support as reasonably possible.However, any tree that restricts to a tree in each reference support is as suitable for inclusion in the supertree support as any other, so we must include them all.We codify these objectives as a pair of Requirements, stated here generally and later more specifically in CCD and SCD contexts.
Requirement 1: To allow an element into the supertree support, it must restrict to elements in each reference support, Requirement 2: Any tree that, under restriction, is a subset of every reference support, must be included in the supertree support.
For more than two reference supports, we propose an incremental approach for building the mutual support: we start with taxon set X 1 and its reference support, extend to X 1 ∪ X 2 by mutualizing with the reference support for X 2 , then continue to (X 1 ∪X 2 )∪X 3 , etc. Thus we will only present an algorithm for the case of X = X 1 ∪X 2 .The algorithms extend to finding supertree supports for multiple sets simultaneously, but the computations grow exponentially in the number of simultaneous supports (see Discussion).

CCD subsplit supports
Assume that we have reference subsplit supports C X i for each taxon subset X i ⊂ X and wish to find a good candidate subsplit support M({C X i }) for the supertree distribution q(τ ).We now specialize the Requirements to the CCD case: Requirement 1 says that any subsplit in the mutualized support must exist (after restriction) in each of the input subsplit supports.Requirement 2 says that any topology that appears (after restriction) in each of the restricted supports must be present in the mutualized support.These are in fact fairly strong constraints.For example, if reference supports do not agree on overlapping sets of taxa, then the supertree support can be too small or even empty.However, if the reference supports are restrictions of the true supertree support, or are supersets of the true support restrictions, then the mutualized support will cover the true support.Below we present Algorithm 1 which fulfills these requirements.
Next we explain the Requirements in more detail for the case of two reference supports.For any clade W ⊆ X , define C X (W ) as all subsplits in C X that divide W , including the trivial subsplit {W , ∅}.In order for a subsplit s to meet Requirement 1, s must be a member of both C X 1 (W ) and C X 2 (W ) after restriction.
To find a collection of subsplits that satisfies Requirement 2, we take an iterative approach over possible clades from the root to the tips (Algorithm 1).Starting with the stack containing the clade X (the union of all tip sets), we iteratively pop the next clade W , and we consider every pairing of subsplits s 1 , For each pair, we generate a set of two potential subsplits, defining the operator: add W to Visited 7: Note that potential subsplits will frequently have taxa in both child clades, but we will exclude these invalid subsplits.We add each nontrivial valid subsplit s ∈ s 1 s 2 to the output, and we push each child clade of s to the stack of clades to consider if the clade is size two or larger and it has not been visited before.
In the Appendix we prove that Algorithm 1 meets both of our requirements for the supertree subsplit support.We also show that Algorithm 1 runs in O(n S 1 n S 2 ) time, where n S i is the number of subsplits in reference support C X i .

SCD PCSP Supports
For SCD-parameterized SBNs, we need to consider the supertree PCSP support.Suppose that we have reference PCSP supports P X i for each taxon subset X i ⊂ X .We have requirements for PCSP support mutualization that parallel our requirements for constructing a subsplit support.Our Requirements take the following form in the SCD case: and a subsplit u i in P X i such that a i X i = u i and U (s X i ) ∈ u i (see Fig. 3 for an illustration).SCD Requirement 2: For every tree τ on X such that, for each i, τ X i ⊆ P X i then τ ⊆ M({P X i }).
Our PCSP mutualization algorithm, laid out in Algorithm 2 and illustrated in Fig. 4, largely follows the structure of Algorithm 1, with a few subtle differences.We use the notation P X ((t, W )) to represent the set of all valid child subsplits of (t, W ) in P X , including the trivial subsplit {W , ∅}.Because parent subsplits can become trivial under restriction, we need to perform additional bookkeeping in Algorithm 2. The items in our recursion stack (line 5) contain three pieces of information: the parent subsplit and clade under consideration in the full taxon set, the most recent parent subsplit Fig. 4 Illustration of finding the mutual PCSP support.On the top left and right we see the trees in the two reference distributions, on taxon sets {A, B, C, D, E} and {B, C, D, E, F}, respectively.Below them (boxed), we see the PCSP supports resulting from those trees.In the center (boxed), we see the PCSP support on {A, B, C, D, E, F} resulting from the mutualization algorithm.Below that, we see the trees that result from the mutual PCSP support.Note that all of the the trees at the bottom restrict to trees in the references in P X 1 , and the most recent parent subsplit in P X 2 .The additional if statements (lines 12 and 14) in the main loop capture both most recent parent subsplits in their respective supports.The need for and operation of these additional constructions is best illustrated with an example, which we provide in the Appendix.This example uses only 4 taxa and is written in great detail.
In the Appendix we prove that Algorithm 2 meets both of our requirements for the supertree PCSP support.We also show that Algorithm 2 runs in O(n P 1 , n P 2 ) time, where n P i is the number of PCSPs in reference support P X i .

Gradients
If we can calculate the gradient of L({ p i } q) = i D KL ( p i q X i ) with respect to the parameters of our SBN, then we can use gradient descent to minimize our objective and optimize our supertree distribution.In this section we describe how to perform such gradient calculation in the CCD and the SCD parameterizations.We refer to our Algorithm 2 Mutual PCSP Support algorithm M(P X 1 , P X 2 )  end for 24: end while 25: return Output gradient-based approach as vbsupertree, and illustrate its optimization loop in Fig. 5.

CCD Parameterizations
Under CCDs, we parameterize the distribution of subsplits s conditional on their clade U (s) using a softmax transformation of a parameter vector v = {v s }, i.e. .
We choose this softmax parameterization in order to have clean derivatives with respect to our parameters and facilitate taking the gradient of our objective function.Softmax is also very commonly used in other statistical applications, so additional features like regularization penalties are easy to implement.We use the shorthand for the derivative of a function with respect to one of our CCD parameters.Following the deriviations shown in the Appendix, we see that ∂ s q(s), Fig. 5 Illustration of the vbsupertree optimization loop.On the top left and top right we see the two reference distributions.Below them, we see the SBNs resulting from those trees (boxed).In the center (boxed, above) we see a candidate supertree SBN on trees in the mutual support.We calculate the KLdivergences and gradients versus the reference SBNs (under restriction) and change the candidate SBN according to the gradients, resulting in the next candidate SBN (center, below).The result can then be looped until a desired KL-divergence or a maximum number of iterations is reached where This form clearly shows the algorithmic complexity of the gradient computation as O(n s 2 ) where n s is the number of subsplits in the support, since both the summation and the derivative traverse every subsplit.

SCD Parameterizations
Under SCDs, we parameterize the distribution of child subsplits s conditional on their parent subsplit and clade (t, U (s)) with parameter vector v = {v s|t }, i.e. .
We use the shorthand for the derivative of a function with respect to one of our SCD parameters.
Following the derivations shown in the Appendix, we see that where k t is the number of child clades of t of size 2 or larger, After a linear pass through the support accumulating path probabilities, the algorithmic efficiency of this calculation is O(n p • n p X), where n p is the number of PCSPs in the support, and n p X is the number of paths in the support that restrict to a PCSP on tip set X .

Simulated Data
We begin exploring the effectiveness of vbsupertree through a simulation study.We sample a phylogenetic tree with 40 tips from the classical isochronous, constant effective population size coalescent, and simulate a sequence data alignment using the Jukes-Cantor 1969 model (Jukes and Cantor 1969).We remove one sequence from the full alignment to make our first reference alignment, and remove a different sequence to create our second reference alignment.We approximate the posterior tree distributions for our full alignment and our two reference alignments by running Markov chain Monte Carlo (Hastings 1970) using the phylogenetic software BEAST (Drummond and Rambaut 2007;Suchard et al. 2018) on our three sequence datasets.We run BEAST for 10 7 steps, remove 50% burn-in from the beginning, and subsample every 1000th tree to reduce autocorrelation, resulting in a 5000 tree posterior samples.Since we are not guaranteed to see every credible tree topology in every run due to the size of tree space, to maintain support compatibility for KL-divergence calculations, we trim out all tree topologies that only appear in a given BEAST output once, in order to increase the proportion of PCSPs in common (under the appropriate restriction) between the three posterior samples.We train rooted, SCD-parameterized SBNs to use as our ground truth distribution and two reference distributions.Finally, we trim any PCSPs in our ground truth and our references that are not covered by the appropriate restriction of our mutualized support (no restriction in the case of the ground truth).
Applying vbsupertree to the two generated reference distribution leads to quick convergence of the loss function to a small value, as seen in Fig. 6, left panel.Additionally, knowing a sample of the true posterior, we chart the progression of the KL-divergence of our supertree SBN versus the truth (SBN trained on the full posterior), resulting in the right panel.

Real World Data
For an analysis on real world data, we select 30 well-differentiated hepatitis C virus (HCV) sequences from the alignment previously analyzed by Pybus et al. (2003) and others.We remove one sequence from the full alignment to make our first reference alignment, and remove a different sequence to create our second reference alignment.From this stage forward, our approach is identical to our simulation study.We approximate the true posterior and our two references by running BEAST on our three sequence datasets.We run BEAST for 10 9 steps, remove 50% burn-in from the beginning, and subsample every 5000th tree to reduce autocorrelation, resulting in a 100000 tree posterior samples.In order to make the supports compatible for KLdivergence calculations, we trim out all tree topologies that only appear in a given BEAST output once, We train rooted, SCD-parameterized SBNs to use as our ground truth distribution and two reference distributions.Finally, we trim any PCSPs in our ground truth and our references that are not covered by the appropriate restriction of our mutualized support.
Applying vbsupertree to the two generated reference distribution leads to quick convergence of the loss function to a small value, as seen in Fig. 7, left panel.Additionally, knowing a sample of the true posterior, we chart the progression of the KL-divergence of our supertree SBN versus the truth (SBN trained on the full posterior), resulting in the right panel.

Discussion
In this paper, we lay out an SBN-based framework for generating supertree supports and training variational supertree distributions.We apply our method to simulated sequence data and find that it trains an SBN that very closely approximates our target posterior distribution.We also apply our method to a subset of a well-known HCV dataset, and successfully train it to approximate our ground truth distribution.
Although the work of Ronquist et al. (2004) described in the introduction is the closest work to that presented here, two other lines of research deserve mention in this context.First, De Oliveira et al. (2016) derive a Bayesian extension of previous work on maximum likelihood supertrees (Steel and Rodrigo 2008).In this strategy, one posits a likelihood model based on measures of disagreement between trees, such as an exponential likelihood in terms of some distance between tree topologies.This method is interesting in that it can incorporate a number of distances representing various aspects of tree disagreement (De Oliveira et al. 2016), however, this is a different than the direct goal of reconstructing a posterior distribution on taxon set X given its projections onto subsets as we describe below.Our objective directly phrases a goal appropriate for divide-and-conquer Bayesian phylogenetics.Also, the work of Bryant (2001) shares some goals and concepts with our mutualization algorithm.However, in that setting one has one tree per taxon subset (rather than many as is the case here) and one wishes to find the optimal tree among the potentially many trees that restrict to the stated tree for each taxon subset.
Another related line of research concerns sequential Monte Carlo inference by subtree merging (Bouchard-Côté et al. 2012;Wang et al. 2015).The state of such a sampler is described by a population of "particles," each of which consists of a collection of rooted trees on disjoint taxon subsets such that the union of the tree tips is the entire taxon set.In each step of the algorithm, particles are chosen from the previous generation, and for each particle a pair of subtrees are merged.These probabilistic choices and mergings are designed carefully such that after completion of all of the steps one obtains a sample from the phylogenetic posterior distribution.This method is in a sense a type of divide-and-conquer algorithm in that it finds solutions to phylogenetic problems on subsets of taxa before finding the entire posterior.However, it differs significantly from our current goal in that we assume that the taxon subsets and the posterior distributions on subtrees are delivered as part of the problem statement, whereas phylogenetic SMC ingests raw molecular sequence data.
One common obstacle for supertree methods is the fact that the compatibility of k tree topologies on k tip sets cannot be checked in polynomial time in k (Steel 1992).
For the methods we present, holding ourselves to two tip sets at a time, this is not an issue.Our subsplit-and PCSP-based approaches pool our topologies into two sets which effectively sets k = 2.It is for this reason we propose using a one-at-a-time approach for using our supertree support mutualization methods on k > 2 tip sets.We anticipate exploring the properties of one-at-a-time versus all-at-once mutualization in future work.
The k = 2 framework may also have its own "Curse of Dimensionality" if the reference tip sets have little overlap.In this case, the information content of the reference distributions might be very small compared to the dimensionality of the supertree distribution, due to the rapid expansion of tree space on the number of tips.In these cases, the choice of the initial supertree distribution (before gradient descent) may influence the final (converged) supertree distribution.In our main experiments, we have taken the approach of using high-entropy starting distributions, but in Appendix A.6 we incorporate a regularization penalty term, to encourage conservative conditional distributions where the information from the reference distributions is insufficient.
One caveat for our supertree support mutualization methods arises when the reference supports do not completely cover the true restricted supports.When the references cover the truth, our results guarantee that the mutualized support contains every topology that we require without containing any extraneous elements.However, if the reference supports are missing elements from the true supports, then topologies will go missing from the mutualized supertree support and it is not guaranteed to cover the true support.Unfortunately, most tree-based Bayesian analyses will have enormous posterior topology supports, and Monte Carlo based methods will collect only a sample from the larger posterior.Thus in future work, we intend to broaden the inclusion criteria for our supertree support methods while still attempting to keep the mutual support as small as possible.
In general we view this work as providing a proof of concept of a new approach for divide-and-conquer Bayesian phylogenetics.To make this a more complete method, it will also require methods to merge variational branch length distributions (Zhang and Matsen IV 2019).Further refinement of these merged distributions with the complete data set, in terms of both support and continuous parameters, will likely be required.We also note that a perfect estimate of the variational distribution on topologies may not be necessary, as one can correct these variational distributions using importance sampling (Zhang and Matsen IV 2019), or perhaps use them as part of an MCMC proposal.However, high-quality importance sampling may require larger topology supports, potentially up to the full tree space for a given tip set.In that case, our mutualization and supertree methods can be adapted to include a nonzero probability for topologies outside of the mutual support.Future work will include exploration of the quality of importance sampling approximations using SBNs, including exploration of issues of support size and compatibility.Finally, we show that Algorithm 1 runs in O(n S 1 n S 2 ) time, where n S i is the number of subsplits in reference support C X i .We know Algorithm 1 only visits each clade W ∈ X at most once, each W restricts to one pair of W 1 , W 2 , and each subsplit divides only one clade.In the worst case, where X 1 and X 2 are disjoint, any W 1 may be paired with any W 2 , so every subsplit in C X 1 might have to be crossed with every subsplit in C X 2 .Therefore the algorithm runs in O(n S 1 n S 2 ) time.

Mutual PCSP Support Example
We will now illustrate Algorithm 2 with a simple yet nontrivial example.To illustrate specific clades, subsplits, and PCSPs, we introduce some shorthand notations for this section.Tree tips will be represented by capital letters, such as A and B. Clades will be represented by concatenated tips, such as ABD and ACD.Subsplits will be represented by two clades in a set, such as {AB, CD}.Subsplits focusing on a specific child clade will be represented by two clades in a set, with the focus clade underlined and generally written second, such as {CD, AB}.PCSPs will then look like {CD, AB} → {A, B}.
Suppose we are given the trees depicted in Fig. 8.The tree with tips ABD results in the reference PCSP support and the tree with tips ACD results in omitting singletons and the empty subsplit for brevity.The union of the tip sets is ABCD.The mutual PCSP support algorithm begins with stack [({∅, ABCD}, {∅, ABD}, {∅, ACD})].
We consider the children (including the trivial child) of {∅, ABD} which are {{A, BD}, {∅, ABD}} and the children of {∅, ACD}, namely {{A, CD}, {∅, ACD}}.The only nontrivial subsplit this results in via is {A, BCD}.Theorem 4 For every (t → s) ∈ M(P X 1 , P X 2 ) and each i = 1, 2, there exists a path (a i → * t → s) ⊆ M(P X 1 , P X 2 ) and a subsplit u i in P X i such that a i X i = u i and U (s X i ) ∈ u i .
Proof For any parent subsplit t, every candidate child subsplit s is constructed from the children (possibly trivial) of the most recent ancestors present in their respective reference support {a i X i = u i }.By induction, we know a i → * t.Finally, by construction t → s, a i X i = u i , and U (s X i ) ∈ u i .
We now begin preparations for the proof of the second Requirement in the SCD case.
Lemma 5 If τ X ⊆ P X , then for every PCSP (t → s) ∈ τ , there exists a path (a → * t → s) ⊆ τ and a subsplit u in P X such that a X = u and U (s Proof If t X is a subsplit in P X , we know U (s X) ∈ t X, so u = t X, and we are done.If t X is not a subsplit in P X , i.e. t X is trivial, but is not π X nor a singleton, then we know U (s X) = U (t X) ∈ π τ (t) X.If π τ (t) X is a subsplit in P X then we are done.Otherwise, we can continue this reasoning and chain of equalities, proceeding up the tree until we reach a parent subsplit in P X or π X , which we know is a parent subsplit in P X .If s X is nontrivial, then there is an uninterrupted path of trivial subsplits between s and the a we find above.Thus, by tree restriction, (u → s X) is a valid PCSP in P X .Lemma 6 Suppose there exists a tree topology τ with taxon set X such that τ X 1 ⊆ P X 1 and τ X 2 ⊆ P X 2 .If (t → s) ∈ τ and the algorithm reaches state where a 1 is the most recent ancestor of s that restricts to a subsplit in P X 1 , and a 2 is most recent ancestor of s that restricts to a subsplit in P X 2 , then (t → s) ∈ M(P X 1 , P X 2 ).
Proof Similar to Algorithm 1, in general we know and via the same logic W 2 = U (s X 2 ).Let u 1 = a 1 X 1 and u 2 = a 2 X 2 .By Lemma 5, we know W 1 ∈ u 1 and W 2 ∈ u 2 .If s 1 = s X 1 is trivial, then U (s 1 ) = W 1 and is in P X 1 ((t 1 , W 1 )) by construction.If s 1 is nontrivial, then also by Lemma 5 we know s 1 ∈ P X 1 ((t 1 , W 1 )).Similarly, s 2 = s X 2 ∈ P X 1 ((t 2 , W 2 )), so Algorithm 2 considers s 1 and s 2 at this step.Then one of the subsplits on X that the algorithm generates is We know s is nontrivial, so (t → s) is added to M(P X 1 , P X 2 ).
The following theorem establishes the second Requirement in the SCD case.
Theorem 7 If there exists a tree topology τ with taxon set X such that τ Proof We proceed via recursive proof by induction over all PCSPs (t → s) ∈ τ , with s = {Y , Z }.Our base case is the algorithm's first state (t, W ) = (π X , X ), t 1 = π X 1 , and t 2 = π X 2 .We know s only has one ancestor, π X , which restricts to π X 1 in P X 1 and to π X 2 in P X 2 .This satisfies the criteria of Lemma 6, so (t → s) will be in M(P X 1 , P X 2 ).
Our inductive step for general (t → s) ∈ τ uses the same argument, but uses the inductive assumption that PCSP (π τ (t) → t) was previously added to the output.Given this assumption, we will show that the algorithm constructs the state triplet If the algorithm reaches this state, then Lemma 6 guarantees that (t → s) will be in M(P X 1 , P X 2 ).Lemma 5 guarantees that such a 1 X 1 and a 2 X 2 exist.
Since (π τ (t) → t) has already been visited and |U (s)| > 2, we know that there is a triplet in the stack with (t, U (s)) as the first component.Considering the other components of this triplet, if t restricts to a subsplit in P X 1 , then a 1 = t and u 1 = t X 1 .If not, then t and s have the same most recent ancestor that restricts to a subsplit in P X 1 , so the algorithm passes it along as u 1 .Either way, (a 1 X 1 , U (s) ∩ X 1 ) is the second component.The same argument holds for X 2 and the third component.This is exactly the state triplet we require, so by Lemma 6, we know (t → s) ∈ M(P X 1 , P X 2 ).Therefore, by induction, all (t → s) ∈ τ are in M(P X 1 , P X 2 ), so τ ⊆ M(P X 1 , P X 2 ).
Finally, we show that Algorithm 2 runs in O(n P 1 , n P 2 ) time, where n P i is the number of PCSPs in reference support P X i .We know Algorithm 2 only visits each subsplit and clade triplet ((t, W ), (t 1 , W 1 ), (t 2 , W 2 )) at most once.
∂ s |t q X( t → s) = a X =t d X =s ∂ s |t q(a → * d). (A7) One building block we need is Note that computing D q (s |t ; a) is a constant time calculation after accumulating a table of path probabilities in linear time before the gradient calculation.Following an argument similar to Eq. A4, we drop terms that are constant with respect to s |t and see that, ∂ s |t q(a) = q(t )∂ s |t q((t , U (s )) → * a | (t , U (s ))) = q(t )D q (s |t ; a). (A8) Furthermore, by identical reasoning we calculate the derivative of the path probabilities, ∂ s |t q(a → * d) = ∂ s |t [q(a)q(a → * d|a)] = q(a)∂ s |t q(a → * d|a) + ∂ s |t q(a)q(a → * d|a) = q(a)q(a → * t |a)∂ s |t q((t , U (s )) → * d | (t , U (s ))) + q(t )D q (s |t ; a)q(a → * d|a) = q(a)q(a → * t |a)D q (s |t ; d) + q(t )D q (s |t ; a)q(a → * d|a). (A9) We combine Eqs.A6 and A7 to find our KL derivative, p(t → s) q X( s|t) ∂ s |t q X( s|t), summing the squared differences between each softmax parameter v s|t and the average of all of the softmax parameters in its parent's conditional distribution v•|t .Differentiation leads to the gradient which encourages conditional probabilities to become more similar when used in gradient descent.We use a regularization weight λ multiplied by the penalty function R.
Repeating our simulated and real-world data vbsupertree experiments with λ = 0 reproduces our results from Sect. 3. Increasing the penalty to λ = 0.05 leads to similar results for both our simulated data (Fig. 10) and our HCV data (Fig. 11).We see smooth operation of the vbsupertree algorithm, but with slightly higher converged KL divergence versus the truth.This is to be expected, as regularization penalties bias results on training data in return for better performance in other areas.Finally, raising the penalty higher to λ = 0.50 introduces an interesting effect.The KL versus truth plot consistently falls, then rebounds slightly.We intend to explore the effects of different λ values in future work adapting machine learning validation techniques to improve training SBNs (Figs. 12, 13).

Fig. 6
Fig.6The progression of the loss function (left panel) and the progression of the KL-divergence versus the truth (right panel) over 50 iterations of vbsupertree applied to simulated data

Fig. 7
Fig. 7The progression of the loss function (left panel) and the progression of the KL-divergence versus the truth (right panel) over 50 iterations of vbsupertree applied to the HCV dataset

Fig. 8
Fig. 8 Example starting trees for finding the mutual PCSP support

Fig. 9
Fig.9Trees resulting from the mutual PCSP algorithm example applied to the trees from Fig.8

Fig. 10
Fig. 10 The progression of the loss function (left panel) and the progression of the KL-divergence versus the truth (right panel) over 50 iterations of vbsupertree applied to simulated data with penalty 0.05

Fig. 12
Fig. 12 The progression of the loss function (left panel) and the progression of the KL-divergence versus the truth (right panel) over 50 iterations of vbsupertree applied to simulated data with penalty 0.50 Visited {}, Output {} 4: while Stack not empty do