Speeding up iterative applications of the BUILD supertree algorithm

The Open Tree of Life (OToL) project produces a supertree that summarizes phylogenetic knowledge from tree estimates published in the primary literature. The supertree construction algorithm iteratively calls Aho’s Build algorithm thousands of times in order to assess the compatability of different phylogenetic groupings. We describe an incrementalized version of the Build algorithm that is able to share work between successive calls to Build. We provide details that allow a programmer to implement the incremental algorithm BuildInc, including pseudo-code and a description of data structures. We assess the effect of BuildInc on our supertree algorithm by analyzing simulated data and by analyzing a supertree problem taken from the OpenTree 13.4 synthesis tree. We find that BuildInc provides up to 550-fold speedup for our supertree algorithm.


INTRODUCTION
The Open Tree of Life (OToL) project summarizes phylogenetic knowledge from tree estimates published in the primary literature.Curators for the project import published trees, associate the tip labels of the trees to standardized taxonomic labels, and correct errors in trees that occurred during the deposition of the trees into repositories.OToL also produces a synthesis tree (Hinchliff et al., 2015) that combines hundreds of input phylogenies with a comprehensive taxonomic tree from the Open Tree Taxonomy (Rees & Cranston, 2017, OTT hereafter).This ''synthetic tree'' is a supertree-a tree that is produced by combining multiple input trees, and having a leaf label-set that is the union of the leaf label-sets of the input trees (Gordon, 1986).
Our supertree method is intended to summarize and transparently represent the published input trees, not to produce a phylogeny estimate that is more accurate than the inputs (Redelings & Holder, 2017).Each edge of the supertree corresponds to a supporting branch in one of the input trees.The synthetic tree can be used as a comprehensive phylogeny of living and extinct taxa.It can also be used as a means of navigating the OToL curated collection of published input trees, and of exploring conflict between them.The OToL portal at https://tree.opentreeoflife.orgallows browsing or downloading the latest release of the synthetic tree.It also allows for uploading and curating input phylogenies.
The OToL synthetic tree is created by an algorithm that will add a grouping from an input tree to the full tree if that grouping is compatible with the previously added groups.More specifically, the supertree algorithm iterates over branches of input phylogenies to determine if the groups subtending each internal branch can be added to the synthesis tree (Redelings & Holder, 2017).The order in which input phylogenies are considered is an input to the algorithm, and this order is provided by data curators for the OToL project.The taxonomy tree is always the last phylogeny to be considered.
The current implementation of the supertree algorithm is fast enough to allow the full supertree to be updated periodically.A faster implementation would allow users to explore the effects of differing inputs, such as a different set of phylogenies or a different ranking of trees.This would enable constructing alternative synthetic trees on demand via the web-interface for a variety of users.
The core algorithm for determining compatibility of each potential grouping with previously added groups is the classic Build (Aho et al., 1981) algorithm.Build is invoked thousands of times during the construction of the supertree using OToL's pipeline (Redelings & Holder, 2017).Each invocation applies Build to the set of all previously added groups (which are already known to be consistent) plus one new group.Here we describe improvements to naively calling Build iteratively.Implementing the incrementalized BuildInc algorithm has resulted in dramatic reductions in running times for the key steps in the supertree pipeline.This new algorithm will allow the OToL project to offer more frequent updates to the synthetic tree and explore features such as on-demand supertree construction under the direct control of users of the project's web-services.
While we focus on the use of Build within the OToL project, Build is an ingredient that is used in a wide range of algorithms, such as computing a consensus of equally likely trees (Sanderson, McMahon & Steel, 2011), inferring species trees from gene trees (Roch & Warnow, 2015), determining orthology and paralogy relations between genes in a gene family (Lafond & El-Mabrouk, 2014), and hierarchical clustering (Chatziafratis, Niazadeh & Charikar, 2018).In order to determine what opportunities are opened up by an incrementalized version of Build, researchers must examine each algorithm that uses Build.However, we highlight two possiblities.First, an incrementalized Build might be used to construct an online version of an existing algorithm-that is, an algorithm in which input is added in batches and that produces a complete result after each batch.A batch could represent a new data point, or it could represent the next gene in a genome-wide scan.Second, an incrementalized Build might aid in discovering a compatible subset of trees or bi-partitions instead of merely declaring failure when a given set is incompatible.Thus, an incrementalized version of Build has implications beyond the OToL project.

Context for B in the current OToL pipeline
The current pipeline divides the full supertree problem into subproblems as described in Redelings & Holder (2017).These subproblems cover regions of the taxonomy tree in such a way that adjacent sub-problems contain common nodes (higher taxa) but do not contain common edges.Each subproblem also contains the relevant region of any input tree that coincides with the region of the taxonomy tree.Repeated calls to Build produce a supertree solution to each subproblem.Solving the subproblems is a time-consuming step for the entire pipeline.Increasing the speed of the subproblem-solver would speed up the pipeline, but would also allow it to handle larger subproblems.Larger subproblems occur when OToL curators add more input phylogenies to the pipeline and this large set of inputs conflict with more groupings found in the taxonomy.Larger subproblems also occur when the final location of an incertae sedis taxon is far from its initial location in the taxonomy (see Redelings & Holder, 2019 for a discussion of the handling of incertae sedis taxa).The increased speed might eliminate the need for decomposing the supertree problem into subproblems.

METHODS
We will begin the methods description with a review of the Build algorithm and the description of the new optimizations to the algorithm that are the focus of this paper.After the algorithmic discussion, the simulation study to evaluate performance will be explained.

Overview of the B algorithm
The Build algorithm is a recursive algorithm that determines if a set of rooted triplets of the form x,y| • z are jointly compatible (Aho et al., 1981).Here, we use the symbol • to stand for the root of the tree to emphasize the rooting of the triple.If they are compatible it constructs a tree that displays all the triplets.The algorithm works by creating a graph for each recursive level.If the graph forms a single connected component, then the Build algorithm has detected an incompatibility between the set of triplets.The graph contains a node corresponding to each leaf that is relevant to the current recursion level.At the root-most level of the recursion every leaf is relevant.A rooted triple is relevant to a recursion level only if each of the three leaves is relevant at that level.For each relevant triple x,y| • z an edge is added between x and y.Since the leaves in the more closely related pair {x,y} are sometimes called the ''cluster'' of the triple, the resulting graph is called the ''cluster graph'' for the given set of triplets.
If no triplets are incompatible on a level, then the procedure applies Build to each connected component of the cluster graph.This is the recursive aspect of the algorithm.The nodes that form a connected component at one level will constitute the relevant nodes for the next recursive call.Thus the relevant leaf set is partitioned at each level.However, the edges between nodes are not passed to the next recursive level.Instead they are constructed from the set of relevant triplets at each level.A rooted triple that is relevant at one level may not be relevant at a subsequent recursion level.Thus, not all rooted triplets have a corresponding edge at the next level.A set of triplets is jointly compatible only if the Build algorithm finds no incompatibility at any recursive level.

Rooted splits
The Build algorithm was originally designed to work on sets of rooted triplets.However, when working with evolutionary trees, it is more convenient to work with sets of rooted splits because each branch corresponds to a rooted split.A rooted split σ on taxon set T is written σ = σ 1 | • σ 2 , where σ 1 and σ 2 are non-overlapping subsets of T .We refer to σ 1 as the include group or ''cluster'', and σ 2 as the exclude group.The root of the tree is on the side of the exclude group.Note that σ 1 ∪ σ 2 may be smaller than T .
It is straightforward to extend Build to operate on a set of rooted splits by treating each rooted split as shorthand for all the rooted triplets that it implies.So Build is commonly modified to take a set of rooted splits instead of a set of rooted triplets.

The relationship of the current optimizations to previous work
Previous approaches to speed up Build have tried to improve the order of computation for the analysis at each recursive level.Deng & Fernández-Baca (2018) decreased the order to O(M log 2 M ) for a set of phylogenies with M = (number of edges) + (number of leaves).This is accomplished by decreasing the time for the analysis on each recursive level from O(M 2 ) to O(log 2 M ).It also involves sharing work between different levels of the recursive algorithm by retaining a graph between successive levels instead of creating a new graph from scratch on each level.
Our approach differs from previous work because we attempt to share work between successive calls to Build.When calling Build with one set of splits followed by calling Build again with one additional split, we seek to reuse work from the first call while performing the second.This requires saving work from the first call in an object that represents the solution to Build, and passing that solution object as input to the second call.We do not attempt to decrease the order of the computation as a function of the number of leaves or edges.Instead we base our incremental approach on a naive algorithm that has total cost O(M 3 ).It may be possible to create an algorithm that is both incremental and has a more favorable order of computation for large M , but we do not attempt that here.

Data structures used to explain implementations of B
In order to explain our incremental Build algorithm in an understandable manner, we seek to introduce the full complexity of the algorithm in stages. 1 We thus begin by describing a version of the traditional Build algorithm that saves its work in a solution data structure.This will allow us to focus on changes to the algorithm instead of changes to the data structures when we introduce the first incremental version of the algorithm at a later stage.Our version of the Build algorithm constructs the connected components of the cluster graph without explicitly constructing the cluster graph.That is, our algorithm does not directly represent the edges of the cluster graph in memory.
We begin by introducing the Solution and Component record types that we will use to store temporary work during the algorithm.Using the terminology common to object-oriented programming, we will refer to an instance of these record types residing in a computer's memory as an ''object.''The following sections will describe the distinct steps in Build using some common programming notational conventions: (1) parentheses after the name of an algorithm denote the arguments supplied to that algorithm and (2) the period (or ''dot'') notation after the name of a data object is used to refer to a field within that record.
A Solution object S contains the following fields: S.T: an array of taxon identifiers relevant to the solution level.
S.C : an array of Component objects -one for each non-trivial connected component of the relevant taxa for this level of BUILD.
S.M: an array that maps a leaf index to the component that it is assigned to.
A Component object C contains the following fields: C.T: a linked list of taxon identifiers for every taxon assigned to this component.An initial Solution object can be created before applying the Build algorithm.Upon termination, the result of the algorithm will be emitted by storing it in that Solution object.As mentioned above in the overview of the algorithm, each level of Build's recursion starts with a set of taxa and a set of relevant splits.Connected components of taxa are created and merged as each input split is considered.In our object-oriented description of the algorithm, this corresponds to creation of a Solution object for the current level of recursion.That Solution object will hold a set of Component objects -each of which will store information about the connected components created during the algorithm.Each Component represents a sub-problem that needs to be solved.If no incompatibility is detected at the current level, then the algorithm recursively calls Build to create a Solution object for each Component.
Thus the Solution object forms a tree by indirect recursion to mirror the recursive structure of Build: each Solution object can contain some Component objects, and each Component object contains a Solution object (see Section B.2 of the Supplementary Information File).We refer to the tree associated with a Solution object as a ''solution data structure''.

The solution and component objects for a level of B
The definition of the Solution and Component record types is given in Fig. 1.Creating a new Solution object is done with a procedure referred to as CreateBlankSolution(T ), where T is a collection of taxon identifiers.A new solution object, S, has only its S.T field initialized; That field holds a copy of the taxon identifiers; thus initialization has computational complexity O(|T |).

Defining the B algorithm
To start the algorithm, an initial Solution object can be created and initialized by filling its S.T field with the complete taxon set, T .All other fields of the solution object are initially empty.We can think of the entire Build algorithm as taking a taxon set T and set of rooted splits, .The full Build algorithm consists of the following steps: 1. initialize a Solution object, S to contain the taxon set T .We will refer to this set of operation as a function: CreateBlankSolution(T ); 2. call a helper algorithm BuildA(S, ); and 3. return the result stored in object S.
The BuildA algorithm describes the set of operations to be performed at a single level of recursion as well as how to call the next levels of recursion.

Dissecting the operations required at each level of B A
The operations performed in BuildA(S, ) can be separated in several steps: 1. RemoveIrrelevantSplits(S, ) to remove splits from that are not relevant at the current level.Splits that are removed here as irrelevant are also implied by the split S.T| •(T-S.I) and are recorded on the solution in field S.I. 2. MergeComponents(S, ) to find the connected components of the cluster graph by merging any two components that overlap the include group (''cluster'') of a split in .3. Fail(S).Return Failure if there is only one connected component.) on the sub-problem for each non-trivial component C, Return Failure if that call fails.The RemoveIrrelevantSplits, MergeComponents, and AssignSplitsToComponents steps all have the form of a for-loop that iterates over splits in and modifies S.This fact will be used later, when we seek to incrementalize these steps.It suggests a naive incrementalization strategy of simply iterating over any newly added splits to add their effects to S.

The R I S (S, ) step
We assume that each split in has an include group fully contained in S.T.This means that a split is relevant at this recursive level if and only if its exclude set intersects the relevant taxon set (S.T).
This procedure examines each split σ in .If σ is not relevant, then σ is added to the collection S.I of implied splits and removed from .Splits removed in this step were relevant at previous recursion levels, but are irrelevant for the solution at or below the recursive level S because they do not separate any taxa in S.T from other taxa in S.T. Determining which splits to remove has computational complexity O(V × E) where Although this is not central to the operation of the algorithm, we note the splits in S.I are implied by the branch of the solution tree leading to S. Thus the solution tree implies all the splits in , and split σ is recorded on the branch that implies it.
It is not necessary to perform this step at the root level of the recursion, as all tips are inside of T for that solution object.

The M C (S, ) step
The MergeComponents step partitions taxa according to which connected component of the cluster graph they are in.This step can also be thought of as partitioning taxa in T according to an equivalence relation, where taxa are equivalent if they are in the same connected component of the cluster graph.Note in this version of Build we construct the connected components of the cluster graph without constructing the graph.
To perform the partitioning, we begin by assigning each taxon t ∈ T to its own connected component {t }.This corresponds to a graph with no edges.We then consider each split σ ∈ and merge any components that overlap σ 1 into a single connected component.This is equivalent to adding edges to the cluster graph connecting all taxa in σ 1 .
Our implementation represents the components as two related maps between components and elements: (1) the linked list of taxon identifiers included in each component (the C.T field of the component C), and (2) the array that maps each taxon to its component (stored in S.M).
Detecting mergers requires considering every taxon in every include group, and has order O(| | × |T |).There can be at most |T | mergers.The cost of merging linked lists is just O(|T |) since merging linked lists is an O(1) operation.Unfortunately, when we merge two components C 1 and C 2 where C 2 is smaller, we must also rewrite the array entry for taxa in C 2 .The cost per taxon is the number of times that taxon is part of a merger where it is in the smaller component.Since this can happen up to log 2 |T | times per taxon in the worst case,2 the cost is

The F (S) step
Fail can be done in O(1) by having an array S.C of active non-trivial components, and checking the size of that array.If the size is 1, and the taxon set for the single component has the same size as S.T, then the operation fails.

The A S T C (S, ) step
For each split σ ∈ we have a guarantee that all the taxa in σ 1 are in the same component by definition of the cluster graph.We refer to the component that contains the include group of σ as its ''corresponding component''.We can determine the corresponding component of any split σ by simply looking up the component for the first element of σ 1 in S.M.We implement the assignment of a split σ to a component C by placing a reference to σ into C. .

Simple optimization: trivial and non-trivial components
In a solution object S, we store an array S.M of pointers to components.We also implement S.T as an array, so that if t = T [i] then taxon t belongs to component S.M [i].
Trivial components are defined as components that contain only one taxon.An optimization to improve memory usage and speed is to simply use a Null reference in the S.M array to indicate that the corresponding taxon is in a trivial component, rather than creating a component object for every trivial component.This requires only minor tweaks to the MergeComponents step to create a new Component object on-the-fly whenever a set of trivial components are merged into each other with no non-trivial component involved in the merger.).

Example #1: creating a solution
The second-level BuildA removes the split and adds it to C 1 .S.I.It then calls MergeComponents, Fail, and AssignSplitsToComponents with no effects.No non-trivial components are created, and the call to BuildA succeeds because it contains the trivial components {A 1 } and {A 2 }.Since there are no non-trivial components, another round of recursion is not performed.

Stepwise construction of B I
The goal of the incrementalized algorithm is to reuse previous work from computing Build (T , ) when computing Build (T , + ).The incrementalized algorithm has the signature BuildInc (S, ).Although the taxon set T and the previously added splits are not explicitly present in the signature of BuildInc, they are contained within S. S may also contain intermediate results from a previous BuildInc call, so that BuildInc can reuse previous work.If BuildInc(S, ) succeeds, then S is modified in-place to contain the new splits in addition to .However, if BuildInc fails, then S is unmodified.In order to simplify the description of the incrementalized algorithm BuildInc, we first introduce two partially-incrementalized algorithms BuildInc and BuildInc .This allows us to simplify the explanation of BuildInc by introducing concepts in a step-wise fashion.The BuildInc and BuildInc algorithms both mutate the solution object even when the algorithm returns False.The fully incrementalized algorithm BuildInc adds the ability to track changes that are made to the solution object, and then roll them back if the algorithm ultimately returns False.We now describe key difference in the steps of the first incrementalized algorithm BuildInc .We will use the and symbols to decorate the names of algorithms based on whether they are used in BuildInc or BuildInc .
The incrementalized algorithm allows adding input splits either one-at-a-time, or in larger batches.We assume that the input splits have been partitioned into a series of non-overlapping subsets i according to some strategy that is chosen in advance.Each subset constitutes a batch.The simplest strategy is to set i = {σ i } for some split σ i , so that splits are added one-at-a-time.However, it is possible to incorporate input splits into the solution object in larger batches, and we explore this option in the Results section below.
Given a partition of the input splits, the general structure for all of the incrementalized algorithms is: 1. initialize a Solution object, S, to contain the taxon set T .We will refer to this set of operation as a function: CreateBlankSolution(T ); 2. BuildIncA (S,∅) 3. for each subset i (a) call BuildIncA (S, i ) 4. return the result stored in object S.
In the descriptions that follow, will be used to denote the set of splits that have been added in previous increments of BuildInc without failing.Thus, + would be the set of splits included if the current round of BuildInc succeeds, and is the set of splits that are new to the current increment of BuildInc .

B I A (S, ): R I S (S,
) step This is identical to RemoveIrrelevantSplits except that (for reasons described below) S.I may already contain some splits before this step begins.

B I A : M C (S, ) step
We seek to construct the connected components of the cluster graph for the splits (S) + from the connected components for (S).We will refer to connected components under the cluster graph for (S) as ''original'' components.The addition of the splits may add edges to the cluster graph, but it cannot remove any.Therefore, the addition of may merge original components, but it cannot split them.In order to compute the new connected components, we simply need to iterate over splits in and continue merging components.
Characterizing how the components for + are related to any original components is central to our approach.The non-trivial components for + can be characterized as new, modified, or unmodified.
• new (a component composed entirely of previously-trivial components) • unmodified (a non-trivial component present in the original S.C) • modified (a non-trivial component that contains at least one original non-trivial component) If S is initially empty, then all non-trivial components will be new.If S is not initially empty, then all three types of non-trivial components can occur.If an original component C is a subset of a component C, then we say that C is subsumed by C. We now describe how these three classes of non-trivial components retain original solutions.
A new (non-trivial) component C will have an empty solution C.S.There is no original solution to retain, since all its subsumed components are trivial.
For unmodified (non-trivial) components, we retain the single original Solution object in the field C.S. We will modify the retained solution if any splits from are assigned to C.
Modified (non-trivial) components have a taxonomic set C.T that was not a connected component in the previous iteration of BuildIncA .However, the Component object does not need to be created de novo; instead, we repurpose the Component object from one of the subsumed non-trivial components to represent the new, larger equivalence class.The previous solution in C.S is not the solution for the enlarged taxon set.However, we retain all the solutions for all subsumed non-trivial components in an additional field for in the component data structure, C.O.

Splits in
must be assigned to their corresponding components by placing them in C.
, just as in BuildIncA.However, splits in are treated differently depending on which original component they were previously assigned to.
Splits from (S) that were assigned to an unmodified component C are still associated with that component.This is because they were contained in C.S, and it has been retained.Splits from (S) that were assigned to a modified component C are no longer associated with C because C.S has been discarded.We therefore iterate over original components C that were subsumed by C and append their splits to C.
(see Section B.2 of the Supplementary Information File).
New components cannot contain any splits from (S), so we do not need to do anything extra for them.

B I A : recursive call to B I A step
We seek to construct a solution for each of the components C ∈ S.C using BuildIncA .In all cases we do this by calling BuildIncA (C.S, C. ) to add splits in C. to the solution C.S.For unmodified components, C.S is the original solution for C, so previous work is re-used.Any corresponding splits from are added to this original solution.For modified components, C.S is a Null reference indicating that no solution has been calculated.We construct an empty Solution with taxa C.T. The set C.
contains splits corresponding to C from both (S) and .Therefore, despite the existence of previous work in C.O, we do not manage to re-use any of it.
For new components, C.S is also Null.We also construct an empty Solution with taxa C.T. The set C.
contains only splits from .There is no previous work that could be re-used here.

Example #2: adding a split to an unmodified component
When incrementally adding a split that resolves a node below the top level, BuildIncA passes the split down the tree until it reaches the level of the resolved node.At levels higher than the resolved node, the split is assigned to one of the components, but does not modify it.At the resolved node, the split either modifies components or causes the creation of a new non-trivial component.
To see this, let us consider incrementally adding the split σ 2 = a 1 a 2 | • a 3 to a Solution object that contains σ 1 = a 1 a 2 a 3 |•b (Fig. 3).Here the split σ 2 resolves a node in the original solution data structure.
Running BuildIncA (S 1 , {σ 2 } leaves the only non-trivial Component C 1 at the top level unchanged.σ 2 is Assigned to C 1 and ends up in C 1 . . This leads to a call to BuildIncA (S 2 , {σ 2 }).σ 2 is not removed, so S 2 .I is unchanged.However, σ 2 leads to the creation of a new non-trivial component C 2 in Merge.σ 2 is Assigned to C 2 and ends up in C 2 . .Finally, we get a call to BuildIncA (S 3 , {σ 3 }).Here σ 2 is finally removed, and placed into S 3 .I.No non-trivial components are created, so we are done.

Example #3: merging two components
When incrementally adding a split that groups two children of a node in the solution tree, the components that correspond to the grouped children tend to re-emerge at a lower level.However, BuildIncA is not able to recognize this, and so must solve each of these re-emergent components from scratch.
To see this, let us consider incrementally adding the split 4).The new split merges the original The BuildIncA handles merged components by extracting the splits from subsumed solutions.Therefore, C 1 .contains all three splits, and not just the incrementally added split.We can see that the original component {A 1 ,A 2 } re-emerges at a lower level as C 3 .However, BuildIncA solves it from scratch.The component {B 1 ,B 2 } does not re-emerge solely by edges corresponding to splits in (S ) (see Section ??).This is because any split whose include group overlaps the component is assigned to (and only relevant to) that component.Therefore, the effect of the splits (S ) is only to connect the taxa in S .T. We may therefore iterate over original solutions S ∈ O and merge any two components that both overlap S .T.This makes it unnecessary to extract the splits σ ∈ S .

B I A : A S T C (S, , O) step
The BuildIncA needs to assign splits in (S), , and (O) to their corresponding components.Each split in is assigned to a component C and stored in C.
, just as as in BuildA and BuildIncA .However, there are two major differences from BuildIncA .
First, BuildIncA does not need to do anything for splits in (S), because they are already assigned to their correct component.BuildIncA needed to extract splits from C.O and add them to if C was a merged component.However, BuildIncA passes C.O down to the next recursive level directly, so this is unnecessary.Note that since BuildIncA no longer assigns splits from (S) to C.
, C. will consist only of splits from .Second, BuildIncA must assign splits in (O) that were recieved from the previous recursive level to their corresponding component.However, all splits for a subsolution S ∈O have to end up in the same component.This is because S .T must be entirely contained in one component.The splits of S must then be in the same component.Therefore, we may simply assign original solutions S to components in their entirety.We can determine which component a solution S goes to by looking up the component for any element in S .T.

B I A : R I S (S, , O) step
There are two differences between BuildIncA (S, ) and BuildIncA (S, , O).First, O might contain a solution S to the problem we are trying to solve.In that case, we would like to reuse that solution.Second, when checking for implied splits, we need to check inside each original solution S ∈ O as well as inside .We now discuss these differences in greater detail.

Reusing previous work in B I A (S, , O)
Suppose that O contains a solution S with the same taxon set as S. In that case, we would like to re-use the previous work in S instead of solving the problem from scratch.
We first note that (S) must be empty, since O was non-empty, so it is safe to discard S and use S instead.Second, O is empty after removing S .This is because S contains all the taxa in S.T and there are no remaining taxa in S.T for any other solutions to contain.We therefore replace the call to BuildIncA (S, , O) with BuildIncA (S , ,∅).This preserves the invariant that S and O cannot both be non-empty.

Removing implied splits from original solutions
Recall that a split σ should be placed into S.I if and only if σ 2 does not intersect S.T.For an original solution S ∈ O, we claim that only splits in S .I can satisfy this condition.If a split σ from S is not in S .I, then σ 2 must intersect S .T.But S .T is a subset of S.T, so σ 2 intersects S.T as well and cannot go into S.I.Therefore, for each original solution S , we only need to check the splits in S .I.
If any split in S .I meets the criterion, then we remove it from S .I (conceptually) and move it to S.I.However, if one of the splits is removed from a solution, then the solution no longer has the property that its splits are all in the same connected component.It is no longer a solution.In such a case, we say that the solution is ''punctured''.
In order to retain our invariants, we remove punctured solutions from O.However, we must retain the splits that they contain.For each punctured original solution S , we copy the splits in S .I that were not moved to S.I into the set .We move the child solutions S .C i .S into O.In this way, all the splits of S are retained-some in S.I, some in , and some in O.
Original solutions in O that are not punctured may be retained unmodified.
Note that replacing an original solution S ∈O with its child solutions retains the invariants that solutions in O (i) have non-overlapping taxon sets and (ii) are contained within S.T.The taxon sets of child solutions to a solution S are contained within S .T and are non-overlapping.Since S .T is contained within S.T and does not overlap with any other solutions in O, its child solutions must also be contained within S.T and not overlap any other solutions in O.

Example #4: merging two components and re-using original solutions
When incrementally adding a split that groups two children of a node, the components that correspond to the grouped children tend to re-emerge at a lower level.While BuildIncA solves the solutions from scratch, BuildIncA is able to re-use the original solutions to these components.
To see this, we show how BuildIncA solves the same problem that BuildIncA solved in Example #3 (Fig. 4).Unlike BuildIncA , BuildIncA leads to C 1 .
containing only the single incrementally-added split the original solutions S 2 and S 3 , BuildIncA passes down the original solutions themselves.
The original component {A 1 ,A 2 } re-emerges at a lowever level, and is solved by re-using the original solution S 2 .The other original solution S 3 is punctured, and its split B 1 B 2 |•A is added to S 4 .I.

THE FULLY INCREMENTALIZED ALGORITHM BUILDINC
The BuildInc and BuildInc algorithms modify the original solution data structure as they execute.When these algorithms discover that the additional splits are incompatible A RollbackInfo object r contains the following fields: r.S: a pointer to the Solution object that this rollback info is about.with the previous solution, we cannot simply revert to the previous solution, because it has been modified.We must therefore recreate the previous solution data structure from scratch, discarding all of the saved work.
We address this problem by extending BuildInc to record any change that it makes to the original solution.We can then reverse these changes in the case of failure.We call the extended algorithm BuildInc.Much of the description of BuildInc thus boils down to (i) specifying just what information must be recorded to reverse changes made by BuildIncA and (ii) some optimizations that avoid spending time recording information that will never be used.

An overview of the rollback approach
All modifications to each Solution object S are complete before any modifications are made to its children.We can therefore represent modifications to the solution data structure as a sequence R of modifications to individual Solution objects.If BuildIncA returns failure at the top level, we can then walk this sequence in reverse order, and roll back the changes that it describes.We create the record type RollbackInfo to record modifications made to an individual Solution object (Fig. 5).
Component mergers are the most interesting type of modifications that BuildIncA makes to Solution objects.We can represent component mergers for each Solution object as a sequence of individual mergers of two components.For each component merger, we record enough information about the two original components to reverse the merger.It is then possible to reverse the mergers by walking the sequence in reverse order, and undoing each merger in turn.We create the record type MergeRollbackInfo to record modifications made to an individual Solution object (Fig. 5).
Mergers are of two types: mergers of two non-trivial components, or mergers of a nontrivial component with a trivial component.We handle mergers of two trivial components, by creating an empty ''non-trivial component'' data structure, and merging it with each trivial component in turn.

Details of rollback info
The RollbackInfo record type Changes that occur to S during BuildIncA(S, ) include (i) appending additional implied splits to S.I, (ii) modifying the component list S.C, and (iii) merging components.We can therefore record the changes that occur to S in terms of (i) the original set of implied splits S.I, (ii) the original list of components S.C, and (iii) a sequence of records that describe individual merges of two components.We note that implied splits are only ever appended to the end of S.I.Therefore, as an additional optimization we can simply record the original length of S.I, and revert to the original version by truncating the array to that length.

The MergeRollbackInfo record type
When merging two non-trivial components C 1 and C 2 , let us assume that C 2 is the smaller component.The modifications that occur to C 1 , C 2 , and S are the following: • the elements of C 2 .T are removed and appended to C 1 .T.
• C 1 .S is set to Null.
We can restore C 1 .S from the saved reference.The pointer to the first element of C 2 .T allows us to split the linked list C 1 .T in two at the proper place, and return the latter half to C 2 .T. We can then walk the restored elements of C 2 .T and set S.M [t]=C 2 for each t ∈C 2 .T.
Sometimes we merge a non-trivial component C with a trivial component containing the taxon t .In such cases, there is no non-trivial component C 2 .We indicate such cases in the RollbackInfo object by setting component2 = Null.Such mergers can be rolled back by removing the last element of C 1 .T, setting S.M[t ] = Null and setting C 1 .S = orig_solution.

Optimizations
Recording and replaying rollback info allows us to avoid discarding saved work.However, both recording and replaying rollback info also have a cost.In order to achieve the optimum speedup from rollback info, we must avoid paying this cost when we do not need to.

Optimization #1
We only need to undo changes to a Solution object if it was part of the previous solution data structure.In order to determine if a Solution object S is part of the previous solution data structure, we initialize a counter S.visits to 0 when creating a new Solution object.We then increment the counter each time the Solution object is visited by BuildIncA.We then avoid appending the rollback info for S to the sequence of changes R if S.visits = 0.
The batching approach works by batch-adding the splits for the each tree in order.To batch-add a group of splits, we try and add the whole group of splits.If that succeeds, then we keep the whole group.If it fails, and the group has one split then we are done.If it fails, and the group has more than one split, then we batch-add the first half of the group, followed by the second half of the group.This ends up being both simpler and more efficient than trying fixed-size batches because we do not need to figure out the ideal batch size and it has exponential back-off.
Batching improves efficiency when is large because the cost of determining the compatibility of + does not increase much as the size of increases.If all the splits in will be accepted, it is thus substantially more efficient to add them in one batch.

Oracle
The oracle first runs conflict analysis on each input phylogeny T to identify branches of T that conflict with the tree of currently accepted splits.We then batch-add splits corresponding to the non-conflicting branches of T .This makes batching more efficient by making it more likely that large batches do not contain any conflicting splits.
Unfortunately, the oracle cannot filter splits for the taxonomy tree if there are any incertae sedis taxa.This is because taxonomy branches may correspond to partial splits in the presence of incertae sedis taxa.Our current conflict analysis does not handle partial splits.

Simulation experiments
The simulations script (gen_subproblem.py)can be found in the otcetera repository on GitHub (https://github.com/OpenTreeOfLife/otceteraand via DOI: https://zenodo.org/doi/10.5281/zenodo.10041275).The user of the script specifies: (a) a number of leaves in the full tree, (b) a number of phylogenetic input trees to simulate, (c) a tip inclusion probability for each phylogenetic input, (d) a number of edge-contract-refine (ECR) moves to conduct on each input tree, and (e) an edge contraction probability for the taxonomy.For each replicate, the script uses DendroPy (Sukumaran & Holder, 2010) to generate a pure birth (Yule) tree with the specified number of leaves as the true(model) full tree for that replicate.The specified number of phylogenetic inputs are created by sub-sampling the full tree (using the tip-inclusion probability to assess whether a tip remains in the sampled input); if all tips are deleted a new phylogenetic input is drawn, rather than emitting an empty tree.Then the specified number of ECR moves are applied to the tree to mimic phylogenetic estimation noise.The last tree emitted for each replicate is designed to mimic the taxonomic input in the problems used by the Open Tree of Life project.The taxonomic input is complete (lacks any sub-sampling based on taxon inclusion probabilities).In addition to having errors introduced by ECR moves, the taxonomic input undergoes branch collapsing (using the user-supplied edge-contraction probability on each internal edge independently) to mimic the unresolved character of most taxonomies.
We simulated a collection of supertree problems containing 50-1000 taxa, all with 20 phylo-inputs.Each OTU was included in the phylogeny inputs with probability 0.5.Each edge in the taxonomy was collapsed with probability 0.75.We introduced two ECR errors per tree.For each simulation condition, we determined the run time by averaging across 15 simulated data sets.

RESULTS
We examined the effect of different optimizations by looking at their run time on simulated data sets and one real data set.Run times are for an Intel i7-5820K CPU with 32 Gb RAM running Linux.

Simulated data
We first examined the effect of the batching and oracle optimizations on simulated data sets as the number of taxa increased (Fig. 6).Batching yields a speedup that increases from 1.6-fold at 50 taxa to 17-fold at 1,000 taxa.Using the oracle to eliminate inconsistent splits shows no speedup when not paired with batching.
However, when combined with batching, the oracle yields an additional speedup that increases from 1.3-fold at 50 taxa to 2.4-fold at 1,000 taxa.This indicates that the oracle allows larger batch sizes to succeed.
Given that our simulation protocol performs two ECR edits to each phylogeny, we expect about four splits to be inconsistent with the underlying tree per phylo-input.Therefore larger trees have a smaller fraction of inconsistent splits.That may explain why larger trees recieve a bigger speedup from batching.
Incrementalizing Build yields a larger speedup than the oracle+batch optimization (Fig. 7).The speedup increases from 20-fold at 50 taxa to 398-fold at 1000 taxa.This represents an additional 9.9-fold speedup over batching + oracle at 1000 taxa.Much of this speedup relies on the abilty to save work via rollback: the incremental algorithm achieves only a 116-fold speedup at 1,000 taxa.Rollback ability thus provides a speedup of 3.4-fold for BuildInc over BuildInc .When the oracle is enabled, 162,517 splits are considered.This lowers the fraction of rejected splits to 1.1%.When using the oracle + batch optimizations, Build is called only 14,481 times.
Figure 8 shows that some phylo-inputs take a lot more time than others.Processing splits from the taxonomy tree takes a large fraction of the total time.However, when graphed against the number of accepted splits, the relationship is more linear, indicating that the effect of large trees is at least partly driven by the fact that large trees have a large number of splits.
C.∆Σ: an array of splits relevant to the component.C.S: either NULL (if no solution is possible) or a Solution object for the component.C.O: an array of pointers to original Solution objects subsumed by this component.

Figure 1
Figure 1 Definitions for the Solution and Component record types.The field C.O is not used by nonincremental Build.Full-size DOI: 10.7717/peerj.16624/fig-1 4. AssignSplitsToComponents(S, ) to the single component on the next recursive level where they may be relevant.5. Iterate.For each non-trivial component C ∈S.C: (a) use CreateBlankSolution(C.T) to create a new solution object for each Component object and store this in C.S. (b) Recurse.Call BuildA(C.S, C.

Figure 2 The
Figure 2 The Solution object S 1 returned by calling Build ([A 1 ,A 2 ,B],{A 1 A 2 | • B}).Solution objects have rounded edges and a pink border.Component objects have rectangular edges and a blue border.The temporary values C 1 .and C 1 .O are shown with the values they contain before being cleared.Solid arrows indicate pointers.Dashed arrows show the correspondence between Solution objects and nodes on the solution tree.Full-size DOI: 10.7717/peerj.16624/fig-2 Consider calling Build (T , ) with taxa T = {A 1 ,A 2 ,B} and splits = {A 1 A 2 | • B}.This creates a solution data structure shown in Fig. 2. Build begins by creating a Solution object S 1 where S 1 .T = [A 1 ,A 2 ,B] and then calling BuildA(S 1 , ).The BuildA creates a single non-trivial component C 1 containing {A 1 ,A 2 } and a single trivial component containing B. BuildA does not fail, because there is more than one component.The single split A 1 A 2 |•B is assigned to C 1 .BuildA then creates a new solution C 1 .S where C 1 .S.T = [A 1 ,A 2 ] and then calls BuildA(C 1 .S, C 1 .

Figure 3 RedelingsFigure 4
Figure 3 Adding a split to an unmodified component.Top: the result of adding the split A 1 A 2 A 3 | • B. Bottom: the result of incrementally adding the split A 1 A 2 | • A 3 , starting with the top Solution.Full-size DOI: 10.7717/peerj.16624/fig-3 As before, BuildIncA (S, , O) must iterate over the non-trivial components C ∈ S.C and construct a solution C.S for each of them.In all cases this is done by calling BuildIncA (C.S, C. , C.O). Passing original solutions from modified components down the tree in C.O allows us to reuse previous work, as described in the section on RemoveIrrelevantSplits below.The only difference from BuildIncA is the third argument, the set of original solutions, O.

A
r.n old implied splits: the previous number of implied splits S.I. r.merge rollback info: the sequence of MergeRollbackInfo objects r.n orig components: the original number of components r.old components: the array of pointers to components after merging but before removing empty components.MergeRollbackInfo object m contains the following fields:m.component1= pointer to C 1 m.component2 = pointer to C 2 , or NULL if C 2 is trivial.m.component2 first = pointer to the first element of C 2 .T, or NULL if C 2 is trivial.m.orig solution = pointer to the original solution C 1 .S.

Figure 6
Figure 6 Effect of oracle and batch optimizations on run time.The left panel is log-scaled, whereas the right panel is not.Run time versus number of OTUs where each sample data set has 20 phylogenetic trees as inputs.Full-size DOI: 10.7717/peerj.16624/fig-6

Figure 8
Figure 8 Time taken versus number of input splits accepted (top) or input trees processed (bottom).The left panel is log-scaled, whereas the right panel is not.Full-size DOI: 10.7717/peerj.16624/fig-8