ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy

Abstract Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.


Introduction
Evolutionary histories of genes and species can differ for several reasons [1], including incomplete lineage sorting (ILS), duplication and loss (duploss), gene transfer, and hybridization. Species tree inference is a central question in evolutionary biology and dealing with these sources of gene tree discordance is crucial. Many approaches have been proposed for species tree inference, including gene trees-species tree co-estimation [2][3][4][5][6] and species tree inference from sequence data [7][8][9]. However, the most scalable approach has remained a two-step process: first infer gene trees independently from sequence data and then combine them using a summary method. The goal of a summary method is to find the species tree best explaining the gene trees according to a model of gene tree discordance. While the ultimate goal is to develop summary methods modelling all sources of discordance, the literature mostly focused on separate causes.
A major family of summary methods focuses on duplication and loss processes producing multi-copy gene trees [10][11][12][13][14][15]. Most of these summary methods rely on maximum parsimony reconciliation [16] and aim at finding the species tree with the minimum reconciliation cost. Example methods include DupTree [10], its later extension iGTP [11,12], DynaDup [15] and earlier similar dynamic programming algorithms [14]. Other methods take a more agnostic approach and minimize the distance between species trees and the gene trees without necessarily invoking specific reasons for discordance. Example methods of this type include MulRF [17] and guenomu [18]. However, a recent result asserts that the optimal solution to the optimization problem solved by MulRF is indeed a statistically consistent estimate of the species tree under a generic duplication-only model of gene evolution [19]. These methods are mostly designed to handle duplication and loss, and although in simulations some have reasonable accuracy under ILS and gene transfer [20], they have not been widely adopted by biologists.
Several summary methods target ILS as modelled by the multi-species coalescence (MSC) model [21,22], and many of them are statistically consistent [e.g., [23][24][25][26][27][28][29][30]. However, the most successful summary method for ILS has arguably been ASTRAL [31], which, due to its high accuracy [32][33][34] and scalability [35,36], has been used widely in biological analyses. ASTRAL, like several other methods [7,24,28], relies on dividing gene trees into unrooted four-taxon trees (called quartets), a feature that allows it to handle ILS and may contribute to its high accuracy. ASTRAL, however, was designed to handle single-copy gene trees reconstructed from sets of orthologous genes. This limitation has restrained the application scope of ASTRAL. For example, two studies on plant transcriptomes had to restrict species tree analyses with ASTRAL to the 400-800 putative single-copy gene trees, discarding thousands of available multi-copy genes [37,38]. A surprising result asserts that treating gene copies as alleles of a same gene, a feature ASTRAL supports [39], is statistically consistent under a standard parametric model of gene duplication and loss and may be accurate [40]. Others have shown that random sampling of leaves works well empirically [41]. Beyond ASTRAL, several methods have focused on dividing multi-copy gene trees into single-copy genes without apparent duplications [42][43][44][45][46]. However, to our knowledge, no quartet-based methods designed to handle duplication and loss currently exist. Extending quartet-based methods to multi-copy gene trees is not trivial if we seek to correctly model orthology and paralogy.
Here, we introduce a quartet-based species tree inference method called ASTRAL-Pro (AS-TRAL for PaRalogs and Orthologs). This method requires defining a measure of quartet similarity between single-copy and multi-copy trees accounting for orthology and paralogy. We define such a measure in a principled manner and show how to optimize it using dynamic programming. We test the method on an extensive collection of simulated datasets and on a real plant dataset.

Problem Definition
Let S be a set of n species. Let us suppose that we are given a set of binary gene trees G, and, for each tree G ∈ G with leaf set L G = {1 . . . m G }, we have a mapping α G : L G → S specifying in which species each gene is sampled. For a rooted tree G, we denote the set of internal nodes in G by I(G), and, for each u ∈ I(G), we define L G (u) as the set of leaves below u. We define two short-hands: α G (A) = {α G (i) : i ∈ A} for A ⊂ L G and α G (u) = α G (L G (u)) for a node u. The notation G A denotes G restricted to the set A.
We let Ω(G) be the multi-labelled tree obtained by replacing each leaf i ∈ L G with α G (i). Multiple copies of the same species in a gene tree G may be created by gene duplication. We assume that each duplication creates a new genomic locus (i.e., a position along the genome) and therefore, each locus, except the original one, has a parent locus (which may or may not have survived to the present day). Thus, each element of L G can be theoretically mapped to its parent locus, allowing us to "trace" the locus of each leaf to its ancestors.
In each gene tree G, we refer to a subset Q of four distinct elements of L G as a quartet. The subtree of a fully resolved tree G induced on a quartet Q exhibits two degree-three nodes. We refer to these nodes as anchors of Q on G. As shown in Fig. 1, for a rooted tree G and for a quartet Q, up to label permutations, G Q can only have two topologies: an unbalanced one (when one anchor descends from the other), denoted as Q ∠ G, and a balanced one (otherwise), denoted as Q ⊥ G. We say a tripartition (P 1 , P 2 , P 3 ) of S "can anchor" a quartet Q of G iff Definition 1 (Tagged trees). We say a rooted tree G is tagged if every internal node is tagged either as duplication or as speciation. A node u with children u 1 and u 2 can be a speciation if the three sets α G (u 1 ), α G (u 2 ), and α G (L G \ (L G (u 1 ) ∪ L G (u 2 ))) are mutually exclusive. . Anchors are u and v, and w = ψ G (Q) is the anchor LCA. While w has to be a speciation for Q to be considered a SQ, u and v can be either speciation or duplication. 2. An example of equivalent classes. Three equivalent classes are anchored on z: all eight quartets of the form {a i , b j , d k , e 3 }, of the form {a i , c j , d k , e 3 }, and of the form {b i , c j , d k , e 3 }, all with balanced topology. Anchored on x: two equivalent classes with unbalanced topology: Anchored on y: two equivalent classes: We note that these labels may or may not correspond to real speciation and duplication events. In particular, in the presence of deep coalescence across duplication events, a correct tagging corresponding to actual events may not be possible.
Definition 2 (SQ). A quartet Q on a rooted tagged gene tree G is called a speciation-driven quartet (SQ) iff |α G (Q)| = 4 and the LCA of any three out of four leaves of Q is a speciation node. Equivalently, a quartet with topology ab|cd is a SQ if and only if its genes are all contained in different species and the LCA of either a or b with either c or d is tagged as speciation.
Definition 3 (Quartet anchor LCA). Let u and v be anchors of a quartet Q on a rooted tree G. We refer to the LCA of u and v as the anchor LCA of Q on G and denote it as ψ G (Q).
The last definition is central to our approach. Note that anchors of a SQ can be speciations or duplications (Fig. 1) and thus SQs are not simply quartets with anchors being speciation nodes. Instead, they are quartets with a topology pre-determined by the speciation event represented by the anchor LCA, regardless of subsequent duplications and losses. Such subsequent duplications and losses may lead to multiple quartets originating from the same speciation event. Since these events include no new information on the speciation event, we count only SQs towards the quartet score of a species tree and weight them in a non-trivial way to avoid double-counting.
Definition 4 (Equivalent SQs). Two SQs on the same 4 species are equivalent if they have the same anchor LCA; i.e., for two SQs, Proposition 1. If Q 1 and Q 2 are equivalent SQs on G, then Ω(G Q 1 ) and Ω(G Q 2 ) are isomorphic.
Thus, equivalent SQs have the same quartet topology when mapped to species. Proofs of all propositions and lemmas and sketches of proofs of all claims can be found in Appendix A.
Proposition 1 tells us that equivalent SQs do not provide any extra information with respect to each other, and therefore, it is reasonable to count all equivalent SQs as one unit when computing the quartet score of a species tree. This intuition is backed by the following proposition: Proposition 2. Assuming a correctly tagged tree G, for all equivalent SQs with a shared anchor LCA w, the three (in the unbalanced case) or four (in the balanced case) quartet leaves below w will all share an ancestral locus at the time of the speciation event corresponding to w.
We can now provide a natural definition of the quartet score. The equivalence relation (Def. 4) partitions all quartets in equivalence classes and, by Proposition 1, for each equivalent class, we can define a unique quartet tree labelled by S. By Proposition 2, each class corresponds to an ancestral locus.
Definition 5 (Per-locus (PL) Quartet Score). The per-locus quartet score of a species tree S with respect to a tagged gene tree G is the number of equivalent quartet classes with a quartet topology that matches the quartet tree induced by S. More formally, The PL quartet score of S with respect to a set of tagged gene trees G is q(S, G) = G∈G q(S, G) . Claim 1. If all nodes on the path between the root r and a node u are tagged as speciations, changing the root to any branch on the path does not alter the PL quartet score.
Problem 1 (Maximum per-Locus Quartet score Species Tree (MLQST) problem). Given a set of rooted tagged gene trees G, find the species tree that maximizes the PL quartet score with respected to input gene trees, i.e., arg max S q(S, G).

Solving the MLQST problem using dynamic programming
We start by briefly describing the ASTRAL algorithm to solve a related problem (the MQSST problem), and then describe how we extend this approach to the MLQST problem.
3.1 Background: ASTRAL on single-copy gene trees. A node in a binary unrooted species tree forms a tripartition of S that implies the topology for all quartets anchored at that node, allowing us to score it against G. Let P = P 1 |P 2 |P 3 and M = M 1 |M 2 |M 3 be two tripartitions, and let I ij = |M i ∩ P j |. Any species tree that displays P will share a certain number of quartets with any gene tree that displays M , and we call this number QI(P, M ) (calculations below extends to multifurcations if M is a d-partition). Defining G 3 as the set of all permutations of {1, 2, 3}, we have [31,47]: and P(G) is the set of partitions representing internal nodes of G. The quartet score of a species tree is simply the sum of the weights of its tripartitions. The division by half in w(P ) is necessary because the sum counts each shared quartet twice (once at each anchor).
ASTRAL finds the tree S that maximizes the quartet score using dynamic programming. It recursively divides S into subsets, in each step, choosing the division that maximizes the sum of the weights. To avoid exponential running time, instead of considering all ways of partitioning a set A ⊂ S into A and A \ A , we constrain the recursion to a given set of bipartitions. Let X be this set and X = {A : A|(S \ A) ∈ X} and Y = {(C, D) : C ∈ X , D ∈ X , C ∩ D = ∅, C ∪ D ∈ X }. Let V (A) be the quartet score of an optimal subtree on the cluster A and set V ({a}) = 0. Then,

ASTRAL-Pro Algorithm
We extend here ASTRAL to multi-copy gene trees. The input to the new method, called ASTRAL-Pro (A-Pro for short), is a set of rooted tagged gene trees (see in Section 3.3 how unrooted gene trees can be rooted and tagged). This extension involves three changes. (i) To handle multi-copy gene trees, when computing the tripartition associated to each node, we use α G to map labels to S. Here, instead of multi-sets, we create sets (counting multiple copies on each side only once). (ii) We change the weight calculation w(P ) so that each equivalent class of quartets is counted once instead of twice, only at its LCA anchor. (iii) When computing w, we only sum over internal nodes tagged as speciations.

Weight calculation.
Let w be an internal node of G tagged as speciation and P = (P 1 |P 2 |P 3 ) be a tripartition of S.
Definition 6. We say that a SQ equivalent class with LCA anchor w in a gene tree G is mapped from left to a species tree tripartition P iff for each quartet Q in the equivalent class (i) P can anchor Q and (ii) the leaves a and b under the anchor of Q that appear first in a post-order traversal of G (e.g., u in Fig. 1) both map to the same side of P (that is, . We denote such quartets by Q w − → P .
We now state a set of lemmas, followed by the main result.
Lemma 2. For a speciation node w with left child w 1 and right child w 2 , let , and LCA of w and z is tagged as speciation}. Let The number of SQ quartet equivalent classes anchored to w and mapped from left to the species partition P can be counted as follows:  QI pro (P, M w ) × 1 speciation (w) .
Theorem 1. The ASTRAL-Pro algorithm obtained by replacing w(P ) function with w pro (P ) in the ASTRAL dynamic programming solves the MLQST problem exactly if X = 2 S .
Proof. By Lemma 4, arg max S q(S, G) = arg max S P ∈P(S) w pro (P ). Thus, ASTRAL dynamic programming can solve the optimization problem exactly given the full search space (the argument is identical to that of ASTRAL and follows from the additive nature of q(S, G)).
We now make two claims, and provide a sketch of proofs in Appendix A. Note that by Claim 3, ASTRAL-Pro has polynomial running time.
Claim 2. For a set of gene trees G including only speciations, the tree returned by ASTRAL-Pro is the same as the one returned by ASTRAL. Algorithm 1 Gene tree tagging and rooting.
procedure TagAndRoot(G) s ← ∞ for edge e in G do root G at e and let r e be the new root s e ← Tag(r e ) if s e < s then r ← r e s ← s e root at r Tag(r)

Tagging and rooting gene trees
Gene trees inferred from sequence data are neither rooted nor tagged. We use the heuristic Algorithm 1 to root and tag gene trees, noting that a partially-correct rooting suffices (Claim 1). Given a rooted tree, we tag a node as duplication only if the node cannot be tagged as speciation by Definition 1 (i.e., observable duplication nodes [43]); other nodes are assumed to be speciation.
For rooting, we seek the root position that minimizes the number of duplications and losses while allowing for free ILS. In each gene tree G, for two nodes u and v where α G (u) = α G (v), we explain all differences in topologies below u and v by invoking ILS (as opposed to duplication/loss). Then, three scenarios are possible for a node u with children u l and u r . (i) When u is duplication and α G (u l ) = α G (u l ), we do not need to invoke any loss. One duplication suffices. (ii) If α G (u l ) ⊂ α G (u r ) or vice versa, we need one loss on u l and an arbitrary amount of deep coalescence. (iii) Else, we need two losses (one in each side) and ILS to describe the differences. Algorithm 1 computes the number of duplication and loss events using this strategy, without penalizing ILS and fixing a cost of one for both duplications and losses. As described, it requires quadratic time per rooting and thus cubic to find optimal rooting. In our implementation, we used memoization to reduce this time to quadratic (details omitted). The LCA-based linear algorithm of Scornavacca et al. [43] could also be adapted.

Search Space
We need to constrain the ASTRAL search space to bipartitions in a set X. To define X, we use a heuristic method relying on several strategies (see Algorithm 2). First, we use a sampling algorithm (SampleFull procedure) to create single-copy versions of each gene tree, creating a set F. This sampling algorithm prunes the right (or left) subtrees below the highest duplication nodes in the tree, and recurses on each pruned tree, until no species has multiple copies. In addition, per gene, 2 C (default: C = 4) trees are sampled from F, creating a multiset I. This sampling can be probabilistic (taking each side of a duplication with probability 1 2 ) for high numbers of duplications. When the number of input trees is small, I may become too small; in these cases, I is augmented using another sampling algorithm (SampleExtra procedure). We provide I as input to the algorithms implemented in ASTRAL-III for building the set X (as if -i I is given to ASTRAL-III). Finally, we complete all trees from F using the tree completion algorithm of ASTRAL-III and add the resulting bipartitions to X. All methods used guarantee that |X| grows polynomially with the number of species, gene trees, and duplication nodes.
Algorithm 2 Building set X. Default constant parameters: C = 4, E m = 500, E s = 4. The algorithm uses the (arbitrary) left/right orientation of children of a node as given in the input.
X ← run all ASTRAL-III methods for building X with I as input (i.e., -i I) replace u with a star tree consisting of leaves from the set B u R = SampleExtra(G L , A l ) SampleExtra(G R , A r ) return R Table 1: Simulation settings for S25 dataset. n=number of ingroup species; k=number of genes; τ = tree height (generations); λ + = duplication rate; λ − = loss rate; N e = Haploid effective population size. Empirically, we estimate: C = mean number of copies per species minus one when λ − = 0 and n = 25; ILS= mean RF distance between true gene trees and the species tree when λ + = 0. MGTE = mean RF distance between true and estimated gene tree when λ + = 0. See Table S1 for full parameters and Figures S1-S6 for full statistics.

Statistical Consistency
When the input set G has only speciation nodes, the MLQST problem reduces to the Maximum Quartet Support Species Tree (MQSST) problem solved by ASTRAL [31]. Thus, like the MQSST, the MLQST is NP-hard [48]. Moreover, the solution to MQSST problem is a statistically consistent estimator of the species tree under the MSC model and thus ASTRAL-Pro is also statistically consistent in absence of duplication.
In the presence of gene duplication and losses only, let us consider the birth-death model proposed by Arvestad et al. [49] and refer to it as the GDL model. Since all quartets in every equivalence class of SQs match the species tree, the per-locus quartet score will be maximized by the species tree. The following theorem follows.
Theorem 2. Under the GDL model [49], the solution to the MLQST problem is a statistically consistent estimator of the species tree given correctly rooted and tagged gene trees.
In fact, we conjecture that ASTRAL-Pro is statistically consistent under the GDL model even when gene trees are imperfectly rooted and tagged. We leave the proof to future work. Finally, note that restricting to X does not impact statistical consistency, as each bipartition of the species tree has a non-zero chance of appearing in output of this algorithm.

Datasets
We use new and existing simulated datasets as well as a biological dataset to test A-Pro.

New simulated dataset (S25)
We perform a set of simulations using SimPhy [50] starting from a default model condition and adjusting five parameters (Table 1). We simulate 50 replicates per condition, and each replicate draws its parameters from prior distributions. Exact commands are given in Appendix B.
Default model: The species tree, simulated under the Yule process with birth rate 5 × 10 −9 and the number of generations sampled from a log-normal distribution (mean 2.9 × 10 9 ), has 25 ingroup and an outgroup species. Each replicate has 1000 true gene trees simulated under DLCoal with fixed haploid population size N e = 4.7 × 10 8 . Gene trees have mean ILS level in [60%, 80%] range (mean 70%) across replicates (Fig. S2). The duplication rate λ + = 4.9×10 −10 ; when there is no loss, gene trees on average include 145 leaves (≈ 5 extra copies per species). The loss rate λ − is set to λ + ; with loss, gene trees have on average 43 leaves. The average number of duplication and loss events are 11 and 9, respectively, but variance is high (Fig. S1). For each gene, we use Indelible [51] to simulate gap-free nucleotide sequences along the gene trees using the GTR+Γ model [52] with 2 different sequence lengths: 500bp and 100bp. We then use FastTree2 [53] to estimate maximum likelihood gene trees under the GTR+Γ model. Gene tree estimation error, measured by the FN rate between the true gene trees and the estimated gene trees, depends on the sequence length and fluctuates significantly (from 0-100%) both within and across replicates (Fig. S3); mean error is 36% and 15% for 100bp and 500bp, respectively.
Controlling λ + , N e : Here, we consider 3 × 5 = 15 conditions, fixing λ − to be equal to λ + , but changing λ + and ILS levels (controlled by N e ). Our λ + settings result in 0 to 5 extra copies per gene, and the mean ILS level between true and estimated gene trees varies between 0 and 70% RF. (Table 1; Fig. S5) All other parameters are identical to the default model.
Controlling n: Fixing all parameters, we vary the number of ingroup taxa n from 10 to 500.
Controlling k: Fixing all parameters, we vary the number of gene trees k from 25 to 10,000.
(v) Gene trees were estimated using RAxML instead of FastTree2.

Biological data (1kp)
A transcriptome analysis of 103 plant species has been performed on 424 single-copy gene trees (out of thousands of genes) using both concatenation and ASTRAL [37]. In preliminary analyses, the authors had inferred multi-copy gene trees using RAxML from 9683 genes for 83 of those species, ranging in size between 5 and 2395 leaves. However, not being able to obtain an accurate species tree from the multi-copy gene trees, they abandoned the strategy in later analyses. The gene trees are available on Cyverse [56]. We used gene trees inferred from AA or first two codon positions (C12) as the original study.

Methods compared
We implemented A-Pro by extending ASTRAL-MP [36] and implementing Algorithms 1 and 2 as part of its native C++ library. We compare A-Pro to the following methods. Another method, STAG [57], is not included because of its poor performance on the S100 dataset [54], including that it fails to run on some model conditions (Fig. S8).
DupTree [10] infers a species tree from rooted or unrooted gene trees minimizing the duplication reconciliation cost [1] under the duplication-only model, but it does not model ILS. We   Controlling the duplication rate (columns; labelled by C) and the ILS level (x-axis; NRF between true gene trees and the species tree for λ + = 0). A-Pro and ASTRAL-multi are identical with λ + = 0. See Table 1 for parameters and Fig. S7 for iGPT-duploss.
provide DupTree with unrooted gene trees. We also tried iGTP, minimizing Dup-Loss score, but we only show results in supplement (Fig. S7) as it was almost universally worse than DupTree.
MulRF [17], based on an extension of the RF distance [58] to multi-labelled trees, is a hillclimbing method that aims at finding the tree with the minimum RF distance to the input. We use MulRF because of its advantage over other methods shown in previous studies [20].
ASTRAL-multi [39] is a feature of ASTRAL designed for handling multiple individuals. A recent paper (concurrently submitted) proposes to use ASTRAL-multi for multi-copy data [40]. Due to its high memory requirements, we were able to include it in only one experiment of S25.

S25 dataset 5.1.1 Controlling duplication and loss rates and the level of ILS
We start by experiments that change the duplication and loss rates (λ + , λ − ) from the default condition (Fig. 2a). On true gene trees, A-Pro and DupTree are essentially tied in terms of accuracy, except for the case with no duplication and loss where A-Pro is perhaps slightly more accurate. Overall, the accuracy of A-Pro and DupTree is statistically indistinguishable under these conditions (p-value = 0.79 according to a multi-variate ANOVA test). Increasing λ + reduces error (p < 10 −5 ), perhaps because additional copies provide more information, akin to increasing the number of loci. Despite statistically significant increases (p = 0.006) in error as λ − increases, both methods are quite robust to loss rates, losing at most 1.5% accuracy on average when λ − = λ + compared to no losses. MulRF has much higher error than other two methods, with errors that range between 10% and 17% across model conditions (we remind the reader that all these conditions have high ILS, a process that MulRF does not model).  On estimated gene trees, the pattern changes, and the error of DupTree increases dramatically while A-Pro remains relatively accurate. When λ + = λ − = 0, DupTree has on average an 11.5% error whereas A-Pro has only a 4.5% error for 500bp. Adding duplications helps both methods but A-Pro remains more accurate. For example, with 100bp input gene trees, DupTree has an error between 50% to 260% higher than A-Pro. With low-error gene trees, differences are statistically significant (p < 10 −5 ) but are more modest in magnitude (the error increase from DupTree to A-Pro across conditions by a median of 28%). The relative accuracy of A-Pro and DupTree is not a function of λ − (p =0.8) but may depend on λ + (p =0.05).
In terms of running time, on the default model condition, we observe that A-Pro is the fastest method, taking less than a minute on this dataset, followed closely by DupTree (Fig. S9). We will revisit running time of A-Pro on larger datasets below. As we change the ILS level (Table 1), the reason for the poor performance of MulRF becomes clear (Fig. 2b). Without ILS, MulRF has excellent accuracy, often matching A-Pro and beating DupTree on low-error gene trees. As the ILS level increases (especially above 20%), the accuracy of MulRF deteriorates quickly. Overall, ILS has the strongest effect on accuracy (p 10 −5 ) but its impact on methods vary (p 10 −5 ). DupTree seems as tolerant of ILS as A-Pro, despite the fact that DupTree is not designed specifically for ILS, and both methods are much more tolerant of ILS than MulRF. Nevertheless, once again, DupTree shows extreme sensitivity to gene tree error. To summarize, DupTree is relatively tolerant of ILS but less tolerant of gene tree error; MulRF is tolerant of gene tree error but not of ILS; A-Pro is quite robust to both.

Controlling the number of genes and species
Increasing the number of genes k in the most difficult case of high λ + , λ − , and ILS results in continued improvement in accuracy for A-Pro for every value we tested up to k = 10 4 (Fig. 3a).   With true gene trees, the error reduces from 26% with k = 25 to below 1% with k = 10 4 . Even with less accurate gene trees, the error reduces to below 2% with increased numbers of genes. Increasing k increases running time, which empirically grows with k 1.4 (Fig. 3b). Nevertheless, using 28 cores, the running time was never more than 3.5 minutes even with k = 10 4 . Increasing n from 25 to 500 shows that A-Pro is relatively robust to a large number of species (Fig. 3c). With true gene trees, the error ranges between 2.5% with 10 species to 3.5% with 500 species. With estimated gene trees, error ranges between 4.1% to 9.5% (for 100bp) and between 2% and 5% (for 500bp). Note that as n increases, the gene tree error also increases ( Table 1; Fig S6). The running time of A-Pro increases roughly quadratically with n ( Fig. 3d) but is below 2 hours (given 28 cores) even for n = 500 (recall that k = 1000).

S100 dataset
Patterns of performance on the S100 dataset are consistent with the S25 dataset (Fig. 4). DupTree is highly accurate with true gene trees and gene trees with low estimation error but quickly degrades in accuracy as gene tree error increases. MulRF is less sensitive to gene tree error but is more sensitive to the ILS level (which is always moderate or low on this dataset). As in S25, here, we see that using ASTRAL-multi to handle duplication and loss is not beneficial. A-Pro works the best overall, ranking first in terms of mean error (rounded to two significant digits) in 105 out of 120 test conditions (Table S2). The second best method is MulRF on this dataset, which is not surprising given the low ILS levels in this dataset.  Figure 5: ASTRAL and ASTRAL-Pro on biological plant paper. ASTRAL-Pro is run on 9683 multi-copy AA gene trees available online [56]. ASTRAL is run on 424 single-copy gene trees and was reported previously [37]. Three genomes, shown in red, were present in multi-copy gene trees but not in the single-copy analyses. The single-copy tree includes 23 extra species that were not in the multi-copy data and are removed here. The two trees are very similar and differ in only four branches shown in red.

Biological plant dataset
On the plant dataset [37], A-Pro on AA gene trees returned a species tree (Fig. 5) that was largely similar to the main single-copy ASTRAL tree reported by the original study (only 4 differences). Using C12 gene trees resulted in only one change, swapping Chara and Coleochaetales. In contrast, DupTree differed from the ASTRAL tree in 33 out of 77 branches (21/77 for iGTP-DupLoss) and violated many known biological relationships (Fig. S10). The A-Pro trees are consistent with ASTRAL for major groups, including placing Zygnematales (not Chara) as sister to all land plants, the placement of Amboerlla as sister to the rest of angiosperms, and monophyly of Bryophytes (liverworts, mosses, and hornworts).
The four changes between the ASTRAL and A-Pro trees are interesting. In A-Pro, unlike ASTRAL, Rosmarinus and Ipomoea are grouped together, which is likely the correct result as these species are in the same order (Lamiales). The position of genus Yucca in the A-Pro tree has changed; interestingly, a recent update to this paper using > 1, 000 species [38] (which samples close genera Asparagales and Liliales) finds Yucca in a position identical to A-Pro. Thus, the A-Pro placement is more likely to be correct. Most consequentially, A-Pro, unlike ASTRAL, recovers the GnePine hypothesis combining Gnetales and Pinaceae, a hypothesis suggested by several studies [59][60][61][62] and all concatenation analyses of 1kp [37]. Interestingly, the new 1kp paper [38] uses DiscoVista [63] to examine quartet frequencies around this branch and detects that the second and third most frequent quartets do not match (0.4 vs. 0.1) and are heavily skewed towards GnePine, making the resolution obtained in ASTRAL less reliable.

Discussions
We developed a "per-locus" quartet-based measure of similarity between multi-copy gene trees and a species tree. The measure relies on internal nodes of gene trees being tagged as speciation or duplication. Somewhat counter-intuitively, despite being a quartet measure, it needs partially rooted trees (Claim 1). The measure defines an equivalence relationship on quartets and counts each equivalent class only once, avoiding double-counting quartets that are bound to have identical topologies. Avoiding double-counting is at the heart of the approach and likely is a main reason behind its high accuracy on simulated and empirical data we tested.
Astral-Pro, which maximizes the per-locus quartet score, is statistically consistent under MSC and GDL models. This makes one hope that it may also be consistent under both causes of discordance combined. The DLCoal model [55] accounts for ILS, duplication, and loss. Under this model, each duplication immediately creates a daughter locus, which is unlinked from the parent locus; the duplication event gets fixed in all species. Gene trees are seen as generated by first producing a locus tree via a birth-death process that runs on the species tree and then running a MSC process on the locus tree. Because the loci are considered as unlinked, the coalescence processes occur independently between the parent and daughter loci (but the daughter MSC process is "bounded" at the time of duplication). Due to the independence of loci, dividing a multi-copy gene family into its constituent loci can give us distributions on gene tree topologies that behave similarly (though not identically) to the MSC model. The per-locus metric seeks to count quartet topologies across loci as they existed at the time of speciation events relevant to a quartet (i.e., at the time of the anchor LCA). When successful, it counts only topologies that are drawn from independent coalescent processes. However, complicated scenarios involving a combination of duplications, losses and ILS can lead to incorrectly tagged gene trees. These scenarios create complications that need to be addressed. We leave it to the future work to study whether ASTRAL-Pro is statistically consistent under the DLCoal model.
To get rooted and tagged gene trees, we used the maximum parsimony principle, with duplication and loss each penalized equally and deep coalescence not penalized at all. There is a large literature on various ways of tagging and rooting gene trees [e.g., [64][65][66], including other penalties for the duplication and loss events (e.g., there is a suggestion of losses having half the penalty of duplications [67]). It may also be possible to improve tagging of gene trees using probabilistic orthology inference [68,69] or using synteny information [70,71]. However, these methods often require a species tree. It may possible to use A-pro in an iterative fashion, where the species tree is inferred, gene trees are re-tagged and re-rooted, and a new species tree is inferred Future work should explore these approaches.
Quartet-based methods for handling multi-copy gene trees are not abundant. Besides our method, one can attempt to sample single-copy gene trees [41], an approach that we plan to test in the future. Most recently, there has been theoretical and empirical evidence that simply treating gene copies as alleles may be sufficient [40]. We showed that this alternative, although attractive in theory, is less accurate and less scalable than A-Pro. We are unaware of other quartet-based species tree inference methods for multi-copy input.
A-Pro has some limitations. Most importantly, in its current form, it can only handle binary trees, which reduces its ability to handle gene tree error [47]. While A-Pro is more robust to gene tree error than alternatives, combining it with co-estimation [3] and tree fixing [72][73][74][75][76][77] may further improve its accuracy. Future work should also explore ways to extend A-Pro so that it can handle polytomies in input gene trees. Finally, with more algorithmic development, it should be possible to provide all the features ASTRAL provides, including branch length and Local-PP [78], polytomy test [79], and visualization of discordance [63]. All these features should be adopted to A-Pro in the future.

A Proofs
Proof of Proposition 1. Denote Q 1 = {a, b, c, d} and Q 2 = {ã,b,c,d} (with obvious correspondence of labels). Let w be the anchor LCA and note that anchor LCA is the LCA of three (if Q 1 ∠ G) or four (if Q 1 ⊥ G) of the quartet leaves; thus, by Definition 2, w is a speciation node or otherwise Q 1 would not be a SQ. Let denote by w 1 and w 2 the children of w; by Definition 1, α G (w 1 ), α G (w 2 ) and the remaining leaves (U = α G (L G − L G (w 1 ) − L G (w 2 ))) must be mutually exclusive. In the unbalanced case, given that a, b ∈ L G (w 1 ), c ∈ L G (w 2 ), d ∈ U , mutual exclusivity is possible only ifã,b ∈ L G (w 1 ),c ∈ L G (w 2 ),d ∈ U . In the case of balanced topology, mutual exclusivity of α G (w 1 ) and α G (w 2 ) and the fact that a, b ∈ L G (w 1 ) and c, d ∈ L G (w 2 ) implies thatã,b ∈ L G (w 1 ),c,d ∈ L G (w 2 ). Thus, in either case, Ω(G Q 1 ) Ω(G Q 2 ).
Proof of Proposition 2. Each node of a gene tree represents an ancestral or present-day gene and thus belongs to a locus. The children of a speciation node stay in the same locus that their parent, while for a duplication node we have that exactly one of the two children change locus and the other stays in the same locus than its parent. Therefore, all nodes under w, which is a speciation node, belong to the descendants (including itself) of the locus to which w belongs, and when tracing back to the time of speciation event w, they will lead to the same locus. Since all equivalent classes share the same anchor LCA, the result follows.
Proof of Lemma 1. Note that P can anchor Q 1 only iff any species tree that includes P must match the gene tree topology for Q 1 By Proposition 1, due to equivalence of Q 1 and Q 2 , we infer Q 2 must (i) match the same species quartet set as Q 1 and (ii) share the same anchor LCA w. Thus, P can also anchor Q 2 . (iii) When Q 1 ∠ G as shown in Figure 1,ã,b ∈ L G (w 1 ) are the leaves mapped to the quartet tree and thus mapped to the same partition as a, b; similarly, when Q 1 ⊥ G, the pair of leaves under left subtree of the anchor LCA of both quartets map to the same partition of P .
Proof of Lemma 2. First note that: We compute each part separately. Recall here that, since Q w − → P , Q is a SQ quartet and thus |Q| = |α G (Q)| = 4. When Since Q w − → P , leaves α G (a) and α G (b) must be in the same partition of P . When α G (a), α G (b) ∈ P 1 , leaves α G (c) and α G (d) must be in partition P 2 and P 3 respectively since P can anchor Q. W.l.o.g., we can assume α G (c) ∈ P 2 . Therefore, α G (a), Similarly, when Q ∠ G, let Q = {a, b, c, d} with α G (a) and α G (b) in the same partition of P . Notice that, in the unbalanced case, α G (a) and α G (b) can be both either in M 1 or either in M 2 , and since c and d are not interchangeable as in the balanced case, we can have α G (a), α G (b) ∈ P i , α G (c) ∈ P j , α G (d) ∈ P k for (i, j, k) with any permutation of (1, 2, 3), from the definition of P anchoring Q. All together we have 12 cases.
In the case that α G (a), α G (b) ∈ P 1 , α G (c) ∈ P 2 , α G (d) ∈ P 3 , and α G (a), α G (b) ∈ M 1 , we have α G (a), α G (b) ∈ M 1 ∩ P 1 , α G (c) ∈ M 2 ∩ P 2 , and α G (d) ∈ M 3 ∩ P 3 . The number of such Proof of Lemma 3. Let Ω(G Q) be designated by ab|cd and assume w.l.o.g that the anchor corresponding to a and b is the first anchor observed on the post-order traverse of G. It is easy to show (see [31]) that if Ω(G Q) S α G (Q) there exist exactly two tripartitions P 1 and P 2 in P(S) that imply a quartet topology that matches Ω(G Q) (condition (ii) of Definition 6). Each of the two tripartitions has two leaves of α G (Q) in one of its parts and the other two leaves fall on two different parts. Also, the two leaves that are together can only be a and b or c and d and thus, only one of P 1 and P 2 would include both a and b in the same part. Therefore, by condition (iii) of Definition 6, exactly one of Q = P ∈P(S) G∈G w∈I(G) QI pro (P, M w ) × 1 speciation (w) The first two lines are implied by Definition 5. Equation (8) follows from Lemma 1 and Lemma 3 that together establish that each equivalence class of quartets maps to exactly one P . Equation (9) follows from Definition 4 combined with a simple rearrangement obtained by counting unique tuples once. Equation (10) follows from the fact that when w is a duplication node, |{α G (Q) : Q ⊂ L G , Q w − → P }| = 0. Equation (11) follows from Lemma 2.
Proof sketch of Claim 1. The rooting that minimizes the number of duplications and losses (#duploss for short) in Alg. 1 may not be unique. In particular, if a rooted tree G minimizes #duploss, then rooting it at any branch such that the path between the parent node of the branch and the current root (including the two end nodes) does not contain any duplication node will also minimize #duploss. We call a correctly-tagged gene tree partially-correctlyrooted if the path between the parent node of the branch where it is rooted and the root in the correctly-rooted tree does not contain any duplication node. In particular, when gene trees do not have duplications, then any rooting of a gene tree is partially-correctly. We observe that the equivalent classes of quartets in all partially-correctly-rooted trees stay the same (although all quartet trees in the same equivalent class may change from balanced to unbalanced or vice versa), and thus any partially-correct-rooting of gene trees will result in the same species tree.
Sketch of proof of Claim 2. When G only includes speciation nodes, regardless of rooting, each quartet is a SQ. Since each leaf corresponds to distinct taxa in the species tree, each quartet equivalent class contains only one quartet. Therefore, each quartet is counted exactly once and thus P ∈P(S) w pro (P ) = P ∈P(S) w(P ) regardless of rooting. (by an argument that is identical to that provided for ASTRAL-III [31] and follows from results of Kane and Tao [80]). However, while ASTRAL-III guarantees |X| = O(nk) with k = |G|, in ASTRAL-Pro, in the presence of duplications, |X| can be large; in particular with our sampling algorithm (Alg. 2), |X| = O(nN ). Thus, the running time of A-Pro is O(D(nN ) 1.73 ). Note that this analysis is not tight and can be made more precise in the future. Also, in the future, we will explore sub-sampling a constant number of trees from the output of Alg. 2 per gene tree, which will limit the |X| = nk and thus limit the running time of ASTRAL-pro to O(D(nk) 1.73 ).
Proof of Proposition 3. Under GDL, besides leaves, each internal node u G ∈ I(G) in a gene tree G corresponds to an internal node u S ∈ I(S); if u G is a duplication node, u S is the node down the branch in S where the duplication event happened, and if u G is a speciation node, u S is the respective speciation node. It is easy to see that α G (u G ) ⊂ L S (u S ). For each SQ quartet Q = {a, b, c, d}, assuming w.o.l.g that G Q has unrooted topology ab|cd, let w G = ψ G (Q), and u G and v G be the children of w G . Let u G , v G , and w G correspond to u S , v S , and w S in S, respectively. Since w G is a correctly tagged speciation node, u S and v S are descendants from different children of w S .
When Q ⊥ G, assuming w.o.l.g. a, b ∈ L G (u G ) and c, d ∈ L G (v G ), we get α G (a), α G (b) ∈ L S (u S ) and α G (c), α G (d) ∈ L S (v S ) and thus α G (a)α G (b)|α G (c)α G (d) is induced by S.
When Q ∠ G, assuming w.o.l.g. a, b ∈ L G (u G ), c ∈ L G (v G ), and d / ∈ L G (w G ), we get α G (a), α G (b) ∈ L S (u S ) and α G (c) ∈ L S (v S ). Since d is not under w G , α G (d) and w S are under different children of the species tree node to which the LCA of d and w G corresponds. Therefore, α G (d) / ∈ L S (w S ) and thus α G (d) / ∈ L S (u S ); since α G (a) ∈ L S (u S ) and α G (b) ∈ L S (u S ), it follows that α G (a)α G (b)|α G (c)α G (d) in S.   Tables   1st 2nd 3rd 4th  MulRF  42  67  10  1  DupTree  28  8  15  69  A-Pro 105  14  1  0  ASTRAL-multi  12  14  71  23   Table S2: Rank of methods on S100 dataset over all 120 test conditions. Ranks are obtained using mean species tree error, rounded to two significant digits to create tie for cases where error values are extremely close.   Figure S3: Distribution of the gene tree errors (normalized RF distance between true gene trees and the estimated gene tree) for inferred trees with at least 14 leaves in the default condition. Results are divided by sequence length (100bps or 500bps) and by replicates, sorted by mean gene tree error of the 100bps condition.      Figure S8: Species tree error on S100 dataset. We compare the species tree error of the STAG method to A-Pro, showing mean and standard error over 10 replicates for each model condition, with varying numbers of genes (k) and sequence lengths (with Inf signifying true gene trees). Model conditions are labeled as a/b where a is the level of ILS (1 or 5) and b is the duplication/loss rate (1, 2, or 5). Cases with missing STAG results are due to STAG failing to run on those model conditions. Note that STAG infers a species tree from the input gene trees that have at least one leaf representing each species of interest; if none of the input gene trees satisfy this requirement, then STAG fails to return a tree.