Reconciliation feasibility in the presence of gene duplication, loss, and coalescence with multiple individuals per species

Background In phylogenetics, we often seek to reconcile gene trees with species trees within the framework of an evolutionary model. While the most popular models for eukaryotic species allow for only gene duplication and gene loss or only multispecies coalescence, recent work has combined these phenomena through a reconciliation structure, the labeled coalescent tree (LCT), that simultaneously describes the duplication-loss and coalescent history of a gene family. However, the LCT makes the simplifying assumption that only one individual is sampled per species whereas, with advances in gene sequencing, we now have access to multiple samples per species. Results We demonstrate that with these additional samples, there exist gene tree topologies that are impossible to reconcile with any species tree. In particular, the multiple samples enforce new constraints on the placement of duplications within a valid reconciliation. To model these constraints, we extend the LCT to a new structure, the partially labeled coalescent tree (PLCT) and demonstrate how to use the PLCT to evaluate the feasibility of a gene tree topology. We apply our algorithm to two clades of apes and flies to characterize possible sources of infeasibility. Conclusion Going forward, we believe that this model represents a first step towards understanding reconciliations in duplication-loss-coalescence models with multiple samples per species. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1701-1) contains supplementary material, which is available to authorized users.

Constraint 1 ensures that genes from the same (species-specific) locus are assigned the same locus class and that the genes and edges assigned the same locus class form a subtree of the gene tree. This latter constraint follows from our assumption that each duplication creates a unique new locus. Without this assumption, two leaves could map to the same locus if, for example, along the path from their MRCA to one of the leaves, there was a duplication to a new locus followed by a duplication back to the original locus. Constraint 2 ensures that genes from irreconcilable loci are not assigned the same locus class. That is, paralogous genes by definition cannot be orthologous, so they must be in different locus classes.
Problem S1.1 (Reconciliation Feasibility). Given G and Le, determine whether G is reconcilable.
To solve Problem S1.1, we rely on two new data structures: Definition S1.4 (Partially Labeled Coalescent Tree). Let P(L) denote the power set of L. Given G and Le, the partially labeled coalescent tree (PLCT) is a mapping P : E(G) → P(L) that labels branches with the species-specific locus or loci to which the branch must belong. That is, for each e ∈ E(G), P(e) contains l if and only if there exists a pair of genes mapped to l and e lies along the path between those genes.
Remark. The PLCT can be thought of as a partial locus map that labels each gene tree branch with the locus in which it evolved. It is a partial map for two reasons: (1) There can exist gene tree branches without any label as branches are labeled only if they lie along the path between two alleles. (2) Branches are labeled with species-specific loci, but multiple species-specific loci may be equivalent and correspond to the same evolutionary locus.
Definition S1.5 (Locus Equivalence Graph). Given a PLCT P for G and Le, the locus equivalence graph (LEG) is a graph G where V (G) = L and for each pair l 1 ∈ L, l 2 ∈ L such that l 1 = l 2 , E(G) contains (l 1 , l 2 ) if and only if there exists an edge in the gene tree that is labeled with both l 1 and l 2 in the PLCT.
That is, because the locus classes are defined over the species-specific loci, the PLCT captures the allele constraints for each species-specific locus. These constraints are then put together in the LEG. In particular, if an edge of the gene tree is labeled with multiple loci in the PLCT, then the multiple loci must correspond to the same locus class and be equivalent. This equivalency constraint is captured in the LEG: the LEG contains an edge between two loci if the loci are equivalent. Note, however, that the LEG captures the loci pairs that must be equivalent, but these are not necessarily the only pairs that can be equivalent.
Next, we define the concept of a reconcilable LEG and relate it to gene tree reconcilability.
Definition S1.6 (Reconcilable Connected Component). A connected component C of the locus equivalence graph G is reconcilable if and only if all of the loci in that component are pairwise reconcilable.
Definition S1.7 (Reconcilable Locus Equivalence Graph). A locus equivalence graph G is reconcilable if and only if all of its connected components are reconcilable.
Theorem S1.1. A gene tree is reconcilable if and only if its locus equivalence graph is reconcilable.
In other words, Theorem S1.1 states that that a gene tree is reconcilable if and only if every connected component of the LEG contains no more than one locus from any species. Algorithm S1 summarizes how to generate the PLCT and LEG and determine feasibility. We can then analyze the time complexity of our algorithm.
Theorem S1.2. Gene tree reconcilability can be decided in a total time complexity of O(|L(G)| 3 ).

S1.1 Proofs
Lemma S1.3. If v is an internal node of the gene tree, and some branch incident with v has label l in the PLCT, then some other branch incident with v must also have label l.
Proof. Suppose some branch incident with v is labeled l. Then, by the method used to label branches of the PLCT in Algorithm S1, we know that v must lie on the path between two genes from the same speciesspecific locus. The path between two genes must start and end at leaves of the gene tree. Therefore, since v is an internal node of the gene tree, it must have two neighbors on this path, and it must have at least two incident branches labeled l.
Lemma S1.4. Suppose C 1 and C 2 are maximal disjoint connected components of the locus equivalence graph. Next, suppose that g 1 ∈ L(G) is a gene from locus l 1 ∈ C 1 and g 2 ∈ L(G) is a gene from locus l 2 ∈ C 2 . Then, there exists an edge between g 1 and g 2 of the gene tree that is unlabeled in the PLCT.
Proof. Note that l 1 = Le(g 1 ) and l 2 = Le(g 2 ) are the labels associated with the loci of g 1 and g 2 , respectively. Assume, by way of contradiction, that there is no unlabeled edge on the path p from g 1 to g 2 . Now consider the traversal of p from g 1 to g 2 . The first edge in this traversal must be labeled l 1 . To see this, note that every edge of p is labeled with some l i . By our definition of the PLCT, edges are only labeled when they lie on the path between two genes with the same species-specific locus. Since the first edge in p is incident to a leaf sampled from l 1 , it is impossible to label the first branch of p anything but l 1 . The same argument shows that the last edge in p must be labeled l 2 .
Thus, p begins with label l 1 and ends with label l 2 . Then, if we traverse p from g 1 to g 2 , there must be some "first edge" e f labeled l i , where l i and l 1 are from different connected components of the LEG. We know this is true because l 1 and l 2 are from disjoint connected components of the LEG.
Let v be the vertex immediately preceding edge e f in p. We know that v is incident with an edge with label l 1 and with an edge with label l i . Then, by Lemma S1.3, v must be incident with at least two edges with label l 1 and with at least two edges with label l i . However, since v is an internal vertex of a binary tree, it has degree 3. Therefore, some edge incident with v must be labeled both l 1 and l i , and thus, l 1 and l i must be connected in the LEG. This is a contradiction because we claimed that l 1 and l i were loci from different connected components of the LEG.
We have reached a contradiction, so there must be some unlabeled edge on the path p from g 1 to g 2 .
Lemma S1.5. A gene tree is reconcilable if and only if the following conditions hold: N1. There exists no duplication along the path between two genes from the same locus. (Allele Constraint) N2. There exists a duplication along the path between two genes from irreconcilable loci. (Paralog Constraint) Proof. We will show that these criteria are equivalent to our original criteria for a reconcilable gene tree, which are repeated here: 1. Genes from the same locus must be assigned the same locus class, and all edges along the path between these genes must be assigned the same locus class as the genes. (Allele Constraint) 2. Genes from irreconcilable loci must not be assigned the same locus class. (Paralog Constraint) Throughout this proof, note that each gene or edge of the gene tree is assigned a single locus class. Most importantly, we observe that a duplication along a path must result in a change of locus class; moreover, this is the only way for a branch to change locus class, and the change is not reversible -a subsequent duplication along the path cannot revert back to a previously used locus class. Thus, Constraint 1 and Constraint N1 are equivalent, as genes from the same locus are assigned the same locus class along with all the edges between them precisely when there is no duplication along the path. Similarly, Constraint 2 and Constraint N2 are equivalent since genes from irreconcilable loci are assigned different locus classes if and only if there is a duplication somewhere on the path between the two.
Proof of Theorem S1.1 Suppose the locus equivalence graph is reconcilable. By Lemma S1.5, we have a definition for a reconcilable gene tree in terms of duplications, the constraints of which are repeated here: N1. There exists no duplication along the path between two genes from the same locus.
N2. There exists a duplication along the path between two genes from irreconcilable loci. That is, if we have a procedure to add duplications to our gene tree such that these constraints hold, then we have demonstrated that the gene tree is reconcilable. Consider the following procedure. For each pair of genes g 1 ∈ L(G), g 2 ∈ L(G) such that g 1 = g 2 and l 1 = Le(g 1 ) and l 2 = Le(g 2 ) are irreconcilable, consider the path from g 1 to g 2 . Since the LEG is reconcilable, two irreconcilable loci must be found in different connected components of the LEG, so by Lemma S1.4, there is an unlabeled edge on this path. Add a duplication on some unlabeled edge. Now, we have satisfied Constraint N2. We also claim that we have satisfied Constraint N1 because we never added a duplication on an edge that was labeled by the PLCT; therefore, we never added a duplication on the path between two genes from the same locus.
For the converse, we will prove the contrapositive: if the locus equivalence graph is irreconcilable, then the gene tree is also irreconcilable. Suppose the LEG is irreconcilable, and assume, by way of contradiction, that the gene tree is reconcilable using our original definition for a reconcilable gene tree, the constraints of which are repeated here: 1. Genes from the same locus must be assigned the same locus class, and all edges along the path between these genes must be assigned the same locus class as the genes. 2. Genes from irreconcilable loci must not be assigned the same locus class. Throughout the remainder of this proof, we say that a (species-specific) locus is assigned a locus class if all genes associated with that locus are assigned the locus class, and similarly, two (species-specific) loci are assigned different locus classes if all pairs of genes, one from each loci, are assigned different locus classes.
Since the LEG is irreconcilable, there must be some connected component that is irreconcilable and this connected component must contain some pair of loci l 1 and l 2 that are irreconcilable. Thus, by Constraint 2, l 1 and l 2 must not be assigned the same locus class. Next, recall that the LEG contains an edge between two loci if the loci are equivalent. Since locus equivalency is transitive, any two loci in a given connected component must be equivalent. So, since l 1 and l 2 are in the same connected component, they must be equivalent (and correspond to the same evolutionary locus) and, by Constraint 1, be assigned the same locus class. We have reached a contradiction in that loci l 1 and l 2 must be assigned and not be assigned the same locus class, so the gene tree must be irreconcilable.
Remark. The proof for the converse does not rely on any lemmas and, in particular, does not require a binary gene tree. Thus, for a non-binary gene tree, it holds that if the LEG is irreconcilable, the gene tree is irreconcilable. But, it does not hold that if the LEG is reconcilable, the gene tree is reconcilable.
Remark. If a reconcilable gene tree exists, this proof yields a procedure for generating a valid labeling of the gene tree, that is, a labeling that satisfies Definition S1.3. We consider all pairs of irreconcilable loci and for each, insert a duplication on an unlabeled edge between the two loci. Next, we cut the tree at the duplications and assign each leaf and edge within the same subgraph to a single locus class. As stated in the proof, this necessarily satisfies the definition of a reconcilable gene tree, thus, yielding a valid labeling. Because there may be multiple ways to insert duplications according to this procedure, multiple valid labelings may exist.
Proof of Theorem S1.2 We consider the time complexity for each subroutine of Algorithm S1 line-by-line. Let n = |L(G)| and k = |L|. We assume that L is represented using non-negative integers and that the mapping of loci to species is implemented using an array of size k so that for l ∈ L, s(l) requires O(1) time. Further, for each e ∈ E(G), P(e) is implemented as a boolean array of size k using one-hot encoding: for each l ∈ L, index l of P(e) is TRUE if P(e) contains l and is FALSE otherwise. So, initializing P(e) to ∅ requires O(k) time and adding an element to P(e) requires O(1) time. Additionally, we assume that G is implemented as an adjacency matrix so that adding vertices and edges requires O(1) time.
The Put together, the total time complexity of the algorithm is O(n 3 + nk + nk 2 + k 3 ). Since k ≤ n, the complexity can alternatively be written as O(n 3 ).

S1.2 Optimization
The approach for generating the PLCT in Algorithm S1 can be decreased from a total time complexity of O(n 3 ) to a complexity of O(nk), where k = |L|. We accomplish this optimization by rooting the gene tree arbitrarily along any branch, allowing us to traverse it in preorder and postorder.
Throughout this section, let T = (V (T ), E(T )) be a rooted, full binary tree with a set V (T ) of nodes and a set E(T ) of directed branches. Let L(T ) ⊂ V (T ) be the set of leaves, I(T ) = V (T ) \ L(T ) be the set of internal nodes, and r(T ) ∈ I(T ) be the root node. For node v, let left(v) be its left child, right(v) be its right child, parent(v) be its parent, and T (v) be the (maximal) subtree of T rooted at v.
Additionally, we define a modified PLCT P : V (G) → P(L) that labels each gene tree node with the species-specific locus or loci to which the node must belong. That is, for each g ∈ L(G), P (g) contains Le(g), and for each g ∈ I(G), P (g) contains l if and only if there exists a pair of genes mapped to l and g lies along the path between those genes. Then, for each e = (u, v) ∈ E(G), the original PLCT can be defined as P(e) = P (u) ∩ P (v); that is, an edge is labeled with l if the path goes through both vertices of the edge.
Algorithm S2 summarizes the optimized algorithm for generating the PLCT. Note that substituting Algorithm S2 for the PLCT component of Algorithm S1 (lines 1-5) reduces the total time complexity for Algorithm S1 from O(n 3 ) to O(nk + nk 2 + k 3 ) = O(nk 2 + k 3 ), which is in general faster because k ≤ n.

Proof of Correctness
In the postorder traversal, for each gene g ∈ V (G), we compute set U(g), which, for leaves, contains the species-specific locus of g (line 4), and for internal nodes, contains the labels that appear in any leaf descended from g (line 7). Next, for each leaf g ∈ L(G), P (g) contains only Le(g) (line 5). For each internal node g ∈ I(G), if U(left(g)) and U(right(g)) contain the same label l, then there must exist a pair of genes g 1 ∈ L(T (left(g))) and g 2 ∈ L(T (right(g))) [that is, g 1 and g 2 are in the set of leaves in the left and right subtrees of g, respectively] such that g 1 and g 2 map to the same locus l. Thus, there exists a path between g 1 and g 2 , both mapped to l, that goes through g, and so P (g) must contain l (line 8). So, at the end of the postorder traversal, for each g ∈ L(G), P (g) is computed correctly, and for each g ∈ I(G), P (g) contains l if and only if there exists a pair of genes in the subtree of G rooted at g such that both genes map to l and g lies along the path between those genes. Note that since T (r(G)) = G, it must be that P (r(G)) is computed correctly.
Then, for each internal node g that is not the root, what remains is to update P (g) with the labels induced by pairs of genes, one in the subtree of G rooted at g and one in the rest of G, such that both genes map to l. We preorder traverse the gene tree to perform this update and prove by induction that we correctly update P (g). For the base case, observe that P (r(G)) is correct. Assume, without loss of generality, that we are updating P (g) where g = left(r(G)). Recall that U(g) contains labels that appear in any leaf descended from g. At the same time, P (parent(g)) contains labels that must be "propagated" down the tree. So, if U(g) and P (parent(g)) contain the same label l, then there must exist a pair of genes g 1 ∈ L(T (g)) and g 2 ∈ L(G) \ L(T (g)) such that g 1 and g 2 map to the same locus l. And, as before, there exists a path between g 1 and g 2 , both mapped to l that goes through g, and so P (g) must contain l (line 11). Thus, the sets P (g) for each g = left(r(G)) and g = right(r(G)) are computed correctly. Induction completes our proof.

Proof of Complexity
We consider the time complexity for each component of Algorithm S2 line-by-line. As before, we assume that, for each g ∈ V (G), U(g) and P (g) are implemented using arrays of size l with one-hot encoding. Thus, the union and intersection of two sets requires O(k) time.
Rooting a tree requires O(n) time. The first 'for' loop at line 2 is executed O(n) times, and lines 4-5 each require O(1) time and lines 7-8 each O(k) time, giving a total time complexity of O(nk) for lines 3-8. The second 'for' loop at line 9 is also executed O(n) times, and line 11 requires O(k) time, giving a total time complexity of O(nk) for line 11. Thus, the total time complexity of the algorithm is O(nk).

S2 Biological Gene Trees
We analyzed seven species or subspecies within the great apes clade: humans (Homo sapiens), Western chimpanzees (Pan troglodytes verus), bonobos (Pan paniscus), Eastern lowland gorilla (Gorilla beringei graueri ), Western lowland gorilla (Gorilla gorilla gorilla), Sumatran orangutans (Pongo abelii ), and Bornean organutans (Pongo pygmaeus). We obtained reference genomes and variants from Prado-Martinez et al. (2013) and genome annotations from Ensembl release 72 (Flicek et al. 2014). For species without a reference genome (bonobos and Bornean orangutans), we used as reference the genome of its closest relative (chimpanzees and Sumatran orangutans). For the non-human species, we retained only variants that passed all filters using VCFtools (Danecek et al. 2011), imputed and phased genotypes using Beagle 4.0 (Browning and Browning 2007), and extracted autosomal variants in protein-coding regions and applied them to the reference fasta sequences using VCFtools. For humans, we performed a similar procedure except that we imputed and phased genotypes against the 1000 Genomes Project (The 1000 Genomes Project Consortium 2012). To reduce the number of samples and proteins under consideration, we retained the two highest-depth samples from each species and analyzed the longest protein sequence per gene.
Next, we obtained 10,800 autosomal gene family definitions from Ensembl release 72 (Flicek et al. 2014) and retained 7988 families that contained a variant in at least one gene within the family, yielding 15,976 families across two haplotypes. Of these, we retained 6298 "non-trivial" families that contained at least two loci from any one species. We aligned protein sequences using MUSCLE (Edgar 2004), reverse translated the protein alignment to a (codon-aligned) nucleotide alignment, and reconstructed gene trees using four programs: PHYLIP (Felsenstein 1989), BioNJ (Gascuel 1997), PhyML (Guindon et al. 2010) with the GTR model, and RAxML (Stamatakis 2006) with 100 fast bootstraps and the GTRGAMMA model.

S3 Simulated Gene Trees
In Rasmussen and Kellis (2012), the authors used the species tree of the Drosophila 12 Genomes Consortium (2007) with estimated divergence times (Tamura et al. 2004), gene duplication and loss rates of 0.0012 events/gene/million years (Hahn et al. 2007), and a generation time of 10 generations/year (Sawyer andHartl 1992, Pollard et al. 2006). While Drosophila melanogaster is estimated to have an effective population size N e of ∼1.15 million individuals (Charlesworth 2009), they used a wide range of population sizes and also a range of duplication-loss rates to investigate the effects of varying levels of gene tree-species tree incongruence, with each set of parameters used to simulate 500 gene families. For each parameter setting, we retained only the "non-trivial" families that contained at least two loci from any one species.
As our algorithm relies on multiple samples per species, we used the locus trees of Rasmussen and Kellis (2012) and, for each locus tree, simulated gene trees with N samples per species for N = 2, 5, 10. These gene trees are necessarily reconcilable; that is, they have no topological error that could result in infeasible reconciliations. Thus, we simulated alignments of 1000 nucleotides for each gene tree under a HKY model (Hasegawa et al. 1985) and with a substitution rate of 5 × 10 −9 substitutions/site/generation (Kimura 1968, Haag-Liautard et al. 2007) using seq-gen (Rambaut and Grassly 1997), and finally, we reconstructed gene trees from these alignments using RAxML (Stamatakis 2006) with 100 fast bootstraps and the GTRGAMMA model.
Note that, in line with our model assumptions, our simulation procedure does not allow for events such as horizontal gene transfer or gene conversion. Those interested in this more general model may consider SimPhy (Mallo et al. 2016 Fig S1. Generative process for the DLCoal model. Given a species tree with known topology and divergence times, a top-down duplication-loss process generates a locus tree. Branches that evolve in the daughter locus (rather than in the mother locus) of a duplication are denoted in a separate color. From the locus tree, a bottom-up coalescent process generates a gene tree. Mappings between the trees indicate how one tree "fits inside" the other. This diagram depicts the same gene evolutionary history as Figure 1c 1. A locus set L contains the set of species-specific loci that have evolved within the gene family. In this example, L = {a 1 , b 1 , b 2 }, where species A has locus {a 1 } and species B has loci {b 1 , b 2 } such that the relationship between loci in different species is unknown. 2. Two loci are orthologous if their MRCA corresponds to a speciation event (white circle in the locus tree). For the depicted history, loci a 1 and b 1 and loci a 1 and b 2 are orthologous. 3. Two loci are equivalent if they derived from their MRCA by speciation events alone. For the depicted history, loci a 1 and b 1 are equivalent. Note that loci a 1 and b 2 are orthologous but not equivalent because a duplication (yellow star) created a new locus ("locus 2") distinct from the original locus ("locus 1"). 4. Two loci are reconcilable if they could potentially be orthologous. In this example, loci a 1 and b 1 and loci a 1 and b 2 are reconcilable. The depicted history shows how a 1 and b 1 would be orthologous. For a different history in which b 1 evolved in the new locus and b 2 in the original locus, a 1 and b 2 would be orthologous. In contrast, loci b 1 and b 2 from the same species must be paralogs; thus, they cannot be orthologous and are not reconcilable.