Sequence length bounds for resolving a deep phylogenetic divergence
Introduction
When sequence sites evolve independently under a Markov process along the branches of a tree , the sequences observed at the tips contain information concerning the underlying tree. This allows for the tree to be reconstructed accurately from sufficiently long sequences; this is the basis of modern molecular systematics (Felsenstein, 2003). The number of sites required to reconstruct accurately depends on how long the edges of the tree are. More precisely, it depends on the expected number of substitutions on each branch (edge) e of the tree—which we refer to as the branch length of e (this is the product of the temporal duration of the branch and the substitution rate).
A number of authors (e.g. Churchill et al., 1992, Lecointre et al., 1994, Saitou and Nei, 1986, Townsend, 2007, Wortley et al., 2005, Xia et al., 2003, Yang, 1998) have considered various ways to quantify the phylogenetic signal in aligned DNA sequences, and to estimate the sequence length required to reconstruct a phylogenetic tree. Most of these studies have involved simulation or heuristic approaches, although some analytical bounds have also been obtained (Mossel and Steel, 2005, Steel and Szekely, 2002). Typically, these bounds state that if an interior branch length is very short, or if a terminal (external) branch length is long, then a large number of sites will be required.
In this paper we explore these results further by obtaining bounds that are expressed purely in terms of the relative sizes of the branch lengths, not their absolute values. One motivation for our approach is that different genes are known to evolve at different rates, so that any particular branch length will depend on which gene is considered; however, the ratios of the branch lengths will be unchanged if the gene-specific rate applies uniformly across the tree.
A particularly difficult tree reconstruction problem, requiring long sequences to resolve, arises when one has an interior edge with a short branch length incident with edges (or subtrees) having large branch lengths. Such a scenario occurs, for example, when speciation events in rapid succession (leading to short branch lengths) occurred in the distant past (leading to the large branch lengths for the incident edges). Several examples of this have been highlighted in the literature (Lockhart et al., 2006, Rokas and Carrol, 2006) and include the origin of metazoa and the origin of photosynthesis.
In this paper we analyse a scenario which, although somewhat idealised, nevertheless captures the essence of this problem—a four-taxon tree, where the terminal edges have equal branch lengths that are times the branch lengths of the interior edge, and a simple symmetric model of site evolution (specifically, we assume sites evolve independently according to a common two-state Markov process).
We provide a mathematical analysis to the question of how many sites are required to resolve the tree correctly (from the three possible resolved topologies on four taxa). We are particularly interested in how the growth of the sequence length, k, depends on , independent of the absolute value of a particular edge length. We establish that k must grow at the rate , which implies that regardless of how fast (or slow) any particular sequence is evolving, we can set explicit lower bounds on the length of sequences required to resolve the tree. We then show that for our setting, the growth in k need not be any worse than this quadratic growth in , because an existing method (namely, maximum parsimony) achieves this growth rate. This does not imply that maximum parsimony is the ‘best’ method for tree reconstruction; we chose it simply because we can analytically calculate tree reconstruction probabilities for this method. Our results complement an earlier simulation-based analysis (Yang, 1998). We contrast our results by considering a quite different model of site evolution (the infinite state model) and establishing that order growth in k can sometimes suffice for this model.
We also extend the approach to more general Markov processes on trees, obtaining exact, but less explicit lower bounds on k and which involve absolute (rather than relative) branch lengths. Our arguments are based on standard techniques from probability theory, such as central limit approximation, and information-theoretic arguments based on the properties of Hellinger distance.
Section snippets
Preliminaries
Consider an unrooted binary phylogenetic tree on four taxa, say , with branch length x for the interior edge and for the terminal edges , where . This is illustrated in Fig. 1(a), and the topology of the tree is shown at the top of Fig. 1(b). The other two competing topologies ( and ) are also shown in Fig. 1(b). Here branch length refers to the expected number of substitutions under some continuous time substitution process.
Recall that a binary character or site
Lower bounds
The main result of this section is the following: Theorem 3.1 Suppose k sites evolve i.i.d. under a symmetric two-state model on some (unknown) four-taxon tree that has branch length x on the interior edge and on each terminal edge. Then any method that is able to correctly identify the underlying tree topology with probability at least requires for any x, where .
To establish this result we require some preliminary results. We begin with a general information-theoretic bound on
An upper bound: the performance of maximum parsimony
We now show that the lower bound described above is essentially ‘best possible’ (up to a constant factor) for the given model, as it can be achieved for a certain choice of x by a simple tree reconstruction method, namely maximum parsimony (MP). This method selects the tree that requires the smallest number of substitutions to extend the sequences at the tips of the tree to (ancestral) sequences at all the interior vertices of the tree (for further background, the reader can consult, for
Lower bounds for more general models
In this section we derive a lower bound on the sequence length required for tree reconstruction, for a much wider range of Markov processes. However, unlike the previous sections our bound is expressed in terms of the absolute branch lengths (or bounds on these) rather than in terms of ratios, and it involves constants that depend on the details of the model.
We first derive a general lemma. Consider any continuous-time, stationary and reversible Markov process. Let denote its state space, and
Concluding remarks
In this paper we have provided precise results for a specific and simple model (the two-state symmetric process), along with less explicit results for more general Markov processes (and phrased in terms of absolute rather than relative branch lengths). The aim is to determine rigorous bounds on the sequence length required for resolving a deep divergence, which may shed light on debates as to whether some early radiations might be fundamentally unresolvable on the basis of current models and
Acknowledgements
We thank the Allan Wilson Centre for Molecular Ecology and Evolution for funding this work. We also thank the two anonymous referees for their helpful comments.
References (19)
- et al.
How many nucleotides are required to resolve a phylogenetic problem? The use of a new statistical method applicable to available sequences
Mol. Phyl. Evol.
(1994) - et al.
A phase transition for a random cluster model on phylogenetic trees
Math. Biosci.
(2004) - et al.
An index of substitution saturation and its applications
Mol. Phyl. Evol.
(2003) - et al.
The Probabilistic Method
(2000) - et al.
Sample size for a phylogenetic inference
Mol. Biol. Evol.
(1992) Inferring Phylogenies
(2003)The relationship between simple evolutionary tree models and observable sequence data
Syst. Zool.
(1989)- et al.
Heterotachy and tree building: a case study with plastids and eubacteria
Mol. Biol. Evol.
(2006) - et al.
How much can evolved characters tell us about the tree that generated them?