- Split View
-
Views
-
Cite
Cite
Benny Chor, Michael D. Hendy, Sagi Snir, Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions, Molecular Biology and Evolution, Volume 23, Issue 3, March 2006, Pages 626–632, https://doi.org/10.1093/molbev/msj069
- Share Icon Share
Abstract
Maximum likelihood (ML) is a popular method for inferring a phylogenetic tree of the evolutionary relationship of a set of taxa, from observed homologous aligned genetic sequences of the taxa. Generally, the computation of the ML tree is based on numerical methods, which in a few cases, are known to converge to a local maximum on a tree, which is suboptimal. The extent of this problem is unknown, one approach is to attempt to derive algebraic equations for the likelihood equation and find the maximum points analytically. This approach has so far only been successful in the very simplest cases, of three or four taxa under the Neyman model of evolution of two-state characters. In this paper we extend this approach, for the first time, to four-state characters, the Jukes-Cantor model under a molecular clock, on a tree T on three taxa, a rooted triple. We employ spectral methods (Hadamard conjugation) to express the likelihood function parameterized by the path-length spectrum. Taking partial derivatives, we derive a set of polynomial equations whose simultaneous solution contains all critical points of the likelihood function. Using tools of algebraic geometry (the resultant of two polynomials) in the computer algebra packages (Maple), we are able to find all turning points analytically. We then employ this method on real sequence data and obtain realistic results on the primate-rodents divergence time.
Introduction
Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees (Felsenstein 1981), but finding the global optimum is a hard computational task, which led to using mostly numeric methods. So far, analytical solutions have been derived only for the simplest models (Yang 2000; Chor, Hendy, and Penny 2001; Chor, Khetan, and Snir 2003)—three and four taxa under a molecular clock, with just two-state characters (Neyman 1971). In this work we present, for the first time, the analytic solutions for a general family of trees with four-state characters, such as DNA or RNA. The model of substitution we use is the Jukes-Cantor model (Jukes and Cantor 1969) under a molecular clock, the simplest four-state model where all substitutions have the same rate. The trees we deal with are on just three taxa, namely, rooted triplets (see fig. 1b).
The change from two to four character states incurs a major increase in the complexity of the underlying algebraic system. Like previous works, our starting point is to present the general ML problem on phylogenetic trees as a constrained optimization problem. This gives rise to a complex system of polynomial equations, and the goal is to solve them. The complexity of this system prevents any manual solution, so we relied heavily on Maple, a symbolic mathematical system. However, even with Maple, a simple attempt to solve the system (e.g., just typing solve) fails, and additional tools are required. Spectral analysis (Hendy and Penny 1993; Hendy, Penny, and Steel 1994; Hendy and Snir 2005) uses Hadamard conjugation to express the expected frequencies of site patterns among sequences as a function of an evolutionary tree T and a model of sequence evolution along the edges of T. As in previous works, we use the Hadamard conjugation as a basic building block in our method of solution. However, it turns out that the specific way we represent the system, which is determined by the choice of variables, plays a crucial role in the ability to solve it. In previous works (Chor, Hendy, and Penny 2001; Chor, Khetan, and Snir 2003), the variables in the polynomials were based on the expected sequence spectrum (Hendy and Penny 1993). This representation leads to a system with two polynomials of degrees 11 and 10. This system is too complex to resolve with the available analytic and computational tools. By employing a different set of variables, based on the path-set spectrum (Hendy and Snir 2005), we were able to arrive at a simpler system of two polynomials whose degrees are 10 and 8. To finesse the construction, we use algebraic geometry combined with Maple. Specifically, we now compute the resultant of the two polynomials, which yields a single, degree 11 polynomial in one variable. The roots of this polynomial yield the desired ML solution. We remark that similar results to those shown here were obtained by Hosten, Khetan, and Sturmfels (2004), using somewhat different techniques. A set of computational algebraic tools for analyzing similar models on small trees based on the methods reported in Catanese et al. (2004); Hosten, Khetan, and Sturmfels (2004); and Sturmfels and Sullivant (2005) is detailed in the web page http://math.tamu.edu/∼lgp/small-trees/.
It is evident that the currently available heuristic methods can fail to infer the correct tree, even for small number of taxa. This is true not only in the presence of multiple ML points but also in cases where the neighborhood of the (single) ML point is relatively flat. Therefore, a practical consequence of this work is the use of rooted triplets in supertree methods (constructing a big tree from small subtrees). For unrooted trees, the subtrees must have at least four leaves (e.g., “the tree from quartets” problem). For rooted trees, it is sufficient to amalgamate a set of rooted triplets (Aho et al. 1981). Our work enables such approaches to rely on rooted ML triplets based on four characters states rather than two.
Additionally, analytic solutions are able to reveal properties of the ML points that are not obtainable numerically. For small trees, our result can serve as a method for checking the accuracy of the heuristic methods used in practice.
The remainder of this work is organized as following. In the next section we describe the materials and methods we used in this work. Specifically, in Definitions, Notations, and the Hadamard Conjugation we provide definitions and notations used throughout the rest of this work. In Definitions, Notations, and the Hadamard Conjugation we develop the identities and structures induced by the Jukes-Cantor model, while in Obtaining the ML Solution we develop ML on phylogenetic trees and subsequently solves the system for the special case of three species under Jukes-Cantor and molecular clock. In Results and Discussion we give results and discuss future research directions: Results on Genomic Sequences reports on experimental results of applying our method on real genomic sequences, while in Directions for Future Research we conclude with three open problems.
Materials and Methods
Here we detail on the methods and tools we employed in order to obtain our results. These are developed specifically for the tree T on three taxa referred to as 1, 2, and 3. We label the leaves of T as 1, 2, and 3 and the edges (branches) as e1, e2, and e3, as illustrated in figure 1a. Taxon 3 is chosen to be the reference taxon. Our analysis is focused on the site substitutions required to transform the reference sequence to those of 1 and 2 under Kimura's 3-substitution model (K3ST) (Kimura 1981). We will subsequently impose the constraints of the Jukes-Cantor model (Jukes and Cantor 1969), and under a molecular clock, on the rooted tree of figure 1b.
Definitions, Notations, and the Hadamard Conjugation
Given aligned nucleotide sequences for the three taxa 1, 2, and 3, there are 43 nucleotide combinations possible at each site. We refer to each of these as a “character pattern.” A character pattern
Kimura (1981) in his K3ST model, describes three classes of substitution: the transitions; and two types of transversions. Following Hendy and Snir (2005), we use α to denote a transition, β and γ to denote the two transversion types, together with ϵ to indicate no substitution, as described in figure 2. For example, the character pattern
In Hendy, Penny, and Steel (1994) it is shown that under the K3ST model, the probability of a substitution pattern is independent of the character state at taxon 3. We index each site pattern by a pair (X, Y) of subsets of {1, 2}, where X is the set of taxa with substitutions which cross the line aa′ in (fig. 2), and Y is the set of taxa with substitutions which cross the line bb′. We denote the probability of site pattern (X, Y) as sX,Y, which when not ambiguous, we index by the lists of elements of X and of Y, for brevity. Thus, for example, the substitution pattern
Equation (2) in Theorem 1 is equivalent to earlier expressions (Steel et al. 1992; Székely et al. 1993) of Hadamard conjugations for the K3ST model, which used Hadamard matrices of 2n rows and columns applied to vectors of 2n entries, with qi = qα(ei) = qβ(ei) = qγ(ei). The current expression has recently been developed in Hendy and Snir (2005). A full proof of this result follows directly from applying Theorem 7 of Hendy and Snir (2005) on the tree T. This approach has been followed in order to clarify the role of path sets, which explain the intermediate terms in the conjugations, enabling us to derive and interpret the values in equation (8) directly.
Obtaining the ML Solution
When we have an aligned sequence of length c for each of the three taxa 1, 2, and 3, we can count the relative frequencies (normalized to sum to 1) of the substitution patterns among the sites. For (X, Y) | X, Y ⊆ {1, 2}, we set fX,Y to be the relative frequency of observing the site pattern (X, Y), and let F be the 4 × 4 matrix with entries f(X, Y) indexed as in the sequence spectrum S. If the sequences were generated under the Jukes-Cantor model on the tree T with edge weights q1, q2, q2, then the expected number of substitution pattern (X, Y) is fX,Y = csX,Y, where c is the number of sites.
However, in c independent samples, the observed frequencies will include sampling error, so we cannot directly conclude S = F. ML provides a method of estimating the parameters q1, q2, and q3 from the observed frequencies F.
We now show that the system of two resulting polynomials {E1, E2} has only finitely many solutions, all of which we can find. The major tool used here is the “resultant” of two polynomials. Let
Computing the resultant is a classical technique for eliminating one variable from two equations. There is an elegant formula for computing it due to Sylvester and another due to Bezout, which have been implemented in the computer algebra package Maple.
The turning points of L (eq. 14) corresponding to realistic trees (namely, trees with positive edge lengths) are exactly the roots of P0.
The only factors in ER (eq. 20) that admit positive real roots are P0 and
The Jukes-Cantor triplet has a finite number of ML points.
P0 has at most 11 different solutions and, for each such solution, we can back substitute to obtain all the values of x3. □
Maxima on the Boundaries of the Parameter Space
In case III, with x = x3, x1 = x2 = 1, L = 0 (unless f1 = f2 = f4 = 0, whence L is undefined).
Results and Discussion
Here we report experimental results obtained by employing our method on genomic sequences. We conclude by discussion and pointing out future research directions.
Results on Genomic Sequences
In order to evaluate our method, we tested it on three sets of real genomic sequences. In each case, we found a single solution to P0 = 0 in the realistic parameter space, for each of the three rooted triples. By evaluating the likelihood at each of these turning points, we found each was a maximum.
We also calculated the ML value for each of the three rooted trees for the beta actin gene, for the three species guinea pig, goose, and Caenorhabditis elegans (accession numbers AF508792, M26111, and NM_076440, respectively), finding the ((guinea pig, goose), C. elegans) tree maximal, with q1 = q2 = 0.021819 and q3 = 0.050188 giving ln L = −1241.5. Finally, we calculated the ML value for each of the three rooted trees for the histone gene of Drosophila melangoster, Hydra vulgaris, and human (accession numbers AY383571, AY383572, and NM_002107, respectively), finding the ((D. melangoster, H. vulgaris), Human) tree maximal, with q1 = q2 = 0.001555 and q3 = 0.012740 with ln L = −86.835133.
Each of the results above agree closely with the numerical values obtained using the popular phylogenetic reconstruction packages PHYLIP (Felsenstein 1995) and PAUP* (Swofford 1998), which use iterative methods to estimate the maxima. In each case, the likelihood function had a unique maximum in the parameter range.
Directions for Future Research
The progress made here brings up a number of open problems:
Our ML solutions are derived from the roots of a univariate, degree 11 polynomial. This implies that the number of ML solutions is finite. It would be interesting to explore the question of “uniqueness” of the solution. If this is the case, it will most likely follow from the existence of a single solution corresponding to a realistic tree, as in Chor, Khetan, and Snir (2003).
The Jukes-Cantor substitution model is a special case of the family of Kimura substitution models. It would be interesting to further extend the result in this paper for the other models (two and three parameters) of the Kimura family.
It would be interesting to extend these results to rooted trees with “four leaves” under Jukes-Cantor model and a molecular clock. Here we have two different topologies—the fork and the comb (Chor, Khetan, and Snir 2003). It is expected that such extension will face substantial technical difficulties.
Appendix
William Martin, Associate Editor
Thanks to Joseph Felsenstein for his fruitful discussions, to Bernd Sturmfels for enlightening comments on this manuscript and informing us about his manuscript (Hosten, Khetan, and Sturmfels 2004), and to the two other reviewers for their constructive comments. This research was supported by the Israel Science Foundation grant 418/00.
References
Aho, A. V., Y. Sagiv, T. G. Szymanski, and J. D. Ullman.
Catanese, F., S. Hosten, A. Khetan, and B. Sturmfels.
Chor, B., M. Hendy, B. Holland, and D. Penny.
Chor, B., M. Hendy, and D. Penny.
Chor, B., A. Khetan, and S. Snir.
Chor, B., and S. Snir.
Felsenstein, J.
———.
Hendy, M. D., D. Penny, and M. A. Steel.
Hendy, M. D., and S. Snir.
Hosten, S., A. Khetan, and B. Sturmfels.
Jukes, T. H., and C. R. Cantor.
Kimura, M.
Neyman, J.
Steel, M., M. D. Hendy, and D. Penny.
Steel, M. A., M. D. Hendy, L. A. Székely, and P. L. Erdös.
Sturmfels, B., and S. Sullivant.
Székely, L., P. L. Erdös, M. A. Steel, and D. Penny.
Thompson, J. D., D. G. Higgins, and T. J. Gibson.
Author notes
*School of Computer Science, Tel-Aviv University, Israel; †Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand; and ‡Mathematics Department, University of California, Berkeley