Maximum Likelihood Estimation for Unrooted 3-Leaf Trees: An Analytic Solution for the CFN Model

Maximum likelihood estimation is among the most widely-used methods for inferring phylogenetic trees from sequence data. This paper solves the problem of computing solutions to the maximum likelihood problem for 3-leaf trees under the 2-state symmetric mutation model (CFN model). Our main result is a closed-form solution to the maximum likelihood problem for unrooted 3-leaf trees, given generic data; this result characterizes all of the ways that a maximum likelihood estimate can fail to exist for generic data and provides theoretical validation for predictions made in Parks and Goldman (Syst Biol 63(5):798–811, 2014). Our proof makes use of both classical tools for studying group-based phylogenetic models such as Hadamard conjugation and reparameterization in terms of Fourier coordinates, as well as more recent results concerning the semi-algebraic constraints of the CFN model. To be able to put these into practice, we also give a complete characterization to test genericity.


Introduction and Preliminaries
This paper is concerned with inferring the evolutionary history of a set of a species or other taxa from sequence data using maximum likelihood.In practice, maximum likelihood inference is among the most commonly-used in phylogenetic analyses, and in contrast to the simple (but more analytically tractable) model considered in this paper, maximum likelihood estimation is typically undertaken with sophisticated models of site evolution, utilizing heuristic (e.g.hill-climbing) methods and multiple start points to explore the tree space in order to obtain parameters which maximize the likelihood.A variety of excellent and widely-used implementations exist, many of which have been used in thousands of studies (e.g., Nguyen et al. 2015;Stamatakis 2014;Yang 2007;Price et al. 2010).
Nonetheless, there remains interest in computing analytic (i.e., closed-form) solutions in simpler cases (Chor and Snir 2004;Chor et al. 2003Chor et al. , 2000;;Yang 2000;Chor et al. 2007;Hobolth and Wiuf 2024;Chor et al. 2005) as well as exact solutions via algebraic methods (Kosta and Kubjas 2019;Garcia Puente et al. 2022), with the goal of providing a more rigorous understanding of the properties of maximum likelihood estimation and the ways that it can fail.For example, it is well-known that the maximum likelihood tree need not be unique (Steel 1994), that for certain data there exists a continuum of trees which maximize the likelihood (Chor et al. 2000), and that there exist data for which the maximum likelihood estimate does not exist-or at least, is not a true tree with finite branch lengths (Kosta and Kubjas 2019).In the context of phylogenetic estimation, maximum likelihood exhibits complex behavior even in very simple cases; one important example of this is long-branch attraction, a form of estimation bias which has been the subject of much interest (see, e.g., Susko and Roger 2021;Parks and Goldman 2014;Bergsten 2005;Anderson and Swofford 2004), but is not yet fully understood.
The specific problem that we consider in this paper is that of using maximum likelihood to estimate the branch lengths of an unrooted 3-leaf tree from molecular data generated according to the Cavendar-Farris-Neyman (CFN) model, a binary symmetric model of site substitution.A related problem of estimating rooted 3-leaf trees under the molecular clock assumption was considered by Yang (2000).The problem considered here differs from that in Yang (2000) as we do not assume a molecular clock; instead our problem involves maximizing the likelihood over three independently varying branch length parameters.
The 3-leaf MLE problem in this paper was considered, though not fully solved, in Kosta and Kubjas (2019).There, the authors introduced a general algorithm for computing numerically exact solutions using semi-algebraic constraints (i.e.polynomial inequalities) satisfied by phylogenetic models, along with methods from numerical algebraic geometry.Applying their method to the 3-leaf maximum likelihood problem under the CFN model, they discovered a nontrivial example where the maximum likelihood estimate does not exist.A similar problem was also considered in Parks and Goldman (2014), who obtained a partial solution (i.e., involving simulations) for the 4-state Jukes-Cantor model, allowing them to use distance estimates to predict with high accuracy certain features of the maximum likelihood estimate for specific data.
In this paper, we go one step further, presenting a full solution to the 3-leaf maximum likelihood problem under the CFN model, for generic data.Our proofs have a similar flavor to the model boundary decompositions technique in Allman et al. (2019, Section 4), but we consider a submodel that is not full dimensional.The main result of this paper provides a full characterization of the behavior of maximum likelihood estimation in this setting, up to and including necessary and sufficient conditions for the MLE to exist, as well as detailed analysis describing the ways in which an MLE may fail to exist.Further, our results validate the predictions given in Parks and Goldman (2014), providing theoretical underpinning to an interesting connection between maximum likelihood and distance-based estimates appearing in that paper.
While finalizing this manuscript, the recent work of Hobolth and Wiuf (2024) came to our attention.Building on a connection between the multinomial distribution and maximum likelihood estimation of 3-leaf trees under the CFN model, the authors of Hobolth and Wiuf (2024) make a surprising and interesting discovery about the likelihood geometry for 3-leaf models: in the case of three leaves, estimation by maximum likelihood is equivalent to estimation by pairwise distances, an equivalence which does not hold for trees with four or more leaves.In particular, the authors use this to obtain an analytic solution to the MLE problem for 3-leaf trees which is applicable whenever the data (regarded here as a vector representing the empirical site pattern frequencies) lies in the interior of the parameter space of the model.The present paper provides a natural extension of the 3-leaf result in Hobolth and Wiuf (2024) in two ways: first, by providing a simple characterization of when the data lies in the interior of the model, and second, by analyzing in detail the case when it does not; the analysis of this non-interior case is both substantial and technically non-trivial, and provides an improved understanding of the settings where maximum likelihood is prone to failure, such as in the case of trees with very long or short branches.
The remainder of this paper is structured as follows.In Sect. 1 we introduce the model, problem statement, and some of the key tools that will be employed.In Sect.2, we present our main result, a closed-form "analytic" solution to the maximumlikelihood problem for 3-leaf trees under the CFN model.In Sect.3, we discuss the significance and novel contribution of this result.A proof of the main result is then presented in Sect. 4.

Tree Parameter
For any finite set X , an X -tree T is an ordered pair (T ; ϕ) where T is a tree with vertex set V (T ) and the labelling map ϕ : X → V is a map such that v ∈ ϕ(X ) whenever v ∈ V (T ) and deg(v) ≤ 2. If ϕ is a bijection into the leaves of T , then T is called a phylogenetic X -tree; in this case the elements of X are identified with the leaves of the tree (for a standard reference, see Semple and Steel 2003).In addition, we associate with T a vector of nonnegative branch lengths d := (d e ) e∈E (T ) , where E(T ) is the edge set of T .
We regard X as a set of taxa, with T representing a hypothesis about their evolutionary or genealogical history; the branch lengths are regarded as representing a measure of evolutionary distance measured in expected number of mutations per site.In this paper we consider exclusively the case X = [3].
Rather than using the evolutionary distances d as edge parameters of T , for our analyses it will be more convenient to use an alternative parameterization of the branch lengths as a vector θ := (θ e ) e∈E(T ) ∈ [0, 1] |E(T )| , where for all e ∈ E(T ).The numerical edge parameters (θ e ) e∈E(T ) have been referred to as "path-set variables" (Chor et al. 2003); we refer to them here as the Hadamard parameters.

Site Substitution Model
The site substitution model considered in this paper is the fully symmetric Cavendar-Farris-Neyman (CFN) model.Also known as the N 2 model (e.g., in Semple and Steel  (2003)), the CFN model takes as input a tree parameter T θ = (T , θ), consisting of a phylogenetic [n]-tree T = (T , ϕ) along with edge parameters θ = (θ e ) e∈E(T ) , and outputs a random vector X = (X 1 , . . ., X n ), whose entries X 1 , . . ., X n ∈ {−1, +1} are associated with the n leaves of T .
The CFN model, corresponding to a time-reversible Markov chain on a tree, is the simplest model of site substitution, possessing only two nucleotide states, which we denote by +1 (pyrimidine) and −1 (purine).Under this model, the probability of a nucleotide in state i transitioning to state j over an edge of length t can be shown to be (Yang 2000, and for a more general reference, see Semple and Steel (2003), p. 197).In other words, the transition probability from state i to j along a given edge e, denoted P i j (e), is for all i, j ∈ {+1, −1}.Moreover, we assume a uniform root distribution, from which it follows that for all σ ∈ {−1, 1} n (c.f.Lemma 8.6.1.(ii)in Semple and Steel (2003), see also Sturmfels and Sullivant (2004), p. 221).
The distribution of X depends on both the topology and branch lengths of the tree parameter.For a phylogenetic [n]-tree T with topology τ and Hadamard parameters θ , we use the notation where and P is the distribution of X under the CFN model with parameters T and θ .

Identifiability of Model Parameters
Another way to understand the CFN model is as a parameterized statistical model with the tree topology τ held fixed; in this case, the CFN model is regarded as the image of the map where τ ⊆ [0, 1] |E(T )| is the set of possible Hadamard parameters, and where r −1 ⊂ R r is the probability simplex of dimension r − 1; i.e., r −1 := x ∈ R r : x 1 , . . ., x r ≥ 0 and The usual assumption prescribed for the CFN model (which is not made in this paper) is that τ = (0, 1) |E(T )| (so that 0 < θ e < 1 for all e ∈ E(T ), or equivalently, that 0 < d e < ∞ for all e ∈ E(T )).Under that assumption, τ is injective (Hendy 1991), and hence the edge parameters θ = (θ e ) e∈E(T ) are identifiable.This means that if T θ := (T , θ) and T θ := (T , θ ) are two n-leaf trees with the same topology τ but with edge parameters θ and θ such that θ = θ , then the distributions of X will be different under T θ and T θ .
On the other hand in this paper, we consider an extension by allowing θ e ∈ [0, 1] for each edge e ∈ E(T ), rather than θ e ∈ (0, 1).This extension allows for branch lengths which are infinite or zero (when measured in expected number of mutations per site), in order to better understand the behavior of maximum likelihood estimation in the limit as one or more branch lengths tend to zero or to infinity.This seemingly slight extension of the model substantially adds to the complexity of the analysis.In particular, as a consequence of this extension, it is no longer the case that the numerical parameters θ are identifiable, which presents certain complications, described in detail later in the paper.On the other hand, it also has the effect of guaranteeing the existence of the maximum likelihood estimate.

Data
In practice, DNA sequence data is typically arranged as a multiple sequence alignment, an n × N matrix, with each row corresponding to a leaf of T and each column representing an aligned site position.It is standard (albeit unrealistic) to assume that sites evolved independently, and as such our data consists of N random column vectors where X is a random variable taking values in {+1, −1} n whose distribution will be described below, and which is regarded as a vector of nucleotides observed at the leaves of T such that X i is the nucleotide observed at the vertex with label i for each i ∈ [n].Under the CFN model, the distribution of X depends on the topology of T as well as the edge parameters θ .Due to the exchangeability of X (1) , . . ., X (N ) , the data can be summarized by a site frequency vector where
Analogously to the notation in Eq. ( 4), define the expected site pattern spectrum to be where pα (τ, θ ) := P X has α-split pattern .
By Eq. (3), the distribution vector p ∈ 2 n −1 is fully specified without loss of information by the lower dimensional vector p ∈ 2 n−1 −1 (c.f., Allman and Rhodes (2007)).The data may also be summarized by the lower-dimensional (but sufficient) statistic where ι) has α-split pattern .

The Maximum Likelihood Problem
Let T = (T , ϕ) be a phylogenetic [n]-tree with unrooted tree topology τ and Hadamard edge parameters θ = (θ e ) e∈E(T ) for some n ≥ 2. Given data taking the form of Eq. ( 5), or equivalently Eq. ( 7), the log-likelihood function is the function where p σ and pα are defined as in Eqs. ( 4) and ( 6).
When n = 3, there is only one unrooted tree topology, in which case this problem reduces to that of finding the numerical parameters θ ∈ [0, 1] 3 which maximize Eq. (8).

Hadamard Conjugation
In this section we introduce an important reparametrizaton of the CFN model, as well as a central tool in our analyses: Hadamard conjugation.
For any even subset Y ⊆ [n], define the path set P(T , Y ) induced by Y on T to be the set of 1 2 |Y | edge-disjoint paths in T , each of which connects a pair of leaves labelled by elements from Y , taking P(T , ∅) = ∅.This set is unique if T is a binary tree (Semple and Steel 2003).
The edge spectrum is the vector where Then H is a 2 n−1 × 2 n−1 matrix, and by our choice of ordering for [n − 1], we have H = (h α,β ) α,β⊆ [n−1] where In particular, H is a symmetric Hadamard matrix with 2 , multiplication by H can be regarded as a discrete Fourier transformation, a commonly-used tool in the study of the CFN model and other group-based substitution models [see Semple and Steel (2003, chp. 8) as well as Coons and Sullivant (2021), Sullivant (2018), Hendy et al. (1994)].Closely related to the discrete Fourier transform is the next theorem, which allows us to translate between the edge spectrum and the expected site pattern spectrum.
Theorem 1.1 [Hadamard conjugation (Hendy and Penny 1993;Evans and Speed 1993)] Let γ ∈ R 2 n−1 be the edge spectrum of a phylogenetic [n]-tree T , let H := H n−1 , and let p the expected site pattern spectrum as defined in Eq. (6).Then where the exponential function exp(•) is applied component-wise.
Hadamard conjugation, as formulated in this theorem, has proved to be an essential tool in a number of previous results and has analogues for substitution models other than the CFN [see, e.g., Chor et al. (2003Chor et al. ( , 2000))].For a proof and a detailed discussion of this theorem, we refer the reader to Semple and Steel (2003).In particular, we will utilize the following proposition, itself a consequence of Theorem 1.1.Proposition 1.2 [Corollary 8.6.6 in Semple and Steel (2003)] Let θ e ∈ [0, 1] for all e ∈ E(T ).Then for all subsets α of (10) Note that Proposition 1.2 holds even if the root distribution is not taken to be uniform, however we do not consider that case in this paper.When n = 3, Eq.(10) reduces to and by a change of notation, this can be rewritten as for all σ ∈ {−1, +1} 3 .Therefore by Eq. (3), for all σ ∈ {+1, −1} 3 .

Interpretation of Hadamard Parameters
We regard a vector of Hadamard parameters θ = (θ e ) e∈E(T ) as biologically plausible if θ e ∈ (0, 1) for all e ∈ E(T ).Since − 1 2 log θ e is the expected number of mutations on edge e, it follows that θ e ∈ (0, 1) if and only if d e ∈ (0, ∞).In other words, biologically plausible Hadamard parameters correspond to trees with branch lengths having positive and finite expected number of mutations per site.In this work, we allow for θ e ∈ [0, 1] in order to better study the ways that maximum likelihood can fail to return a tree with biologically plausible parameters; for example, trees with extremely short or long branches are of special interest, since it is in this setting that long-branch attraction is hypothesized to occur.
Observe that θ e ∈ [0, 1] measures the correlation between the state of the Markov process at the endpoints of the edge e. Suppose e = (u, v) ∈ E(T ) and let X u and X v denote the state of the Markov process at nodes u and v respectively.Equation (2) implies that if θ e = 1 then X u = X v with probability 1.On the other hand, if θ e = 0 then X u and X v are independent; to see why this is the case, observe that using Eq.
(2), it holds for all i, j ∈ {−1, 1} that The observation has an important consequence which is summarized in the next lemma.In particular, by conditioning on the state of the Markov process at the endpoints of e and using the Markov property, a straightforward calculation gives the following result:

Lemma 1.3 (Independence caused by "infinitely long" branches) Let T = (T , ϕ) be a phylogenetic [n]-tree. Let e ∈ E(T ) and let A e |B e denote the split induced by e on the leaf set [n].
If θ e = 0 then the random vectors (X i : i ∈ A e ) and (X i : i ∈ B e ) are independent.
A proof of Lemma 1.3 can be found in Appendix A.

Fourier Coordinates and Semi-algebraic Constraints
One key takeaway of 1.2 is that the probability of any site pattern can be computed as a polynomial function of the Hadamard parameters.
Hence the CFN model for a fixed tree topology τ may be regarded as the image of the map In our setting, we take τ = [0, 1] |E(T )| and regard the statistical model as the image An important monomial parameterization of the CFN model is obtained by means of the discrete Fourier representation (see Coons and Sullivant 2021;Sullivant 2018;Sturmfels and Sullivant 2004), which here is given by the matrix H .If q = (q 111 , q 101 , q 011 , q 110 ) is a vector of Fourier coordinates, then q = H 2 p.In the case with n = 3, we have γ For group-based models (see, e.g., Sullivant 2018) like the CFN model, τ is a polynomial map, and hence all points in M τ satisfy certain polynomial equalities, called phylogenetic invariants, and polynomial inequalities, called semi-algebraic constraints, both of which are usually formulated in terms of Fourier coordinates.
The semi-algebraic constraints of the CFN model were studied more generally by Matsen ( 2008) and Kosta and Kubjas (2019).When the assumption is made that the tree parameter has biologically plausible parameters, the semi-algebraic constraints for 3-leaf trees consist of the following inequalities: q > 0 (16) and q 110 q 101 < q 011 , q 110 q 011 < q 101 , and q 101 q 011 < q 110 .( The inequality Eq. ( 16) corresponds simply to the assumption that the evolutionary distance between each pair of leaves is finite.
The remaining semi-algebraic constraints in Eq. ( 17) have straightforward interpretation in terms of the additive evolutionary distances on the tree induced by the branch lengths d 1 , d 2 , and d 3 .To see this, observe that by Eq. ( 15), the inequalities in Eq. ( 17) can be written as θ 1 θ 2 θ 1 θ 3 < θ 2 θ 3 , θ 1 θ 2 θ 2 θ 3 < θ 1 θ 3 , and θ 1 θ 3 θ 2 θ 3 < θ 1 θ 2 , and by Eq. ( 1), these are equivalent to In other words, the semi-algebraic constraints in Eq. ( 17) are nothing but the triangle inequality in disguise. 1aken together, Eqs. ( 16) and ( 17) are equivalent to the tree having branch lengths in d 1 , d 2 , d 3 which are both positive and finite.Note this assumption is not made in this paper, as we allow for branch lengths to also be either zero or infinite, a relaxation which corresponds to using non-strict inequalities in Eqs. ( 16) and ( 17).Nonetheless, the strict inequalities will play an important role in our main result.

Remark 1.4
The semi-algebraic constraints in (17) show up in other settings as well.For example, in Matsen ( 2008) they appear as embeddability conditions for the Kimura 3-parameter model; in that setting, the inequalities are equivalent to the nonnegativity of the off-diagonal entries of the mutation rate matrix, and therefore-just as in our setting-implicitly specify that the branch lengths be nonnegative.For a generalization of these inequalities as embeddability conditions see the main result of Ardiyansyah et al. (2021).
Because the Fourier coordinates factorize into Hadamard parameters, as shown in Eq. ( 15) (and more generally: see, e.g., Semple and Steel 2003), the nontrivial Fourier coordinates thus have a simple biological interpretation.For each distinct pair i, j ∈ In other words, the nontrivial Fourier coordinates are covariances of nucleotides observed at the leaves of the tree.
Both the monomial parameterization in Eq. ( 15) and the semialgebraic constraints in Eqs.(16 and (17) will play an important role in the proof and interpretation of our main result for 3-leaf trees.In particular, our approach is to estimate the Fourier coordinates θ 1 θ 2 , θ 1 , θ 3 , θ 2 θ 3 directly from the data, and it turns out that whether or not these estimates satisfy inequalities corresponding to Eqs. ( 16) and ( 17) completely determines the qualitative properties of the maximum likelihood estimate.
2 Main Result: An Analytic Solution to the 3-Leaf MLE Problem

The 3-Leaf Maximum Likelihood Problem
Let T be an unrooted phylogenetic [3]-tree, with unknown numerical edge parameters as shown in Fig. 1.Let s be a site frequency vector obtained from N independent samples X (1) , . . ., X (N ) generated according to the CFN process on T .The 3-leaf maximum likelihood problem is to find all numerical parameters θ ∈ [0, 1] 3 which maximize Eq. ( 8) given the data s.
Note that since there is only one possible unrooted topology for a 3-leaf tree, the topology parameter τ does not play a role in this problem.

Key Definitions and Notation
The key statistics used are the following: as well as In words, M + i j is the number of samples for which leaves i and j share the same nucleotide state and M − i j is the number for which the nucleotides observed at leaves i and j differ.It follows by definition that The statistic B i j measures the observed correlation of the observations at leaves i and j of the tree.By the law of large numbers, almost surely as N → ∞.Therefore by Eq. ( 18), B i j is a consistent estimator of the Fourier coordinate θ i θ j , which is itself the covariance of X i and X j .Moreover, it is easy to check that Due to symmetries of the problem, it will be useful to index the statistics in Definition 2.1 using permutations.
For each π ∈ Alt(3), we write The use of the statistic B will permit us to obtain a simple criterion for when a maximum likelihood estimate corresponding to a 3-leaf tree with finite branch lengths exists.This criteria involves the following set: As we will show in the next theorem, it turns out that given some fixed data s, a maximum likelihood estimate with finite branch lengths exists precisely if and only if B ∈ D. Importantly, since B is an estimate of the nontrivial Fourier coordinates (q 110 , q 101 , q 011 ) = (θ 1 θ 2 , θ 1 θ 3 , θ 2 θ 3 ), the inequalities which define D correspond precisely to the semi-algebraic constraints in Eqs. ( 16) and (17).

Assumptions About the Data
We make two simplifying assumptions about the data s: A.2 B 12 , B 13 and B 23 are nonzero and distinct.
In words, assumption A.1 states that each site pattern aaa, aab, aba, abb is observed at least once in the data (where a and b represent different nucleotides).One consequence of this is that M + i j , M − i j ∈ (0, N ) whenever i = j.Since the number of site patterns is 2 n−1 , this assumption would be unrealistic for a tree with many more leaves (e.g., n > 30) given the size of genomic datasets (Chor et al. 2000), but for our purposes (i.e., with n = 3), this assumption is reasonable.Assumption A.2 is an assumption about the genericity of the data, in the sense that it is equivalent to assuming that These two assumptions considerably simplify the problem with very little loss of generality, as both assumptions are likely to be satisfied when N is large.

Statement of Main Result
Our main result is the following theorem, which solves the maximum likelihood problem for 3-leaf trees; further discussion of this result is given in Sect.3. Theorem 2.4 (Global MLE for the 3-leaf tree) Assume that A.1 and A.2 hold.Then has a maximizer on the set [0, 1] 3 .Denote the set of all such maximizers as with the third one taking any value On the other hand, if B / ∈ D, then the following trichotomy holds: and Enumerating out the possible cases of Theorem 2.4 immediately yields the following corollary.
Corollary 2.5 Under the assumptions of Theorem 2.4, the MLE can be determined from the values of B 12 , B 13 , B 23 using Table 1.
Before proving this theorem in Sect.4, we first discuss its significance and some implications.

Discussion of Novel Contribution
In addition to providing necessary and sufficient conditions for the MLE to exist as a tree with finite branch lengths, Theorem 2.4 also characterizes the ways that this can fail to occur, and highlights a subtle connection between the semi-algebraic constraints given in Eqs. ( 16) and ( 17) and properties of the maximum likelihood estimate.
In an important paper on long-branch attraction (Parks and Goldman 2014), a compelling connection was drawn between maximum likelihood and distance estimates on a 3-leaf tree under the Jukes-Cantor model of site substitution.Through a combination of analytic boundary case analysis and simulations, the authors argued that the failure of distance-based branch-length estimates to satisfy the triangle inequality and nonnegativity constraints was a good predictor of maximum likelihood failing to return a tree with biologically plausible branch lengths.
Due to the use of different substitution models (i.e., the CFN model considered here versus the 4-state Jukes-Cantor model considered in Parks and Goldman (2014)), some translation is necessary to recognize the connection between our results and those of Parks and Goldman (2014).
The distance estimates used in Parks and Goldman (2014) are related to the standard Jukes-Cantor correction (Jukes and Cantor 1969;Yang 2006), and are given by the formula which returns an estimate of evolutionary distance D i j between taxa i and j, measured in expected number of mutations per site.The variable M − i j is the number of samples such that the nucleotide states observed at taxa i and j differ (i.e., the same as in this paper).The convention used in Parks and Goldman (2014) . Formulated in terms of these distances, the predictions made in Parks and Goldman (2014) are given in Table 2.
By comparison, for the CFN model considered in this paper, the analogous distance estimate (see, e.g., Yang 2000) is An inspection of Tables 1 and 2 reveals that, modulo the aforementioned change to the distance estimates to account for the different substitution models, the results of Theorem 2.4 coincide precisely with the predictions made in Parks and Goldman (2014) for all data satisfying Assumptions A.1 and A.2. Thus, Theorem 2.4 puts the predictions in Parks and Goldman (2014) on a sound theoretical basis, as it proves that certain properties of the maximum likelihood estimate are fully determined by whether or not the data satisfies the semialgebraic constraints of the model (i.e., the inequalities of Eqs.(16 and (17), which are mirrored by the inequality conditions in Tables 1 and 2).
In addition, Theorem 2.4 provides a characterization and better understanding of the way that the maximum likelihood estimator can fail to return a tree with biologically plausible branch lengths, but instead returns an estimate with edge parameters θ / ∈ (0, 1) 3 .Consider the data point 17,5,27,5,16,5,19,6). Kosta and Kubjas (2019), it was shown using algebraic methods that maximum likelihood fails to return an estimate with biologically plausible edge parameters for this data point.Instead, it was shown that for this data, the likelihood is maximized as the branch length of leaf 2 goes to infinity.
This agrees with the result in Kosta and Kubjas (2019), as θ 2 = 0 if and only if d 2 = +∞.Indeed, Theorem 2.4 characterizes all the ways that the likelihood estimate might be maximized when one or more branch lengths tend to infinity, up to our simplifying Assumptions A.1 and A.2.
Further, Theorem 2.4 provides a succinct explanation of why, for this data, it would be unreasonable to expect maximum likelihood to return a tree with biologically plausible edge parameters.For any tree with edge parameters θ ∈ (0, 1) 3 , the Fourier coordinates must satisfy the semi-algebraic constraint in Eq. ( 17), but for this data point, since B 12 , B 23 < 0, the estimates of the Fourier coordinates fail to satisfy a corresponding positivity inequality.
The previous example elucidates one of the ways that long branches on a species tree can result in the MLE returning a boundary case: when data comes from a species tree with one or more very long branches relative to the size of N , it is more likely that one or more components of B will be negative, so that B / ∈ D and hence the MLE must be on the boundary.
Nonetheless, this is not the full story.In this case B i j > 0 for all i, j ∈ [3], so the analogous inequality to Eq. ( 16) is satisfied: based on the data, all of the pairs of nucleotides are positively correlated, so no infinitely-long branches are to be expected; the distance estimates D 12 , D 13 , D 23 are all finite and positive.
However, by Theorem 2.4, this is not sufficient to guarantee that maximum likelihood will return a tree with biologically plausible parameters, since there is another semi-algebraic constraint (i.e., Eq. ( 17)) which must be satisfied for such trees.Indeed, since ∈ D, and hence the data falls into case (i) of Theorem 2.4.More specifically, since this data corresponds to row 2 of Table 1, it follows that the likelihood is maximized when which does not correspond to a binary tree because the branch length of leaf 1 is zero.
One final and more general takeaway from Theorem 2.4 is that it highlights how the geometry of the statistical model, here determined by the semi-algebraic constraints Eqs. ( 16) and ( 17), influences the possible behavior of maximum likelihood estimation.Maximum likelihood returns a tree with biologically plausible branch lengths if and only if the data satisfies analogues of the polynomial inequalities Eqs. ( 16) and ( 17).In addition, when the data does not lie in the interior of the model, the question of which inequalities are not satisfied determines the different ways that maximum likelihood fails (i.e., which branches have lengths zero or infinity).This suggests that for phylogenetic trees with more than three leaves, a better understanding of the role that semialgebraic constraints play in maximum likelihood estimation may turn out to be useful in explaining some of the more complex behaviors of maximum likelihood estimation of phylogenetic trees.Of special interest to note are 4-leaf trees, due to the possibility of long-branch attraction.Indeed, recent work has shown that in the 4-leaf case, the study of semi-algebraic constraints for phylogentic models involves surprising subtleties that may be important for inference (Casanellas et al. 2021).
To be sure, in the cases of trees with n ≥ 4 leaves, to say nothing of more realistic and complicated substitution models, the increased algebraic complexity of the likelihood equations presents formidable obstacles.First, as shown in Hobolth and Wiuf (2024), when n ≥ 4, the likelihood equations do not have solutions which can be expressed in terms of pairwise sequence comparisons (as was done here and in Hobolth and Wiuf (2024)).Moreover, in many cases, closed form solutions are unlikely to exist at all; an example of this can be found in the analysis of the 4-leaf MC-comb in Chor et al. (2003), where it was shown that the critical points of the likelihood function correspond to zeros of a degree 9 polynomial which cannot be solved by radicals.
Despite these limitations, solutions can nonetheless be obtained using tools from numerical algebraic geometry which return theoretically correct solutions with probability one (see, e.g., Kosta and Kubjas 2019, Gross et al. 2016, Chor et al. 2005, Chor et al. 2003).Moreover, the number of solutions, called the maximum likelihood (ML) degree, can be computed using Gröbner basis techniques (Hoşten et al. 2005) and methods from singularity theory (Rodriguez and Wang 2017;Maxim et al. 2024).

Proof of the Main Result
Our proof of Theorem 2.4 considers the problem of maximizing the log-likelihood separately two cases: (1) the interior case, i.e., the problem of maximizing over all θ ∈ (0, 1) 3 , and (2) the boundary cases, corresponding to when θ ∈ ∂(0, 1) 3 .
For the boundary cases, we follow the approach taken by the authors of Parks and Goldman (2014), who analyzed the maximum likelihood problem in the context of the Jukes-Cantor model and obtained analytic solutions for boundary cases there by decomposing the boundary ∂(0, 1) 3 into 26 components, consisting of 8 vertices, 12 edges, and 6 faces, and then maximizing on each of those individually.Our approach is similar, though we group certain edges and faces together in those cases in which the analysis is similar.
The approach taken to proving Theorem 2.4 is as follows.First, the problems of maximizing in the interior case and boundary cases are considered separately in Eq. ( 44) , where π , π = Alt(3)\ {π } Eq. ( 39) Sects.4.1 and 4.2 respectively.Section 4.3 presents several lemmas which compute and compare log-likelihoods of local maxima in various cases, with results summarized in Table 3.Finally, in Sect.4.4, we utilize these results to prove Theorem 2.4.

Maximizing the Log-Likelihood on (0, 1) 3
In this subsection we consider the problem of maximizing Eq. ( 8) on the set (0, 1) 3 .This set corresponds to those trees whose branches are of finite and nonzero length, when measured in expected number of mutations per site.Since (0, 1) 3 is open, the existence of a local maximum is not guaranteed.The main result of this subsection gives, for generic positive data, necessary and sufficient conditions for to have a local maximum in (0, 1) 3 , and a formula if it exists; it also shows has at most one local maximum on (0, 1) 3 .We begin with an important definition and a technical lemma.Let φ : R 3 \ {x : x 1 x 2 x 3 = 0} → C 3 be defined by Further, let φ be the restriction of φ to D: The next lemma summarizes some useful properties of φ.
Next, to show that φ is injective, let x, x ∈ D such that φ(x) = φ( x).Let φ 1 and φ 2 denote the first and second components of φ respectively.Then Similar arguments show that x 2 = x2 and x 3 = x3 , and hence x = x.Therefore φ is injective.
Next we show that φ is surjective.Let y ∈ (0, 1) 3 be arbitrary.Then it is easy to see that the point y = (y 1 , y 2 , y 3 ) := (y 1 y 2 , y 1 y 3 , y 2 y 3 ) satisfies y i y j < y k for all choices of distinct i, j, k ∈ [3], and hence y ∈ D.Moreover, φ(y ) = y by definition of φ.Therefore φ is surjective and has an inverse given by the formula φ −1 (y) = y , which is precisely the formula in Eq. ( 23).
We are now ready to state the main lemma of this subsection, which solves the problem of maximizing on the set (0, 1) 3 , or in other words, solves the maximum likelihood problem for biologically plausible parameters.
If B ∈ D then θ * is the unique local maximum of in (0, 1) 3 and has log-likelihood sα log sα On the other hand, if B / ∈ D then has no local maximum on (0, 1) 3 .
Proof of Lemma 4.2 For ease of notation, we will write s1 := s∅ , s2 := s{1} , s3 := s{2} , and s4 := s{1,2} . (25) It follows by Eqs. ( 8) and ( 11) that An initial attempt to compute the critical points of (θ ) directly by taking the gradient of yields a polynomial system which is difficult to solve analytically, so instead we modify this approach by first considering a different function whose extrema are closely related to those of .
To define this function, first let D F ⊆ R 3 be the intersection of half-spaces defined by the following inequalities and let F : D F → R be defined by The significance of F is owed to the observation that = F • φ −1 , which is proved in the next claim.
To show that u ∈ D F , it suffices by Eq. ( 27) to show that The first of these four equations holds trivially.As for the other three, we will only prove 1 + w 1 w 2 − w 1 w 3 − w 2 w 3 > 0, as the other two inequalities can be proved in the same manner.Let h(w 1 , w 2 ) := 1 + w 1 w 2 − w 1 − w 2 .Since w 1 , w 2 > 0 and Therefore it will suffice to show that h(w 1 , w 2 ) ≥ 0 for all w 1 , w 2 ∈ [0, 1].Indeed, using calculus it is easy to see that h is minimized on [0, 1] × [0, 1] when at least one of the arguments w 1 , w 2 equals one, and that the minimum is zero.Therefore 1 + w 1 w 2 − w 1 w 3 − w 2 w 3 > 0. We conclude that D ⊆ D F , as required to prove the claim.
The next two claims serve to characterize the extrema of F. Claim 2. The point B = (B 12 , B 13 , B 23 ) is the unique critical point of F.

Proof of Claim 2
It is first necessary to verify that B is in the domain of F. To do this, it will suffice to show that B satisfies the inequalities in Eq. ( 27).Using Eq. ( 20) and the observation s1 + s2 + s3 + s4 = N , it is easy to check that (30) Therefore by A.1, it follows that B = (B 12 , B 13 , B 23 ) satisfies the inequalities in Eq. ( 27), as required.Therefore B is in the domain of F.
We now proceed with a standard critical point calculation.Letting , and taking partial derivatives of F in Eq. ( 28) with respect to the variables x, y and z, it follows that for all u = (x, y, z) where Setting ∇ F = 0 and using Eq. ( 31), we deduce that the following system of equations holds: Substituting the formulas for the A i 's from Eq. ( 32) and rearranging terms, we obtain Writing this out, this is the matrix equation Using the fact that s1 + s2 + s3 + s4 = N , one can check that the 3 × 3 matrix in the above equation has inverse By Eq. ( 20), the right-hand side is precisely the vector (B 12 , B 13 , B 23 ) , and therefore we conclude that B = (B 12 , B 13 , B 23 ) is the unique critical point of F on its domain P.This completes the proof of the claim.
Let H F denote the Hessian matrix of F; that is, It is clear that F is twice-differentiable on its domain, so H F (x) is defined for all x ∈ D F .Claim 3. H F (B) is negative definite.

Proof of Claim 3
Since H F (B) is a real symmetric matrix, it will suffice to show that its eigenvalues are all negative, as this will imply that H F (B) is negative definite.
Using the code in Appendix B, we first compute the characteristic polynomial of H F : where and s1 , s2 , s3 , s4 are defined in Eq. ( 25).We need to show that P char has only negative roots, as this will imply that all three eigenvalues of H F are negative, and hence that H F (B) is negative definite.Since H F is a real symmetric matrix, the roots of P char are all real numbers, and therefore it will be enough to show that P char has no nonnegative roots.Indeed, since all the coefficients of P char are positive, Decartes' rule of signs implies that P char has no positive roots.Moreover since (s 1 , s2 , s3 , s4 ) = (0, 0, 0, 0) by A.1, the constant term in P char is nonzero, and hence P char (0) = 0 as well.We conclude that P char has no nonnegative roots, as required to prove the claim.
Using the results of the previous three claims, the next two claims will together characterize the local maxima of on (0, 1) 3 .
Next we show that θ * is a local maximum of .Since = F • φ −1 by Eq. ( 29), and since F and φ −1 are both twice differentiable on their respective domains, therefore it follows by the chain rule for Hessian matrices (see, e.g., Magnus and Neudecker 2007, pp. 125-126), that is also twice differentiable and has Hessian matrix where Since H F (B) is a negative definite matrix by Claim 3, and since det J φ −1 (θ * ) = 0 by Eq. ( 34), we conclude that H (θ * ) and H F (B) are similar matrices, and hence H (θ * ) is negative definite as well.Therefore by the second derivative test (see, e.g., Magnus and Neudecker 2007, Theorem 4, p. 140), the point θ * is a local maximum of .This completes the proof of the claim.
We can now use Claims 4 and 5 to prove the theorem.If B ∈ D then Claims 4 and 5 imply that θ * is the unique local maximum in (0, 1) 3 .On other other hand, if B / ∈ D, then the contraposition of Claim 5 implies has no local maximum in (0, 1) 3 .This proves the first part of Lemma 4.2; it remains only to prove Eq. ( 24).Indeed, if B ∈ D then since θ * = φ(B 12 , B 13 , B 23 ), it follows by definition of φ that for all i, j ∈ [3] such that i < j.Plugging Eq. ( 35) into Eq.( 30), Therefore by Eq. ( 26), which is precisely Eq. ( 24).This completes the proof of Lemma 4.2.

4.2
Maximizing the Log-Likelihood on @(0, 1) 3 In this subsection we consider the problem of maximizing on the boundary ∂(0, 1) 3 .As discussed at the beginning of Sect.4, this corresponds to the boundary of the unit cube, consisting of 6 faces, 12 edges, and 8 vertices.The lemmas in this section consider the problem of maximizing on various groupings of these components.
The eight vertices of the unit cube are simply the elements of the set {0, 1} 3 .The twelve edges are the sets defined for j, k ∈ {0, 1}.The 6 faces are the sets of the form where π ∈ Alt(3) and i ∈ {0, 1}.
Proof Suppose θ ∈ E ind , and let i, j ∈ [3] such that i = j.Then the path between leaves i and j contains an edge e such that θ e = 0. Therefore by Lemma 1.3, the random variables X 1 , X 2 and X 3 , are mutually independent.Therefore for all σ ∈ {−1, 1} 3 , Substituting this into Eq.( 8) implies Eq. ( 36).(2) implies that no transitions can occur on leaf 3, and hence vertex 3 is regarded to lie on the path between leaves 1 and 2 The next lemma of this section considers the problem of maximzing on the three faces F π,1 , π ∈ Alt(3).An example of the graphical model correponding to these cases is shown in Fig. 3. Lemma 4.5 (Maximizers of on F π,1 ) Let π ∈ Alt(3), and let θ ∈ F π,1 .Then the set of local maxima of on F π,1 is or equivalently Proof Let π ∈ Alt(3), and let i = π(1), j = π(2), and k = π(3).Let θ i , θ j ∈ (0, 1) be arbitrary.Since θ k = 1, it follows by Eq. ( 12) Hence Eq. ( 8) can be written as Differentiating with respect to θ i and θ j , it follows that for each u ∈ {i, j}, Solving the system ∇ (θ ) = 0, we obtain the solution satisfying and if B ik , B jk ∈ (0, 1) then Eq. ( 41) determines the unique critical point of on the set F π,1 ; on the other hand, if B ik / ∈ (0, 1) or B jk / ∈ (0, 1), then has no critical points on F π,1 .
Moreover, this critical point on F π,1 is a maximum by the second derivative test since for all θ ∈ F π,1 , the Hessian matrix Finally, if Eq. (41) holds, plugging θ i = B ik and θ j = B jk into Eq.( 40) gives which is nothing but Eq. ( 38) written in a different notation.It remains to prove Eq. ( 39).Observe that 1 39) can be obtained by making these substitutions in Eq. ( 38) and then simplifying using logarithm properties and M The final lemma of this section considers the problem of maximizing when exactly one parameter in {θ 1 , θ 2 , θ 3 } is assumed to be zero.This pertains to each of the 3 remaining faces F π,0 , π ∈ Alt(3), as well as the remaining six edges and E (1,0,•) .Such boundaries represent one of two graphical models, like those shown in Fig. 2. We group the boundary cases up into the following three sets, defined for each π ∈ Alt(3): Fig. 3 The graphical model corresponding to a 3-leaf "tree" with edge parameters θ ∈ E (1,•,0) (left) or θ ∈ F (•,•,0) (right).In both cases, the biological meaning of θ 3 = 0 is that species 3 is "infinitely far away" from species 1 and 2, when distances are measured in expected number of mutations per site.By Lemma 1.3, (X 1 , X 2 ) ⊥ ⊥ X 3 , and for this reason we depict vertex 3 as a disconnected vertex Interpreted geometrically, G π consists of the union of one face and two of its adjacent edges:

Lemma 4.6 (Maximizers of on G π )
Let π ∈ Alt(3).The local maxima of on G π are the points in the set all of which have log-likelihood or equivalently In particular, this implies that if B π / ∈ (0, 1) then has no local maxima on F.
The proof of this lemma is similar to that of Lemma 4.5, and can be found in Appendix A.
The key results of Sects.4.2 and 4.1 are summarized in Table 3.In that table, computed are the maximizer(s) of on the interior and boundary of the unit cube (namely, the sets E int , E triv , E ind , G π , and F π,1 , π ∈ Alt(3), which together partition the closed unit cube).The corresponding sets of maximizers on each of these sets (E int , E triv , E ind , G π , and F π , π ∈ Alt(3), respectively) are level sets of , although some of them may be empty depending on the data.The necessary and sufficient conditions for each of them to be nonempty, as well as the value that the log-likelihood function takes on each of them, are shown in the second and third columns of the table.

Comparisons of Likelihoods
To prove Theorem 2.4 will require some comparisons of the log-likelihoods of elements of E int , E ind , G π , and F π , (π ∈ Alt(3)).In this subsection, we prove several lemmas toward this end.
We will make use of the information inequality, which we state next (for proof, see, e.g., Theorem 2.6.3 in Cover (2006)).The next two lemmas utilize the information inequality to show that elements of E int have greater log-likelihood than elements of F π and G π , for all π ∈ Alt(3).
Let θ ∈ F π .By Eq. ( 39), Observe that Therefore we can rewrite Eq. ( 45) as Regrouping terms gives To apply Theorem 4.7, it is first necessary to verify that Since the entries of this vector are clearly nonnegative, it suffices to show that they sum to 1. Indeed, using M + 13 + M − 13 = N and M + 23 + M − 23 = N , we have Therefore by applying Theorem 4.7 to the right hand side of Eq. ( 46), (θ ) ≤ α⊆ [2] sα log sα where the last equality follow from Eq. ( 24).
A similar application of Theorem 4.7 can be used to prove the following theorem; the details of the proof can be found in Appendix A. for all θ ∈ G π and all π ∈ Alt(3).
The next lemma compares the likelihoods of elements in F π and G π when π, π ∈ Alt(3) are distinct.Proof Suppose θ ∈ F π 1 , and suppose that θ ∈ G π for some π ∈ Alt(3)\ {π 1 }.Without loss of generality, assume π = π 3 .Then by Eqs. ( 38) and ( 43) , it follows that As a function of B π 2 , the right-hand side is strictly increasing on (0, 1), which can be seen by differentiating and observing that the derivative is positive on this interval.Moreover, we note that B π 2 ∈ (0, 1), a fact which follows from the hypothesis that F π 1 = ∅ (see Table 3).Therefore, since the right-hand side of Eq. ( 48) is strictly increasing on (0, 1), and since B π 2 ∈ (0, 1), it follows that This completes the proof of the lemma.
Since is upper semicontinuous, it has at least one maximizer on [0, 1] 3 .In order to find the maximizer(s), observe that since (51) In Lemmas 4.2, 4.4, 4.6, and 4.5, we computed the log-likelihood of the points in each of the eight sets in this disjoint union (see Table 3 for a summary of these results), so the rest of the proof will simply be a comparison of the likelihoods of elements of these sets.We start by proving that if B ∈ D, then the maximum is given by Eq. ( 21).Suppose B ∈ D.
Since the depends only on the product θ i θ j , and not on θ i or θ j independently, the log-likelihood restricted to the set G π may thus be regarded as function of the single variable x := θ i θ j ∈ (0, 1).Plugging Eq. ( 57) into Eq.( 8), we obtain (58) Differentiating gives Solving the equation (x) = 0, we find that f has at most one critical point on (0, 1), which is the point provided that B i j ∈ (0, 1).If this is the case, then x is a local maximum by the second derivative test, since (1 − x) 2 < 0.

Fig. 1
Fig. 1 Three-leaf tree T with Hadamard edge parameters θ T 1 , θ T 2 , θ T Lemma 4.2 (MLE for 3 Leaf Tree-Interior Case) Assume that A.1 and A.2 hold, and let Therefore x is a critical point of F. Since the only critical point of F is B by Claim 3, it follows that x = B. Since x ∈ D by Lemma 4.1, this implies that B ∈ D. Therefore θ = φ(x) by definition of x = φ(B) since B = x ∈ D = θ * by definition of θ * .