Implicit inequality constraints in a binary tree model

In this paper we investigate the geometry of a discrete Bayesian network whose graph is a tree all of whose variables are binary and the only observed variables are those labeling its leaves. We provide the full geometric description of these models which is given by a set of polynomial equations together with a set of complementary implied inequalities induced by the positivity of probabilities on hidden variables. The phylogenetic invariants given by the equations can be useful in the construction of simple diagnostic tests. However, in this paper we point out the importance of also incorporating the associated inequalities into any statistical analysis. The full characterization of these inequality constraints derived in this paper helps us determine how and why routine statistical methods can break down for this model class.


Introduction
A Bayesian network whose graph is a tree all of whose inner nodes represent variables which are not directly observed lie in an important class of models, containing phylogenetic tree models and hidden Markov models. Inference for this model class tends to be challenging and often needs to employ fragile numerical algorithms. In [27] we established a useful new coordinate system for such models when all of the variables are binary. This analysis enabled us not only to address various identifiability issues but also helped us to derive exact formulas for the maximum likelihood estimators given that the sample proportions were consistent with edge probabilities assigned to this model class.
However, the application of this new coordinate system reaches far beyond understanding the identifiability and it can be used to analyze the global structure of these tree models. For example [7] gave an intriguing correspondence between, on the one hand, a correlation system on tree models and on the other distances induced by trees where the length between two nodes in a tree is given as a sum of the length of edges in the path joining them. Our new coordinate system for the tree models enables us to explore this relationship between probabilistic tree models and tree metrics in detail. It was already implicit that constraints on possible distances between any two leaves in the tree imply some inequality constraints on the possible covariances between the binary variables represented by the leaves. These inequalities follow from the four-point condition ( [21], Definition 7.1.5) together with some other simple non-negativity constraints (c.f. Equation (27)). However in this paper we also show that these inequality constraints cannot be sufficient and there are some additional constraints involving higher order tree cumulants. We provide the full set of the defining constraints in Theorem 4.6. This is given by a list of polynomial equations and inequalities which describe the set of all probability distributions in the model.
Our approach here is founded in a geometric study of tree models through the method of phylogenetic invariants first introduced by Lake [15], and Cavender and Felsenstein [6]. These invariant algebraic relationships are expressed as a set of polynomial equations over the observed probability tables which must hold for a given phylogenetic model to be valid. We note that these algebraic techniques have also been embraced by computational algebraic geometers [1] [11] [26] enhancing the statistical and computational analyses of such models [5]. A similar problems can be solved for other model classes [8]. The main technical deficiency of using phylogenetic invariants in this way is that they do not give a full geometric description of the statistical model. The additional inequalities obtained as the main result of this paper complete this description. Where and how these inequality constraints can helpfully supplement an analysis based on phylogenetic invariants is illustrated by the simple example given below. The inner node represents a binary hidden variable H and the leaves represent binary observable variables X 1 , X 2 , X 3 . The tree represents the conditional independence statements X 1 ⊥ ⊥ X 2 ⊥ ⊥ X 3 |H. The model has full dimension over the observed margin (X 1 , X 2 , X 3 ) and consequently there are no equations defining it. However it is not a saturated model since not all the marginal probability distributions over the observed vector (X 1 , X 2 , X 3 ) lie in the model. For example Lazarsfeld [16,Section 3.1] showed that the second moments of the observed distribution must satisfy Cov(X 1 , X 2 )Cov(X 1 , X 3 )Cov(X 2 , X 3 ) ≥ 0. This constraint, which clearly impacts the inferences we might want to make, is not acknowledged through the study of phylogenetic invariants. Therefore inference based solely on these invariants is incomplete and in particular naive estimates derived through these methods can be infeasible within the model class in a sense illustrated later in this paper.
This example motivated the closer investigation of the semi-algebraic features associated with the geometry of binary tree models with hidden inner nodes. The main problem with the geometric analysis of these models is that in general it is hard to obtain the inequality constraints defining a model even for very simple examples (see [9,Section 4.3] [12,Section 7]). Despite this, some results can be found in the literature. Thus in the case of a binary naive Bayes model a somewhat complicated solution was given by Auvray et al. [2]. In the binary case there are also some partial results for general tree structures given by Pearl and Tarsi [19] and Steel and Faller [25]. The most important applications in biology involve variables that can take four values. Recently Matsen [17] gave a set of inequalities in this case for group-based phylogenetic models (additional symmetries are assumed) using the Fourier transformation of the raw probabilities. Here we provide a simpler and more statistically transparent way to express the constrained space.
In Section 2 of this paper we briefly introduce conditional independence models on trees. We then proceed to describe a convenient change of coordinates for the models under consideration following [27]. In the new coordinate system the parametrization of the model has an elegant product form and in Section 3 we show how to use this to obtain the full semi-algebraic description of a simple naive Bayes model. In Section 4 we state the main result of the paper given by Theorem 4.6 and give some necessary constraints on the probability distributions in the model class using a correspondence with tree metrics. In Section 5 we discuss these results for a simple quartet tree model. Finally, in Section 6 we use the parametrization developed earlier to find an alternative form of equations given by Allman and Rhodes [1]. Our alternative specification is simpler from the algebraic point of view and has a more transparent statistical interpretation. We prove our main theorem in Section B. The paper is concluded with a short discussion.

Tree models and tree cumulants
In this paper we always assume that random variables are binary taking values either 0 or 1. We consider models with hidden variables, i.e. variables whose values are never directly observed. The vector Y has as its components all variables in the graphical model, both those that are observed and those that are hidden. The subvector of Y of observed variables is denoted by X and the subvector of hidden variables by H.
A (directed ) tree T = (V, E), where V is the set of vertices and E ⊆ V × V is the set of edges of T , is a connected (directed ) graph with no cycles. A rooted tree is a directed tree that has one distinguished vertex called the root, denoted by the letter r, and all the edges are directed away from r. A rooted tree is usually denoted by T r . By pa(v) we denote the node preceding v in T r . In particular pa(r) = ∅. A vertex of T of degree one is called a leaf. A vertex of T that is not a leaf is called an inner node.
A Markov process on a rooted tree T r is a sequence {Y v : v ∈ V } of random variables such that for each (α 1 , . . . , 1|i = 1 for all v ∈ V \{r} and i = 0, 1 then the set of parameters consists of exactly 2|E| + 1 free parameters: we have two parameters: θ 1|1 for each edge (u, v) ∈ E and one parameter θ Let n be the number of leaves of T and let ∆ 2 n −1 = {p ∈ R 2 n : β p β = 1, p β ≥ 0} with indices β ranging over {0, 1} n be the probability simplex of all possible distributions of X = (X 1 , . . . , X n ) represented by the leaves of T . Equation (1) induces a polynomial map f T : Θ T → ∆ 2 n −1 obtained by marginalization over all the inner nodes of T where H are all possible states of the vector of hidden variables, i.e. the sum is over calling it the general Markov model (c.f. [21,Section 8.3]). A semi-algebraic set in R d is any space given by a finite number of polynomial equations and inequalities. Since Θ T is a semi-algebraic set and f T is a polynomial map then by the Tarski-Seidenberg theorem [3, Section 2.5.2] M T is a semi-algebraic set as well.
In [27] we described a convenient change of coordinates for directed tree models as a function of the usual parametrization (2) which is expressed in terms of the probabilities. The idea was to define a regular one-to-one polynomial map f pκ from ∆ 2 n −1 to the space of new parameters called tree cumulants K T . We defined a partially ordered set (poset) of all the partitions of the set of leaves induced by removing inner edges of the given tree T . Then tree cumulants are given as a one-to-one function of probabilities induced by a Möbius function on the poset. The details of this change of coordinates are given Appendix A and are illustrated below.
The tree cumulants are given by 2 n − 1 coordinates:μ i for all i ∈ [n] and a set of real-valued parameters {κ I : I ⊆ [n] where |I| ≥ 2}. The first n coordinates are linear functions of the means of the n observed variables sinceμ i = 1 − 2EX i . Each formula for κ I is expressed as a function of the higher order central moments of the observed variables. These formulas are given explicitly in (31) of Appendix A. By M κ T we will denote the image of the original model M T in the space of tree cumulants. We note that since f pκ is a one-to-one polynomial map then by the Tarski-Seidenberg theorem (see [3, Section 2.5.2]) M κ T is a semi-algebraic set. In this paper we provide the full semi-algebraic description of M κ T , i.e. the complete set of polynomial equations and inequalities involving the tree cumulants which describes M κ T as the subset of K T . Example 2.1. Consider the quartet tree model, i.e. the general Markov model given by the following graph (c.f. Section 6 in [27]). The tree cumulants are given by 15 coordinates:μ i = 1 − 2P(X i = 1) for i = 1, 2, 3, 4 and κ I for I ⊆ [4] such that |I| ≥ 2.
for all 1 ≤ i < j < k ≤ 4 which we note is a third order central moment. However tree cumulants of higher order cannot be equated to corresponding central moments but only expressed as functions of them. These functions are obtained by performing an appropriate Möbius inversion (see [27]). Thus for example from Appendix A we have that But note that since the observed higher order central moments can be expressed as functions of probabilities, tree cumulants being functions of such moments can also be expressed as functions of these probabilities.
Now let Ω T denote the set of parameters with coordinates given byμ v for v ∈ V and η u,v for (u, v) ∈ E. Define a reparametrization map f θω : Θ T → Ω T as follows: where λ v = EY v is a polynomial in the original parameters θ of degree depending on the distance of v from the root r. Indeed, let r, v 1 , . . . , v k , v be a directed path in T . Then (4) λ It can be easily checked that , which gives a clear statistical interpretation for the new parameters. The parameter space Ω T is given by the following constraints: In [27] we proved that there is is a one-to-one polynomial map between the two spaces giving the following diagram.
One motivation behind the change of coordinates and parameters is that the induced parametrization ψ T : Ω T → K T has a particularly elegant form in terms of the new parameters.
1|1 = 0.3. Using (2) we provide the corresponding probabilities over the observed nodes in the third column in the table below. The change of coordinates presented in Appendix A gives the corresponding non-central moments λ I = E i∈I X i and tree cumulants κ I supplemented with the means λ 1 , λ 2 , λ 3 , λ 4 . (1 −μ 2 r )μ rμa η r,1 η r,2 η r,a η a,3 η a,4 = 0.0006. Proposition 2.2 has been formulated for trivalent trees. However it can be easily extended to a more general case. For a given tree a contraction of an edge (u, v) results in another tree obtained from the original tree by identifying the nodes u and v and removing the edge (u, v). Let T be a tree and let T be any trivalent tree such that T is obtained from T by edge contractions. Then M κ T ⊆ M κ T ⊂ K T and by Corollary 4 in [27] the parameterization in (7) remains valid for T but expressed in the coordinates of K T . Example 2.4. Let T be a star tree with four leaves, i.e. a tree with one inner node r and four leaves connected to r by edges (r, i) for i = 1, 2, 3, 4. This tree can be obtained from the quartet tree in Example 2.1, denote it by T , by contracting (r, a). The model of the star tree can be realized as a subset of K T , i.e. the space of tree cumulants for the quartet tree. The coordinates of K T are obtained in Example 2.1 and the parametrization of M κ T is given for example by (1 −μ 2 r )μ 2 r η r,1 η r,2 η r,3 η r,4 .
Note however that this star tree may be obtained from many different trivalent trees by edge contraction. It follows that there exist many ways to embed the model and retain the parametrization.
Remark 2.5. Let T r be a rooted tree and T its undirected version. Then M κ T depends only on T . Indeed, without loss fix two different rootings r and r . Let T be a tree rooted in r and by T denote its copy rooted in r . Then M κ T = M κ T and the parameters (η u,v ), (μ v ) and (η u,v ), (μ v ) are related as follows. We Note however that if for example η u,v = 0 and µ v = 1 then η v,u is not well defined and in this case we set η v,u = 0. From the form of inequalities in (5) constraints onμ v and η u,v are satisfied if and only if the constraints onμ v and η u,v are satisfied.

The semi-algebraic description of the tripod tree model
In this section we obtain the full semi-algebraic description of the tripod tree model. This result is not new (see [2] [22]). However it is convenient to give a new proof of this result both to unify notation and to introduce the strategy which is used to attack the general case later. We begin with a definition.
If a ijk = 1 then treating all entries formally as joint cell probabilities (without positivity constraints) we can simplify this formula using the change of coordinates to central moments. The reparameterizations in Appendix A are well defined for this extended space of probabilities and we have that (8) Det A = µ 2 123 + 4µ 12 µ 13 µ 23 , which can be verified by direct computations. We note in passing that a similar idea of treating moments formally lies behind the umbral calculus [20].
From the construction of tree cumulants (c.f. Appendix A) it follows that κ I = µ I for all I ⊆ [n] such that |I| ≤ 3. Henceforth, for clarity, these lower order tree cumulants will be written as their more familiar corresponding central moments.
Lemma 3.2 (The semi-algebraic description of the tripod model). Let M T be the general Markov model on a tripod tree T rooted in any node of T . Let P be a 2×2×2 probability table for three binary random variables (X 1 , X 2 , X 3 ) with central moments µ 12 , µ 13 , µ 23 , µ 123 (equivalent to the corresponding tree cumulants) and µ i = 1 − 2EX i for i = 1, 2, 3. Then M κ T is given by and (1 −μ 2 h )η h,i for all i = 1, 2, 3 satisfy the inequality (5). Moreover, P ∈ M T if and only if K = f pκ (P ) ∈ K T = C 3 satisfies the following inequalities Proof. By Remark 2.5 M κ T does not depend on the rooting and hence we can assume that T is rooted in h. The parameterization in (9) follows from Proposition 2.2 by considering T rooted at h and the corresponding independence statements Denote by M the subset of K T given by inequalities in (10), (11) and (12). We need to show that M = M κ T for any rooting of T . First we prove that M κ T ⊆ M.
Sinceμ h ∈ [−1, 1] this implies the inequality in (10). Moreover, we have (14) Det P = µ 2 123 + 4µ 12 µ 13 µ 23 = Multiplying both sides byμ 2 h together with the second equation in (9) implies (15)μ 2 h Det P = µ 2 123 , (1 −μ 2 h ) Det P = 4µ 12 µ 13 µ 23 . On the other hand (9) and (14) imply also that (16) η 2 h,i µ 2 jk = Det P for all i = 1, 2, 3. Again by substituting µ ij for 1 (15), (16) and (17) imply that To show that K satisfies (12) first divide this inequality by µ 2 jk (if it zero the inequality is trivially satisfied since DetP = µ 2 123 ). Using (16) and the fact that Since 1 ±μ i ≥ 0 this in turn reduces to (5). Therefore since by hypothesis (5) holds and hence M κ T ⊆ M. Now we show that M ⊆ M κ T by proving that for K ∈ M a parameter ω in (9) exists which satisfies constraints defining Ω T and K = ψ T (ω). Let P = f −1 pκ (K) then from (10) we know that Det P ≥ 0. So consider separately the two situations: first when Det P = 0 and second when Det P > 0. In the first case again from (10) necessarily µ 123 = 0. The inequality (11) therefore implies that at least two covariances are zero. If all the covariances are zero then for η h,1 = η h,2 = η h,3 = 0 andμ 2 h = 1 we obtain a valid choice of parameters in (9) and the values satisfy (5). When one covariance, say µ 12 = 0, is non-zero then if a choice of parameters exists it has to satisfyμ 2 h = 1, η h,1 , η h,2 = 0 and η h,3 = 0. Such a choice of parameters will exist if we can ensure that This follows from Corollary 2 in [14] which states that if only µ 12 = 0 then there always exists a choice of parameters for model X 1 ⊥ ⊥ X 2 |H, where H is hidden. Assume now that Det P > 0 which by (11) implies that µ ij = 0 for each i < , which coincides with (9) modulo the sign. It can be easily shown that µ 12 µ 13 µ 23 > 0 implies that there exist a choice of signs for η h,i for i = 1, 2, 3 such that for all 1 ≤ i < j ≤ 3 as in (9). For example set sgn(η h,i ) = sgn(µ jk ) and use the fact that by our assumption sgn(µ ij ) = sgn(µ ik )sgn(µ jk ). This choice of signs already determines the sign ofμ h so that holds. It remains to show that parameters set in this way satisfy the constraints defining Ω T . First note that since 0 ≤ 4µ 12 µ 13 µ 23 ≤ Det P thenμ 2 h ∈ [0, 1] as required. From Appendix D in [27] we know that if (η h,1 , η h,2 , η h,3 ,μ h ) is one choice of parameters then there exists only one alternative choice and it is ( We show that (10) and (12) already imply that this inequality has to be satisfied.
Multiply both sides by µ 2 jk Det P ± µ jk µ 123 and then divide by 4µ 12 µ 13 µ 23 (both expressions are strictly positive) to obtain Since DetP > 0 and µ 2 jk > 0 this is equivalent to The second inequality in (20) is exactly (12). So we have to show that the first inequality is already implied by (10) and (12). To see this rewrite (12) as . However, since µ 12 µ 13 µ 23 ≥ 0 the right hand side of the inequality above has to be non-negative as well. In particular (21) ( noting that the left-hand sides are nonnegative. For each of the two inequalities if the right-hand side is negative then the inequality is trivially satisfied. If the right-hand side is nonnegative then in the first case −2µ 123 µ jk ≥ −µ 123 µ jk and in the second case 2µ 123 µ jk ≥ µ 123 µ jk . Hence the following set of inequalities is implied by (21) ( This is exactly the first inequality in (20) which shows that it is implied by (10) and (12). Consequently, (18) and hence also (5) are satisfied. It follows that M ⊆ M κ T .

A connection with tree metrics
Now let T be a general tree with n leaves. Before stating the main theorem of the paper we first show how to obtain an elegant set of necessary constraints on M T . In this section we assume thatμ be the set of edges on the unique path joining i and j in T . Then for each probability distribution in M κ T such that all the correlations are well defined. Proof. By (7) applied to T (ij) we have where r is the root of the path between i and j and hence Now apply (22) to each η u,v in the product above to show (23).
The above equation allows us to demonstrate an interesting reformulation of our problem in term of tree metrics (c.f. [21,Section 7]) which we explain below (see also Cavender [7]).
This map is a tree metric by Definition 4.2. In our case we have a point in the model space defining all the second order correlations and The question is: What are the conditions for the "distances" between leaves so that there exists a tree T and edge lengths d(u, v) for all (u, v) ∈ E such that (24) is satisfied? Or equivalently: What are the conditions on the absolute values of the second order correlations in order that ρ 2 ij = e∈Eij ρ 2 e (for some edge correlations) is satisfied? We have the following theorem.
Moreover, a tree metric defines the tree uniquely.
The question we may now ask is -For a given assignments of edge weights on a tree metric, which of these correspond to a probability model on the tree defined as the image of (2)? From Lemma 4.1 we have seen that the tree metric itself induces (using Definition 4.2) some necessary conditions related to the four-point condition. Since δ(i, j) = log(−ρ ij ) these constraints in terms of correlations translate in Since log is a monotone function we obtain (25) min for all not necessarily distinct leaves i, j, k, l ∈ [n]. However later in Theorem 4.6 we show that these constraints are not the only active constraints on the model M T . Before we present this theorem it is helpful to make some simple observations about the relationship between correlations and probabilistic tree models. Since ρ u,v can have different signs we define a signed tree metric as a tree metric with an additional sign assignment for each edge of T . There are additional natural constraints which assure that there exists a choice of signs for edge correlations such that (23) is satisfied. Then there exists a map s as defined above such that for all i, j ∈ [n] (26) σ if and only if for all triples i, j, k ∈ [n] σ(i, j)σ(i, k)σ(j, k) = 1.
Proof. First assume that there exists the map s as in the statement. It induces a map s : V × V → {−1, 1} (we use the same notation) such that s(k, l) = (u,v)∈E(kl) s(u, v). For any triple i, j, k there exists a unique inner node h which is the intersection of all three paths between i, j, k. By the above equation the choice of signs for all s(u, v) (u, v) ∈ E gives s(i, h), s(j, h) and s(k, h). Since s(i, j) = s(i, h)s(j, h) and the same for the two other pairs, we get that s(i, j)s(i, k)s(j, k) = s 2 (i, h)s 2 (j, h)s 2 (k, h) = 1 and the result follows since by construction σ(i, j) = s(i, j) for all i, j ∈ [n].
To prove the converse we use an inductive argument with respect to number of hidden nodes. Note that whenever there is a path E(uv) in T such that all its inner nodes have degree two then a sign assignment satisfying (26) exists if and only if there exists a sign assignment for the same tree but with E(uv) contracted to a single edge (u, v). Hence we can assume that the degree of each inner node is at least three. First we will show that the theorem is true for trees with one inner node (star trees). In this case we will use induction with respect to number of leaves. The theorem is true for the tripod tree what can be checked directly. Assume it works for all star trees with k ≤ m − 1 leaves and let T be a star tree with m leaves. By assumption for any three leaves i, j, k: σ(i, j)σ(i, k)σ(j, k) = 1. If we consider a subtree with (1, h) deleted then by induction assumption we can find a consistent choice of signs for all remaining edge correlations. A choice of a sign for (1, h) consistent with (26) exists if for all i ≥ 2 σ(1, i) = s(1, h)s(i, h). This is true if either σ(1, i)s(i, h) = 1 for all i or σ(1, i)s(i, h) = −1 for all i. Assume it is not true, i.e. there exist two leaves i, j such that σ(1, i)s(i, h) = 1 and σ(1, j)s(j, h) = −1. Then in particular since σ(i, j) = s(i, h)s(j, h) we have σ(1, i)σ(1, j)σ(i, j) = −1 which contradicts our assumption.
If the number of the inner nodes is greater than one, pick an inner node h adjacent to exactly one inner node. Let h be the inner node adjacent to h and let I be a subset of leaves which are adjacent to h. Let 1 ∈ I and consider a subtree T obtained by removing all leaves in I and the incident edges apart from 1 and (h, 1). By the induction, since h has degree two in the resulting subtree, we can find signs for all edge correlations of T . Set s(h, h ) = 1 then s(h, 1) = s(h , 1) and we need only to show that we can identify s(h, i) for all i ∈ I \ 1. Let j, k be any two leaves not in I and let i ∈ I. Using exactly the same argument we used for the analysis of the star tree case we can now show by contradiction that for each i ∈ I, there exists an assignment of s(i, h)s(h, h ) = s(i, h).
The lemma implies that for all i, j, k ∈ [n] necessarily ρ ij ρ ik ρ jk ≥ 0 or equivalently that for all i, j, k ∈ [n] necessarily µ ij µ ik µ jk ≥ 0. This in particular implies that µ ik µ jl µij µ kl ≥ 0 for all i, j, k, l ∈ [n]. By taking the square root in (25) these constraints can be combined and rearranged to give the inequalities (27) 0 for all (not necessarily distinct) i, j, k, l ∈ [n]. We have now obtained the complete set of inequality constraints on M T that involve only second order moments in their expression. However the fact that additional constraints involving higher order moments exist is illustrated in the following simple example. Clearly K satisfies all the tree metric constraints in (27). The equation (4.1) is satisfied with ρ hi = 0.7 for each i = 1, 2, 3. We now show that despite this K / ∈ M κ T . For if K ∈ M κ T we could findμ h and η h,i satisfying constraints in (5) so that (9) held. Using the formulas in Corollary 11 in [27] it is easy to compute that µ h = 0.86 and η h,i ≈ 0.98. To confirm this substitute these values into (9) to check the equations are satisfied. However, K is not in the model since these parameters do not lie in Ω T . Indeed, (5) is not satisfied.
The consequence of the fact that the parameters do not lie in Ω T is that this parametrization does not lead to a valid assignment of conditional probabilities to the edges of the tree. For example with numbers given above we can calculate that the induced marginal distribution for (X i , H) would have to satisfy P(X i = 0, H = 1) = −0.0043 which is obviously not a consistent assignment for a probability model. Thus there must exist other constraints involving observed higher order moments that need to hold for a probability model to be valid. We note that for the tripod tree these were given by Lemma 3.2.
In Section B we prove the following theorem which gives the complete set of constraints which have to be satisfied by tree cumulants to lie in M T in the case when T is a trivalent tree. For the statement of this result we need the following definitions. For each edge e of T we define the edge split (A)(B) as a partition of the set of leaves into two subsets A and B which correspond to two connected components of T obtained after removing e. For example in Example 2.1 removing (r, a) induces (12)(34). Moreover, let P ∈ ∆ 2 n −1 be the probability distribution of the vector (X 1 , . . . , X n ) then for any i, j, k ∈ [n] P ijk denotes the 2 × 2 × 2 table of the marginal distribution of (X i , X j , X k ). (not necessarily disjoint) I 1 , I 2 ⊆ A, J 1 , J 2 ⊆ B then we must have that for all three permutations σ of {i, j, k} such that σ(j) < σ(k).
Remark 4.7. In the phylogenetic analysis it is often assumed that η u,v > 0 for all (u, v) ∈ E andμ 2 v = 1 for all v ∈ V (c.f. assumptions (M1)-(M3) in Section 8.2 and Section 8.4 in [21]). In this case µ ij µ ik µ jk > 0 for all i, j, k ∈ [n] and the second constraint in (C2) is not active. Moreover, the model is globally identified (c.f. Appendix D in [27]).
If we have only tree cumulants we can still identify the parameters of the model up to the label switching on the inner nodes using Corollary 11 in [27]. For examplē Note that the numbers on the right hand side of the three equations above do not depend on the choices made. For example by Corollary 11 [27] to computeμ 2 r we can use any three leaves separated by r. In the formula above we used 1, 2, 3 but we could also use 1, 2, 4. But in both cases we get the same result, namely that It can be checked that for this point all the equations in (C1) are satisfied. However it is not in the model. Using the formulas in Corollary 11 [27] it is simple to confirm that the point mapping to K satisfies θ (1) 1|0 = −0.3. This cannot therefore be a probability and so θ / ∈ Θ T .

Phylogenetic invariants
In a seminal paper Allman and Rhodes [1] identified equations defining the general Markov M T in the case when T is a trivalent tree. In this section we relate their results to ours. To introduce their main theorem we need the following definition. Definition 6.1. Let X = (X 1 , . . . , X n ) be a vector of binary random variables and let P = (p γ ) γ∈{0,1} n be a 2 × . . . × 2 table of the joint distribution of X. Let (A)(B) form a partition of [n]. Then the flattening of P induced by the partition is a matrix E) be a tree. In particular, for each e ∈ E, removing edge e from E induces a partition of the set of leaves into two subsets corresponding to the two connected components of the resulting forest. They called this flattening an edge flattening and we denote it by P e .
If P is the joint distribution of X = (X 1 , . . . , X n ) then each of its flattenings is just a matrix representation of the joint distribution P and contains essentially the same probabilistic data. However, these different representations contain important geometric information about the model. Note that the result includes the case of the tripod tree model since in this case each edge flattening of the joint probability table is a 2 × 4 table so there are no 3×3 minors and hence there are no non-trivial polynomials vanishing on the model.
In an analogous way to the edge flattenings of tables representing probability distributions we can define edge flattenings of (κ I ) I⊆[n] where κ ∅ = 1 and κ i = 0 for all i ∈ [n] (c.f. Appendix A). Let e be an edge of T inducing a split (A)(B) ∈ Π T such that |A| = r, |B| = n − r. Then N e is a 2 r × 2 n−r matrix such that for any two subsets I ⊆ A, J ⊆ B the element of N e corresponding to the I-th row and the J-th column is κ IJ . Denote by N e its submatrix given by removing the column and the row corresponding to empty subsets of A and B. Here the labeling for the rows and columns is induced by the ordering of the rows and columns for P e (c.f. Definition 6.1), i.e. all the subsets of A and B are coded as {0, 1}-vectors and we introduce the lexicographic order on the vectors with the vector of ones being the last one.
The following result allows us to rephrase the equations in Theorem 6.2 in terms of our new coordinates. Proof. Let P e = [p αβ ] be the matrix induced by a split (A 1 )(A 2 ). We will show that rank(P e ) = rank(D e ) where D e = [d IJ ] is a block diagonal matrix with 1 as the first 1 × 1 block (i.e. d ∅∅ = 1, d ∅J = 0, d I∅ = 0 for all I ⊆ A 1 , J ⊆ A 2 ) and matrix N e as the second block. It will then follows that rank (P e ) = 2 if and only if rank (N e ) = 1.
First note that the flattening matrix P e can be transformed to the flattening of the non-central moments just by adding rows and columns according to (29) and then to the flattening of the central moments M e = [µ IJ ] such that I ⊆ A 1 , J ⊆ A 2 using (30). It therefore suffices to show that rank(M e ) = rank(D e ).
Let I ⊆ A 1 , J ⊆ A 2 . Then for each π ∈ Π T (IJ) there is at most one block containing elements from both I and J. For otherwise removing e would increase the number of blocks in π by more than one which is not possible. Denote this block by (I J ) where I ⊆ I, J ⊆ J. Note that by construction we have either both I , J are empty sets if π ≥ (A 1 )(A 2 ) in Π T (IJ) or both I , J = ∅ otherwise. We can rewrite (32) splitting the blocks We have d I J = κ I J and it can be further rewritten as In [10] Eriksson noted that some of invariants usually prove to be better in discriminating between different tree topologies than the others. His simulations showed that the invariants related to the four-point condition were especially powerful. The binary case we consider in this paper can give some partial understanding of why this might be so. Here, the invariants related to the four-point condition are only those involving second order covariances (c.f. Section 4). Moreover, the estimates of the higher-order moments (or cumulants) are sensitive to outliers and their variance generally grows with the order of the moment. Letμ be a sample estimator of the central moments µ and let f be one of the polynomials in Theorem 6.4 but expressed in terms of the central moments. Then using the delta method we have Var(f (μ)) ∇f (µ) t Var(μ)∇f (µ).
Consequently, in this loose sense at least, the higher the order of the central moments and hence tree cumulants the higher the variability of we might expect the invariant to exhibit (see [18,Section 4.5]).

Discussion
The new coordinate system proposed in [27] provides a better insight into the geometry of phylogenetic tree models with binary observations. The elegant form of the parameterization is useful and has already enabled us to obtain the full geometric description of the model class. One of the interesting implications of this result for phylogenetic tree models is that we could consider different simpler model classes containing the original one in such a way that the whole evolutionary interpretation in terms of the tree topologies remains valid. If we are interested only in the tree we could consider the model defined only by a subsets of constraints in Theorem 4.6 involving only covariances. The price for this reduction is that the conditional independencies induce by the original model do not hold anymore which in turn affects the interpretation of the model. We note that this approach is in a similar spirit to that employed to motivate the MAG model class introduced in [23].
This work has encouraged us to use this reparametrization of this model class to estimate models within Bayesian framework. When the sample proportions lie in the model class then we have already noted that the MLEs are given by formulas in Corollary 11 in [27]. However these sample proportions rarely lie in the model exactly. In a later paper we prove various formal methods for incorporating the semi-algebraic geometry in a model to improve the prior specification of the tree model and hence enhance the estimation of the model parameters.

Acknowledgements
Diane Maclagan and John Rhodes contributed substantially to this paper. We would also like to thank Bernd Sturmfels for a stimulating discussion at the early stage of our work and Lior Pachter for pointing out reference [7].
. This is a linear map f pλ : R 2 n → R 2 n with the determinant equal to one, where the components λα of λ = f pλ (p) are defined by where 1 denotes here the vector of ones and the sum is over all binary vectors β such that α ≤ β ≤ 1 in the sense that αi ≤ βi ≤ 1 for all i = 1, . . . , n. In particular λ0 = 1 for all probability distributions. So the image f pλ (∆2n−1) is contained in the hyperplane defined by λ0 = 1. The linearity of the expectation implies that the central moments can be expressed in terms of non-central moments.
where |β| = i βi. Using these equations we can transform variables from the non-central moments λ = [λα] to another set of variables given by all the means λe 1 , . . . , λe n , where e1, . . . , en are standard basis vectors in R n , and central moments [µα] for α ∈ {0, 1}. The polynomial change of coordinates f λµ : R 2 n → R n × R 2 n is an identity on the first n coordinates corresponding to the means λe 1 , . . . , λe n and is defined on the remaining coordinates using the equations (30). Denote Cn = (f λµ • f pλ )(∆2n−1) which is contained in a subspace of R n × R 2 n given by µ0 = 1 and µe 1 = · · · = µe n = 0.
To simplify notation henceforth we will index moments not with {0, 1} n but with the set of subsets of [n]. Here the set A ⊆ [n] is identified with α ∈ {0, 1} n such that αi = 1 for all i ∈ A and it is zero elsewhere. In particular for each i ∈ [n] we write λi for λe i . The coordinates of Cn are given by λ1, . . . , λn together with µI for all I ⊆ [n] such that |I| ≥ 2.
Note that the Jacobian of f λµ • f pλ : ∆2n−1 → Cn is constant and equal to one. The final change of coordinates requires some combinatorics. Let T = (V, E) be a tree with n leaves. A split induced by e ∈ E is a partition of [n] into two non-empty sets induced by removing e from E and restricting [n] to the connected components of the resulting graph. By a multisplit we mean any partition (B1) · · · (B k ) of the set of leaves induced by removing a subset of the set of edges of T . Each Bi is called a block of the partition.
By ΠT we denote the partially ordered set (poset) of all multisplits of the set of leaves induced by edges of T . The poset ΠT has a unique maximal element induced by removing all edges in E and the minimal one with no edges removed which is equal to a single block [n]. The maximal element of a lattice is denoted by 1 and the minimal one is denoted by 0.
For any poset Π a Möbius function mΠ : Π × Π → R is defined in such a way that mΠ(x, x) = 1 for every x ∈ Π, mΠ(x, y) = − x≤z<y mΠ(x, z) for x < y in Π and it is zero otherwise (c.f. [ Note that all f pλ , f λµ , fµκ, after restriction to ∆2n−1, f pλ (∆2n−1) and Cn respectively, are regular polynomial maps with regular inverses (c.f. [27], Appendix A). This therefore implies that there is a regular one-to-one polynomial map between ∆2n−1 and KT . Since the rooting is not relevant we choose an arbitrary inner node as the root node. We first show that M κ T ⊆ M. Let K ∈ M κ T and hence K = ψT (ω) for some ω ∈ ΩT . The equations in (C1) hold since by construction (c.f. Theorem 6.2) the variety in KT defined by the equations contains the image of ψT and hence also K. To show that K satisfies (C2) and (C3) consider the projection K ijk for each i, j, k ∈ [n]. By Lemma 1 in [27] M κ T (ijk) is equal to the tripod tree model. Since K ijk ∈ M κ T (ijk) then by Lemma 3.2 (C2) and (C3) must hold. To show that K satisfies (C4) let i, j ∈ I be such that µij = 0. In this case from (7) ω is such that either ηu,v = 0 for some (u, v) ∈ E(ij) orμ 2 r(ij) = 1. In the first case since E(ij) ⊆ E(I) then κI = 0 by (7). In the second case if r(ij) = r(I) then again κI = 0 by (7). If r(ij) = r(I) then the edge (v, r(ij)) pointing to r(ij) also lies in E(I). By (5) either η v,r(ij) = 0 in which case we are done orμ 2 v = 1 in which case again there exists an edge in E(I) pointing into v. Since the tree is finite eventually eitherμ 2 r(I) = 1 or ηu,v = 0 for some (u, v) ∈ E(I). This shows that necessarily κI = 0 and hence K satisfies (C4).
To show that K satisfies (C5) let i, j, k, l ∈ [n] be the four leaves mentioned in the condition. Let u and v be two inner nodes such that u separates i from j, v separates k from l and {u, v} separates {i, j} from {k, l}. In other words u, v are the only inner nodes of degree three in T (ijkl). By Lemma 2 in [27] T (ijkl) gives the same model as the quartet tree with four leaves i, j, k, l and two inner nodes u, v. Moreover, by Remark 2.5, M T (ijkl) does not depend on the rooting so we can assume that the tree is rooted in u.
Since K ijkl ∈ M T (ijkl) then for some parameter choices Substitute these equations into (C5). There are then two cases to consider: µuv ≥ 0, µuv < 0. Laborious but elementary algebra shows that the condition in (C5) is equivalent to (5) applied to (1−μ 2 u )ηu,v and hence (C5) holds by definition. Consequently M κ T ⊆ M. We next show that M ⊆ M κ T . Let K ∈ M. We construct a point ω0 ∈ R |V |+|E| such that ω0 ∈ ΩT and ψT (ω0) = K, i.e. ω0 is such that for all I ⊆ [n] such that |I| ≥ 2 κI can be written in terms of the parameters in ω0 as in (7). Case 1: Begin by assuming that K is such that µij = 0 for all i, j ∈ [n]. We now set squares of values of all the parameters in terms of the observed moments as in Corollary 11 in [27]. We will show that the equations in (7) must hold for their modulus values. Next we will need to ensure there is at least one assignment of signs for a set of parameters such that all (7) hold exactly. Finally we show that the parameter vector ω0 defined in this way lies in ΩT .
For each inner node h of T let i, j, k ∈ [n] be separated by h in T . By (C2) we have that µijµ ik µ jk > 0 and hence also that DetP ijk > 0. Now set We show that (C1), which K satisfies by assumption, implies that the value of (μ 0 h ) 2 does not depend on the choice of i, j, k. It suffices to show that if k is replaced by another leaf k such that i, j, k are separated by h in T then DetP ijk . Since h has degree three in T then there exists an edge e ∈ E inducing a split (A)(B) such that i, j ∈ A and k, k ∈ B. From (C1) it follows that (34) µ ik µ jk = µ ik µ jk , µ ijk µ ik = µ ijk µ ik , µ ijk µ jk = µ ijk µ jk and consequently (35) DetP ijk µijµ ik µ jk = DetP ijk µijµ ik µ jk which implies that DetP ijk as required.
For terminal edges (v, i) of T such that i ∈ [n] let j, k ∈ [n] be any two leaves of T such that v separates i, j, k. Set As in the previous case it is straightforward to check that given (C1) this value does not depend on the choice of j, k. Without loss assume that instead of k we have k and v separates i, j, k in T . Since there exists an edge split such that i, j and k, k are in different blocks we have (34) and (34) and consequently For inner edges (u, v) ∈ E let i, j, k, l ∈ [n] be any four leaves such that u separates i from j, v separates k from l and {u, v} separates {i, j} from {k, l}. Set which is well-defined since µ 2 ij and DetP ikl are strictly positive. We now show that this value does not depend on the choice of i, j, k, l. By symmetry it suffices to show that we obtain the same value if instead of l we took another leaf l such that u, v are the only degree three nodes in T (ijkl ). Since v has degree three then there must exist an inner edge separating i, j, k from l, l . From (C1) it follows that µ il µ kl DetP ikl = µ il µ kl DetP ikl , µ il µ kl = µ il µ kl and hence µ 2 DetP ijk Det P ikl as required.
We now show that that the modulus of equations (7) hold. First consider the case I = {i, j}. Label the inner nodes of E(ij) by v1, . . . , v k beginning from the node adjacent to i. For each s = 1, . . . , k let is denote a leaf such that vs separates i, j, is in T . We assume that the root r(ij) of this path is in v1. The analysis is the same for any other rooting by Remark 2.5. We now proceed to check that Since v1 separates i, j, i1 by construction, from (33) we therefore have . of (7) hold. It suffices to check that the signs match. But this follows directly from the construction. Indeed, proving (38) and we have shown that Similarly from (40) we have that Multiply both sides by sgn(µ ijk ) and use the fact that ( (u,v)∈E(ijk) s(u, v)) 2 = 1 to get We now show (7) for |I| ≥ 4 by induction. Let (u, v) ∈ E be any edge splitting I into two subsets I1 and I2 such that |I1|, |I2| ≥ 2 and u is the node closer to I1. Let i ∈ I1 and j ∈ I2 then by (C1) κI 1 I 2 = κI 1 j κiI 2 κij .
By induction we can assume that κI 1 j , κiI 2 and κij have form as in (7). Moreover, (1 −μ 2 r(iI 2 ) )(1 −μ 2 r(I 1 j) ) (1 −μ 2 r(ij) ) The root of T (I) is either in T (I1u) or in T (vI2). In the first case r(I1j) = r(I) and r(iI2) = r(ij). In the second case r(I1j) = r(ij) and r(iI2) = r(I). Hence in both cases (1 −μ 2 r(iI 2 ) )(1 −μ 2 r(I 1 j) ) (1 −μ 2 r(ij) ) = (1 −μ 2 r(I) ) and (48) has the required form given by (32). It follows that K = ψT (ω0). It now remains to show that the parameters defined in (43), (44) and (45) define a parameter vector ω0 which lies in ΩT . Since by (C2) µ 2 ijk ≤ DetP ijk for all i, j, k ∈ [n] it follows that for all inner nodes h we haveμ 0 h ∈ [−1, 1] as required. For a terminal edge (v, i) consider the marginal model induced by T (ijk), where j, k are any two leaves such that v separates i, j, k in T . From Lemma 3.2 constraints (C2) and (C3) imply that ηv,i is a valid parameter. To show that (45) satisfies (5) we write We now substitute this together with the expressions forμ 0 u andμ 0 v given by (43) into (5). First assume s(u, v) = 1 then s(u, k) = s(v, k), s(v, i) = s(u, i) and (5) becomes The left hand side is equal to 4µ ik µ jk µ il √ DetP ijk DetP ikl s(j, l) = 4µ 2 ik µ jl √ DetP ijk DetP ikl s(j, l). Now multiply both sides by |µ jl | √ DetP ijk DetP ikl to get However, s(u, l) = s(v, l) hence this is satisfied by (C5). It i easily calculated that the case s(u, v) = −1 leads to the same constraint. This finishes the proof in Case 1 when K is such that µij = 0 for all i, j ∈ [n]. Case 2: For the general case let K ∈ M be a tree cumulant and let Σ = [µij] ∈ R n×n be the matrix of all covariances between the leaves. We say that that an edge e ∈ E is isolated relative to K if µij = 0 for all i, j ∈ [n] such that e ∈ E(ij). By E ⊆ E we denote the set of all edges of T which are isolated relative to K. By T = (V, E \ E) we denote the forest obtained from T by removing edges in E and we call it the K-forest. We define relations on E and E \ E. For two edges e, e with either {e, e } ⊂ E or {e, e } ⊂ E \ E write e ∼ e if either e = e or e and e are adjacent and all the edges that are incident with both e and e are isolated relative to K. Let us now take the transitive closure of ∼ restricted to pairs of edges in E to form an equivalence relation on E. This transitive closure is constructed as follows. Consider a graph with nodes representing elements of E and put an edge between e, e whenever e ∼ e . Then the equivalence classes correspond to connected components of this graph. Similarily, take the transitive closure of ∼ restricted to the pairs of edges in E \ E to form an equivalence relation in E \ E. We will let [ E] and [E \ E] denote the set of equivalence classes of E and E \ E respectively (for details see Section 5 in [27]).
Again we show that there exists ω0 ∈ ΩT such that ψT (ω0) = K. Set η 0 u,v = 0 for all (u, v) ∈ E andμ 0 v = 0 for all inner nodes of T with degree zero in T . It then follows that (1 −μ 2 u )ηu,v = 0 satisfies (5) for all (u, v) ∈ E andμ 0 v ∈ [−1, 1] for all v ∈ V and hence these parameters satisfy constraints defining ΩT . If I ⊆ [n] is such that E(I) ∩ E = ∅ then κI = 0 by (C4). Hence in this case we can assert that If I ⊆ [n] is such that E(I) ∩ E = ∅ then I ⊆ [n l ] for some l = 1, . . . , k. Since K [n l ] ∈ MT l then there exists a choice of parameters such that κI can be written as (7). Consequently K ∈ MT and we are finished.