Smoothness of marginal log-linear parameterizations

We provide results demonstrating the smoothness of some marginal log-linear parameterizations for distributions on multi-way contingency tables. First we give an analytical relationship between log-linear parameters defined within different margins, and use this to prove that some parameterizations are equivalent to ones already known to be smooth. Second we construct an iterative method for recovering joint probability distributions from marginal log-linear pieces, and prove its correctness in particular cases. Finally we use Markov chain theory to prove that certain cyclic conditional parameterizations are also smooth. These results are applied to show that certain conditional independence models are curved exponential families.


Introduction
Log-linear parameters provide an attractive, flexible, and variation independent way of parameterizing a multivariate probability distribution. Setting certain log-linear parameters to zero can be used to define several important classes of statistical model, including undirected graphical models and hierarchical models. Bergsma and Rudas (2002) study log-linear parameters defined within particular marginal distributions; this leads to much greater flexibility in model specification: for example, setting such parameters to zero can be used to define arbitrary conditional independence models (Rudas et al., 2010;Forcina et al., 2010). If those same parameters can be embedded into a larger smooth parameterization of the joint distribution, then the model defined by the conditional Unfortunately, there exist models of conditional independence which cannot be shown to be curved exponential families using Bergsma and Rudas' results. Forcina (2012) studies examples of models defined by 'loops' of conditional independences, such as which is equivalent to setting a particular collection of marginal log-linear parameters to zero (see Section 6). There is no way to embed these parameters into a smooth parameterization of the kind studied by Bergsma and Rudas (2002), so we cannot use their results to demonstrate that this model is a curved exponential family. Forcina (2012) gives a numerical test for this model which is highly suggestive of smoothness in these cases, but no formal proof is available.
In this paper we show that the class of smooth discrete parameterizations made up of marginal log-linear (MLL) parameters is considerably larger than had previously been known. We use this to parameterize conditional distributions and 'interventional' distributions of the kind studied in causal inference. As a consequence we are able to shed light on the relationship between log-linear parameters defined within different marginal distributions, which is generally complex. A closed-form expression for the partial derivative of one MLL parameter with respect to another is derived: this allows us to demonstrate that various fixed point maps are contractions that smoothly recover the joint probability distribution from certain collections marginal log-linear parameters. In addition, we use Markov chain theory to demonstrate that certain conditional independence models, such as the one above, are curved exponential families of distributions.
The rest of the paper is organized as follows: Section 2 reviews marginal loglinear parameters and their properties, and Section 3 demonstrates how they may be used to parameterize conditional distributions. Section 4 gives results on the relationship between log-linear parameters defined within different margins, enabling certain parameterizations to be proven smooth. Section 5 extends these results to construct fixed point methods which will smoothly recover a joint distribution. Section 6 combines the results of Section 4 with Markov chain theory to demonstrate that certain conditional independence models are curved exponential families. Section 7 contains some discussion, and a conjecture on the precise characterization of smooth MLL parameterizations.

Marginal Log-Linear Parameters
We consider multivariate distributions over a finite collection of binary random variables X v ∈ {0, 1}, for v ∈ V ; we denote their joint distribution by All the results herein also hold (or have analogues) in the finite discrete case, but the notation becomes more cumbersome. For M ⊆ V we denote the marginal distribution over Distributions are assumed to be strictly positive: p V > 0.
Definition 2.1. Let ∆ k ≡ {p V > 0} be the strictly positive probability simplex of dimension k = 2 |V | − 1. We say that a homeomorphism θ : ∆ k → Θ ⊆ R k onto an open set Θ is a smooth parameterization of ∆ k if θ is twice continuously differentiable, and its Jacobian has full rank k everywhere.
It is well known that the collection η ≡ (η L , ∅ = L ⊆ V ) provides a smooth parameterization of the joint distribution p V .
We define a marginal log-linear parameter by analogy with (1), simply considering the ordinary log-linear parameter for a particular marginal distribution: See Bergsma and Rudas (2002). Clearly λ V L = η L and, for example, which is the log-odds ratio between X 1 and X 3 .
One way to characterize the main question considered by Bergsma and Rudas (2002) is as follows: given some arbitrary margins p M1 , . . . , p M k of a joint distribution p V , what additional information does one need to smoothly reconstruct the full joint distribution? They show that one possibility is to take the collection of log-linear parameters η L = λ V L where L M i for any i = 1, . . . , k. It follows that given any sequence of margins M 1 , . . . , M k = V which respects inclusion (i.e. M i ⊆ M j only if i < j), then we can smoothly parameterize the joint distribution by the marginal log-linear parameters of the form λ Mi L , where L ⊆ M i but L M j for any j < i. For example, if we take the sequence of margins {1, 2}, {2, 3}, {1, 2, 3}, we obtain a parameterization consiting of the vector λ P below. The pairs (L, M ) are summarized in the adjacent table. 1 P : M i L 12 1, 2, 12 23 3, 23 123 13, 123.
λ P = (λ 12 1 , λ 12 2 , λ 12 12 , λ 23 3 , λ 23 23 , λ 123 13 , λ 123 123 ) T . Now, let P be an arbitrary collection of ordered pairs of subsets of V of the form (L, M ) for ∅ = L ⊆ M ⊆ V . We say that P is complete if every non-empty subset of V appears exactly once in the first entry of such a pair. If, in addition, the margins can be ordered so that each effect appears in the first margin of which it is a subset, we say the parameterization is hierarchical. Parameterizations which can be constructed from an inclusion respecting sequence of margins in the manner above are precisely those which are hierarchical.
Given P define λ = λ P (p) = (λ M L : (L, M ) ∈ P) to be the corresponding vector of marginal log-linear parameters. Bergsma and Rudas (2002) show that any hierarchical and complete collection of marginal log-linear parameters gives a smooth parameterization of the joint distribution; in addition, they show that completeness is necessary for smoothness. However, there are example of parameterizations which are not hierarchical but which are smooth (Forcina, 2012). In fact, to our knowledge, no example has been uncovered of a complete parameterization which is non-smooth.
The following proposition explains our general approach for demonstrating the smoothness of particular collections of parameters.
Proposition 2.2. Any collection λ P of 2 |V | − 1 marginal log-linear parameters is a smooth parameterization of p if and only if there is a differentiable map which recovers the distribution from the parameters.
Sketch. There is a differentiable map λ P → p to the collection of positive joint distributions (by hypothesis), and all marginal log-linear parameters are (infinitely) differentiable functions of the joint distribution, so the reverse also holds. Since the dimensions are the same, it follows that the Jacobians of both maps have full rank everywhere (see, e.g., Kass and Vos, 1997, Corollary A.3). In fact since the map from probabilities to MLL parameters is infinitely differentiable, so is its inverse by the inverse function theorem.
Hence, we need only find a differentiable inverse map which recovers the joint probabilities from λ P in order to demonstrate that the parameterization is smooth. This forms the basis of most of the results in the rest of the paper.

Conditional Distributions
In order to parameterize a marginal distribution p M , we can use the marginal log-linear parameters {λ M L : L ∈ P(M )}, where P(M ) is the collection of non-empty subsets of M . In this section we take an analogous approach with conditional distributions. Define that is, all the subsets of A ∪ B which contain some element of A. Then, for example, C (12 | 3) = {1, 2, 12, 13, 23, 123}. Note that P(A) = C (A | ∅).
Lemma 3.1. The marginal log-linear parameters smoothly parameterize the conditional distribution of X A | X B .
Since the proof is rather easy we omit it, and in fact the result is a (very useful) corollary of Theorem 3.3 below. Note that, since we have assumed the probabilities are positive, all conditional probabilities are well defined.
Theorem 3.3. Given disjoint subsets A, B, let A 1 , . . . , A k = A be a sequence of non-empty subsets of A which respects inclusion, and let Remark 3.4. This extends Theorem 2 of Bergsma and Rudas (2002), which covers the case B = ∅. The conditions above are exactly analogous to the way a complete and hierarchical parameterization is constructed, but with the additional requirement that the conditioning set B be contained within every margin in the sequence.
Then for fixed x B , the collection is just a hierarchical and complete collection of marginal log-linear parameters for the joint distribution p A|B (· | x B ).

Constructing Smooth Parameterizations
The results of the previous section turn out to be invaluable for exploring the relationship between log-linear parameters defined within different margins. One of the difficulties with marginal log-linear parameters is that these relationships are rather opaque. Theorem 3 of Bergsma and Rudas (2002) shows that distinct MLL parameters defining the same effect (i.e. λ M L and λ N L with M = N ) are linearly dependent at certain points of the parameter space. This is because they are in some sense conveying the same information, but from the perspective of two different margins. The following theorem elucidates the exact relationship between two effects defined within different margins, and can be used to show the smoothness of some non-hierarchical parameterizations.
for a smooth function f , which vanishes whenever (where (η J : J = K) are held fixed).
Proof. We have Since the second term is a smooth function of the conditional probabilities p(x A | x M ), it follows from Lemma 3.1 that it is a smooth function of the claimed parameters. The implication of independence follows from Lemma 2.9 of Evans and Richardson (2013).
and similarly Hence the derivative of (2) in the case A = V \ M becomes and, since there is no dependence upon x V \M , this is the same as Then note that |x L | + |x K∩M | + |x K\M | = |x L | + |x K | simply counts the number of 1s in L and in K, so |x L△K | is even if and only if |x L | + |x K | is. Hence which gives the required result.
Remark 4.2. This result is remarkably useful, because the relationship between parameters corresponding to effects in different margins is, in general, rather complex: however, we have shown that if the conditional distribution of A|M is fixed, then the relationship between λ M L and λ MA L (indeed between any parameter of the form λ MB L for B ⊆ A) reduces to a linear one.
In particular, if we know p A|M , then λ MA L and λ M L become interchangable as part of the parameterization, preserving smoothness and (when relevant) variation independence.
Example 4.3. Consider P 1 and P 2 below.  These are both complete, but P 1 is non-hierarchical because the effect 2 is given within the margin 123, which in an inclusion respecting ordering must come after the margin 23. It is therefore unclear whether or not λ P1 is smooth. However, given the parameters λ 1|23 = (λ 123 1 , λ 123 12 , λ 123 13 , λ 123 123 ), Theorem 4.1 shows that λ 123 2 and λ 23 2 are interchangable. Hence λ P1 is smooth if and only if λ P2 is also smooth which, since P 2 satifies the conditions of a hierarchical parameterization, it is. In addition, λ P1 and λ P2 are both variation independent parameterizations (i.e. any λ P1 ∈ R 7 corresponds to a valid probability distribution).
We formalize the approach used in the preceding example with the following definition and proposition.
Definition 4.4. Let P be a collection of MLL parameters, and define That is, all effects involving v are removed, and any margins M containing v are replaced by M \ {v}.
Proposition 4.5. Let P be a complete collection of marginal log-linear parameters over V such that the variable v is not in any margin except V . Then λ P is a smooth parameterization of X V if and only if λ P−v is a smooth parameterization of X V \v . In addition, λ P is variation independent if and only if λ P−v is.
Proof. Since V is the only margin containing v and the parameterization is complete, we have the parameters λ v|V \v = (λ V L : v ∈ L). Hence, by Lemma 3.1 we can smoothly parameterize the distribution of X v | X V \v with these parameters.
By Theorem 4.1, any other parameter λ V L such that v / ∈ L is (having fixed the distribution of X v | X V \v ) a smooth function of λ V \v L . It follows that we have a smooth map between λ and (λ P−v , λ v|V \v ). Since λ P−v is a function of p V \v , and λ v|V \v smoothly parameterizes p v|V \v , it follows that λ P smoothly parameterizes p V if and only if λ P−v smoothly parameterizes p V \v .
Lastly, the two pieces λ P−v and λ v|V \v are variation independent of one another, and parameters within λ v|V \v are all variation independent since they are just ordinary log-linear parameters; therefore λ P−v is variation independent if and only if λ P is. which is a hierarchical parameterization of 23. Hence λ P is smooth.
Proposition 4.5 leads to the following Corollary, which is equivalent to Lemma 6 of Forcina (2012). To our knowledge, this was the only existing analyitical result showing that a non-hierarchical MLL parameterization is smooth.
Corollary 4.7. Any complete parameterization in which the margins are strictly nested: (M 1 ⊂ M 2 ⊂ · · · ⊂ M k = V ) is smooth. In particular, any complete parameterization on precisely two different margins is smooth.
A second corollary shows that we may use MLL parameters to smoothly parameterize interventional distributions, which come in the form of products of conditional distributions.
Corollary 4.8. Let A i , B i for i = 1, . . . k be a sequence such that A i ∪B i ⊂ B i+1 for each i = 1, . . . , k −1, and let A k ∪B k ⊆ V . Then the joint distribution of X V is smoothly parameterized by the conditional distributions p(x Ai | x Bi ) together with the log-linear parameters Example 4.9. Suppose we wish to parameterize the joint distribution of four binary variables X 1 , X 2 , X 3 , X 4 , but are particularly interested in the piece de-fined by the product Such products of conditionals arise in the context of distributions which have been reweighted according to some causal order (see, e.g. Hernán and Robins, 2015). By Lemma 3.1, we can parameterize this piece with λ 1 1 , λ 123 3 , λ 123 13 , λ 123 23 , λ 123 123 . Then, by Corollary 4.8, we can parameterize the joint distribution by adding the log-linear parameters corresponding to the other subsets into the full margin. We obtain P : 13, 23, 123 1234 2, 12, 4, 14, 24, 124, 34, 134, 234, 1234 . Then λ P a smooth (variation independent) parameterization of the joint distribution of X 1 , X 2 , X 3 , X 4 .

More Conditioning
If, for some v ∈ V , we always have the parameters λ M A and λ M vA available within the same margin M , we can effectively reduce the problem to one of parameterizing X V \v | X v . This result complements Proposition 4.5.
Proposition 4.10. Let P be a complete parameterization, and suppose that for some v ∈ V , and every A ⊆ V \ {v}, the set A ∪ {v} and A appear within the same margin in P.
Then λ P is a smooth parameterization of X V if and only if λ P−v is a smooth parameterization of X V \v . In addition, λ P is variation independent if and only if λ P−v is variation independent.
Proof. Since A ⊆ V \ {v} and A ∪ {v} always appear in the same margins, set Then, as in the proof of Theorem 3.3, for fixed x v the parameters κ M A (x v ) : (A, M ) ∈ P, v / ∈ A form a complete MLL collection of the form P −v for the conditional distribution of X V \v | X v = x v . If λ P−v is smooth then we can smoothly recover the conditional distribution p V \v|v . Furthermore, if the effect for v is in a margin N ∪ {v}, then using (2) we obtain and smoothly recover λ v v . In addition λ v v is variation independent of p N |v (since p v , p N |v constitutes a parameter cut) and has range R, so the same is true of Conversely if λ P is smooth, then given parameters λ P−v we can set up a dummy distribution on p V in which κ M A (x v ) = λ M A for each x v , and λ N v = 0, thus smoothly recovering p V \v .
Example 4.12. Suppose we wish to test the smoothness of the model defined by the conditional independences These can be fixed by setting the parameters in P below to zero (Rudas et al., 2010, Lemma 1 One way to prove that such a model is a curved exponential family is to embed it within a complete parameterization which is known to be smooth (Bergsma and Rudas, 2002, Theorem 5). In this case there is clearly no hierarchical parameterization containing these parameters, but we can instead embed into Q. This satisfies the conditions of Proposition 4.10 with v = 1, so it reduces to Q −1 : which is hierarchical. Hence λ Q is smooth, and the model defined by λ P = 0 is a curved exponential family.

Fixed Point Mappings
The results so far are useful, but so far as demonstrating smoothness goes, they only apply to a relatively small set of parameterizations; in this section we build on Section 4 to develop these tools further.
Given a particular complete MLL parameterization λ P , the identity (2) in Theorem 4.1 can be written in vector form as Given λ, this suggests that η might be recovered by using fixed point methods, and the identity (3) gives us information about the Jacobian of the map f above. since η 12 , η 13 , η 123 are given in the parameterization we can assume these to be fixed, so abusing notation slightly Similarly, η 1 = λ 13 1 + g(η 2 , η 23 ) for some smooth g, so η 1 is a solution to the equation If Ψ can be shown to be a contraction, then we are guaranteed to find a unique solution, and therefore recover the joint distribution.
Define ǫ = min xV p(x V ) to be the smallest amount of probability assigned to any cell in our joint distribution, and ∆ ǫ = {p : min xV p(x V ) ≥ ǫ} to be the probability simplex consisting of such distributions. The Jacobian of an otherwise smooth parameterization can become singular on the boundary of the probability simplex, so it is useful to have control over this quantity.
The next result allows to control the magnitude of the columns (or rows) of the Jacobian of Ψ in certain examples. The proof is given in the appendix. which by applying the two parts of Lemma 5.2 both have magnitude at most 1 − ǫ. Hence |Ψ ′ (x)| ≤ 1 − ǫ, and Ψ is a contraction on ∆ ǫ for every ǫ > 0. It follows that the equation has a unique solution among all positive probability distributions (and this can be found by iteratively applying Ψ to any initial distribution), and by the inverse function theorem it is a smooth function of λ. Hence λ P is indeed smooth.
Remark 5.4. Forcina (2012) also uses fixed point methods to recover distributions from marginal log-linear parameters, but that approach is quite distinct from this one. We discuss those methods in Section 6.
Lemma 5.2 enables us to formulate the following general smoothness result, which formalizes the idea used in the example above.
Lemma 5.5. Let P be complete and such that for any effect (L, M ) ∈ P with M V , there is at most one other margin N V in P with L ∩ (V \ N ) = ∅. Then λ P is smooth.
Proof. By Theorem 4.1, Since N is the only margin in P such that L ∩ (V \ N ) = ∅, it follows that all the effects in λ V \M|M are known and fixed except for (λ V K : K ∈ L N ), where L N is the set of effects contained in the margin N . Hence Now, consider the vector equation obtained by stacking (4) over all pairs (L, M ) ∈ P. This defines a fixed point equation whose solution is η, and the column of the Jacobian corresponding to L has non-zero entries From Lemma 5.2, each column has magnitude at most 1 − ǫ, and therefore the mapping is a contraction on ∆ ǫ for each ǫ > 0. It follows that the fixed point equation has a unique solution which, by the inverse function theorem, is a smooth function of λ.
From this result we produce a corollary which is easier to digest.
Corollary 5.6. Any complete parameterization with at most three margins is smooth.
Proof. Since one of the margins must be V , we can apply the previous Lemma.

Cyclic Parameterizations
This section takes a slightly different approach to determining smoothness, but one that is particularly applicable to conditional independence models.
Forcina (2012, Example 2) considers the model defined (up to some relabelling) by the conditional independences to zero. Note that we cannot embed these parameters into a larger hierarchical parameterization, because each pairwise effect will 'belong' to a margin preceding it; for example, 12 is a subset of 124, so for hierarchy the margin 123 must precede 124; by a similar argument, 124 must precede 134 which must precede 123. We therefore have a kind of 'cyclic' parameterization, referred to as a 'loop' by Forcina. None of the methods used in the previous section seem well suited to dealing with these parameterizations. Forcina (2012) presents an algorithm for recovering joint distributions given parameterizations of this kind, together with a condition under which it is guaranteed to converge to the unique solution. However, this condition is on the spectral radius of a complicated Jacobian, and is difficult to verify except in a few special cases: a numerical test is suggested, but this does not constitute a proof of smoothness. Here we show that, at least in some cases, Forcina's algorithm can be recast as a Markov chain whose stationary distribution is some margin of the relevant probability distribution.
Theorem 6.1. Let A 1 , . . . , A k be a disjoint sequence of sets with k ≥ 2 such that the conditional distributions p(x Ai | x Ai−1 ) > 0 for i = 2, . . . , k are known, together with p(x A1 | x A k ). Then the marginal distributions p(x Ai ) are smoothly recoverable.
Proof. Define a |X A1 | × |X A1 | matrix M with entries This is a (right) stochastic matrix with strictly positive entries, and the marginal distribution p(x A1 ) satisfies In other words, p(x A1 ) is an invariant distribution for the Markov chain with transition matrix defined by M . Since M has a finite state-space and all transition probabilities are positive, the chain is positive recurrent and the equations have a unique solution (see, e.g. Norris, 1997). Hence p(x A1 ) is defined by the kernel of the matrix I − M T , and this is a smooth function of the original conditional probabilities.
Remark 6.2. The Markov chain corresponding to M is that which would be obtained by picking some X A1 , and then evolving X Ai using p(x Ai | x Ai−1 ) until we get back to i = 1. The equations can be solved iteratively by repeatedly right multiplying any positive vector by M , so that it converges to the stationary distribution of the chain; this corresponds precisely to Forcina's algorithm. By Lemma 3.1 the first three margins are equivalent to the conditional distributions p 1|2 , p 2|3 and p 3|1 . Using the conditionals in the manner suggested by Theorem 6.1, we can smoothly recover (for example) the margin p 3 , and consequently P is equivalent to the hierarchical parameterization P ′ .
Example 6.4. The parameters (6) can be embedded in the complete parameterization P below. . P satisfies Proposition 4.10 with v = 1 and reduces to P −1 , which is isomorphic to the smooth parameterization in Example 6.3. Hence P is smooth, and the conditional independence model (5) is a curved exponential family.
Example 6.5. Consider the model defined by it consists of setting the parameters in P below to zero. We can embed P in the complete parameterization Q. Note that using λ 14 4 , λ 14 14 and the fact that X 4 ⊥ ⊥ X 2 | X 1 , means we can construct the conditional distribution p(x 4 | x 1 , x 2 ). Similarly we have p(x 3 | x 2 , x 4 ), p(x 1 | x 3 , x 4 ) and p(x 2 | x 1 , x 3 ). In a manner analogous to the previous example, we can set up a Markov chain whose stationary distribution is the marginal p(x 1 , x 2 ) as follows. First pick x 1 , x 2 . Now, for i > 0 Then the distribution of (x 2 ) converges to p 12 . We can therefore smoothly recover a distribution satisfying the conditional independence constraints from the 7 free parameters. The dimension of the model is full, so we have a smooth parameterization of the model, which is therefore a curved exponential family Lauritzen (1996).
Note that the construction of the Markov chain above is only possible when the conditional independence constraints hold, so in this case we have not actually demonstrated that λ Q is generally smooth, only that the model defined by setting λ P = 0 is a curved exponential family.
We remark that all discrete conditional independence models on four variables either require repeated effects to be constrained in different margins, or can be shown to be smooth using the results of this section. However, the next example shows that for five variables the picture is incomplete.
Example 6.7. The conditional independence model defined by contains no repeated effects, and yet does not seem approachable using the methods outlined above. Empirically, Forcina's algorithm still seems to converge to the correct solution, which suggests that the model is indeed smooth.

Discussion
We have presented a variety of approaches to demonstrating that non-hierarchical but complete marginal log-linear parameterizations are smooth, but the picture is still incomplete. There are 104 complete MLL parameterizations on three variables, of which 23 are hierarchical. A further 9 can be shown smooth using Proposition 4.5, and one using Proposition 4.10. Another 26 can be dealt with using 5.5 in combination with other methods. Example 6.3 brings the total to 60 smooth models.
Consider P 1 below.
P 1 : M i L 1 1 12 2 13 3 123 12, 13, 23, 123 P 2 : M i L 1 1 123 2, 12, 3, 13, 23, 123 Although it does not satisfy the conditions of Lemma 5.5 directly, one can use the basic idea of this result to set up a smooth contraction mapping to the parameterization P 2 ; since P 2 is hierarchical, both parameterizations are smooth. This method shows that three more parameterizations are smooth, a total of 63. In addition, of the remaining 41 complete parameterizations, there are smooth mappings between a group of four and a group of three, so it remains to establish the smoothness (or otherwise) of at most 36 distinct parameterizations.
We conjecture that any complete parametrization is smooth, a result which would enable us to show that models such as that given in Example 6.7 are curved exponential families of distributions. (−1) |A∩B| d A < 1.
Proof. The 2 k × 2 k -matrix M with (B, A)th entry M B,A = 2 −k/2 (−1) |A∩B| is orthogonal, and therefore preserves vector lengths. Then the vector M d has entries with magnitude at most 2 −k/2 , and therefore has total magnitude at most 1. The same is therefore true of d.
Proof of Lemma 5.2. For C ⊆ M , define so that d C =

Hence
which is an alternating sum of probabilities which sum to one, so has absolute value at most 1 − ǫ. The result follows from Lemma A.1. The second result is essentially identical, due to the symmetry between L, K in (3).