On the role of the overall effect in exponential families

Exponential families of discrete probability distributions when the normalizing constant (or overall effect) is added or removed are compared in this paper. The latter setup, in which the exponential family is curved, is particularly relevant when the sample space is an incomplete Cartesian product or when it is very large, so that the computational burden is significant. The lack or presence of the overall effect has a fundamental impact on the properties of the exponential family. When the overall effect is added, the family becomes the smallest regular exponential family containing the curved one. The procedure is related to the homogenization of an inhomogeneous variety discussed in algebraic geometry, of which a statistical interpretation is given as an augmentation of the sample space. The changes in the kernel basis representation when the overall effect is included or removed are derived. The geometry of maximum likelihood estimates, also allowing zero observed frequencies, is described with and without the overall effect, and various algorithms are compared. The importance of the results is illustrated by an example from cell biology, showing that routinely including the overall effect leads to estimates which are not in the model intended by the researchers.


Introduction
This paper deals with exponential families of probability distributions over discrete sample spaces. When defining such families, usually, a normalizing constant, which of course, is constant over the sample space but not over the family, is included. The presence of the normalizing constant implies that the parameter space may be an open set, which, in turn, is necessary for asymptotic normality of estimates and for the applicability of standard testing procedures. The normalizing constant, from an applied perspective, may be interpreted as a baseline or common effect, present everywhere on the sample space and is, therefore, also called the overall effect. The focus of the present work is to better understand the implications of having or not having an overall effect in such families, in particular how adding or removing the overall effect affects the properties of discrete exponential families.
Motivated by a number of important applications, ;  developed the theory of relational models, which generalize discrete exponential families, also called log-linear models, to situations when the sample space is not necessarily a full Cartesian product, the statistics defining the exponential family are not necessarily indicators of cylinder sets, and the overall effect is not necessarily present. Exponential families without the overall effect are particularly relevant, sometimes necessary, when the sample space is a proper subset of a Cartesian product. Several real examples, when certain combinations of the characteristics were either not possible logically or were left out from the design of the experiment were discussed in . A real problem of this structure from cell biology is analyzed in this paper, too. When the overall effect is not present, the standard normalization procedure to obtain probability distributions cannot be applied, because the family is curved . When, in spite of this, the standard normalization procedure is applied, as was done in this analysis, the resulting estimates do not possess the fundamental model properties.
The standardization of the estimates in exponential families is also an issue, when the size of the problem is very large and the computational burden is significant. Some Neural probabilistic language models are relational models. Due to the high-dimensional sample space, the evaluation of the partition function, which is needed for normalization, may be intractable. Some of the methods of parameter estimation under such models are based on the removal of the partition function, that is, the removal of the overall effect from the model and performing model training using the models without the overall effect. Approximations of estimates with and without the overall effect were studied, for example, by Mnih & Teh (2012) and Andreas & Klein (2015), among others. A different approach to avoiding global normalization (i.e., having an overall effect) is described in Koller & Friedman (2009). However, the implications of the removal of the overall effect are not discussed in the existing literature.
Another area where removing or including the overall effect is relevant, is context specific independence models, see, e.g., Høsgaard (2004) and Nyman, Pensar, Koski, & Corander (2016). When the sample space is an incomplete Cartesian product, removing the overall effect, as described in this paper, specifies different variants of conditional independence in the parts of the sample space, depending on whether or not the part is or is not affected affected by the missing cells.
While including the overall in the definition of the statistical model to be investigated is seen by many researchers as "natural' or "harmless", we show in this paper that adding or removing the overall effect may dramatically change the characteristics of the exponential family, up to the point of altering the fundamental model property intended by the researcher.
The main results of the paper include showing that allowing the overall effect expands the curved exponential family to the smallest regular exponential family which contains it. When the overall effect is removed, the sample space may have to be reduced (if there were cells which contained the overall effect only), and the changes in the structure of the generalized odds ratios defining the model are described in both cases. In the language of algebraic geometry, the procedure of removing the overall effect is identical to the dehomogenization of the variety defining the model (Cox, Little, & O'Shea, 2007). An important area of applications of the results presented here is the case when several binary features are observed, but the combination that no feature is present is either is impossible logically or is possible but was left out from the study design. The converse of dehomogenization, that is homogenizing a variety, involves including a new variable, and it is shown that in some cases this can be identified, from a statistical perspective, with augmenting the sample space by a cell which is characterized by no feature being present. For example, the Aitchison -Silvey independence (Aitchison & Silvey, 1960;Klimova & Rudas, 2015) is homogenized, through the augmentation of the sample space, into the standard independence model. The paper is organized as follows. Section 2 gives a canonical definition of relational models using homogeneous, and if there is no overall effect included, one inhomogeneous generalized odds ratios, called dual representation and shows that including the overall effect is identical to omitting the inhomogeneous generalized odds ratio from it.
Section 3 contains the result that including the overall effect expands the curved exponential family into the smallest regular one containing it. For the case of the removal of the overall effect, the dual representation of the model is given, and the relevance of certain results in algebraic geometry to the statistical problem is discussed. In particular, the homogenization of a variety through including a new variable is identified with augmenting the sample space with a cell where no feature is present, when this is meaningful. It is proved that the homogenization of the Aitchison -Silvey (in the sequel, AS) independence model, which is defined on sample spaces where all combinations of features, except for the "no feature present" combination, are possible, is the usual model of mutual independence on the full Cartesian product obtained after augmenting the sample space with the missing cell. The relationship of these results with context specific independence is also described.
Section 4 compares the maximum likelihood (ML) estimates in geometrical terms for relational models with and without the overall effect and based on the insight obtained, a modification of the algorithm proposed in Klimova & Rudas (2015) is given. It is illustrated, that the ML estimates under two models which differ only in the lack or presence of the overall term, may be very different, up to the point of the existence or no existence of positive ML estimates, when the data contain observed zeros. However, when the MLE exists in the model containing the overall effect, it also does in the model obtained after the removal of the overall effect.
Finally, Section 5 discusses an example of applications of relational models in cell biology. The equal loss of potential model in hematopoiesis (Perié et al., 2014) is a relational model without the overall effect. The published analysis of this model added the overall effect to it, to simplify calculations, and with this changed the properties of the model so that the published estimates do not fulfill the fundamental model property.

A canonical form of relational models
Let Y 1 , . . . , Y K be random variables taking values in finite sets Y 1 , . . . , Y K , respectively. Let the sample space I be a non-empty, proper or improper, subset of Y 1 × · · · × Y K , written as a sequence of length I = |I| in the lexicographic order. Assume that the population distribution is parameterized by cell probabilities p = (p 1 , . . . , p I ), where p i ∈ (0, 1) and I i=1 p i = 1, and denote by P the set of all strictly positive distributions on I. For simplicity of exposition, a distribution in P will be identified with its parameter, p, and P = {p > 0 : Let A be a J × I matrix of full row rank with 0-1 elements and no zero columns. A relational model for probabilities RM (A) generated by A is the subset of P that satisfies: where θ = (θ 1 , . . . , θ J ) is the vector of log-linear parameters of the model. A dual representation of a relational model can be obtained using a matrix, D, whose rows form a basis of Ker(A), and thus, DA = O: The number of the degrees of freedom K of the model is equal to dim(Ker(A)). In the sequel, d 1 , d 2 , . . . , d K denote the rows of D. The dual representation can also be expressed in terms of the generalized odds ratios: or in terms of the cross-product differences: where d + and d − stand for, respectively, the positive and negative parts of a vector d. The following dual representation is invariant of the choice of the kernel basis. Let X A denote the polynomial variety associated with A (Sturmfels, 1996): The relational model generated by A is the following set of distributions: where int(∆ I−1 ) is the interior of the (I − 1)-dimensional simplex. Notice that the variety X A includes elements p with zero components as well and can be used to extend the definition of the model to allow zero probabilities. The extended relational model,RM(A), is the intersection of the variety X A with the probability simplex: See Klimova & Rudas (2016) for more detail on the extended relational models.
Let 1 = (1, . . . , 1) be the row of 1's of length I. If 1 does not belong to the space spanned by the rows of A, the relational model RM (A) is said to be a model without the overall effect. Such models are specified using homogeneous and at least one non-homogeneous generalized odds ratios, and the corresponding variety X A is non-homogeneous .
Proposition 1. Let RM (A) be a model without the overall effect. There exists a kernel basis matrix D whose rows satisfy: Proof. A relational model does not contain the overall effect if and only if it can be written using non-homogeneous (and possibly homogeneous) generalized odds ratios . Therefore, D has at least one row, say d 1 , that is not orthogonal to 1: Suppose there exists another row, say d 2 , that is not orthogonal to 1 and thus C 2 = d 2 1 = 0. The vectors d 1 and d 2 are linearly independent, so are the vectors d 1 and It is assumed in the sequel that 1 is not in the row space of A. Notice that, because A is 0-1 matrix without zero columns, this is only possible when 2 ≤ J = rank(A) < I − 1. Throughout the entire paper, the kernel basis matrix D is assumed to satisfy (8), and, without loss of generality, d 1 1 = −1.
Some consequences of adding the overall effect to a relational model will be investigated by comparing the properties of the relational model generated by A and the model generated by the matrixĀ obtained by augmenting the model matrix A with the row 1 : Let RM (Ā) be the relational model generated byĀ. Because 1 is a row ofĀ, the corresponding polynomial variety XĀ is homogeneous (cf. Sturmfels, 1996).
Theorem 1. The dual representation of RM (Ā) can be obtained from the dual representation of RM (A) by removing the constraint specified by a non-homogeneous odds ratio from the latter.
Proof. Write the dual representation of RM (A) in terms of the generalized log odds ratios: By the previous assumption, d 1 1 = −1, and thus, the constraint d 1 log p = 0 is specified by a non-homogeneous odds ratio. DefineD as: and thus, d 2 , . . . , d K ∈ Ker(Ā). Finally, as rank(D) = K − 1, d 2 , . . . , d K is a basis of Ker(Ā), and therefore, 3 The influence of the overall effect on the model structure The consequences of adding or removing the overall effect will be studied separately. The changes in the model structure after the overall effect is added are considered first. Let RM (A) be a relational model without the overall effect and RM (Ā) be the corresponding augmented model. Let A = (a ji ) for j = 1, . . . , J, i = 1, . . . , I. For any p ∈ RM (Ā): where θ j = θ j (p), j = 0, 1, . . . , J, are the log-linear parameters of p. In particular, θ 0 (p) is the overall effect of p.
Theorem 2. The augmented model, RM (Ā), is the minimal regular exponential family which contains RM (A), and Proof. The second claim is proved first. Denote M 0 = {p ∈ RM (Ā) : θ 0 (p) = 0}. Let D be a kernel basis matrix of A, having the form (8), and notice that Therefore, any p ∈ M 0 , satisfies D log p = 0, and thus, belongs to RM (A). On the other hand, for any p ∈ RM (A), bothD log p = 0 and θ 0 (p) = 0 must hold, which immediately implies that p ∈ M 0 . The first claim is proved next. The relational model RM (A) is a curved exponential family parameterized by If the overall effect is added to RM (A), the parameter space gets an additional parameter: Because Θ 1 is the smallest open set in R J that contains Θ, it parameterizes the minimal regular exponential family containing RM (A). This family is, in fact, RM (Ā).
Example 1. The relational models generated by the matrices consist of positive probability distribution which can be written in the following forms: where β 0 is the overall effect. The dual representations can be written in the log-linear form, By Theorem 1, after the overall effect is added, the model specification does not include the non-homogeneous constraint anymore. In terms of the generalized odds ratios: The second model may be defined using restrictions only on homogeneous odds ratios, and there is no need to place an explicit restriction on the non-homogeneous odds ratio.
The changes in the model structure after the overall effect is removed are examined next. A relational model with the overall effect can be reparameterized so that its model matrix has a row of 1's, and because of full row rank, this vector is not spanned by the other rows. The implications of the removal of the overall effect will be investigated using a model matrix of this structure, sayĀ 1 . By the removal of the row 1 , one may obtain a different model matrix on the same sample space, but it may happen that there exists a cell i 0 , whose only parameter is the overall effect, and after its removal, the i 0 -th column contains zeros only. In such cases, to have a proper model matrix, such columns, that is such cells, need to be removed. Write I 0 for the set of all such cells i 0 , and let I 0 = |I 0 |. Then, the reduced model matrix, A 1 , is obtained fromĀ 1 after removing the row of 1's and deleting the columns which, after this, contain only zeros. This is a model matrix on I \ I 0 . Without loss of generality, the matrixĀ 1 can be written as: If the sample spaces of RM (Ā 1 ) and RM (A 1 ) are the same that is, when I 0 is empty, the reduced model is the subset of the original one, consisting of the distributions whose overall effect is zero, see Theorem 2. If the sample space is reduced, the relationship between the kernel basis matrices is described in the next result.
Theorem 3. The following holds: (ii) The kernel basis matrix D 1 of A 1 may be obtained from the kernel basis matrixD 1 of A 1 by deleting the the columns in I 0 and then leaving out the redundant rows.
Proof. (i) BecauseĀ 1 is a J × I matrix of full row rank, dim(Ker(Ā 1 )) = I − J. The linear independence of its rows implies that the rows of A 1 are also linearly independent. Then, which implies that Suppose A 1 does not have the overall effect. Notice that each v i has length I 0 , and therefore, one can apply a non-singular linear transformation to the basis vectors d 1 , d 2 , . . . , d I−J to reduce them to the form: The equations (11)  The linear independence of d I 0 +1 , . . . , d I−J in R I entails the linear independence of u I 0 +1 , . . . , u I−J in R I−I 0 . Notice that u 1 , . . . , u I 0 are jointly linearly independent from u I 0 +1 , . . . , u I−J , but not necessarily linearly independent from each other. A kernel basis of A 1 comprises I − J − I 0 + 1 linearly independent vectors in Ker(A 1 ), and, for example, u I 0 , u I 0 +1 , . . . , u I−J form such a basis. Therefore, D 1 can be derived from a kernel basis matrix ofĀ 1 by removing the columns for I 0 and leaving out the I 0 − 1 redundant rows.
Remove the row 1 and the last two columns and consider the reduced matrix: A 1 = 1 0 1 0 0 1 1 1 .
The model RM (A 1 ) does not have the overall effect and can be specified by two generalized odds ratios: p 2 /(p 1 p 4 ) = 1, p 2 /p 4 = 1.
These odds ratios are defined on the smaller probability space, and may be obtained by removing p 5 and p 6 , and the redundant odds ratio, from the odds ratio specification of the original model. In terms of the generalized odds ratios, the model specification is p 1 /p 3 = 1. Notice thatĀ 1 is row equivalent toĀ 2 =   0 0 0 1 1 1 1 0 1 0 1 0   .
Because every d in Ker(A) is orthogonal to (0, 0, 0, 1), its last component has to be zero: d 4 = 0. Therefore, in any specification of RM (Ā 1 ) in terms of the generalized odds ratios, p 4 will not be present. Set The model RM (A 1 ) has the overall effect and can be specified by exactly the same generalized odds ratio as the model RM (Ā 1 ): p 1 /p 3 = 1.
The polynomial variety XĀ 1 defining the model RM (Ā 1 ) is homogeneous. If the removal of the cells comprising I 0 leads to a model without the overall effect, the variety XĀ 1 is dehomogenized, yielding the affine variety X A 1 (cf. Cox et al., 2007).
The converse to this procedure, homogenization of an affine variety, is also studied in algebraic geometry, and is performed by introducing a new variable in such a way that all equations defining the variety become homogeneous. The essence of this procedure is that all probabilities are multiplied by this new variable. This leaves the homogeneous odds ratios unchanged, as the new variable cancels out. The value of a non homogeneous odds ratio becomes, instead of 1, the reciprocal of the new variable. For example, the odds ratio in Example 2 p 2 /(p 1 p 4 ) = 1 becomes vp 2 /(vp 1 vp 4 ) = 1/v, where v is the new variable. If now v could be seen as the probability of an additional cell, say p 0 , then this would be a homogeneous odds ratio, p 0 p 2 /(p 1 p 4 ) = 1.
Although a straightforward procedure in algebraic geometry, it does not necessarily have a clear interpretation in statistical inference. Introducing a new variable and a new cell for the purpose of homogenization can be made meaningful in some situations, if the sample space may be extended by one cell, and the new variable is the parameter (probability) of this cell. Homogenization requires this new variable to appear in every cell, too, so the parameter may be seen as the overall effect. The new cell has only the overall effect, thus no feature is present in this cell.
The augmentation of the sample space by an additional cell does make sense, if that cell exists in the population but was not observed because of the design of the data collection procedure, as in Example 4. The additional cell has the overall effect only, thus is a "no feature present" cell.
Example 4. In the study of swimming crabs by Kawamura, Matsuoka, Tajiri, Nishida, & Hayashi (1995), three types of baits were used in traps to catch crabs: fish alone, sugarcane alone, fish-sugarcane combination. The sample space consists of three cells, I = {(0, 1), (1, 0), (1, 1)}, and the cell (0, 0) is absent by design, because there were no traps without any bait. Under the AS independence, the cell parameter associated with both bait types present is the product of the parameters associated with the other two cells. This is a relational model without the overall effect, generated by the matrix (cf. Klimova & Rudas, 2015): In fact, the overall effect cannot be included in this situation, because it would saturate the model. The affine variety associated with this model can be homogenized by including a new variable. The new variable may only be interpreted as the parameter associated with no bait present, and calls for an additional cell in the sample space (to avoid model the saturation of the model) which may only be interpreted as setting up a trap without any bait, which would also be a plausible research design. The resulting model is generated by A 0 : and indeed, is the model of traditional independence on the complete 2 × 2 contingency table.
For situations like in Example 4, the AS independence is a natural model, but it also applies to cases, when the "no feature present" situation is logically impossible (like market basket analysis, or records of traffic violations, see ; Klimova & Rudas (2015), and also the biological example in Section 5), and in such cases, the cell augmentation procedure is not meaningful. There are, however situations, when the existence of the "no feature present" cell is logically not impossible, but the actual existence in the population is dubious.
For a more general discussion of the homogenization of AS independence, let d 1 , . . . , d K be a kernel basis of A, satisfying (8) with d 1 1 = −1. The polynomial ideal I A associated with the matrix A is generated by one non-homogeneous polynomial , the difference in the degrees of the monomials p d + 1 and p d − 1 is −1. Therefore, the polynomial p d + 1 − p d − 1 can be homogenized by multiplying the first monomial by one additional variable, say p 0 :

The polynomial ideal generated by
and the corresponding variety are homogeneous, and can be described by the matrix of size (J + 1) × (I + 1) of the following structure: Here, 1 I is the row of 1's of length I, and 0 J is the column of zeros of length J.
In fact, the homogeneous variety X A 0 is the projective closure of the affine variety XĀ 0 (Cox et al., 2007). The latter can be obtained from the former by dehomogenization via setting p 0 = 1.
The homogenization of the model of AS independence for three features is discussed next.
Example 5. Consider the model of AS independence for three attributes, A, B, and C, described in . Consider the following kernel basis matrix which is of the form (8): The corresponding polynomial ideal is: The generating set of I A includes at least one non-homogeneous polynomial, due to d 1 , and can be homogenized by introducing a new variable, say p 000 . The resulting ideal, is homogeneous, and its zero set where A 0 =     1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 is thus a homogeneous variety. The relational model RM (A 0 ) is defined on a larger sample space, namely I ∪ (0, 0, 0). The model has the overall effect and is the following set of distributions: The rows of A 0 are the indicators of the cylinder sets of the total (the row of 1's), and of the A, B, and C marginals. Therefore, the relational model RM (A 0 ) is the traditional model of mutual independence.
The next theorem states in general what was seen in the example. Let X 1 , . . . , X T be the random variables taking values in {0, 1}. Write I 0 for the Cartesian product of their ranges, and let I = I 0 \ (0, . . . , 0). The variety X A is non-homogeneous, because among its generators there is at least one non-homogeneous polynomial. In order to obtain the projective closure of X A (cf. Cox et al., 2007), include the "no feature present" cell, indexed by 0, to the sample space, choose a Gröbner basis of the ideal I A , and homogenize all non-homogeneous polynomials in this basis using the cell probability p 0 . Because the projective closure of X A is the minimal homogeneous variety in the projective space whose dehomogenization is X A (Cox et al., 2007), Theorem 3(ii) implies that this closure can be described using the matrix Each distribution in RM (A) has the multiplicative structure prescribed by A (Klimova & Rudas, 2016), and during the homogenization, is mapped in a positive distribution in X A 0 . Because all strictly positive distributions in X A 0 have the multiplicative structure prescribed by A 0 , they comprise the relational model RM (A 0 ). This matrix describes the model of mutual independence between X 1 , . . . , X T in the effect coding, and the proof is complete.
The homogenization (in the language of algebraic geometry) or regularization (in the language of the exponential families) leads to a simpler structure, which allows a simpler calculation of the MLE. However, the additional cell was not observed in these cases, and assuming its frequency is zero is ungrounded and may lead to wrong inference.
The framework developed here may also be used to define context specific independence, so that in one context conditional independence holds, in another one, AS independence does. To illustrate, let X 1 , X 2 , X 3 be random variables taking values in {0, 1}. Assume that the (0, 0, 0) outcome is impossible, so the sample space can be expressed as: Let p = (p 001 , p 010 , p 011 , p 100 , p 101 , p 110 , p 111 ), and consider the relational model without the overall effect generated by The kernel basis matrix is equal to: and thus, the model can be specified in terms of the following two generalized odds ratios: COR(X 1 X 2 | X 3 = 0) = p 110 p 010 p 100 = 1, COR(X 1 X 2 | X 3 = 1) = p 001 p 111 p 011 p 101 = 1.
The second constraint expresses the (conventional) context-specific independence of X 1 and X 2 given X 3 = 1. The first odds ratio is non-homogeneous, and the corresponding constraint may be seen as the context-specific AS-independence of of X 1 and X 2 given X 3 = 0.

ML estimation with and without the overall effect
The properties of the ML estimates under relational models, discussed in detail in  and Klimova & Rudas (2016), are summarized here in the language of the linear and multiplicative families defined by the model matrix and its kernel basis matrix. The conditions of existence of the MLE are reviewed first. Let a 1 , . . . , a |I| denote the columns of A, and let C A = {t ∈ R J ≥0 : ∃p ∈ R |I| ≥0 t = Ap} be the polyhedral cone whose relative interior comprises such t ∈ R J >0 , for which there exists a p > 0 that satisfies t = Ap. A set of indices F = {i 1 , i 2 , . . . , i f } is called facial if the columns a i 1 , a i 2 , . . . , a i f are affinely independent and span a proper face of C A (cf. Grünbaum, 2003;Geiger, Meek, & Sturmfels, 2006;Fienberg & Rinaldo, 2012). It can be shown that a set F is facial if and only if there exists a c ∈ R J , such that c a i = 0 for every i ∈ F and c a i > 0 for every i / ∈ F . Let q ∈ P and let K be the set of κ > 0, such that, for a fixed κ, the linear family F(A, q, κ) = {r ∈ P : Ar = κAq} (17) is not empty, and let F(A, q) = K F(A, q, κ). For each κ > 0, the linear family F(A, q, κ) is a polyhedron in the cone C A .
Theorem 5. (Klimova & Rudas, 2016) Let RM (A) be a relational model, with or without the overall effect, and let q be the observed distribution.
1. The MLEp q given q exists if only: (i) supp(q) = I, or (ii) supp(q) I and, for all facial sets F of A, supp(q) ⊆ F .
In either case,p q = F(A, q) ∩ int(X A ), and there exists a unique constant γ q > 0, also depending on A, such that: Ap q = γ q Aq, 1 p q = 1.
2. The MLE under the extended model RM(A), defined in (7), always exists and is the unique point of X A which satisfies: Ap = γ q Aq, for some γ q > 0; (18) 1 p = 1.
The statements follow from Theorem 4.1 in Klimova & Rudas (2016) and Corollary 4.2 in , and the proof is thus omitted. The constant γ q , called the adjustment factor, is the ratio between the subset sums of the MLE, Ap q , and the subset sums of the observed distribution, Aq. If the overall effect is present in the model, γ q = 1 for all q.
Let A be a model matrix whose row space does not contain 1 , and letĀ be the matrix obtained by augmenting A with the row 1 . It will be shown in the proof of the next theorem that every facial set of A is facial forĀ. If the observed q is positive, the MLEsp q and p q under the models RM (A) and RM (Ā), respectively, exist. However, as implied by the relationship between the facial sets of A andĀ, if q has some zeros, the MLE may exist under RM (A), but not under RM (Ā), or neither of the MLEs exist.
Theorem 6. Let A be a model matrix whose row space does not contain 1 , and letĀ be the matrix obtained by augmenting A with the row 1 . Let q be the observed distribution. If, given q, the MLE under RM (Ā) exists, so does the MLE under RM (A).
Assume that q has some zeros, that is, supp(q) I, and that the MLE under RM (Ā) exists. It will be shown next that for any facial set F of A, supp(q) ⊆ F . The proof is by contradiction. Let F 0 be a facial set of A, such that supp(q) ⊂ F 0 . Therefore, there exists a c ∈ R J , such that c a i = 0 for every i ∈ F and c a i > 0 for every i / ∈ F .
From a geometrical point of view, F(A, q) is a polyhedron which decomposes into polyhedra F(A, q, κ), with κ > 0; clearly, q ∈ F(A, q, 1). The MLE under RM (Ā) given r ∈ F(A, q, κ) is the unique point common to the polyhedron F(A, q, κ) and the variety XĀ. Among the feasible values of κ there exists a unique one, sayκ, such that the MLEp r , ∀r ∈ F(A, q,κ), coincides with the MLE of q under RM (A),p q . This happens whenγ r = 1 so that, from (ii) in Theorem 7,κ =γ q . This latter point,p q , is the intersection between F(A, q) and the non-homogeneous variety X A . This specific value of the adjustment factor γ q =κ, is the adjustment factor of the MLE under RM (A) given q. An illustration is given next.
Relational models for probabilities without the overall effect are curved exponential families, and the computation of the MLE under such models is not straightforward. An extension of the iterative proportional fitting procedure, G-IPF, that can be used for both models with and models without the overall effect was proposed in Klimova & Rudas (2015) and is implemented in Klimova & Rudas (2014). Alternatively, the MLEs can be computed, for instance, using the Newton-Raphson algorithm or the algorithm of Evans & Forcina (2013). One of the algorithms described in Forcina (2017) gave an idea of a possible modification of G-IPF. A brief description of the original and modified versions of G-IPF is given below: Run IPF(γ) to obtain p γ , where Adjust γ, to approach the solution of Iterate with the new γ Theorem 8. If q > 0, the G-IPFm algorithm converges, and its limit is equal top q , the ML estimate of p under RM (A).
The original G-IPF can be used whether or not q has some zeros, and it computes a sequence whose elements are the unique intersections of the variety X A and each of the polyhedra defined by Aτ = γAq for different γ. This sequence converges, and its limit belongs to the hyperplane 1 τ = 1 (Klimova & Rudas, 2016). G-IPFm produces a sequence whose elements are the unique intersections of the interior of the homogeneous variety XĀ and each of the polyhedra F(A, q, γ). The limit of this sequence belongs to the interior of the non-homogeneous variety X A . To ensure the existence, differentiability, and monotonicity of f (γ), described above, the G-IPFm algorithm should be applied only when q > 0. If q has some zero components, the positive MLEp q may still exist, see Theorem 5(ii). However, for some q, because, in general, the matrices A andĀ have different facial sets, no strictly positive p γ would satisfyĀp γ = 1 γAq .
Some limitations and advantages of using the generalized IPF were addressed in Klimova & Rudas (2015), Section 2. In particular, while the assumption of the model matrix to be of full row rank can be relaxed for G-IPF, it is one of the major assumptions for the Newton-Raphson and the Fisher scoring algorithms. The algorithms proposed in Forcina (2017) also require the model matrix to be of full row rank, and their convergence relies on the positivity of the observed distribution.

Loss of potentials in hematopoiesis
Hematopoietic stem cells (HSC) are able to become progenitors that, in turn, may develop into mature blood cells. Understanding the process of forming mature blood cells, called hematopoiesis, is one of the most important aims of cell biology, as it may help to develop new cancer treatments. The HSC progenitors can proliferate (produce cells of the same type) or differentiate (produce cells of different types). Multiple experiments suggested that HSC progenitors are multipotent cells and differentiate by losing one of the potentials. While the mature blood cells are unipotent, they do not proliferate or differentiate The differentiation is believed to be a hierarchical process, with HSC progenitors and mature blood cells at the highest and the lowest levels, respectively.
The models discussed below apply to the steady-state of hematopoiesis, under the assumption that cells neither proliferate nor die and can undergo only first phase of differentiation. Various hierarchical models for differentiation have been proposed (cf. Kawamoto, Wada, & Katsura, 2010;Ye, Huang, & Guo, 2017). The equal loss of potentials (ELP) model was introduced in Perié et al. (2014), and is described next. Denote by M DB the three-potential HSC progenitor of the M , D, and B mature blood cell types. During the first phase of differentiation, an M DB progenitor can differentiate by losing either one or two potentials at the same time, and thus produce a cell of one of the six types: The model specified by (19) is where, using the notation in Perié et al. (2014), α M , α D , α B are the parameters associated with the loss of the corresponding potential from M DB. It can be easily verified that the relational model generated by (20) does not have the overall effect, so the normalization has to be added as a separate condition: Z = p * DB + p M * B + p M D * + p * * B + p * D * + p M * * = 1.
That is, the authors rescaled the loss probabilities to force them sum to 1. In fact, (22) is also a relational model; it is generated bȳ and can be obtained by adding the overall effect to the model defined by (20). Because, the original model does not have the overall effect, adding a row of 1's changed this model. One can check by substitution that the probabilities in (22) do not satisfy the multiplicative constraints (19). The estimates of the probabilities of loss of potentials from the M DB cells shown in Figure 3B of Perié et al. (2014). In the notation used here, p * DB = 0.35,p M * B = 0.08,p M D * = 0.49, p * * B = 0.01,p * D * = 0.06,p M * * = 0.01.
These probabilities sum to 1, but also do not satisfy (19).