A class of models for multiple binary sequences under the hypothesis of Markov exchangeability

: We discuss inference for multiple binary sequences under the hypothesis of Markov exchangeability. So far, the only kind of models for this purpose have been the mixtures of Markov chains. We present a new class of hierarchicalmodels parameterized in terms of Bahadur/Lancaster’s interactions, and compare it to the mixtures of Markov chains models.


Introduction
The concept of Markov exchangeability was suggested for the first time as a particular case of partial exchangeability by de Finetti in [6]. This particular case has peculiar importance and is sometimes simply referred to as partial exchangeability. Since then, the main aim of the authors working on this specific topic has been to find a theorem characterizing the discrete time mixtures of Markov chains processes. Various papers ( [13,14,8,34,12]) outline different theoretical frameworks for extending de Finetti's representation theorem to Markov exchangeable (hereafter ME) processes, the main result being that a recurrent process is ME if, and only if, its law is a mixture of Markov chains. Besides this characterization problem, some particular ME processes have been studied in the context of certain random walks over graphs, namely the Edge Reinforced Random Walks (see [25] and references therein for a comprehensive survey), and in the Bayesian analysis of certain processes (see [24,10]). The inferential analysis of ME processes has not received the same attention.
A finite sequence of r.v.s (X 1 , . . . , X n ) taking values in a discrete state space I is said ME if its distribution assigns the same probability to all the I-valued sequences having the same starting state and the same number of transitions (i, j) for each couple of states in I. A process {X n } n∈N is said ME if the above holds for every n.
When we have to analyze a dataset consisting of multiple I-valued sequences, for example when we have a sequence of responses recorded for each unit, we may consider a hypothesis of Markov exchangeability. Rigorously, we should adopt it whenever, in a sequence of observations, we consider "the last outcome before any observation as a relevant attribute of the observation itself, and, once the observations are classified according to this attribute, time order becomes irrelevant" (see [12]). Even if we do not have such a precise idea on the data, we can adopt a hypothesis of Markov exchangeability for example whenever we deem it appropriate to use an homogeneous Markov chain, but this choice results in a poor fit to the data (possibly due to population's heterogeneity). Then we can look for a wider class of models maintaining the hypothesis of Markov exchangeability (a Markov chain is a particular ME process). Moreover, we can adopt it as a way to approximate a multivariate discrete distribution, reducing the complexity of the analysis. In fact, under Markov exchangeability, we a priori reduce the maximum number of free parameters the distribution can have (numerical details will be given on that). So, if one is interested, even if somehow informally, to the order of a set of I-valued variables, Markov exchangeability is a reasonable intermediate between considering the joint distribution in its greatest generality, and a simple Markov dependence (i.e., since I is discrete, a Markov chain).
To the author's knowledge, only Quintana and co-authors have explicitly considered this use of the hypothesis in several papers (see e.g. [26][27][28]). Other papers analyze the practical use of discrete time mixtures of Markov chains models for the analysis of multiple categorical sequences (e.g. [15,17,16]). Specifically, they analyze various kind of finite mixtures of Markov chains and their use in (model-based) cluster analysis, but they do not even mention the concept of Markov exchangeability.
This paper tries to fill the gap, analyzing the practical use of this hypothesis, and proposing an alternative to the mixtures of Markov chains models.
The need of an alternative to the mixture models is explained by the following argument: de Finetti's representation theorem asserts that an infinite sequence of r.v.s is exchangeable if, and only if, its law is a mixture of laws of i.i.d. variables. Not all the exchangeable finite sequences are initial segments of longer exchangeable sequences, i.e., as is said, not all of them are "extendible". Then, it may be the case that a mixture of i.i.d. model is not suitable to analyze the data at hand under the exchangeability assumption. As a simple example of this, in a mixture of i.i.d. r.v.s, the correlation among the variables is necessarily nonnegative, while it is not so in the not extendible case. The question of extendibility has been studied mainly for binary exchangeable r.v.s (e.g. [9,3,33]), and the concept has been extended to ME r.v.s in [35,36], so that we can say that a finite ME sequence is not necessarily the initial segment of a mixture of Markov chains process.
We will consider the case I = {0, 1}, as repeated binary variables often arise in practice, and we present a new class of hierarchical models for ME binary data. Such a class is presented as a reparameterization of the joint distribution of n ME r.v.s in terms of the "Bahadur/Lancaster's interactions". These interactions, also called "additive interactions", were first introduced in [2] and [21] and define the "additive models" which constitute an alternative to the loglinear models for the analysis of categorical variables. In fact, this class of additive models for the ME case is large enough to include all the distributions of the ME sequences of a fixed length disregarding their extendibility under a simple additional assumption, and nearly all the others.
The paper is structured as follows: In Section 2 we give some definitions and insights in ME distributions and mixtures of Markov chains, and we present a first simple parameterization of a ME distribution. In Section 3 we define a second parameterization which serves as a necessary intermediate step in order to construct the Bahadur/Lancaster's interaction parameters, and an assumption we should adopt to successfully accomplish that construction. In Section 4 we briefly introduce the additive models in general, then present the additive models for ME binary data. An application of the models presented concludes the paper.

Some definitions
Consider an I-valued sequence (x 1 , . . . , x n ). Define its transition counts n i,j for all i, j in I as Then, we will say that (X 1 , . . . , X n ) is ME (or n-ME when we need to highlight the length of the sequence), if its joint n-variate distribution assigns the same probability to all the (x 1 , . . . , x n ) in I n having the same value of the first step x 1 , and the same transition count matrix N . That is, x 1 and N together are a sufficient statistic, the probability of having any sequence starting in x 1 and consistent with N depends only on x 1 and N , and we will denote it p x1,N . Denote the set of all the distinct transition count matrices of all the I-valued sequences of n steps starting in x 1 as Φ(x 1 , n). For what we have said, an n-ME distribution is completely defined by the probabilities {p x1,N } for x 1 ranging in I and N ranging in Φ(x 1 , n), and that constitutes a first simple parameterization of an n-ME distribution.
When I = {0, 1}, we deal with 2 × 2 transition count matrices of the kind: N = n 0,0 n 0,1 n 1,0 n 1,1 Two cases are possible: n 0,1 and n 1,0 are equal or differ by one. In the first case, the sequences necessarily start and end at the same state; in the second, in different states. So, if we know x 1 and N , we also know x n . Let us denote as Φ 0 (0, n) the subset of Φ(0, n) of the matrices having n 0,1 = n 1,0 . The corresponding sequences all start and end in 0. Denote as Φ 1 (0, n) the subset of Φ(0, n) of the matrices having n 0,1 = n 1,0 + 1. The corresponding sequences all start in 0 and end in 1. We have Φ 0 (0, n) ∪ Φ 1 (0, n) = Φ(0, n). Symmetrically, for the sequences starting in 1, we define

1116
The number of probabilities defining an n-ME distribution is equal to the number of possible different transition count matrices for each fixed starting state. We have (for a proof see [7]) So, in general an n-ME binary distribution is defined by the 2 n 2 +2 probabilities {p 0,N } N∈Φ(0,n) and {p 1,N } N∈Φ(1,n) . We will also use the symbols p x1, n 0,0 n 0,1 n 1,0 n 1,1 . Let N be the transition count matrix intended as a r.v. Denote as w x1,N the probability of (X 1 = x 1 ) ∩ (N = N ) . Then any n-ME binary distribution is as well defined by the probabilities {w 0,N } N∈Φ(0,n) and {w 1,N } N∈Φ(1,n) . The relation with the previous parameterization is clear: the parameter w x1,N is the probability of having any sequence in I n consistent with (X 1 = x 1 )∩(N = N ) , i.e. starting in x 1 and having the transition count N , and all those sequences have the same probability p x1,N . The number of sequences such defined has been first computed by Whittle in [32]. When I = {0, 1} and N = If we do not add any assumption, the {w x1,N } are subject to the only restriction: x1∈I N∈Φ(x1,n) w x1,N = 1 Then we have 2 n 2 + 1 free parameters, and that is the maximum number of identifiable free parameters for an n-ME distribution. Any parameterization of an n-ME distribution without further assumptions would be a one-to-one transform of the {w x1,N } (or equivalently of the {p x1,N }), and so would have that number of free parameters. Then, in a fully parametric approach, we will say that a model for an n-ME sequence is saturated if it has 2 n 2 + 1 free parameters. That result allows us to appreciate the usefulness of a hypothesis of Markov exchangeability in terms of reduction of complexity: if we do not make any assumption on the joint distribution of n binary r.v.s, we would have 2 n − 1 free parameters. Under the only assumption of Markov exchangeability, we a priori reduce that number to about n 2 . In case of simple exchangeability it would be n, but we would have lost any information about the order of the variables.

Mixtures of Markov chains
We say that an I-valued process X = {X n } n∈N is ME if (X 1 , . . . , X n ) is ME for every n. Diaconis and Freedman in [8] demonstrated that a recurrent process (X 1 = X n i.o.) is ME if, and only if, its law is a mixture of Markov chains. That is, let P be the space of all the stochastic matrices Θ = {θ i,j } i,j on I × I. Then there exists a mixing measure ν on the Borel sets of I × P such that Let Γ i (k) be the step of the process at which the state i occurs for the k-th time. Let V i (k) be the k-th successor of the state i, i.e. the variable immediately subsequent the k-th occurrence of i: V i (k) = X Γi(k)+1 , and let v i (k) be the corresponding observed value. Originally de Finetti hinted at the possibility to characterize X as a mixture of Markov chains by the exchangeability of all the subprocesses {V i (k)} k , i ∈ I. Much later in [12] it has been demonstrated that the idea of de Finetti and the characterization of Diaconis and Freedman coincide in case of recurrent processes. In the following we will use the fact that in a ME process the {V i (k)} k are exchangeable.
The second class of mixtures of Markov chains models we will consider are the finite mixture models (see [17,28]), where the mixing distribution is conceived as a discrete distribution supported only on a finite number of points. A binary simple Markov chain is completely defined by three parameters: the probability of transition (0, 0), θ 0,0 , the probability of transition (1, 1), θ 1,1 , and by P (X 1 = 1) = q 1 . Then if the discrete mixing distribution has, say, d support points, it assigns masses λ h , h = 1, . . . , d, When dealing with mixtures of discrete components distributions, an identifiability problem may arise. That question has been studied and solved in the case of finite mixtures of Binomials, calculating the maximum number of components the mixture can have, that guarantee the identifiability of all the parameters (see, for example, [22] for a geometric approach). In our case, we have already calculated the maximum number of identifiable parameters 2 n 2 +1, then 4d −1 could not exceed that number and we have: In [28] the authors consider a mixing distribution over d points θ 0,0 (h), θ 1,1 (h) , h = 1, . . . , d, i.e., X 1 is analyzed separately and the total number of free parameters is 3d.

A first reparameterization
In an n-ME sequence (X 1 , . . . , X n ), the first k elements (X 1 , . . . , X k ), k < n, constitute a k-ME sequence, and we can obtain all the probabilities of the kind be the transition count matrix up to the first k steps, i.e. i,j∈I k i,j = k − 1 and let k 0,0 + k 0,1 = k + 0 and k 1,0 + k 1,1 = k + 1 . Then we have (for a proof see [7]) where the sums should be restricted over those matrices N in Φ(0, n) having n i,j ≥ k i,j , for all i, j in I. A similar formula holds for the sequences starting in 1. Consider now the probability p 0 , ( k 1 0 r ) of having the sequence of r + k + 2 steps starting in 0 with k transitions (0, 0), a single transition (0, 1) and ending with r transitions (1, 1), and denote it w 0,k,r . Applying the above formula we have We set w 0,n−1,0 = p 0 , n−1 0 0 0 . Introduce the operators ∆ 0 and ∆ 1 such that: then the inverse formula of (4), defining the {p 0,N } in terms of the {w 0,k,r }, is the following (see [7]): In an n-ME sequence the probabilities {w 0,k,r } are well defined for every couple of nonnegative integers (k, r) such that 0 ≤ k + r ≤ n − 2 together with the case w 0,n−1,0 . The two formulas above assure that the set {w 0,k,r } such defined constitutes a saturated parameterization of an n-ME distribution starting in 0. It is easily seen that the number of parameters defined is n 2 + 1. For the sequences starting in 1, we introduce the parameters {w 1,k,r }, defined as the probabilities of having the sequence starting in 1 with r transitions (1, 1), a single transition (1, 0) and ending with k transitions (0, 0). Formulas analogous to (4) and (5)  Let Y i,j (k) be the indicator function of the event the k-th successor of i is j considered as a r.v. and let y i,j (k) be the corresponding observed value. That is: In the particular case when (X 1 , . . . , X n ) is the initial segment of a mixture of Markov chains process, that parameters are restricted to satisfy: where E ν indicates the expectation w.r.t. the mixing measure ν(X 1 , θ 0,0 , θ 1,1 ).
Consider now the parameters m i,k,r defined as We have: Then, once you know the {m i,k,r } defined for i in {0, 1} and every couple (k, r) such that 0 ≤ k +r ≤ n−1, excluding the cases m 0,0,n−1 and m 1,n−1,0 , you can define an n-ME distribution. On the converse, it is easily seen that in an n-ME distribution it is not possible to single out all the values {m i,k,r } such defined without adding some restrictions, i.e., you cannot write them as a function of the {p i,N } and they are not identifiable. To realize that, one can simply note that their number is 2 n+1 2 − 2 which is greater than the maximum number of identifiable parameters for an n-ME distribution.
In order to reduce the total number of independent parameters and to make further constructions, we present an assumption which, though arbitrary, is reasonable and not particularly restrictive: we will call it "Independence Assumption", hereafter Ind.Ass. 1: Ind.Ass. 1. X 1 and N are independent.
In a mixture of Markov chains model that assumption corresponds to ν(X 1 , θ 0,0 , θ 1,1 ) = ν 1 (X 1 ) ν 2 (θ 0,0 , θ 1,1 ) If we adopt Ind.Ass. 1, we can bypass the identifiability problem we have mentioned above. In fact, we have so, by (6), the {m k,r } defined for every (k, r) such that 0 ≤ k + r ≤ n − 1 suffice to define an n-ME distribution under Ind.Ass. 1. On the converse, it is possible to write (with a recursive formula) all the probabilities {m k,r } such defined in terms of the {w 0,k,r } and {w 1,k,r }. In fact we have Then we have m 1,r = m 0,r − w 0,0,r q0 , and in general, by recurrence, So, we can parameterize any n-ME distribution satisfying Ind.Ass. 1 in terms of the {m k,r }, for k + r ≤ n − 1, together with q 1 .
In case of a mixture of Markov chains (2), the parameters m k,r are constrained to be the mixed moments of the mixing measure ν(θ 0,0 , θ 1,1 ): Another case of interest is when the two exchangeable subprocesses forming the ME process are independent. We will refer to that condition as:

The additive models
A typical approach to the analysis of multivariate categorical data is that of defining some kind of interactions between the variables, by means of which we decompose their joint distribution in a hierarchical model. We can start from the saturated model, that is, when all the identifiable interaction parameters are included, to analyze the dependence structure. Then, setting equal to zero some interaction parameters, we can construct all the reduced models, and choose a suitable, parsimonious one, tuning our target of goodness of fit. Note that it is required that the interaction parameters can consistently assume zero values. There are two main approaches to interaction: the so called multiplicative and the additive approach. For a comparison of the two see [4,5,18]. The most commonly used is the multiplicative approach and the respective models, namely the loglinear models. The additive definition was introduced in [2] and defined for general real variables in [21] and hence is sometimes called Bahadur/Lancaster's interaction. It has been studied in few isolated papers ( [37,31,30]). The respective models for categorical variables (we will call them additive models) have been rarely utilized ( [23,11]).
When dealing with {0, 1}-valued variables, the additive interaction of order k among the variables (X 1 , . . . , X k ) is It can be viewed as a generalization of the concept of covariance between two variables, so we will denote it Cov[X 1 , . . . , X k ].
When data consist of sequences of unequal sizes, we should specify model parameters in such a way that they have a consistent interpretation, whatever the sequence size, i.e. parameters with meanings that are invariant across different marginal distributions. That is, the joint distribution of (X 1 , . . . , X k ), k < n, should be defined by the same parameters, or a subset of the parameters, defining the joint distribution of (X 1 , . . . , X n ). Sometimes, those distributions defined by parameters invariant under marginalization are said "reproducible". For some detail on the concept of reproducibility see [29,18,11].
The main advantage of the additive models with respect to the loglinear models is that the former are reproducible. On the converse, consider the following example from [18]: Let (X 1 , X 2 , X 3 ) be binary variables and denote P (X 1 = i, X 2 = j, X 3 = h) as p i,j,h , i, j, h ∈ {0, 1}. Under the saturated loglinear model we have where for example u j,h (2, 3) is the interaction parameter resulting from the joint occurrences of state j in X 2 and state h in X 3 . The logarithm of its marginal distribution for (X 1 , X 2 ) is lg p i,j (1, 2) =ũ +ũ i (1) +ũ j (2) +ũ i,j (1, 2) butũ i (1) = u i (1),ũ j (2) = u j (2) andũ i,j (1, 2) = u i,j (1, 2). So loglinear models are not reproducible, and when data consist of multiple sequences of unequal sizes, we should prefer an additive model.

Additive models for Markov exchangeable data
The additive models for general r.v.s are a parameterization of the joint distribution of a set of variables. That parameterization is readily adaptable to the case of exchangeable binary variables. In fact, in that case the only difference is that things simplifies, as all the interactions of a same order are equal, i.e. they depend only on the number k of variables involved: Cov[X i1 , . . . , X ik ] = Cov k , ∀ i 1 , . . . , i k . For an application of the additive models in the case of binary exchangeable data see [1] and [20].
In our case of ME variables, we do not define the additive interactions between the X 1 , . . . , X n , but between the r.v.s {Y 0,0 (k)} k and {Y 1,1 (r)} r , and we need an additional assumption to define the parameterization, namely Ind.Ass. 1.
Say data consist of m binary sequences {x 1 , . . . , x m }. For each observed sequence x l = (x l,1 , . . . , x l,nl ), l = 1, . . . , m, consider its number of transitions (i, j), n i,j (x l ), such that i,j∈I n i,j (x l ) = n l − 1, and the quantities n + i (x l ) = j∈I n i,j (x l ). We will consider the observed values {y 0,0 (k)} k=1,...,n + 0 (xl) and {y 1,1 (r)} r=1,...,n + 1 (xl) as realizations of the r.v.s {Y 0,0 (k)} k and {Y 1,1 (r)} r . Note however that, even in the case when all the observed sequences {x 1 , . . . , x m } have the same length, (i.e. n l = n, ∀ l), the quantities n + i (x l ) for each fixed i ∈ {0, 1}, will not be, in general, constant amongst the various sequences. For that reason, reproducibility is a crucial property of the interaction model chosen, hence, we will not consider the multiplicative interactions, which are not reproducible. Note that the parameterizations in terms of the {p x1,N } or {w x1,N } presented in Section 2 are clearly not reproducible, while the parameterizations presented in Section 3 are reproducible, but it does not make sense to set some of their elements equal to zero. The mixture models presented in Section 2.1 are reproducible.
In case of ME variables, the additive interactions depend only on the numbers of variables involved for each exchangeable group {Y 0,0 (k)} k and {Y 1,1 (r)} r , that is, we can define By expanding the product, one can find the following formulas defining the one-to-one relation between parameters {m k,r } and {Cov k,r } (see [7]): As a consequence, the parameterization in terms of the {Cov k,r }, defined for every (k, r) such that 2 ≤ k + r ≤ n − 1, together with m 0,1 , m 1,0 and q 1 represents all the n-ME distributions satisfying Ind.Ass. 1. (Cov 0,0 is always equal to 1 while Cov 1,0 and Cov 0,1 are always 0). Note that the simplest case when all the Cov k,r are zero, represents the Markov chains models. In fact by (11), we would have m k,r = (m 1,0 ) k (m 0,1 ) r , that is, by (9), a Markov chain having probability of transition (0, 0) equal to m 1,0 , and probability of transition (1, 1) equal to m 0,1 .
Recall that, in case of a mixture of Markov chains (2), the {m k,r } are the mixed moments of the mixing measure ν. Formulas (10) and (11) link the ordinary mixed moments and the central mixed moments of a bivariate distribution (see e.g. equations (34.28) (34.29) in [19]). Consequently, in a mixture of Markov chains model we have: Actually, in a mixture model all the even order terms Cov 2k,2r are necessarily positive. These restrictions on the parameters constitute a simple test for the extendibility of a ME sequence under Ind.Ass. 1, and can help us in deciding for an additive model instead of a mixture model. We can write many similar necessary conditions for extendibility using moments inequalities. One that has been found to be useful is the following: In a mixture model we necessarily have Obviously, in the same conditions the ML estimates { w x1,N } of the {w x1,N } simply are the observed proportions of sequences consistent with (x 1 , N ): Since ML estimates are invariant under one-to-one functions, we can find the ML estimates of any saturated parameterization of an n-ME distribution by applying to the (14) the corresponding transformation. So, by (1) we can find the ML estimates of the {p x1,N }. Subsequently, by (4) we can obtain the ML estimates of the {w i,k,r }. Then, under Ind.Ass. 1, by (8) we can obtain the ML estimates of the {m k,r }, and by (10) the ML estimates of the {Cov k,r }.

Reduced additive models
To construct the reduced additive models under Ind.Ass. 1 and to calculate the ML estimate of the parameters, first consider the log-likelihood (13) in terms of the {p x1,N }. Then, by formulas (5), (6), (7) and (11), we can rewrite it in terms of the {Cov k,r } together with m 0,1 , m 1,0 and q 1 . Thus we can construct a reduced additive model, simply by setting as zero in the log-likelihood expression some of the Cov k,r , for example starting from those of higher orders, and finally find the ML estimates of the remaining parameters by numerically maximizing the log-likelihood with respect to them (note that the log-likelihood is always linear in the interaction parameters).
We can also construct the additive interactions without Ind.Ass. 1 defining and where P i is the distribution of (X 1 , . . . , X n ) conditioned on {X 1 = i} and E Pi is the respective expectation. Then formulas (10) and (11) hold with Cov i,k,r replacing Cov k,r and m i,k,r qi replacing m k,r . But, as we have said, without Ind.Ass. 1 we have an identifiability problem regarding the {m i,k,r }. Then, any one-to-one transform of the {m i,k,r } would have the same problem, and we cannot define a saturated parameterization in terms of the {Cov i,k,r }. Nevertheless, we can consider nearly all the corresponding reduced additive models.
In fact, if we a priori set as zero (at least) the parameters Cov 0,k,r and Cov 1,k,r for all the couples (k, r) such that k + r = n − 1, we can write the likelihood in terms of the {Cov 0,k,r } and {Cov 1,k,r } for k + r < n − 1 together with the four parameters m i,0,1 qi , m i,1,0 qi , i ∈ {0, 1} and numerically maximize it. In case of a mixture model, we have some restrictions on the parameters analogous to those we have seen before under Ind.Ass. 1. We can consider some of them. In fact, even if the parameters m i,k,r are not identifiable without Ind.Ass. 1, the particular cases m 0,k,0 and m 1,0,r (and hence Cov 0,k,0 and Cov 1,0,r ) actually are. In fact, they correspond respectively to p 0, ( k 0 0 0 ) and p 1, ( 0 0 0 r ), so we can obtain them by (3). In a mixture we necessarily have Then, given a real dataset, starting from the (14), we can calculate the sample values m 0,k,0 and m 1,0,r , and eventually derive some evidences against a mixture model if the inequalities (15) do not hold.

Numerical examples
We now present an application of the models analyzed to three datasets. The first one has been obtained from a longitudinal study: the National Longitudinal Survey of Youth (NLSY79) whose data are freely available on web. Over 12000 persons answered a questionnaire by interview for several years. We have considered a variable concerning the labor status of the respondent during the week preceding the interview. The second one is a simulated dataset generated from a mixture of Markov chains model. The third, analyzed in [27], concerns the results of two kinds of medical tests on diabetic pregnant patients to determine fetal oxygenation. On all the datasets, we have fitted a simple Markov chain (MC), a MBM model, some finite mixtures of Markov chains and some additive models (AM) with and without Ind.Ass. 1. To estimate the additive models, we have used a software that allows manipulation of mathematical expressions in symbolic form to write their log-likelihoods following the procedure described in Section 4.2.2. Then we utilized a Newton-Raphson maximization routine to maximize them. The ML estimates of the finite mixture models have been computed using the EM algorithm (for the equations utilized see [17] and [28]), while a numerical maximization routine has been used for the MBM model (see [27]). Since the fitted models have different numbers of parameters, we consider the Akaike and the Bayesian information criteria in order to compare their performances: wherel is the value of the log-likelihood of the model corresponding to the ML estimates of the parameters, m is the number of sequences in the dataset, and c is the number of free parameters to be estimated in the model. Concerning the finite mixture models and the additive models, we present just the models that resulted in the best performance w.r.t. the two criteria. In particular, we have not checked all the possible additive models, so the results we show may be suboptimal.

NLSY dataset
We have dichotomized the variable concerning the labor status of the respondent (1=working, 0=not working) and considered a period of 12 years. We successively have excluded the units having missing values, finally obtaining a sample consisting of 5657 binary sequences of length 12. We consider them as independent realizations of a 12-ME sequence (X 1 , . . . , X 12 ). With regard to the finite mixture models, the fit is just slightly improved by adding more than 6 component Markov chains, and the models with 5 and 6 components have the best BIC and AIC respectively in their class.
The total number of possible additive models is huge. So we have restricted our attention to the lower order interaction parameters: {Cov i,k,r : k + r ≤ 6}, as it seems that in general they contribute more to the fit. In addition, to make a direct comparison with the mixture models, we have fixed a number of parameters equal to that of the best performing mixture models (c = 19 and 23). AM 1 in Table 1 was found to be the best additive model with those restrictions under Ind.Ass. 1. The others (AM 2 and AM 3) without Ind.Ass. 1. AM 1 is defined by all the interaction parameters Cov k,r , for k, r ≤ 3 excluding Cov 2,2 , together with Cov 4,0 , Cov 0,4 , Cov 4,2 and Cov 2,4 . AM 2 is defined by the interaction parameters Cov i,2,0 , Cov i,0,2 , Cov i,2,1 , Cov i,1,2 and Cov i,4,0 , for both i = 0 and i = 1, together with Cov 0,1,1 , Cov 0,3,1 , Cov 0,1,3 and Cov 0,0,4 . AM 3 has all the parameters of AM 2 together with Cov 1,3,0 , Cov 1,0,3 , Cov 1,1,1 and Cov 0,2,3 . AM 1, AM 2 and AM 3 all have a better AIC and BIC than the mixture models. More generally, most of the additive models we have checked perform better than mixtures. To investigate such a clear difference we checked the criteria for extendibility for the empirical distribution. Let m 0,k,0 be the sample value of m 0,k,0 for this dataset. We have that m 0,k,0 / q 0 − ( m 0,1,0 / q 0 ) k is slightly positive for small values of k, it decreases with k and is negative for k ≥ 8. We know that in the mixture models (15) holds, and in addition, all the fitted mixture models show an opposite trend to that of the empirical distribution, as the difference increases with k. Consequently, it is not surprising that the additive models in general work better than the mixtures for this dataset.
From the results obtained with the additive models we can derive some information on the data. Let us call the terms Cov k,r having both k and r non null, the "cross" terms. The MBM satisfies Ind.Ass. 2 that amounts to: Since Cov 0,1 = Cov 1,0 = 0, all the cross terms having k or r equal to one are null under the MBM, and in general the other cross terms are small in value. But the cross terms seem to be important for the additive models in this dataset, and that suggests a certain dependence between the two exchangeable subprocesses forming the ME process exists, i.e. Ind.Ass. 2 is untenable. That explains the inadequacy of the MBM for this dataset.

Simulated dataset
In [33] it is demonstrated that, in the space of all the n-exchangeable binary distributions, the proportion of them (with respect to the Lebesgue measure of the space) which are mixtures of i.i.d. model quickly tends to zero with n. That fact, together with the results in [7] suggest that an analogous proportion should be valid in the ME case, i.e. if we pick uniformly at random a distribution in the simplex of the parameters w x1,N representing all the n-ME binary distributions, probably it will not be a mixture of Markov chains model. For that reason, and to test the versatility of the additive models, we simulated a dataset from a mixture of Markov chains. We have chosen a large family of mixtures, namely, a 8-components finite mixture whose parameters' values were randomly sampled with uniform probability over the parameters' space. The generated dataset consists of 100 binary sequences of length 6. Model AM 1 in Table 2 satisfies Ind.Ass. 1, models AM 2 does not. AM 1 is defined by the interaction parameters Cov 2,0 and Cov 0,2 . AM 2 is defined by Cov 0,2,0 , Cov 1,2,0 and Cov 1,0,2 . The MBM results as the best model for both the criteria. We note however that, even in this adversely simulated dataset, the additive models are competitive as they show a performance comparable to that of the MBM, and preferable to those of the finite mixtures. In general, it is not hard to find an additive model with this characteristics for this dataset. The reason lies in the possibility of choosing an intermediate number of parameters, while in the finite mixtures we directly pass from c = 7 to c = 11.
In the various additive models checked, the cross terms did not show particular importance. AM 1 and AM 2 do not retain any cross term. That may suggest a weak dependence between the two exchangeable subprocesses, and that is consistent with the good performance of the MBM.

Fetal Oxygenation dataset
So far we have analyzed datasets consisting of sequences of equal lengths, but it is important to note that it is possible to use the additive models and the mixture models even if the sequences at hand have different lengths, since those models are reproducible. Say the lengths of the observed sequences range from n min to n max , then we simply have to consider a log-likelihood of the kind nmax n=nmin x1∈I N∈Φ(x1,n) ♯[x 1 , N ] · lg (p x1,N ) instead of (13). The main difference with the case when all the sequences have the same length, is that the processing time of the algorithms increases, especially for the EM algorithm. For that reason, we now analyze a small dataset consisting of only 30 sequences whose lengths range from 2 to 7. Two kind of medical tests were repeatedly performed in different occasions on 30 units and for each occasion we fix a 1 if they resulted as concordant and a 0 if they resulted as discordant.
Model AM 1 in Table 3 satisfies Ind.Ass. 1; AM 2 does not. AM 1 is defined by parameters Cov 2,0 and Cov 0,2 . AM 2 retains no interaction parameters, i.e. is just defined by and q 1 . It substantially provides two transition probabilities matrices depending on whether the initial state is 0 or 1. Again, the additive models presented are preferable to the mixtures w.r.t. the AIC and BIC. In particular, the finite mixtures require too many parameters for this small dataset.

Conclusions
The mixtures of Markov chains do not represent all the ME distributions, and many cases not covered by them may be of practical interest. The models presented in this paper should fill the gap. The finite mixture models enjoy a great popularity mainly due to their neat interpretation in terms of cluster analysis. On the converse, the additive models suffer from the same problem of the loglinear models (or any interaction model) in terms of parameters selection and interpretability. Nevertheless, the additive models for ME data seem to be an interesting tool, their main strength lying in two points: First they represent an extremely flexible hierarchical class. In a finite mixture of Markov chains, we must add four parameters for every additional component. In an additive model we can consider any number of parameters ranging from a very simple model (a Markov chain, or even an i.i.d. model) to the saturated model. Moreover, for every fixed number of parameters we have several possible additive models among which we can choose. So, if we check a sufficiently large number of models (possibly all), we would likely find one resulting as preferable to any mixture, both in terms of goodness of fit, and in parsimony of parameters. The second advantage concerns the processing time of the estimation procedure. In fact, the ML estimates of the additive models are immediately obtained since Newton-Raphson methods, or similar devices, have a quadratic speed of convergence. That contrasts with the slowness of the EM algorithm which is a well-known problem. For example in the NLSY dataset the EM for the finite mixtures required several hours to converge and the processing time dramatically increases with the number of components.