Multivariate Generalizations of the Multiplicative Binomial Distribution: Introducing the MM Package

We present two natural generalizations of the multinomial and multivariate binomial distributions, which arise from the multiplicative binomial distribution of Altham (1978). The resulting two distributions are discussed and we introduce an R package, MM, which includes associated functionality.


Introduction
The uses of the binomial and multinomial distributions in statistical modelling are very well understood, with a huge variety of applications and appropriate software, but there are plenty of real-life examples where these simple models are inadequate.In the current paper, we first remind the reader of the two-parameter exponential family generalization of the binomial distribution first introduced by Altham (1978) to allow for over-or under-dispersion: P (Z = j) = n j p j q n−j θ j(n−j) f (p, θ, n) j = 0, . . ., n where n j p j q n−j θ j(n−j) . (2) Here, 0 p 1 is a probability, p + q = 1, and θ > 0 is the new parameter which controls the shape of the distribution; the standard binomial Bi(n, p) is recovered if θ = 1.Altham points out that this distribution is more sharply peaked than the binomial if θ > 1, and more diffuse if θ < 1.As far as we are aware, no other generalization has this type of flexibility; for example, the beta-binomial distribution (Johnson, Kemp, and Kotz 2005) only allows for over-dispersion relative to the corresponding binomial distribution.
We then introduce two different generalizations, both of which are of exponential family form.
We call these: The multivariate multinomial distribution Take non-negative integers y 1 , . . ., y k 0 with y i = y, a fixed integer.Then suppose that the probability mass function of Y 1 , . . ., Y k is where the free parameters are p = (p 1 , . . ., p k ) and θ = θ ij , with p i 0 for 1 i k, p i = 1, and θ ij > 0 for 1 i < j k [these restrictions on i, j understood henceforth].Here C = C (y, p, θ) is a normalization constant.Thus the standard multinomial is recovered if θ ij = 1, and in this case C = 1.

The multivariate multiplicative binomial distribution
For simplicity of notation, we restrict attention to the bivariate case.The proposed frequency function where p i + q i = 1 and x i + z i = m i , i = 1, 2; all parameters are strictly positive and C is again a normalization constant.Thus X 1 , X 2 are independent iff φ = 1.Furthermore, if φ = 1, then θ 1 = θ 2 = 1 corresponds to X 1 , X 2 independent Binomial: X i ∼ Bi (m i , p i ).
We then introduce an R (R Development Core Team 2011) package MM which implements some functionality for these distributions.Because of their simple exponential family forms, both these distributions may be fitted to appropriate count data by using the glm() R function with the Poisson distribution and log link function.This follows from the ingenious result of Lindsey and Mersch (1992), and has the very important consequence that the computationally expensive normalizing constants in Equations 3 and 4 above need never be evaluated.
Both these distributions are clearly exponential family-type distributions (Cox and Hinkley 1974), and may be seen as discrete analogues of the multivariate normal distribution in the following two respects, which we illustrate only for the multiplicative multinomial distribution.
Firstly, suppose we have observed frequencies n (y 1 , . . ., y k ) with y i = y, then the log likelihood of this dataset may be written as n (y 1 , . . ., y k ) log P (y 1 , . . ., y k ) . ( where the summation is over y 1 , . . ., y k 0 with y i = y.This gives rather simple expressions, essentially the sample mean and sample covariance matrix of the vector (y 1 , . . ., y k ), for the minimal sufficient statistics of this exponential family.Hence, by standard theory for exponential families, at the maximum likelihood value of the parameters, the observed and fitted values of the minimal sufficient statistics will agree exactly.
Secondly, as we show later, each of the distributions given in Equations 3 and 4 is reproductive under conditioning on components.

A probabilistic derivation of the Multiplicative Multinomial
It is possible to consider the MM distribution from the perspective of contingency tables, which for simplicity we will carry out for k = 3, y = 4: The general case is notationally challenging.
Our preferred interpretation is drawn from the field of psephology: Consider a household of 4 indistinguishable voters, each of whom votes for exactly one of 3 political parties, say ℘ 1 , ℘ 2 , ℘ 3 .Let y 1 , y 2 , y 3 be the total number of votes obtained from this household for ℘ 1 , ℘ 2 , ℘ 3 respectively, and so y 1 + y 2 + y 3 = 4.
The 4 voters in the household may be considered as corresponding to the rows, columns, layers and the 4 th dimension of a 3 × 3 × 3 × 3 contingency table, with cell probabilities P ijhl for 1 i, j, h, l 3, which we will assume have the following symmetric form where θ rs = θ sr for s, r ∈ {i, j, h, l} [notation is analogous to that used in Equation 8of Altham (1978), with θ written for φ], and without loss of generality θ rr = 1.
The parameters θ ij may be interpreted in terms of conditional cross-ratios.Recalling that P ijhl = P ijlh = . . .= P lhji we have, for example: By enumerating the possible voting results for a given family of size 4, we may find the resulting joint distribution of (Y 1 , Y 2 , Y 3 ), where random variable Y i is the household total of votes for party ℘ i , i = 1, 2, 3.For example, 1 is the probability that all 4 members of the household vote for ℘ 1 .Similarly, 12 is the probability that 3 members of the household vote for ℘ 1 and the remaining 1 member votes for ℘ 2 .This clearly corresponds to the given Multiplicative Multinomial distribution, so C = C .We return to this example with a synthetic dataset in Section 3.1 below.

Marginal and conditional distributions
There does not appear to be an elegant expression for the marginal distribution of (Y 1 , . . ., Y r ) where r < k.However, the multiplicative multinomial behaves 'elegantly' under conditioning on a subset of the variables (Y 1 , . . ., Y k ).For example, where Hence the distribution of Y 1 , conditional on the values of (y 3 , . . ., y k ) is multiplicative binomial in the sense of Altham (1978).Similarly, the distribution of (Y 1 , . . ., Y ν ), conditional on the values of (y ν+1 , . . ., y k ) is multiplicative multinomial in the sense defined above.

The normalization constant
The constant C in Equation 3 must be determined by numerical means: Although this is provided in the MM package [function NormC()], it is computationally expensive and difficult to evaluate for all but small k and y.
1.4.The Poisson method for fitting distributions with an intractable normalizing constant The parameters (p, θ) may be estimated without determining the normalizing constant C by transforming the problem into a Generalized Linear Model.The method presented here follows Lindsey and Mersch (1992); for simplicity of notation we take k = 3. Equation 3 is equivalent to where offset [y 1 , y 2 , y 3 ] = − log (y i !) accounts for the multinomial term on the right hand side.The log-likelihood L of the dataset n (y 1 , y 2 , y 3 ) is given by L = y 1 +y 2 +y 3 =y n (y 1 , y 2 , y 3 ) log P (y 1 , y 2 , y 3 ) .
1 The distribution of independent Poisson random variables conditional on their total is multinomial with probabilities equal to the scaled Poisson parameters.If Xi ∼ Po(λi), then elementary considerations show , the right hand side being recognisable as a multinomial distribution.Given that the distribution is of the exponential family, it is the case that n = λi, the normalizing constant is not needed.

Multivariate multiplicative binomial
Considering the bivariate case for simplicity, suppose (X 1 , X 2 ) to be non-negative integers not exceeding known fixed maxima m 1 , m 2 respectively.
We introduce a 5-parameter distribution of exponential family form.In common with the multiplicative binomial, it has the property that at the maximum likelihood values of these parameters, the observed and fitted values of the means of X i , i = 1, 2 will agree exactly, and similarly for the observed and fitted values of the covariance matrix of X 1 , X 2 .This distribution is easy to fit to frequency data (again using the Lindsey and Mersch Poisson device).The distribution has some nice properties, but there do not appear to be simple formulae for its moments.
The proposed frequency function where p i + q i = 1 and x i + z i = m i ; all parameters are strictly positive.
Here, C is the normalization constant: Although there does not seem to be a simple expression for the correlation between X 1 and X 2 , it is easily seen that φ controls their interdependence in a likelihood ratio fashion, with Indeed, following Lehmann (1966) we can prove a much stronger statement: The variables X 1 , X 2 are positive monotone likelihood ratio dependent2 for φ > 1, negative if φ < 1.
The conditional distribution is again of multiplicative binomial form, since we can write

The MM package
The MM package associated with this article provides R functionality for assessing the multiplicative multinomial and multivariate binomial.We have provided user-friendly wrappers to expedite use of the distributions in a data analysis setting.
The MM package uses an object-oriented approach: The set of free parameters (one vector, one upper-diagonal matrix) is not a standard R object, and is defined to be an object of S4 class paras.The objects thus defined are user-transparent and a number of manipulation methods are provided in the package.
For example, consider Equation 3 with k = 5 and p i = 1 5 , 1 i 5 and θ ij = 2 for 1 i < j 5.This distribution would be underdispersed compared with the corresponding multinomial.It is straightforward to create an object corresponding to the parameters for this distribution using the package: See how closely clustered the sample is around its mean of (4,4,4,4,4); compare the wider dispersion of the multinomial: ] 6 4 3 4 3  [2,] 4 3 5 4 4  [3,] 5 7 2 4  Thus sample2 is drawn from the classical multinomial.It is then straightforward to perform a likelihood ratio test on, say, sample1: Function MM_allsamesum() calculates the log likelihood for a specific parameter object (in this case, pm1 and pm2 respectively) and we see that, for sample1, hypothesis pm1 is preferable to pm2 on the grounds of a likelihood ratio of about Λ = 0.47 × 10 −6 , corresponding to 14.56 units of support.This would exceed the two units of support criterion suggested by Edwards (1992) and we could reject pm2.Alternatively, we could observe that −2 log Λ is in the tail region of its asymptotic distribution, χ 2 1 .The package includes a comprehensive suite of functionality which is documented through the R help system and accessible by typing library(help="MM") at the command prompt.

The package in use
The package comes with a number of datasets, four of which are illustrated here.We begin with a small synthetic dataset, then consider data taken from the social sciences, previously analyzed by Wilson (1989); analyze some pollen counts considered by Mosimann (1962) in the context of palaeoclimatology; and finally assess a marketing science dataset.

Synthetic voting dataset
We begin with a small synthetic dataset which is simple enough to illustrate the salient aspects of the multiplicative multinomial distribution, and the MM package.This dataset arises from 96 households each of size 4, in which each member of the household is noted as voting Lib, Con or Lab respectively.We take n(•, •, •) as the voting tally; thus n(0, 0, 4) = 5 [the first line] means that there are exactly 5 households in which all 4 members vote Labour; similarly n(0, 1, 3) = 8 means that there are exactly 8 households in which 1 member votes Conservative and the remaining 3 vote Labour.
R> data("voting") R> cbind(voting,voting_tally) One natural hypothesis is that the data are drawn from a multinomial distribution (alternative hypotheses might recognize that individuals within a given household may be non-independent of each other in their voting).
The multinomial hypothesis may be assessed using glm() following Lindsey and Mersch (1992) but without the interaction terms: Thus the model fails to fit (77.315 being much larger than the corresponding degrees of freedom, 12).This is because the observed frequencies of the cells in which all members of the household vote for the same party (namely for rows 1, 5 and 15 of the data) greatly exceed the corresponding expected numbers under the simple multinomial model.The next step is to take account of the fact that individuals within a given household may be non-independent of each other in their voting intentions (and may indeed tend to disagree with each other rather than all vote the same way).Positive dependence between individuals in a household could be modelled by the Dirichlet-multinomial distribution (Mosimann 1962), but by using the Multiplicative Multinomial introduced here, we are allowing dependence between individuals in a household to be positive or negative.
The MM parameters may be estimated, again following Lindsey and Mersch (1992)  Observe that the MLEs of p [viz (0.367, 0.315, 0.318)] are obtained as proportional to the exponential of the estimated regression coefficients: (e 1.375 , e 1.222 , e 1.231 ), normalized to add to 1.This model is quite a good fit in the sense that the null deviance 11.5 is not in the tail region of χ 2 9 , its null distribution; it can be seen that all 3 interaction parameters are significant and, for example, θ23 = 0.652 = exp (−0.428).

Housing satisfaction data
We now consider a small dataset taken from Table 1 of Wilson (1989), who analyzed the datset in the context of overdispersion (Wilson himself took the dataset from Table 1 of Brier (1980)).
In a non-metropolitan area, there were 18 independent neighbourhoods each of 5 households, and each household gave its response concerning its personal satisfaction with their home.The allowable responses were 'unsatisfied' (US), 'satisfied' (S), and 'very satisfied' (VS).
R> data("wilson") R> head(non_met) Thus the first neighbourhood had three households responding US, two reporting S, and zero reporting VS; the second neighbourhood had the same reporting pattern.
The MM parameters may be estimated, again following Lindsey and Mersch (1992)  Thus in this dataset, only the first interaction parameter US:S is significant.This might be interpreted as an absence of VS responses coupled with a broader than expected spread split between US and S. Note that the residual deviance is now less than the corresponding degrees of freedom.
In this case, the three categories US, S, and VS are ordered, a feature which is not used in the present approach.It is not clear at this stage how we could best include information about such ordering into our analysis.
We now check agreement of the observed and expected sufficient statistics: The summary() method gives normalized statistics so that, for example, the row_sums total y = 5.This may be compared with the expectation of the maximum likelihood MM distribution:

Mosimann's forest pollen dataset
Palynology offers a unique perspective on palaeoclimate; pollen is durable, easily identified, and informative about climate (Faegri and Iversen 1992).We now consider a dataset collected in the context of palaeoclimate (Sears and Clisby 1955;Clisby and Sears 1955), and further analyzed by Mosimann (1962).
We consider a dataset taken from the Bellas Artes core from the Valley of Mexico (Clisby and Sears 1955, Table 2); details of the site are given by Foreman (1955).The dataset comprises a matrix with N = 73 observations, each representing a depth in the core, and k = 4 columns, each representing a different type of pollen.We follow Mosimann in assuming that the 73 observations are independent, and in restricting the analysis to depths at which a full complement of 100 grains were identified.R> data("pollen") R> pollen <-as.data.frame(pollen)R> head(pollen) Thus each row is constructed to sum to 100, and there are 4 distinct types of pollen; hence in our notation y = 100 and k = 4.
Observe that this dataset, in common with the housing satisfaction data considered above, has to be coerced to histogram form; but this time the numbers are larger.The partitions package (Hankin 2006) uses generating functions to determine that there are exactly R> S(rep(100,4),100) [1] 176851 possible non-negative integer solutions to y 1 + y 2 + y 3 + y 4 = 100 (most of these have zero observed count).Each of these solutions must be generated and this is achieved using the compositions() function of the partitions package (Hankin 2007).
First we repeat some of Mosimann's calculations as a check.Using the ordinary multinomial distribution Mn(y, p), we find p to be R> p.hat <-colSums(pollen)/sum(pollen)

Pinus
Abies qqqqqqqqqqqq q q q q q q q q q q q q q q q q q q q q q qqqqq 70 75 80 85 90 95 100 0.00 0.04 0.08 0.12 Pinus qqqqqq qq qq q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q 0 2 4 6 8 10 0.00 0.10 0.20 0.30 Abies probability q q q q q q q q q q q q q q q multinomial MM q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00 0.04 0.08 0.12 Quercus q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 Alnus probability q q q q q q q q q q q q q Figure 1: Marginal frequency distributions for numbers of each of four pollen types based on the multinomial distribution (black) and the multiplicative multinomial (red).
The Thus we arrive at an apparently rather symmetrical set of parameter estimates (in the sense of the elements of p being close to one another, and the elements of θ being close to unity).For this dataset, we observe that the asymptotic distribution of the residual deviance, χ 2 176841 , is not a good approximation for its actual distribution.This is because the frequency data is overwhelmingly comprised of zeros, with only 66 nonzero frequencies amongst the 176851 compositions.Function glm() tenders a warning to this effect.

Marketing science: An example of the multivariate multiplicative binomial
We now illustrate the multivariate multiplicative binomial with an example drawn from the field of economics.Danaher and Hardie (2005) considered a dataset obtained from a sample of N = 548 households over four consecutive store trips.For each household, they counted the total number of egg purchases in their four eligible shopping trips, and the total number of bacon purchases for the same trips; the hypothesis was that egg consumption was correlated with bacon consumption.and glm() gives a good fit in the sense that the residual deviance of 18.666 is compatible with its asymptotic null distribution χ 2 19 .The bacon:eggs coefficient (ie log φ = 0.3006) gives φ = e 0.3006 = 1.3507, showing strong positive association.
We can now verify that the expected (marginal) number of egg purchases and bacon purchases under the ML distribution match the observed.The first step is to create M, the expected contingency matrix:

Suggestions for further work
The multiplicative multinomial is readily generalized to a distribution with 2 k − 1 parameters: P (y 1 , . . ., y k ) = y y 1 . . .y k S⊆ [k] (Θ S ) i∈S y i (18) where [k] = {1, 2, . . ., k} is the set of all strictly positive integers not exceeding k.Here, the parameters are indexed by a subset of [k]; it is interesting to note that Θ ∅ formally represents the normalization constant C. In this notation, Equation 3becomes Equation 18 leads to a distribution of the exponential family type; but interpretation of the parameters is difficult, and further work would be needed to establish the usefulness of this extension.
Further, Equation 4 generalizes to t i=1 m i x i z i p x i i q z i i θ x i z i i i<j It is possible to generalize the equations in a slightly different way.Consider an r × c matrix n with entries n ij and fixed marginal totals.Now suppose that each row of n comprises independent observations from a multinomial distribution with probabilities p 1 , . . ., p r , and likewise the columns are multinomial q 1 , . . ., q c : This is the null of Fisher's exact test.Then one natural probability measure would be (the fixed known row-and column-sums mean that p i and q j , and the marginal multinomial terms, are absorbed into the normalizing constant C).With a slight abuse of notation this can be written where Θ governs row-wise departures from multinomial and Φ governs column-wise departures; there are a total of r(r − 1)/2 + c(c − 1)/2 free parameters.
The Poisson device of Lindsey and Mersch is again applicable, with the difference that the enumeration carried out by compositions() is replaced by enumeration of contingency tables with the correct marginal totals: Function allboards() of the aylmer package (West and Hankin 2008).A simple example is given under help(sweets).
As pointed out by an anonymous referee, it might be possible to extend either or both of the new distributions to the context of regression on covariates.

Conclusions
In this paper, we considered natural generalizations of the multiplicative binomial distribution to the multivariate case.The resulting distributions have a number of desirable features, including a more precise control over the variance than the multinomial, and a straightforward interpretation in terms of contingency tables.
The distributions belong to the exponential family; this makes fast calculation of maximum likelihood estimates possible using generalized linear model techniques; in R idiom, the glm() function is used.
Novel analyses are presented on data drawn from the fields of social science and palaeoclimatology.
NA NA 2 2 d NA NA NA NA 2 e NA NA NA NA NA Now we may sample repeatedly from the distribution (sampling is quick because it does not require evaluation of the normalization constant).Consider y = 20: R> set.seed(0)R> (sample1 <-rMM(n=10, Y=20, paras=pm1)) but this time admitting first-order interaction, using bespoke function Lindsey(): summary(suffstats(wilson)) We have corrected what seems to be a small typo in the 4th column of this matrix in Mosimann's Table2).It is particularly striking that the data show positive correlations for 3 entries.Such positive correlations could never arise from the Dirichlet-multinomial distribution, but they will be exactly matched by our new multiplicative multinomial model.The full sample covariance matrix for the dataset is The dataset is provided with the MM package: Thus 16 households purchased eggs twice and bacon once (this is Danaher and Hardie's Table1).The purchases of eggs and bacon are not independent 3 and we suggest fitting this data to the distribution given in Equation4; here m 1 = m 2 = 4.The Poisson device of Lindsey and Mersch is again applicable: