A Generalization of the Dirichlet Distribution

This paper discusses a generalization of the Dirichlet distribution, the 'hyperdirichlet', in which various types of incomplete observations may be incorporated. It is conjugate to the multinomial distribution when some observations are censored or grouped. The hyperdirichlet R package is introduced and examples given. A number of statistical tests are performed on the example datasets, which are drawn from diverse disciplines including sports statistics, the sociology of climate change, and psephology.


Introduction
The Dirichlet distribution is conjugate to the multinomial distribution in the following sense.If random variables p = (p 1 , . . ., p k ) satisfy k i=1 p i = 1 and are Dirichlet, that is, they have a prior distribution then if the p i are interpreted as the parameters of a multinomial distribution from which a 1 , . . ., a k independent observations of types 1, . . ., k are made, then the posterior PDF for p will be thus belonging to the same family as the prior, the Dirichlet.
It is convenient to denote the distribution of Equation 1 as D(α) where α = (α 1 , . . ., α k ) is the k-tuple of Dirichlet parameters; thus that of Equation 2 would be D(α + a), where a = (a 1 , . . ., a k ) is the k-tuple of observation counts.
In this paradigm, an observation is informative because it increases the Dirichlet parameter of its category by one.However, an observation may be informative even if it does not belong unambiguously to a single category: Consider making r = r 123 + r 456 censored observations whose exact classes are not observed but r 123 are known to be one of categories 1, 2, or 3, and r 456 are known to be one of categories 4,5, or 6.The posterior would satisfy and is not Dirichlet.Consider now the case where observations are made from a conditional multinomial.Suppose s 123 observations are made whose class is known a priori to be one of 1,2, and 3, and there are s i of class i where i = 1, 2, 3, then the posterior would be also not Dirichlet.These types of observation occur frequently in a wide range of contexts and naturally lead one to consider the following generalization of the Dirichlet distribution: where K is the set of positive integers not exceeding k, ℘(K) is its power set, and F is a function that maps ℘(K) to the real numbers 1 .Here, p i 0 for 1 i k and k i=1 p i = 1.We call this the hyperdirichlet distribution and denote it by H(F).
The first term is there so that defining results in H(F) being identical to D(α).
The distribution appears to have |℘(K)| = 2 k real parameters but the effective number of degrees of freedom is actually 2 k − 2: the first and last parameter correspond to the empty set and the complete set respectively, and so do not affect the PDF.

Normalizing constant
The normalizing factor of the PDF given in equation 5 is given by where p k = 1 − k−1 i=1 p i .This is given by function B() in the package.If the distribution is Dirichlet or Generalized Dirichlet, the closed form expression for the normalizing constant is used.If not, numerical methods are used 2 .
1 Here, a letter in a calligraphic font always denotes a function from ℘(K) to the real numbers; it is a generalization of the vector of parameters α used in the Dirichlet distribution; bold letters (such as p) always denote k-tuples.Taking F as an example, the hyperdirichlet distribution itself is denoted H(F) and its PDF would be f (p; F).
2 Certain special cases of the hyperdirichlet may be manipulated using multivariate polynomials so that closed-form expressions for the normalization constant become available (Altham 2009, page 88).However, further work would be required to translate Altham's insight into workable R idiom (Hankin 2008).
Numerical evaluation of Equation 7 is computationally expensive, especially when k becomes large- Evans and Swartz (2000) and others refer to "the curse of dimensionality" when discussing the difficulty of integrating over spaces of large dimension.
The determination of p-values often requires evaluating integrals of this type, in addition to more computationally demanding integrals such as evaluated in section 3.1 on page 9.The hyperdirichlet package provides functionality to calculate p-values for a wide range of natural hypotheses, but many such calculations are prohibitively time consuming.
An alternative to p-values is furnished by the Method of Support (Edwards 1992), which requires no integration for its calculation; examples of this are provided in Section 3 below.It is anticipated that practitioners using the package will be able to choose between computationally expensive p-value calculation and the much faster assessment provided by the Method of Support.

Moments
where Updating a prior H (F) in the light of observations is straightforward.If an observation i, drawn from a multinomial distribution, is made, then the posterior is H F + S {i} , where If the observation is restricted a priori to be in G ⊆ K, and subsequently specified to be amongst C ⊆ G, then the posterior is H (F + S C − S G ).

Restrictions
Not every F corresponds to a normalizable H(F), that is, a distribution with a finite integral.A sufficient condition is that for all nonempty G ⊆ K, H⊆G F(H) > 0. For example, for k = 4, If F(G) = 0 whenever |G| > 1 then H(F) reduces to a Dirichlet; likewise Equation 10 reduces to the standard Dirichlet restriction α i > 0 for 1 i k.
In this paper I discuss this natural generalization of the Dirichlet distribution and introduce an R (R Development Core Team 2009) package, hyperdirichlet, that provides some numerical functionality.

Generalizations of the Dirichlet distribution
Previous generalizations of the Dirichlet distribution include the work of Bradley and Terry (1952), who considered rank analysis of incomplete designs.In the case of pairs, ranking is equivalent to choosing a winner from two items, their likelihood function would correspond to in current notation (here there are a total of n ij + n ji Bernoulli trials beween player i and player j > i of which n ij are won by player i).This is a special case of Equation 5.
Censored observations, in which the class of an object is specified to be one of a subset of {1, . . ., k}, lead naturally to a likelihood function that is a generalization of Dirichlet's; a survey is given by Paulino (1991).Paulino and de Bragança Pereira (1995) present a comprehensive Bayesian methodology for censored observations and a simplified analysis of their sample dataset is provided exempli gratia in the package, documented under paulino.
A different generalization was presented by Connor and Mosimann (1969), who observed that the Dirichlet distribution was neutral3 and proved that [function gd() in the package] is the most general form of a random variable with neutrality.Wong (1998) extended this work and showed that the generalized Dirichlet distribution was conjugate to a particular type of sampling experiment.

Prior information and the hyperdirichlet distribution
The Bayesian paradigm allows one to use prior information in the form of a prior distribution on the parameters.There are many types of prior information that are expressed in a natural way using the hyperdirichlet distribution and some examples are discussed here.
Consider four tennis players P 1 through P 4 .When P i plays P j with i = j, the result is a single observation from a Bernoulli distribution with parameters 1929), where the p i are the unknown probabilities of victory; we require p i = 1.
A Dirichlet prior would be proportional to 4 i=1 p α i −1 i where α i > 0, but suppose our prior information is that P 1 and P 2 are considerably stronger than P 3 and P 4 (perhaps we know P 1 and P 2 to be strong squash players, and P 3 and P 4 weak badminton players-surely informative about the p i ) but remain ignorant of P 1 's strength relative to P 2 , and of P 3 's strength relative to P 4 .
Then an appropriate prior might be ∝ (p 1 + p 2 ) γ 12 where the magnitude of γ 12 reflects the strength of our prior beliefs.If γ 12 is large, then the probability density is small everywhere except near points with p 1 + p 2 = 1.
The best one could do with a standard Dirichlet prior would be to assign large values for α 1 and α 2 and small values for α 3 and α 4 .But this would have the disadvantage that one would have to have firm beliefs about the relative strengths of P 1 and P 2 , and in particular that a match between P 1 and P 2 would be a Bernoulli trial with unknown probability p, where p is itself drawn from a beta distribution with parameters (α 1 , α 2 ).Thus and one might not have sufficient information to make such an assertion-compare this with a prior ∝ (p 1 + p 2 ) γ 12 in which the density is uniform along lines of constant p 1 + p 2 .
Situations where one has prior information that is not representable with a Dirichlet distribution arise frequently, especially when the identities of the various players are not known.The special case of k = 3 is readily visualized because the system possesses two degress of freedom and the PDF may be plotted on a triangular plot.In the context of the sports estimation problem above, an example of prior information might be that a knowledgeable person observed the players and noted that two were very much stronger than the third; he in fact reported that "the guy with a red shirt got hammered" (West 2008).But whether it was player 2 or player 3 who wore the red shirt is not known; and no information about the relative strengths of the two non-red wearing players is available.Figure 1 shows an example of how observations affect prior information in this case.

Examples
This section presents the hyperdirichlet package in use.Examples drawn from diverse disciplines are given.

Chess
Many attributes of the hyperdirichlet distribution are evident in the simplest non-trivial case, that of k = 3.This case is also facilitated by the fact that, having two degrees of freedom, the distribution may be readily visualized.In addition, the normalization factor is easily evaluated, the integrand having arity two.
Consider Table 1 in which matches between three chess players are tabulated; this dataset has been used by West and Hankin (2008).
The likelihood function is (the symbol 'C' consistently stands for an undetermined constant), and this corresponds to α with α = 0.1 corresponding to one player being known to be weaker than the other two; see how the high-probability region adheres to the edges of the triangle, thus implying that at least one player is weak.(b), posterior PDF following the observation that p 1 beat p 2 7 times out of 10 (note the induced asymmetry between p 1 and p 2 ).(c), prior PDF ∝ , again with α = 0.1, corresponding to p 3 being good and one (but not both) of p 1 or p 2 being good.(d), posterior, again following p 1 beating p 2 7 times out of 10 This dataset is included in the aylmer package; it may be loaded and coerced to an S4 object of class hyperdirichlet: > data("chess") > (w <-as.hyperdirichlet(chess)) Normalizing constant not known thus R object w corresponds to H(W).This simple example shows how a matrix, each row of which corresponds to repeated multinomial trials (here restricted to two outcomes), may be coerced to a hyperdirichlet object.Each output line of the print method corresponds to a subset of {p 1 , p 2 , p 3 }; in columns 1-3, 0 means "not included" and 1 means "included"; thus, for example, the second line shows that Karpov won 22 (=10+12) games overall; and the fourth line shows that Anand and Karpov played 35 games.
The final two columns show the parameters and the powers of the 2 k = 8 subsets respectively.Although these two columns give identical information, having both displayed simultaneously avoids much confusion in practice.
The normalizing constant B is as yet unknown; it is unevaluated by default as its calculation is numerically expensive, especially when k becomes large.The R idiom to calculate B is > (w <-as.hyperdirichlet(w, calculate_NC = TRUE))

Normalizing constant not known
Thus object w now includes the normalizing constant.

Anand
Figure 2: Support function for the three chess players of Table 1.Each player has an associated p, and we demand p 1 + p 2 + p 3 = 1.When player i plays player j = i, the outcome is a Bernoulli trial with parameter p i / (p i + p j ).Each labelled corner corresponds to a canonical basis vector; the top corner, for example, is point (0, 1, 0): Anand wins all games (this point has zero likelihood as the dataset includes games in which Anand lost).Note that the support is unimodal This allows one to test various hypotheses using the standard methodology.For example, consider H 0 : p = 1 3 , 1 3 , 1 3 .The p-value for such a test is the integrated probability density, the integration proceeding over regions 'more extreme' (that is, regions with lower likelihood) than H 0 .The R idiom would be Here, function calculate_B() integrates over the domain of the distribution, but excluding regions where f() returns TRUE.In this case, the integration proceeds over regions of the simplex that are more extreme than H 0 , where a point is held to be 'more extreme' if its likelihood is lower than that of H 0 .The test has a p-value of about 0.395, indicating that there is insufficient evidence to reject H 0 at the 5% level (in practice one would use function probability() which achieves the same result more compactly).This functionality can be applied in a slightly different context.If w is interpreted as a probability density function with domain p = (p 1 , p 2 , p 3 ) where p i = 1, it is straightforward to use the Bayesian paradigm (taking a uniform prior for simplicity) to estimate the probability that p lies within any specified region.For example, the probability that Topalov is indeed a better player than Anand is merely the probability that p ∈ {p|p 1 p 2 }.This is given by which may be evaluated with function probability(): Note that this is not the probability that Topalov would beat Anand in a game.The figure is the posterior probability that the Bernoulli parameter for such a game would exceed 0.5 (recall that uncertain probabilities are held to be random variables in the Bayesian paradigm).
Examples are given below which illustrate inferential techniques that do not require the value of the normalizing constant (or indeed any integral) to be evaluated.

Public perception of climate change
Lay perception of climate change is a complex and interesting process (Moser and Dilling 2007); the issue of immediate practical import is the engagement of non-experts by the use of "icons"4 that illustrate different impacts of climate change.Methodological constraints dictated that each respondent could be presented with a maximum of four icons.Table 2 (dataset icons in the package) shows the experimental results.
One natural hypothesis H F is that there exist p = (p 1 , . . ., p 6 ) with p i = 1 such that the probability of choosing icon i is proportional to p i ; the subscript 'F ' indicates here and elsewhere that the p i may be freely chosen subject to their sum.The Aylmer test (West and Hankin 2008) shows that there is insufficient evidence to reject this hypothesis and we proceed on the assumption that such a p does in fact exist: This is the object of inference.This paper follows Esty (1992), who gives an example drawn from the field of psephology.In his voting model, k choices are evaluated by voters; the object of inference is the set p = (p 1 , . . ., p k ), where k i=1 p i = 1.If the voter has evaluated nominee j, then nominee j is selected with probability p j / p i , where the summation is over all evaluated nominees.
The maximum likelihood estimate for p is obtained straightforwardly in the package using maximum_likelihood() function; numerical techniques must be used because analytical solutions are not generally available 5 .> data("icons") > ic <-as.hyperdirichlet(icons)> (m.free <-maximum_likelihood(ic)) Observe how the first element, NB-corresponding to the Norfolk Broads-is the largest of the six; this is consistent with the sociological arguments presented by O'Neill in which "local" issues dominate more distant concerns (the test took place in Norwich).
One natural line of enquiry is to test whether the finding that the point estimate of NB is the largest of the six is statistically significant.
There seem to be a number of closely related alternative hypotheses.Firstly, one may consider the hypothesis H F : p i = 1, and compare with H 1 : Recalling that the normalizing factor is difficult to calculate, it is possible to use the Method of Support (Edwards 1992).
The maximum support for H 1 is given by the following R idiom; the disallowed argument to maximum_likelihood() prevents the optimization routine searching outside the domain of H 1 .
Observe that the MLE subject to H 1 is on the boundary of admissibility as (to within numerical accuracy) p 1 = 1 6 .The relevant statistic is thus indicating that the support at any point admissible under H 1 may be increased by 2.6 by the expedient of allowing the optimization to proceed freely over the domain of H F .Edwards's criterion of 2 units of support per degree of freedom is thus met and H 1 may be rejected.
Secondly, one might consider H 2 : p i = 1, p 1 > max (p 2 , . . ., p 6 ); thus p 1 is held to be greater than all the others.
Again observe that the MLE lies on the boundary of its restricted hypothesis [p 1 == p 3 ].We have indicating that there is insufficient evidence to reject H 2 : There are points within the region of admissibility of H 2 whence one can gain only a small amount of support (viz.0.0853) by optimizing over the whole of H F .

Low frequency responses
O'Neill argues that the fifth and sixth icons are both considered by her respondents to be "remote" (cf the first, which is definitely local).Thus one might consider H 3 : Thus indicating that the observed low frequencies of respondents choosing OA and WAIS are unlikely to be due to chance, consistent with O'Neill's sociological analysis.
As a final example, consider H 4 : p i = 1, max {p 5 , p 6 } min {p 1 , p 2 , p 3 , p 4 }.This corresponds to an assertion that the maximum of the two distant icons is less than any local icon.The support for this hypothesis is about 3.  3: First five results from a sports league comprising five players, p 1 to p 9 ; dataset volleyball in the package.On any given line, a '1' denotes that that player was on the winning side, a '0' that he was on the losing side, and NA that he did not take part for that game The same techniques can be applied to any dataset in which repeated conditional multinomial observations are made; observe that a numerical value for the normalizing constant is not necessary for this type of inference.

Team sports
Table 3 shows the result of a sports league in which up to n = 9 players compete.A 'game' is a disjoint pair of subsets of K = {1, 2, 3, 4, 5, 6, 7, 8, 9} together with an identification of one of these subsets as the winning side.
Thus the likelihood function for the first two games would be showing that the estimate is quite accurate: Esty (1992) points out that numerical means will find the maximum likelihood estimate easily if the data is irreducible, as here.
One topic frequently of interest in this context is the ranking of the players.On the basis of this point estimate, one might assert that p 1 p 2 p 3 p 4 ; observe that the ranks of the MLE are not correct beyond the fifth, even with the large amount of data used.How strong is the evidence for this ranking?shows that there is no strong statistical evidence to support the assertion that the players are ranked as in the MLE: There exist regions of parameter space with a different ranking for which less than two units of support are lost.

Tennis
The above analysis assumed that the strength of a team is proportional to the sum of the strengths of the players.However, many team sports appear to include an element of team cohesion; Carron, Bray, and Eys (2002) suggest that there is a 'strong relationship' between cohesion and team success.
In the current context, the simplest team is a pair.Doubles tennis appears to be a particularly favourable example: "if the two partners coordinate. . .well, they force their opponents to execute increasingly difficult shots" (Cayer 2004).Note that Cayer's assertion is independent of the individual players' strengths.
The hyperdirichlet distribution affords a direct way of assessing and quantifying such claims, using the likelihood function induced by teams' scorelines directly.Consider Table 4, in which results from repeated doubles tennis matches are shown.The likelihood function is where p i = 1 is understood.Players P 1 and P 2 are known to play together frequently and one might expect them to win more often when they play together than by chance.Indeed, each matching has a scoreline of roughly 50-50, except {P 1 , P 2 } vs {P 3 , P 4 }, which results in a win for {P 1 , P 2 } 9 times out of 11.Is this likely to have arisen if team cohesion is in fact absent?
which formalizes the effectiveness of team cohesion in terms of a 'ghost' player with skill p g who accounts for the additional skill arising when P 1 and P 2 play together; the null is then simply p g = 0.
It is straightforward to apply the method of support.Function maximum_likelihood() takes a zero argument that specifies which components of the p i are to be constrained at zero; here we specify that p g = 0: > data("doubles") > maximum_likelihood(doubles)$support -maximum_likelihood(doubles,zero=5)$support [1] 2.773369 thus one may reject the hypothesis the ghost player has zero strength.The inference is that P 1 and P 2 when playing together are stronger than one would expect on the basis of their performance either in singles matches, or doubles partnering with other players: The scoreline provides strong objective evidence that team cohesion is operating.
This technique may be applied to any of the datasets considered in this paper, and in the context of scorelines the ghost may be any factor whose existence is in doubt.Negative factors (for example, a member of the audience whose presence adversely affects one competitor's performance) may be assessed by recasting the negative effect as a helpful ghost whose skill is added to the opposition's.

Conclusions
The Dirichlet distribution is conjugate to the multinomial distribution.This paper presents a generalization of the Dirichlet distribution which is conjugate to a more general class of observations that arise naturally in a variety of contexts.The distribution is dubbed 'hyperdirichlet' as it is clearly the most general form of its type.
The hyperdirichlet package of R routines for analysis of the distribution is introduced and examples of the package in use are given.
One difficulty in using the distribution is that there does not appear to be a closed-form analytical expression for the normalizing constant; numerical methods must be used.The normalizing constant is difficult to calculate numerically, especially for distributions of large dimension.
The normalizing constant is needed for conventional statistical tests; but its evaluation is not necessary for the Method of Support, which is used to test a wide variety of plausible and interesting hypotheses using datasets drawn from a range of disciplines.
is.proper() in the package tests for normalizability].

Table 1 :
Results of 88 chess matches (dataset chess in the aylmer package) between three Grandmasters; entries show number of games won up to 2001 (draws are discarded).Topalov beats Anand 22-13; Anand beats Karpov 23-12; and Karpov beats Topalov 10-8 a hyperdirichlet distribution, say H(W)

Table 2 :
(O'Neill 2007)esults from O'Neill (2007) (dataset icons in the package): Respondents' choice of 'most concerning' icon of those presented.Thus the first row shows results from respondents presented with icons NB, L, THC, and WAIS; of the 15 respondents, 5 chose NB as the most concerning (see text for a key to the acronyms).Note the "0" in row 6, column 9: This option was available to the 18 respondents of that row, but none of them actually chose WAIS In one study(O'Neill 2007), subjects are presented with a set of icons of climate change and asked to identify which of them they find most concerning.Six icons were used: PB [polar bears, which face extinction through loss of ice floe hunting grounds], NB [the Norfolk Broads, which flood due to intense rainfall events], L [London flooding, as a result of sea level rise], THC [the thermo-haline circulation, which may slow or stop as a result of anthropogenic modification of the water cycle], OA [oceanic acidification as a result of anthropogenic emissions of CO 2 ], and WAIS [the West Antarctic Ice Sheet, which is rapidly calving as a result of climate change].
16, indicating that one may reject H 4 .