Partition structure and the A-hypergeometric distribution associated with the rational normal curve

A distribution whose normalization constant is an A-hypergeometric polynomial is called an A-hypergeometric distribution. Such a distribution is in turn a generalization of the generalized hypergeometric distribution on the contingency tables with fixed marginal sums. In this paper, we will see that an A-hypergeometric distribution with a homogeneous matrix of two rows, especially, that associated with the rational normal curve, appears in inferences involving exchangeable partition structures. An exact sampling algorithm is presented for the general (any number of rows) A-hypergeometric distributions. Then, the maximum likelihood estimation of the A-hypergeometric distribution associated with the rational normal curve, which is an algebraic exponential family, is discussed. The information geometry of the Newton polytope is useful for analyzing the full and the curved exponential family. Algebraic methods are provided for evaluating the A-hypergeometric polynomials.


Introduction
The A-hypergeometric function introduced by Gel'fand, Kapranov, and Zelevinsky [1] is a solution of the A-hypergeometric system of partial differential equations. The series solution around the origin is called the A-hypergeometric polynomial. Takayama et al. [2] called a distribution whose normalization constant is an A-hypergeometric polynomial as an A-hypergeometric distribution. Such a distribution is in turn a generalization of the generalized hypergeometric 1 The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan; E-mail: smano@ism.ac.jp 1 distribution on the contingency tables with fixed marginal sums, and so is of interest in algebraic statistics and information geometry. In this paper, we will see that this framework with a homogeneous matrix A of two rows helps inferences involving exchangeable partition structures.
Exchangeable partition structures appear in count data modeling and sampling theory, and play important roles in Bayesian statistics (see, e.g., [3,4,5,6]). They have been studied in the context of combinatorial stochastic processes (see , e.g., [7,8,9]). Thanks to known results on the A-hypergeometric system with a homogeneous matrix A of two rows in the contexts of commutative algebra and algebraic geometry, explicit results can be obtained and performance of computational methods can be examined accurately.
This paper is organized as follows. In Section 2, we will see that the A-hypergeometric system with a homogeneous matrix A of two rows is associated with an algebraic curve known as a monomial curve. The unique polynomial solutions of this A-hypergeometric system are constant multiples of the A-hypergeometric polynomial. In particular, for the A-hypergeometric system associated with a special monomial curve, called the rational normal curve, the Ahypergeometric polynomial is a constant multiple of the associated partial Bell polynomial, which was recently defined by the author [10].
In the following three sections, we discuss statistical applications. In Section 3, the computational aspects of similar tests that involve A-hypergeometric distributions will be discussed. As an alternative to the Markov chain Monte Carlo with moves by a Markov basis, an exact sampling algorithm for general (any number of rows) A-hypergeometric distributions is presented.
The algorithm is demonstrated in a goodness of fit test of a Poisson regression. Section 4 sheds light on a connection with exchangeable partition probability functions (EPPFs). The Ahypergeometric distribution associated with the rational normal curve appears as the conditional distribution of a general class of EPPFs given the sufficient statistics. In Section 5, the maximum likelihood estimation of the A-hypergeometric distribution will be discussed. The information geometry of the Newton polytope of the A-hypergeometric polynomial works effectively. The p.m.f. (probability mass function) is an algebraic exponential family. From geometric properties of the Newton polytope, we see an interesting observation (Theorem 5.1): the maximum likelihood estimator (MLE) of the full exponential family for a count vector does not exist with probability one. So, we consider a sample consisting of multiple count vectors and/or curved exponential families. Gradient-based methods to evaluate the MLE will be discussed. They are demonstrated in a problem associated with an EPPF that appears in an empirical Bayes approach.
All the above applications demand practical methods for evaluating the A-hypergeometric polynomials associated with the rational normal curve. Section 6 is devoted to tackling this issue. The A-hypergeometric polynomials satisfy a recurrence relation that comes from the enumerative combinatorial structure of partial Bell polynomials. Use of the recurrence relation is a method for evaluating the A-hypergeometric polynomials. Lemma 6.1 gives an explicit expression for a system of contiguity relations among the A-hypergeometric polynomials, called the Pfaffian system. By virtue of this explicit expression, alternative algebraic methods for evaluating the A-hypergeometric polynomials are presented. They are examples of methods called the holonomic gradient methods (HGMs) [11,12,13]. Roughly speaking, the difference HGM demands less computational cost, while the recurrence relation gives more accurate estimates. The performance of these methods are compared in applications to evaluating specific A-hypergeometric polynomials. If n − k is large, no method is feasible and asymptotic approximations are inevitable instead. The accuracy of known asymptotic form and that obtained by the method developed by Takayama et al. [2] are compared.

Partial Bell polynomials as A-hypergeometric polynomials
In this section, we will see that the unique polynomial solution of the A-hypergeometric system associated with the rational normal curve is a constant multiple of the associated partial Bell polynomials. The standard monomials for the left ideal of the A-hypergeometric system will be presented. They are useful for evaluating the A-hypergeometric polynomials.
Consider a partition of positive integer n with k positive integers: n = n 1 + · · · + n k . Here {n 1 , ..., n k } is a multiset. Support of the p.m.f. can be represented by the set of multiplicities s j := |{i : n i = j}|, j ∈ {1, ..., n}, that is, S n,k := (s 1 , ..., s n ) : This count vector (s 1 , ..., s n ) is the main concern of this paper. Let us call it the size index, following a terminology introduced by Sibuya [14]. The partial Bell polynomials are defined on the support (2.1) with a sequence of non-negative numbers w 1 , w 2 , ... [15]: with the convention B 0,k (w · ) = δ 0,k . The author has defined associated versions of the partial Bell polynomials [10]. They are generalizations of the partial Bell polynomials and come from setting restrictions on the support. The associated partial Bell polynomials are partial Bell polynomials with some terms of the non-negative sequence set to zero. Defining the associated versions is useful for the following discussion.
Definition 2.1 ( [10]). Consider a partial Bell polynomial B n,k (w) that is defined by an infinite sequence of non-negative numbers w 1 , w 2 , .... The associated partial Bell polynomials are defined as follows.
with the conventions B n,k,(r) (w) = 0, n < rk, B (r) n,k (w) = 0, n < k, n > rk, and B (r) n,k (w) = B n,k (w), n ≤ r + k − 1. The supports S n,k,(r) and S (r) n,k are defined as S n,k,(r) := (s 1 , ..., s n ) : The associated partial Bell polynomials (2.3) and (2.4) are represented by another partial Bell polynomial or as a linear combination of other partial Bell polynomials [10]. For later discussion, we present the following fact, which was not presented in [10].
Proposition 2.2. The associated partial Bell polynomial (2.3) can be represented by the following partial Bell polynomial: Here, symbols for factorials ( used. In addition, w ·+r−1 /(· + 1) r−1 means that the sequence w 1 , w 2 , ... in the definition of the Proof. Take (r − 1) elements for each cluster. Then the total number of remaining elements is n − (r − 1)k. The cluster sizes of the partition of the remaining elements into k clusters are free from restrictions. Denoting s j = t j−r+1 in (2.3), we have which is the assertion.
The associated partial Bell polynomials satisfy the following recurrence relation that comes from the enumerative combinatorial structure of the partial Bell polynomials.

Proposition 2.3 ([10]
). The partial Bell polynomials and the associated partial Bell polynomials Here, a ∨ b := max{a, b} and a ∧ b := min{a, b}.
The Weyl algebra of dimension m is the free associative C-algebra D m = C x 1 , ..., x m , ∂ 1 , ..., ∂ m modulo the commutation rules Let I be a left ideal in D m . It is known that the set of standard monomials of a Gröbner basis of I is a basis of the factor ring D m /I, which is a vector space of C(x 1 , ..., x m ). If I is a zerodimensional ideal, D m /I is finite dimensional. If a holomorphic function f satisfies a system of differential equations L•f = 0, L ∈ I, f is called a holonomic function. Gel'fand, Kapranov, and Zelevinsky [1] defined a class of holonomic functions known as GKZ-hypergeometric functions, which are also referred to as A-hypergeometric functions.
Definition 2.4. Let A be an integer-valued d× m-matrix of rank d, and fix a vector b ∈ C d . The A-hypergeometric system H A (b) is the following system of linear partial differential equations for an indeterminate function f (x): as a left ideal in the Weyl algebra D m . We call it the A-hypergeometric ideal. The second group of annihilators generates the toric ideal I A of A.
The series representation of the A-hypergeometric function around the origin, namely is called the A-hypergeometric polynomial. We set Z A (b; x) := 0 if b / ∈ A · N m as the convention.
For the associated partial Bell polynomial B (r) n,k (w), definition (2.4) is identical to n! times the A-hypergeometric polynomial with (2.9) A = 0 1 2 · · · (r − 1) ∧ (n − k) and the indeterminants are identified as . The indeterminants will be parameters in statistical contexts. For the associated partial Bell polynomial B n,k,(r) (w), the identity is not evident. However, identity (2.5) leads to an expression as the partial Bell polynomial, which is identical to n! times the A-hypergeometric polynomial with The indeterminants are identified as In general, a homogeneous matrix of two rows generates integer partitions. Let 0 < i 1 < i 2 < · · · < i m−1 be relatively prime integers (the greatest common divisor is one). Without loss of generality, we may assume The convex hull of the column vectors is a one-dimensional polytope, whose volume vol(A) is i m−1 . The toric ideal I A determines a degree i m−1 monomial curve in the projective space P m−1 . The monomial curve is normal if and only if i m−1 = m − 1. In this case, the monomial curve is the embedding of P 1 in P m−1 and called the rational normal curve; for background, see, e.g., [16]. The indeterminants of the A-hypergeometric system are identified as x j = w j /j!, j ∈ {1, i 1 + 1, ..., i m−1 + 1}, and the support is a set of integer partitions which is not empty if and only if b ∈ NA, where NA is the monoid spanned by the column vectors of A, which generate a lattice of N 2 . In this paper, we will focus on the A-hypergeometric systems associated with the rational normal curve, because they arise naturally in statistical applications.
Theories around the A-hypergeometric system with a homogeneous matrix A of two rows are well developed [17,18]. It is straightforward to see that Lemma 1.3 of [17] gives the following fact. The Buchberger algorithm and the elimination theory provide a method for computing the reduced Gröbner basis of the toric ideal I A (Algorithm 4.5 of [19]). The minimum fiber Markov basis associated with the toric ideal I A with the matrix (2.10) was obtained by [20]. In this paper, we will use the following minimal Gröbner basis. It is streightforward to obtain the reduced Gröbner basis from a minimal Markov basis. Throughout the present paper, we fix the term order as reverse lexicographic with ∂ 1 ≻ ∂ 2 ≻ · · · ≻ ∂ n−k+1 . The result is as follows.
Proposition 2.6. A minimal Gröbner basis of the toric ideal I A , where the matrix A is of the The standard monomials give solution bases of the A-hypergeometric system, and the cardinality is called the holonomic rank. The set of standard monomials are as follows.
Proposition 2.7. For a matrix A of the form (2.10) with i m−1 = m − 1 ≥ 2 and any vector b ∈ C 2 , the totality of the standard monomials of the initial ideal of the A-hypergeometric ideal Proof. For the annihilators (2.6), we have in ≺ (L 1 ) = ∂ 2 and in ≺ (L 2 ) = ∂ 1 . It follows from Proposition 2.6 that the initial ideal for the minimal Gröbner basis of the toric ideal I A is Therefore, the totality of the standard monomial of the A- [17] or Theorem 4.2.4 of [18]). Therefore, {1, ∂ i : 3 ≤ i ≤ m} is the totality of the standard monomials.
Before closing this section, let us see a connection between the exponential structures in enumerative combinatorics [21] and the A-hypergeometric system associated with the rational normal curve. The exponential structure is characterized by the exponential generating function P (z) of the number of possible structures p(n) satisfying with the convention p(0) = 1. Because B n,k (w) is the number of possible structures whose number of clusters is k, we have p(n) = n k=1 B n,k (w) (the Bell polynomial). Therefore, By an argument on the de Rham cohomology, the hypergeometric ideal H A (b) eliminates the A-hypergeometric integral [18]. For d = 2, the integral is Taking cycle C belonging to the homology group H 1 (z ∈ C\{0}|f (z, x) = 0) gives a solution basis (the inverse is not always true [1]). Suppose the matrix A and vector b are as in (2.9) with r = n. Letting C be a small cycle around the origin yields the residue of the origin: Comparing with (2.11) shows that this integral is a constant multiple of the partial Bell polynomial. Other solution bases do not have such integral representation, nevertheless, they can be obtained by perturbations of b (see Example 6.2).
The exponential structure is a facet of the A-hypergeometric system associated with the rational normal curve. Considering the exponential structure in the theory of the A-hypergeometric system provides us broader viewpoint than that given by the enumerative combinatorics. Section 6 will show that the framework in terms of the A-hypergeometric system gives us methods for evaluating the A-hypergeometric polynomials other than the method using the recurrence relation that comes from the enumerative combinatorial structure of partial Bell polynomials.
However, formulations in terms of the A-hypergeometric system sometimes involve unwanted generality. In Section 6, we will see that properties specific to the A-hypergeometric polynomial are helpful for avoiding difficulties caused by the unwanted generality in evaluating of the A-hypergeometric polynomials.

Samplers for similar tests
As the generalized hypergeometric distribution on the contingency table with fixed marginal sums, the A-hypergeometric distribution appears as a conditional distribution of some model with given sufficient statistics. For such a case, a similar test can be conducted with the conditional distribution [22] with the aid of samplers from the conditional distribution. Constructing a sampler with algebraic constraints has been one of the motivating problems in algebraic statistics to date [23] (recent developments in this line of research can be found in [24]). In this section, we will discuss the computational aspects of samplers for the A-hypergeometric distribution.
An exact sampling algorithm is proposed for general (any number of rows) A-hypergeometric distributions. Then an application to the A-hypergeometric distribution associated with the rational normal curve is presented.
Suppose we have a model whose conditional distribution given the sufficient statistics is the A-hypergeometric distribution, namely where A is an integer valued d × m-matrix of rank d, b ∈ C d , and c ∈ N m is a count vector of m categories with c 1 + · · · + c m = k. For a similar test of hypothesis H 0 : x = x 0 , consider using the probability function q(c; x) as the test statistic. The significance probability of the observation where C follows the A-hypergeometric distribution with parameter x 0 . To estimate the significance probability, we need an unbiased sampler from the A-hypergeometric distribution. called a Markov basis [23]. The move has one-to-one correspondence to the binomial ideal of In uses of an MCMC sampler, we must assess the convergence of the chain to the target distribution to guarantee that a sample is taken from the target distribution. However, such assessment is not always easy. In contrast to MCMC, the following algorithm can sample from the target distribution exactly. The cost we must pay is to evaluate the A-hypergeometric polynomials. This type of algorithm was proposed for a test that appeared in genetics and that involves an exchangeable partition probability function [25]. However, the following algorithm can apply to general A-hypergeometric distributions.
Algorithm 3.1. A count vector c with c 1 + · · · + c m = k is sampled from the A-hypergeometric distribution (3.1) by the following steps. Let I i ∈ {1, ..., m} be the indicator of the category of the i-th observation of the sample of size k, where c i = |{j : I j = i}|.
Let us consider an application of Algorithm 3.1 to the A-hypergeometric distribution associated with the rational normal curve. The p.m.f. is where the matrix A is given in (2.9). Now the count vector c is the size index s, and we put m = r ∧ (n − k + 1). We assume m ≥ 3 and k ≥ 2, since otherwise the sampling is trivial. A Markov basis and the Metropolis-Hastings ratio are as follows.
For j > i + 2, the Metropolis-Hastings ratio for the move from state s to state s + ǫz is Proof. By virtue of Theorem 3.1 of [23] the minimal Gröbner basis G A of the toric ideal I A , which is given by Proposition 2.6 while replacing ∂ i by x i , is a Markov basis. The Metropolis-Hastings ratio follows by a simple calculation.
Example 3.3. The data set considered is from [26] and concerns a goodness of fit of a regression model to effect of an insecticide. Consider the univariate Poisson regression with m levels of a covariate. The means that µ i , i ∈ {1, ..., m} of independent Poisson random variables S i were modeled as log µ i = α + βi. The sufficient statistics are the sample size k = m i=1 s i and the sum of the levels n = m i=1 is i . The conditional distribution given the sufficient statistics is the Ahypergeometric distribution (3.3) with r = m and x i = 1/i!, i ≥ 1. A chemical to control insects is sprayed on successive equally infested plots in increasing concentrations 1, 2, 3, 4, 5 (in some units). After the spraying the number of insects left alive on the plots are (s 1 , s 2 , s 3 , s 4 , s 5 ) = (44,25,21,19,11). k = 120 and n = 288. The similar test tells us how well the model fit to the data. An estimate of the significance probability of the χ 2 -statistic based on 900, 000 samples from the exact sampler (Algorithm 3.1) was 0.0258. To evaluate the A-hypergeometric polynomials, the recurrence relation in Proposition 2.3 was employed. This should be close to the true value. For the MCMC, an estimate based on a walk of 90, 000 steps (with the initial 10, 000 steps having been discarded to avoid sampling from the un-converged part of the chain, as was done in [26]) was 0.0231. We can say that the MCMC sampling scheme gives a reasonable estimate. This can be confirmed with the histograms shown in Figure 1. The histogram obtained by the MCMC sampler is fairly close to that obtained by the exact sampler.

Exchangeable partition probability functions
Chapter 1 of [9] is an extensive survey of the relationship between the partial Bell polynomials and the exchangeable partition probability functions (EPPFs). A typical application of EPPFs is in Bayesian statistics. For a multinomial sampling from a prior distribution, the marginal likelihood of a sample is an EPPF (see Example 4.3). In the context of Bayesian nonparametrics, a prior process characterized by an EPPF is called a species sampling prior [4,27]. In this section, we will see that the conditional distribution of a general class of EPPFs is the A-hypergeometric distribution associated with the rational normal curve.
Label each observation of a sample of size n with a positive integer, and consider a probability law on partitions of the set {1, 2, ..., n}. If we assume exchangeability, then cluster sizes are our concern. Hence, we consider a probability law on a set of integers whose sum is a positive integer n. Following Aldous [7], let us call such a probability law a random partition. We say that a random partition Π n is exchangeable if a symmetric function p n on a set of partitions of an integer n satisfies for a partition of {1, 2, ..., n} to be arbitrary k clusters {A 1 , ..., A k }. This p.m.f. p n is called an EPPF.
Let us consider a class of EPPFs that have a multiplicative form, namely The support is given by partitions of a fixed positive integer n with k positive integers. The parameters are two sequences of positive numbers (v n,k ) and (w i ), 1 ≤ i, k ≤ n. This EPPF is an example of multiplicative measures, which were studied by Vershik [28] as a model of statistical mechanics. Here, a cluster of size i has w i different microscopic structures. In terms of the size index, we have .., n − k + 1} and the support S n,k is given in (2.1). The number of clusters |Π n | is the sufficient statistic for v and is distributed as where B n,k (w) is the partial Bell polynomial. The conditional distribution is In [9,28], this p.m.f. was referred to as the microcanonical Gibbs distribution. If we consider logarithms of x are natural parameters, this is an exponential family. Moreover, this is the Ahypergeometric distribution, since the partial Bell polynomial is n! times the A-hypergeometric polynomial associated with the rational normal curve, where the matrix A and vector b are given Proof. Sufficiency is obvious by the factorization theorem. For completeness, assume for a function f (·) that the number of clusters |Π n | satisfies E(f (|Π n |)) = 0 for arbitrary v. Choose arbitrary k 0 in {1, ..., n} and fix the parameter as v n,k = δ k,k 0 (Z n,k 0 (w)) −1 . Then we have The following proposition is a generalization of Theorem 2.5 of [29].
Proof. The conditional distribution (4.2) yields where the multiplicity vector s ′ is s ′ j = s j , j = i and s ′ i = s i − r i . The joint moments are derived in the same manner. The Lehmann-Scheffé theorem [30] gives the assertion.
The sequences v and x may be parametrized by a few parameters. An important parametriza- Gnedin and Pitman [31] showed that an EPPF has infinite exchangeability if and only if x has this parametrization. Such a multiplicative measure is called the Gibbs random partition.
The Gibbs random partition characterizes an important class of prior processes in Bayesian nonparametrics [4]. The Gibbs random partition is the marginal likelihood of a sample taken from the prior process (see Example 4.3). The two-parameter Dirichlet process, which is also called the Pitman-Yor process [32,33], is a popular prior process in Bayesian nonparametrics [4]. The Pitman random partition [34] is a member of the Gibbs random partitions, which is the marginal likelihood for the two-parameter Dirichlet process. For nonzero α the partial Bell polynomial has the form Here, C(n, k; α) is the generalized factorial coefficient, which satisfies For α = 0, the partial Bell polynomial is the unsigned Stirling number of the first kind.
Estimating the number of unseen species is an intriguing classical problem (recent progress can be found in, for example, [35,36]). An empirical Bayes approach is as follows [29]. Here, k := |{i : n i > 0}| is the number of observed species. This EPPF is the Dirichletmultinomial or the negative hypergeometric distribution. In terms of the size index, we have As this expression shows, the Dirichlet-multinomial distribution is an example of a Gibbs random partition. The number of observed species k is the sufficient statistic of the total number of Applications to some data sets can be found in [29].

Maximum Likelihood Estimation
In this section, we will discuss the maximum likelihood estimation of the A-hypergeometric distribution associated with the rational normal curve. Takayama, et. al [2] gave a framework for the general A-hypergeometric distributions, while this section presents some results on the Ahypergeometric distribution associated with the rational normal curve. The main tools employed here are the same as those employed in [2], but more detailed analyses are possible thanks to specific properties of the A-hypergeometric system associated with the rational normal curve, such as the relationship with the partition polytopes. The information geometry of the Newton polytope of the A-hypergeometric polynomial plays important roles throughout this section.
The p.m.f. is an algebraic exponential family. The maximum likelihood estimation of the full and curved exponential families is discussed. Gradient-based methods to evaluate the maximum likelihood estimator (MLE) will be discussed. An application to a problem associated with an EPPF that appears in an empirical Bayes approach is then presented.
Let us consider a particular A-hypergeometric distribution associated with the rational normal curve, whose p.m.f. is where the matrix A is given in (2.9) with r = n ≥ k + 2 ≥ 4 and the support is (2.1). With this setting the A-hypergeometric polynomial is 1/n! times the partial Bell polynomial. Although this setting makes the discussion model-specific, the model covers important statistical applications.
As in the previous section, let . The p.m.f. is the exponential family and the log likelihood is where ψ n,k (ξ) := log Z n,k (e ξ ) is the potential, and a constant is omitted. Here and in the following Einstein's summation convention will be used; indices denoted by a repeated letter, where the one appears as a superscript while the other appears as a subscript, are summed up.
Under a transformation of the indeterminants the A-hypergeometric polynomial transforms as This transformation is known as the torus action, namely where a i is the i-th column vector of the matrix A. This is a known property of partial Bell polynomials [15]. Following Takayama et al. [2], let us introduce the generalized odds ratio to parametrize the quotient space R n−k+1 >0 /ImA ⊤ . The Gale transform of A, which will be denoted asĀ, satisfiesĀA ⊤ = 0. The explicit forms of the row vectors arē where e i is the (n − k + 1)-dimensional unit vector whose i-th component is unity. The Gale transformation provides the generalized odds ratios, namely The moment map is invariant under the torus action. It can be seen that the moment map E(S) : R n−k+1 /ImA ⊤ ∋ log y → η provides the dual (η-) coordinate system in the sense of information geometry. The dual coordinate and the Fisher metric are immediately given as respectively, where ∂ i := ∂/∂ξ i = θ i . Here, the torus action with s 1 = x −1 2 x 1 and s 2 = x −1 1 in (5.2) is used such that the vector y becomes (1, 1, y 1 , ..., y n−k−1 ). y −1 = y 0 = 1. Because of the dually flatness, an exponential family is e-flat and also m-flat [40].
Because the gradient of the log likelihood is ∂ i ℓ n,k (s; ξ) = s i −η i , finding the MLE is equivalent to finding the inverse image of the moment map where the Newton polytope New(Z n,k ) is the convex hull of the support S n,k given in (2.1). The following theorem comes from the fact that a size index never enters the right-hand side of (5.5).
Theorem 5.1. For the likelihood given by the A-hypergeometric distribution associated with the rational curve (5.1), the MLE does not exist with probability one.
Remark 5.2. This assertion might seem curious, but we have an analogy in the theory of exponential families as follows. If the sample size is one, MLE of the beta distribution and that of the gamma distribution do not exist with probability one. This is because the sufficient statistics are on the boundary of the parameter space (see Example 5.6 in [37] and Example 9.8 in [38]). For the A-hypergeometric distribution, the size index, a count vector of multiplicities, can be regarded as a multivariate sample of size one. A similar argument appeared in the context of algebraic statistics on a hierarchical log-linear models [41].
In the following discussion, the partition polytope is useful. Denote the set of possible partitions of positive integer n by S n := ∪ n i=1 S n,i . The convex hull of S n is called the partition polytope P n , which was discussed by [42]. The partition polytope has an important property, namely that P n is a pyramid with the apex e n . In other words, all vertices are on the faces of P n because the n-th coordinate of the apex is 1 and that of the other vertices are zero.
Proof. Because the p.m.f. (5.1) is regular, the MLE exists if and only if the sufficient statistics are in the interior of the convex hull of the support (Theorem 5.5 of [37] and Corollary 9.6 of [38]). The condition is s ∈ relint(New(Z n,k )). If n ≥ k ≥ n/2, there is a one-to-one affine map between the vertices in S n−k and those in S n,k : The map is easily confirmed with Young tableau, a collection of boxes arranged in left-justified boxes, with the row length in non-increasing order. Listing the number of boxes in each row gives a partition. The affine map (5.6) means that if we discard the rightmost column, we have a partition in S n−k . Because all vertices of S n−k are on the faces of P n−k , all vertices of S n,k are on the faces of (New(Z n,k )), so ∀s / ∈ relint(New(Z n,k )). For 2 ≤ k < n/2, the modified map S n−k ∋ (s 1 , ..., s n−k , 0) : is one-to-one, whereS n−k is a collection of all integer partitions of n − k with n−k i=1 s i ≤ k. It can be shown that all vertices ofS n−k are on the faces of P n−k and ∀s / ∈ relint(New(Z n,k )).
Remark 5.3. The fact that a size index never enters relint(New(Z n,k )) can be seen as an observation of the integer partition: the number of clusters whose sizes are equal to or greater than (n − k)/2 + 1 is at most one. In particular, we have the vertex e n−k+1 + (k − 1)e 1 and other vertices are 0 in the (n − k + 1)-th coordinate. This implies that every vertex of S n,k is on the boundary of New(Z n,k ).
where y 1 is the generalized odds ratio defined in (  Corollary 5.6. For the A-hypergeometric distribution associated with the rational normal curve (5.1), the image of the moment map (5.5) agrees with the relative interior of the Newton polytope New(Z n,k ). Moreover, the moment map is one-to-one.
Let us prepare the following lemma.
Lemma 5.7. The affine dimension of the Newton polytope New(Z n,k ) is n − k − 1 for n ≥ k + 2 ≥ 5.
Proof. When n ≥ k ≥ n/2, the one-to-one affine map (5.6) implies that the affine dimension of the Newton polytope equals the affine dimension of the partition polytope P n−k , which is n − k − 1 by Theorem 1 in [42]. If 3 ≤ k < n/2 it is sufficient to establish that there exists a basis of the vector space of size index s ∈ S m , m ≥ 2, which consists of vertices satisfying m j=1 s j ≤ 3. In fact, this is true. If m is even, a basis is given by {e m , e i + e m−i , 2e j + e m−2j : Proof of Corollary 5.6. If k ≥ 3, according to Lemma 5.7 the affine dimension of New(Z n,k ) is n − k − 1 and the condition of Theorem 5.5 is satisfied. For k = 2, the affine dimension of the Newton polytope is ⌊n/2⌋ − 1 and Theorem 5.5 does not work for n ≥ 5. But we can prove the assertion directly as follows. If n is even, the A-hypergeometric polynomial is Z n,2 = n/2−1 j=1 x j x n−j + x 2 n/2 /2. It can be seen that the Newton polytope is a pair of two simplices, one is the convex hull of {e 1 , e 2 , ..., e n/2 } and the other is that of {e n , e n−1 , ..., e n/2 }.
The following theorem is fundamental. The following corollary characterizes the MLE for the curved exponential family.  (3) If MLEs exist, they are consistent and first order asymptotically efficient as N → ∞.
The asymptotic covariance matrix is given by g −1 ab /N .
Proof where the inner product ·, · is in terms of the Fisher metric (5.4). If this condition is satisfied the orthogonal projection is possible and the MLEû exists uniquely.
As an example of the curved exponential family, let us consider the A-hypergeometric distribution (4.2) which emerges as the conditional distribution of the Gibbs random partition. The submanifold M is one-dimensional and parametrized by α ∈ (−∞, 1). By the parametrization (4.3) the generalized odds ratio becomes The image of the moment map M is now a smooth open curve in relint(New(Z n,k )). One of the limit points as α → 1, which is called the Fermi-Dirac limit, is η = (k − 1)e 1 + e n−k+1 . This is a vertex of New(Z n,k ). Here, the Fisher metric is 0. Another limit is α → −∞, which is called the Maxwell-Boltzmann limit. A simple calculation shows that the other limit is and the Fisher metric is where S(n, k) denotes the Stirling number of the second kind. The inverse of the N times the  Here, the orthogonal projection is the identity map. The MLE does not exist for a sample of size N = 1. For a sample of size N ≥ 2, ifs is within the interval, which is equivalent to 0 < 4N 2 < 3(n − 3)N 1 , the MLE exists uniquely. The asymptotic variance increases linearly with sample size n, as for the full exponential family discussed in Example 5.8.
One of the limit points of the curve M with α → 1 is (n − 4, 0, 0, 1) ⊤ , while the other limit point The latter point is in relint(New(Z n,k )), but in the limit n → ∞ it tends to (n − 6, 3, 0, 0), which is a vertex of New(Z n,k ). Because the curve M is not convex, it is not straightforward .
A sketch of the proof is as follows. It is obvious that no MLE exists fors = (n − 4, 0, 0, 1) ⊤ .
Let us assume thats = (n − 4, 0, 0, 1) ⊤ . Then f (1) = 8(2s 4 +s 3 − 2) < 0. Because of the nature of a cubic curve, it is obvious that the necessary condition of the existence of the MLE is that the coefficient of α 3 is negative, in which case there may be a possibility that two MLEs of the same likelihood exist. Necessary conditions for the existence of the two MLEs is ∂ α f (1) < 0 and the solution of ∂ α f = 0 is smaller than 1. However, we can check that the intersection of these conditions for s is empty. Therefore, the condition that the coefficient of α 3 is negative is sufficient for the unique existence of the MLE. Let us view (5.9) as being certainly the condition that determines the possibility for the orthogonal projection around α → −∞.
are the tangent vectors of M and A(−∞) expressed in terms of basis {∂ i }, respectively. Taking )∂ i , the condition of possibility for the orthogonal projection is which is equivalent to (5.9). The remarkable difference from the case of the full exponential family is that MLE exists even for the case of N = 1. In fact, it can be seen that s = (n−5, 1, 1, 0) with n ≥ 7 has the MLE. If the MLE exists, the asymptotic variance with N → ∞ is g αα /N ∼ n(α − 1) 3 (α − 2)/(4N ) for large n. The asymptotic variance increases linearly with sample size n, as in the case of n = k + 2 in Example 5.12. Figure 2 depicts the Newton polytope New(Z 10,7 ) projected onto the η 3 -η 4 plane, which is the lower triangle of the diagonal, and the submanifold M is the curve. The estimating manifold for the case ofs = (4.8, 1.6, 0.4, 0.2) is shown by the arrow, and the MLE isα = 0.073. The shaded region fors is the region in which no MLE exists, which is the normal fan at the limit point of α → −∞.
Remark 5.14. Essentially the same argument as ours here provides a classical result on the existence of the MLE for a sample from the Dirichlet distribution [43]. The log likelihood of the symmetric m-variate Dirichlet-multinomial distribution of parameter (−α) is given by where a constant term is omitted. This is a curved exponential family. Theorem 1 of [43], which was proved using the variation-diminishing property of the Laplace transform, says that the MLE exists uniquely if and only if is satisfied. In our context, the assertion is as follows. The moment map for the full exponential family is now and the image is the partition polytope P n instead of the Newton polytope New(Z n,k ). The submanifold M is parametrized by α ∈ (−∞, 0) and the two limit points are η = e n and which correspond to limits of α → 0 and α → −∞, respectively. The MLE exists if the size indexs is outside the normal fan at α = −∞, which is equivalent to (5.10).
Before closing this section, let us summarize numerical methods for evaluating the MLE. The discussion for the general A-hypergeometric distribution was given in [2]. For the full exponential family (5.7), the MLE iŝ The derivative is ∂f Evaluateŷ is equivalent to finding the inverse image of the maps = η(ŷ). A simple gradient descent algorithm is as follows.
If we use Newton's method, which is called the natural gradient method in information geometry, ∂f /∂y i may be replaced with With some tedious algebra, it can be seen that the Fisher metric g ij can be computed by using the Pfaffians, whose explicit form will be given in Section 6, and the dual coordinate: The symmetry of the Fisher metric g ij = g ji is equivalent to which is the integrability condition of the Pfaffian system (see Section 6). Compared with the simple gradient descent, Newton's method demands the cost of the matrix inversion.
For the curved exponential family, the algorithm needs to be modified slightly. As an example, we consider the parametrization given in (5.8). The gradient descent algorithm is now as follows.
(1) End if (2) Else set and go to (1) while incrementing j by 1, If we use Newton's method, ∂f /∂α may be replaced with (∂f /∂α) −1 ∂f /∂α, where Example 5.17. The data sets considered are from [44] and concern word usage of Lewis Carroll in two works, namely, Alice's Adventure in Wonderland (Alice in Wonderland) and Through the looking-glass and what Alice found there (Through the looking-glass). An empirical Bayes approach is as follows. In these data, the size index s i is the number of word types that occur al. [29]. Suppose we want to compare Alice in Wonderland and Through the looking-glass.
The latter is Carroll's second story about Alice. We might hypothesize that Carroll benefited from his experience in writing Alice in Wonderland, and that Through the looking-glass might be characterized by the greater vocabulary richness. This hypothesis is concordant with our result, because larger α implies stronger tendency to use word type that have never occurred (see Proposition 9 of [34]). Table 1 displays word frequency spectra of Alice in Wonderland and Through the looking-glass. Let us discuss methods to numerically evaluate the A-hypergeometric polynomial associated with the rational normal curve. We will present results for Z n,k (x) ≡ Z A ((n − k, k) ⊤ ; x), where the matrix A is given in (2.9) with r = n ≥ k+2 ≥ 4. It is straightforward to modify the following discussion for general A-hypergeometric polynomials associated with a monomial curve whose matrix A has the form of (2.10) and b ∈ NA, by fixing some of the indeterminants to be 0. The cases with n = k, k + 1 are trivial because the A-hypergeometric polynomials are monomials Z n,n−1 (x) = x n−2 1

Computation of
x 2 /(n − 2)! and Z n,n (x) = x n 1 /n!. A method to evaluate Z n,k (x) is to use the recurrence relation in Proposition 2.3. As another method, let us discuss applying the HGM [11,13]. The HGM is a method for evaluating holonomic functions numerically. For our problem, totality of the standard monomials of the A-hypergeometric ideal H A (b) is given in Proposition 2.7. Because the factor ring D n−k+1 /I is finite dimensional, we should have the following system of partial differential equations This system is called the Pfaffian system, and it represents contiguity relations among the Ahypergeometric system. The first step in developing the HGM is to obtain the Pfaffians P (n,k) .
In principle, Pfaffians can be obtained by the Buchberger algorithm and reductions of the standard monomials with the reduced Gröbner basis of H A (b) [18,45]. However, such general treatment is unrealistic because the computational cost grows rapidly with the holonomic rank.
In addition, it is non-trivial to treat the singular loci that appear in the Pfaffians. For actual applications, explicit expressions for the Pfaffians are inevitable for a specific solution rather than a general one. Goto and Matsumto obtained such an expression for the A-hypergeometric polynomial of type (i + 1, i + j + 2), which appears as the normalizing constant of the two-way contingency tables with fixed marginal sums [46]. Following them, we call the vector Q n,k (x) the Gauss-Manin vector.
Let us consider how to obtain explicit expressions for the Pfaffians in (6.1), The first rows are immediately determined with the annihilator (2.6). That is, However, other rows demand some consideration. Taking derivatives of the definition of the A-hypergeometric polynomial (2.8), we have Therefore, the Gauss-Manin vector becomes a simple expression, namely Because the A-hypergeometric polynomial has finite terms, higher-order differential operators provide annihilators. Using (6.3), it can be seen that the second derivative yields annihilators: Finally, the recurrence relation in Proposition 2.3 yields the following annihilators: By using the annihilators (2.7), the annihilators (6.4) and (6.5) are recast into for j ≤ i ≤ n − k + 1. Solving this system of annihilators for the second derivatives we can obtain the Pfaffian system (6.1).
Lemma 6.1. The elements of the Pfaffians for the A-hypergeometric polynomial Z n,k (x) are, for 1 ≤ l, m ≤ n − k and 1 ≤ i ≤ n − k + 1, where (P (n,k) i ) 1,m are given in (6.2) andP (n) i are upper triangular matrices with elements The following explicit example may help explain the discussion so far.
for n ≥ 5. For the case of n = 4, the two fake exponents degenerate and the result is (6.8) with the last term replaced by (−4y 1 ). The Pfaffian system is obtained by the Buchberger algorithm and reductions of the standard monomials {1, θ 3 } with the reduced Gröbner basis of the hypergeometric ideal H A (b), whose explicit expression is The Pfaffians for Q n,n−2 (x) = (Z n,n−2 , Z n−3,n−3 ) ⊤ are P (n,n−2) 1 = n − 4 1 The singular loci is y 1 = 1/4, which is on the boundary of the convergence radius of the expression (6.8). A linear combination of the above two solution bases satisfies the Pfaffian system (6.1) with Pfaffians (6.9). In contrast, the A-hypergeometric polynomial (6.7) satisfies the Pfaffian system (6.1) with Pfaffians (6.6): Let us discuss how to evaluate the Gauss-Manin vector Q n,k (x) at a given point of indetermi- . The original HGM is as follows [11]. Because the difference of Q n,k (x) can be approximated as a numerical integration method for difference equations, such as the Runge-Kutta method, provides the numerical value. For the implementation, the initial value at some initial point of indeterminants is needed. One method is direct evaluation of the series at the initial point [47].
However, for the computation of Z n,k (x), simple and exact expressions are available at some specific points of indeterminants, which comes for known results on the partial Bell polynomials [15]. For example, To compute the normalizing constant of the two-way contingency tables with fixed marginal sums, another type of HGM algorithm, which is based on difference equations among A-hypergeometric polynomials, was employed [46]. Following [12], we call this method the difference HGM.
Noting the derivative (6.3) for i = 1, 2, the Pfaffian system is recast into a difference equation: If 2 ≤ k < n/2, it is straightforward to see that the Gauss-Manin vector can be obtained by simple matrix multiplication: where the inverse of the Pfaffian P (i,j) 1 is given as For n/2 ≤ k ≤ n − 2, naive application of (6.11) fails because of the singularity in (P Nevertheless, the following algorithm provides the Gauss-Manin vector.
(4) Else we have . Accuracy is also an important concern, but it is difficult to give a general statement. In the following example, comparison of the performance of these three methods is demonstrated with a specific example. Improvements of implementation of the HGM algorithms will be discussed elsewhere.
Example 6.4. The generalized factorial coefficient, which appeared in Section 4, is 1/n! times the A-hypergeometric polynomial with x i = (1 − α) i−1 /i!, i = 1, 2, .... For the initial values of the HGM, the exact expressions in (6.10) can be used; the former corresponds to α = −1 and the latter corresponds to α = 1/2. For the Pfaffians, we have and executed by one core of a 2.66 GHz Intel Core2 Duo CPU P8800 processor. Table 2 gives the results for α = 1/2, which was chosen because we know the true values (6.10). Roughly speaking, the difference HGM demands less computational cost, while the recurrence relation gives more accurate estimates. Assume n − k is small. If k is large, the HGM and the difference HGM demand less computational cost than the use of the recurrence relation; otherwise, the HGM and the difference HGM would demand more computational cost. The HGM and the difference HGM lose accuracy for large n − k. In particular, the HGM gave negative value for n − k = 30, so we omit those results. The loss of accuracy comes from the fact that α = −1 and α = 1/2 are distant from each other. In fact, the HGM works for evaluation at α = 0.1 and gave similar values to those of the recurrence relation (Table 3), although we do not know the true values at α = 0.1.
If n − k is large all three methods presented above fail, in which case asymptotic approximation is inevitable. For specific parametrization of the indeterminants x, we can consider the asymptotic form. However, Theorem 6 in [2] established an asymptotic approximation for the general A-hypergeometric distributions in the regime of b = γβ, for some β ∈ int(R ≥0 A), to a Gaussian density. The asymptotic form of the A-hypergeometric polynomial comes from the normalization constant. In particular, if n = 2k,θ 1 = 0 andθ 2 = log{ky 1/3 1 /(1 + 2y 1/2 1 )}. Example 6.7. This is a continuation of Example 6.4. The accuracy of asymptotic forms of the generalized factorial coefficients is examined. An asymptotic form has been obtained by Keener et al. [29]. Here, we reproduce the result because the expressions in [29] contain some mistakes.