Extremes and gaps in sampling from a GEM random discrete distribution

We show that in a sample of size $n$ from a GEM$(0,\theta)$ random discrete distribution, the gaps $G_{i:n}:= X_{n-i+1:n} - X_{n-i:n}$ between order statistics $X_{1:n} \le \cdots \le X_{n:n}$ of the sample, with the convention $G_{n:n} := X_{1:n} - 1$, are distributed like the first $n$ terms of an infinite sequence of independent geometric$(i/(i+\theta))$ variables $G_i$. This extends a known result for the minimum $X_{1:n}$ to other gaps in the range of the sample, and implies that the maximum $X_{n:n}$ has the distribution of $1 + \sum_{i=1}^n G_i$, hence the known result that $X_{n:n}$ grows like $\theta\log(n)$ as $n\to\infty$, with an asymptotically normal distribution. Other consequences include most known formulas for the exact distributions of GEM$(0,\theta)$ sampling statistics, including the Ewens and Donnelly--Tavar\'e sampling formulas. For the two-parameter GEM$(\alpha,\theta)$ distribution we show that the maximal value grows like a random multiple of $n^{\alpha/(1-\alpha)}$ and find the limit distribution of the multiplier.

For an independent and identically distributed (i.i.d.) sequence (X n ) with a continuous distribution function F (x) := P(X n ≤ x), the probabilistic structure of order statistics is well understood. Many exact distributional identities are obtained by reduction to the simplest i.i.d. cases: • X i = U i , signifying F (x) = x for 0 ≤ x ≤ 1, the uniform distribution on (0, 1); • X i = ε i /λ for ε i i.i.d. exponential (1) and a constant λ > 0, in which case F (x) = 1 − e −λx for x ≥ 0, the exponential(λ) distribution on (0, ∞).
This reduction involves the identity of n-dimensional joint distributions which holds with almost sure identities if the U i and ε i are defined by U i := F (X i ) and ε i := − log(1 − U i ). For discrete distributions the situation is complicated by possible ties but also well understood. See [14] [53] for further background on order statistics. We are primarily interested here in the structure of the gaps between sample values which we list from the top of the sample down, as G i:n := X n+1−i:n − X n−i:n (1 ≤ i ≤ n) for distributions of X i whose support has a minimal value m 0 ≥ 0, with X 0:n := m 0 . So we interpret G n:n := X 1:n − m 0 as the gap below the minimum of the sample, with G n:n = 0 iff some sample value hits the minimum of the range. The order statistics are then encoded in the gaps as X k:n = m 0 +  According to a well known result of Sukhatme-Rényi [65], [53,Repr. 3.4], for i.i.d. sampling the structure of the gaps is simplest for exponential variables: for X i = ε i λ the G i:n are independent with (G i:n , 1 ≤ i ≤ n) Among absolutely continuous distributions, the family of shifted exponential distributions is characterized by quite weak forms of this assertion, for instance that G 1:2 is independent of G 2:2 . See Ferguson [27] and earlier work cited there, and [69]  which implies the convergence in distribution, with centering but no normalization where γ := lim n→∞ (− log n + n i=1 1/i) is Euler's constant. The infinite sum converges both almost surely and in mean square, by Kolmogorov's theorem for sums of independent random variables with mean 0, and the limit has the Gumbel distribution function F (x) = exp(−e −x ). More generally, the asymptotic behavior of the maximum M n of an i.i.d. sample from a continuous distribution is well understood. If after proper rescaling, the distribution of M n has a non-degenerate weak limit, that limit must have the distribution function F ρ (x) = exp(−(1 + xρ) −1/ρ ), for some ρ ∈ (−∞, ∞) and x such that 1 + xρ > 0 (and F ρ (x) equals 0 or 1 for other x), see, e.g., [53]. Here ρ (and the scaling) depends on the behavior of the distribution near the supremum of its support. Limits can be also degenerate, and there exist distributions for which no non-degenerate limit is possible.
The situation is quite different for an infinite exchangeable sequence X 1 , X 2 , . . .. In this case any distribution can appear as a limiting distribution of the finite sample maximum M n := X n:n , as shown by the following example, which seems to be folklore. Let Z 1 , Z 2 , . . . be a sequence of i.i.d. random variables and let M be a random variable, independent of this sequence, with some given distribution. Take X n = Z n + M to obtain an exchangeable sequence X 1 , X 2 , . . . . If the support of the distribution of Z 1 is bounded above then M n converges a.s. to a shift of M , without any rescaling. So to obtain results of any interest about limit distributions of maxima from an exchangeable sequence, some further structure must be involved.
(1. 6) We specify P • by the residual allocation model (RAM) or stick-breaking scheme [42], [68] [58, §5] (1 − H i ) = 0 almost surely. (1.8) The random variables H i may be called residual fractions, random discrete hazards, or factors. We use the term RAM to indicate that the H i are independent, but not necessarily that they are identically distributed. But we also consider these models in the broader context of H • subject to (1.8), corresponding to P • subject to (1.6), without any further dependence assumptions, which we call a generalized residual allocation model (GRAM) [58, §5].
The case of i.i.d. factors H i has been extensively studied by Gnedin and coauthors [36], [30], who call this model the Bernoulli sieve. Another RAM of particular interest, because of its invariance under size-biased permutation [59] is the GEM model with parameters (α, θ). In this model, H i has the beta(1 − α, θ + iα) density on (0, 1) where 0 ≤ α < 1 and θ > −α are real parameters, and B(·, ·) is Euler's beta function. The GEM hazard variables are i.i.d. only in the important special case α = 0 covered by the following theorem: . . be an exchangeable sequence obtained by i.i.d. sampling from the GEM(0, θ) distribution (1.7) for i.i.d. beta(1, θ) hazards H i . For each fixed n ≥ 1, the gaps G i:n between the order statistics of the sample, read from right to left, with G n:n + 1 := min 1≤i≤n X i , are independent geometric(i/(i + θ)) variables, with means θ/i for 1 ≤ i ≤ n.
As detailed in Section 3, this theorem contains most known results about GEM(0, θ) samples, including the well known sampling formulas for GEM(0, θ), due to Ewens [23], Antoniak [2], and Donnelly and Tavaré [18]. It is also very close to recent studies of random compositions derived from RAMs with i.i.d. factors, as we acknowledge further below. Our simple description of GEM(0, θ) gaps is hidden in these studies by different encodings of the values and their multiplicities in discrete random sampling, based on the count sequence N • •:n := (N • 1:n , N • 2:n , . . .) defined by (1.10) Here 1(A) denotes the indicator of an event or set A. The notion of a random sample from a random discrete distribution admits a variety of interpretations, some of which are recalled in Section 3. But as our primary metaphor for sampling, we follow recent studies of the Bernoulli sieve [30] in regarding the sample X 1 , . . . , X n as an allocation of n balls labeled by i = 1, 2, . . . , n into an unlimited number of boxes labeled by b ∈ {1, 2, . . .}. So X i is the label of the box into which ball i is thrown. Given P • the X i are independent allocations with P( :n is the number of balls thrown into box b, the sample maximum X n:n = max{b : N • b:n > 0} is the label of the rightmost occupied box, and so on. The key to Theorem 1.1 is the close parallel between the structure of gaps in sampling from GEM(0, θ), and from exponential(λ), as in (1.3). This parallel guided our choice to list the gaps from top down rather than bottom up, as well as the definition of the final gap G n:n := X 1:n − 1 in the discrete case. We show in Section 2 how Theorem 1.1 follows easily from its exponential(λ) analog, using Ignatov's construction [44] of GEM(0, θ) from a Poisson point process. This construction, and the change of variables (1.1), which maps sampling by independent uniforms in (0, 1) to sampling by independent exponentials in (0, ∞), was developed and applied in a number of previous works [44], [30], to deduce results for sampling from RAMs and related regenerative composition structures from corresponding results in renewal theory. The method yields also the following corollary of Theorem 1.1: The GEM(0, θ) models for 0 < θ < ∞ are the only RAMs with i.i.d. factors such that for all sufficiently large n the gaps between order statistics in a sample of size n are independent.
We conjecture that the GEM(0, θ) models are the only random discrete distributions of any kind subject to (1.6) with this property of independence of sample gaps for all n, or for all large n. But resolving this question seems beyond the reach of our current methods.
We discovered these properties of gaps in GEM(0, θ) samples by seeking an adequate explanation of the identity in distribution for the maximum M n of a GEM(0, θ) sample, presented in the next corollary of Theorem 1.1. We first found this identity by a different method indicated in Section 4, without consideration of gaps. But the gaps explain it much better: Corollary 1.3. For the maximum M n of a sample of size n from GEM(0, θ) for independent G i with the geometric(i/(i + θ)) distribution.
Since G i has mean EG i = θ/i and variance Var G i = θ(i + θ)/i 2 ∼ θ/i as i → ∞, Lindeberg's central limit theorem implies the limit in distribution (1.12) where Z has the standard normal distribution. This limit theorem for M n is known as an instance of a more general normal limit theorem for M n in sampling from a large class of In Section 3 we discuss further these different representations of K n , their interpretations in various applications, and related representations of the counts which for 1 ≤ j ≤ n gives the numbers of clusters of size j in the sample, and for j = 0 which is the total count of all values between 1 and M n that are missing in the sample of size n. (Here and below (x) + = 1 2 (x + |x|) is the positive part of x.) There is by now a substantial theory of asymptotics for these and related statistics of samples from RAMs with i.i.d. factors, developed by Gnedin and coauthors by various techniques. In particular, the work of [37] shows that the central limit theorems (1.12) for M n and the same result for K n hold jointly with the same limit variable Z, for the simple reason that for a large class of RAMs with i.i.d. factors, including GEM(0, θ), the difference M n − K n in (1.14) has a limit in distribution as n → ∞, without any centering or scaling. See [37] for further discussion, especially [37, Proposition 5.1] for a pretty formula involving the gamma function for the probability generating function of the large n limit in distribution of K 0:n for GEM(0, θ), and [29]  an exponential distribution, where M n has a limit in law with just centering and no normalization. So, even though we regard Theorem 1.1 as a discrete analog of the more familiar description of gaps in i.i.d. sampling from an exponential distribution, the limit theorems for M n implied by these results are quite different. Naively, it might be expected that the discrete analog of the simple structure of gaps in an exponential(λ) sample should be the gaps in sampling from a geometric(p) distribution, which is the RAM (1.7) with deterministic factors H i ≡ p. But apart from some simple results for a sample of size n = 2, which are easily seen to hold for any RAM with H 1 independent of H 2 , H 3 , . . ., the possibilities of various configurations of ties for n ≥ 3 makes the description of gaps and related statistics for geometric(p) sampling more complicated than might at first be expected. The distribution of the count of missing values K 0:n in (1.14), as well as the number L n of ties with the maximal value M n L n := n − max{j : X j:n < X n:n } = N • Mn:n = min{i : G i:n > 0} (1.15) have been the subject of many studies of i.i.d. sampling from fixed discrete distributions, especially from the geometric(p) distribution. See for instance [10] [41] and earlier literature cited there. In sampling from the geometric(p) distribution it is known the distributions of K 0:n and L n remain tight as n → ∞, but that they do not converge, due to a periodic phenomenon which has been extensively studied in this literature. As shown in [37] however, for a large class of RAMs including GEM(0, θ), the periodic phenomena which arise from i.i.d. geometric(p) sampling get smoothed out very nicely: the count L n and related statistics such as the K j:n have limits in distribution without any centering or normalization as n → ∞. The idea that such results should be informed by analysis of gaps as well as counts leads to some developments of those limit theorems explored in a sequel [61] of this article.
For RAMs with independent but non-identically distributed factors, our results are more limited. For the two parameter GEM(α, θ) distribution the representation (1.11) of M n as a sum of independent random variables is no longer valid. Nevertheless we show in Section 6 that in this case the maximum of a size n sample behaves as a random multiple of n α/(1−α) as n → ∞, and in Section 7 we indicate some companion limit theorems for L n in this case. See also a sequel to this article [62], where for sampling from GEM(α, θ) we derive a result presented here without proof as Theorem 3.4, which gives the distribution of the value-ranked frequencies, meaning the sequence of non-zero components of (N • b:n , b = 1, 2, . . .).

Point processes associated with random discrete distributions
For a random discrete distribution P • := (P j ) on the positive integers governed by the GRAM (1.7) let Thinking of P • as a random discrete distribution on the real line, which happens to be concentrated on positive integers, F j = P • (−∞, j] is the random cumulative distribution function evaluated at positive integers j. In the stick-breaking interpretation, the F j ∈ [0, 1] are the break points. But we prefer the language of the stars and bars model, discussed further in [61]. Following the method developed by Ignatov [44] for GEM(0, θ), and further developed in [31] [33] [34] for various other models of random discrete distributions, we treat the bars F j as the points of a simple point process N F on (0, 1), which counts bars not including endpoints of [0, 1]. So is the number of bars in (a, b]. The stars are the i.i.d. uniform on [0, 1] points U 1 , U 2 , . . . which fall between bars and define a random sample X 1 , In terms of Gnedin's balls in boxes model of [36], the bars F j in (0, 1) are the dividers between boxes [F j−1 , F j ) which collect the stars (balls) U i into clusters falling in the same box. For the F j , the prevailing assumption (1.6) becomes 0 < F 1 < F 2 < · · · ↑ 1 almost surely.
It is convenient to make the change of variable from [0, 1) to [0, ∞) by the map u → x = − log(1 − u). This is the inverse of the cumulative distribution function We regard these images S j of bars F j as the points of an associated point process N S on (0, ∞): Note that the assumption (1.6) translates into 0 < S 1 < S 2 < · · · ↑ ∞ a.s. For convenience we recall the Ignatov's construction of GEM(0, θ).
With above notation, the following conditions are equivalent: Proof of Theorem 1.1. Whatever the random discrete distribution P • subject to (1.6), for the sample (X 1 , . . . , X n ) constructed as above, the order statistics of the discrete sample are obtained by counting bars to the left of the order statistics of the corresponding uniform and exponential samples, according to the formula X i:n − 1 = N F (0, U i:n ] = N S (0, ε i:n ], (2.1) while the gaps between order statistics of the discrete sample are obtained by counting bars between corresponding order statistics of the uniform and exponential samples: According to Lemma 2.1 the point process N S is homogeneous Poisson with rate θ. By construction it is independent of the gaps between exponential order statistics appearing in (2.2), which due to (1.3) are independent and distributed like ε i /i for 1 ≤ i ≤ n. It follows that the G i:n are independent, with G i:n d = N S (0, ε/i] for ε standard exponential independent of N S . So the distribution of G i:n is the mixture of Poisson(θt) distributions with t assigned the exponential(i) distribution of ε/i. It is well known that such a mixture is geometric(p) for p determined by equating means: (1 − p)/p = θ/i. This gives p = i/(i + θ), and the conclusion follows.
Proof of Corollary 1.2. For a RAM with i.i.d. factors the point process N S is a renewal process. For each fixed δ with 0 < δ < 1, the number of points F j such that F j ≤ 1 − δ is N S (0, − log(δ)], which is well known to have finite exponential moments. On the other hand, by the law of large numbers for the sampling process, provided − log(δ) is chosen to be a continuity point of the renewal measure, with probability one, for sufficiently large n, both the sum of gaps n j=δn G j:n and the sum of indicators of non-zero gaps n j=δn 1(G j:n > 0) will eventually equal the number of renewals N S (0, − log(δ)], because every relevant box will be occupied. Assuming the gaps are independent, it then follows from the Poisson approximation to sums of independent Bernoulli(p i ) random variables with small parameters p i that N S (0, − log(δ)] is Poisson distributed. A variation of this argument shows that N S has independent, Poisson distributed increments. In other words, N S is a possibly inhomogeneous Poisson process. But the RAM assumption makes N S a renewal process. By comparison of the Poisson and renewal decompositions of N S at its first point S 1 , it is easily shown that the point processes that are both Poisson processes and renewal processes are the Poisson processes of constant rate θ for some θ > 0. So the conclusion follows from Ignatov's construction of GEM(0, θ) in Lemma 2.1.
As shown in [31] [29] [37] an asymptotic analysis of sampling statistics for a RAM with i.i.d. factors can be made by exploiting properties of the renewal counting process N S in this case. But for GEM(α, θ) with 0 < α < 1, the spacings between the S j are independent but not identically distributed, with S j − S j−1 converging almost surely to 0 as j → ∞, by a simple Borel-Cantelli argument. It follows that in sampling from GEM(α, θ) for 0 < α < 1 the behavior of the order statistics and gaps is very different from the case α = 0, and not approachable by methods of renewal theory.
Both to illustrate applications of Theorem 1.1 in the case α = 0, and to motivate study of the GEM order statistics for 0 < α < 1, before proceeding with that study we first show how Theorem 1.1 leads to some known results about the GEM model, and recall some of its diverse applications.

Applications
The GEM acronym was assigned by Warren Ewens [24, p. 321] to acknowledge the work of Griffiths [40], Engen [21] [20] and McCloskey [52] in developing the GEM model for random frequencies in genetics and ecology. The GEM(0, θ) with i.i.d. beta (1, θ) factors was studied first, following which the two parametric extension proposed by Engen [20] has also been extensively studied [56,60,26], motivated by its appearance in the structure of interval partitions generated by the zeros of various stochastic processes. New models of stationary reversible dynamics for population frequencies consistent with GEM(α, θ) for 0 < α < 1 have recently been developed and are of continuing interest [57] [11]. The GEM model has also been applied in Bayesian non-parametric statistics as a building block for Bayesian analysis and machine learning algorithms for inferences about clustered data. See the recent review by Crane [12] [13].

Species sampling
The GEM distributions were first studied in the setting of ecology, where a random discrete distribution P • may be regarded as a species abundance model, meaning an idealized listing of frequencies of an unlimited number of distinct species in a population of unlimited size. To justify the use of infinite models in this setting, some preliminary remarks may be in order. All actual populations are finite, with only a finite number of species. However ecologists discovered that while models with a large finite number of species are quite intractable, remarkable simplifications occur in some particular infinite models. In a sample of n individuals from a large population of size say m, one can suppose that size m population itself is a sample from some ideal infinite population.
Finite samples from this population may be assumed to exchangeable, and de Finetti's theorem leads to a representation of consistent models of species sampling from a large population in terms of limiting random frequencies, as shown by Kingman.
In the context of species sampling, in a sample of size n from some population, the data acquired is most naturally regarded as just the partition of n, that is a collection of positive integers that sum to n, obtained by classifying individuals by species, and ignoring any species labels. Such a partition of n can be described in two different ways.
The first is to list the number N i:n of individuals of type i, for 1 ≤ i ≤ k, where k = K n is the number of distinct species in the sample, and there is some convention for the ordering of types. The second is to provide for each j ∈ [n] the count of species with j representatives in the size n sample. The total number of distinct species is then The frequencies P i of species in the ideal infinite population then arise as almost sure limits of N i:n /n as n → ∞, for some consistently chosen ordering of N •:n .
It is an awkward aspect of random frequency models that most labeling of frequencies P i by integers or other countable sets are somewhat arbitrary. There are several possible workarounds for this problem. The easiest way is to list the ranked sample frequencies, meaning the numbers of representatives of various species in descending order as which is the decreasing rearrangement of any other listing of the sample frequencies (N 1:n , . . . , N k:n ). This is also a common way to list parts of partitions in combinatorics. Another way to arrange species is to obtain a size n sample by sampling the population one by one and listing the species in order of their appearance in the sampling process.
Given that any particular species in the whole population has m representatives in a sample of size n, the probability of that species being listed first in order of appearance is m/n, by the assumed exchangeability of the sampling process. This leads to the general notion of a size-biased random permutation of a finite or countably infinite index set I, or of a collection of components of some kind C i , i ∈ I that is indexed by I, for some notion of sizes V i := V (C i ) of the components being permuted [16] [59]. Typically V (C i ) is the number of elements for a finite set C i , or some measure such as length for infinite sets C i like intervals. The size function V of components is subject to the requirement that V i > 0 and that Σ := i V i < ∞, which needs to hold almost surely for a collection of random components (C i , i ∈ I). Given such random components (C i , i ∈ I), their size-biased permutation is a random indexing of these components (C σ(i) , i = 1, 2, . . .  (1) is called an size-biased pick from the components, and for each m ≥ 1 and j 1 , . . . , j m with is an size-biased pick from the remaining components indexed by I \ {j 1 , . . . , j m }: With component C i being a non-empty set of representatives of some species type i in an exchangeable random sample of size n, for some arbitrary labeling of species by i ∈ [k], the sequence of sample frequencies in order of appearance N * •:n = (N * 1:n , . . . , N * k:n ) is found to be the sequence of sizes N * i:n = V (C σ(i) ) of a size-biased random permutation of the clusters of different species found in the sample, with size V (C i ) = N i:n being the number of each species present in the sample. The notation N * •:n is a mnemonic for this size-biasing which is involved in any ordering of species by appearance in an exchangeable process of random sampling. This size-biased listing of sample cluster sizes in order of appearance turns out to be very convenient to work with, especially when the frequencies P • are in a size-biased random order themselves, an assumption which is quite natural in this context. This assumption holds for GEM(α, θ) model, as shown in [56], which is one of the reasons for use and study of this model.
In the context of species sampling, the values X i in a size n sample from a random discrete distribution such as GEM(α, θ) seem to be of little interest besides the way that clusters of these values define a partition or a composition of n. However a natural meaning for these sample values X i and their order statistics X •:n can be provided using the fact that GEM(α, θ) frequencies P • are already in size-biased random order, hence invariant in distribution under size-biased permutation (ISBP) [59], meaning that where (P * 1 , P * 2 , . . .) is a size-biased permutation of (P 1 , P 2 , . . .). Proposition 3.1. Consider an exchangeable process of species sampling (Y 1 , Y 2 , . . .) from random frequencies P • . Split (Y 1 , Y 2 , . . .) into an initial sample (Y 1 , . . . , Y n ) of size n, followed by a secondary sample (Y n+1 , Y n+2 , . . .) of unlimited size. For each 1 ≤ i ≤ n let X i be the number of species discovered by the secondary sample up to and including discovery of Y i in the secondary sample. Then (X 1 , . . . , X n ) is a sample of size n from P * • , a size-biased random permutation of P • . In particular if P • is ISBP, as is the case for GEM(α, θ), then (X 1 , . . . , X n ) is distributed like a sample of size n from P • . Moreover, for a sample (X 1 , . . . , X n ) from P * • so constructed, as well as the usual interpretation of K j:n as numbers of species with j representatives in the initial sample, and K n = n j=1 K j:n the total number of species found by the initial sample, various features of the order statistics in the sample can be interpreted as follows: • the minimum sample value X 1:n is the number of species encountered in the secondary sample up to and including when the first species in the primary sample is encountered.
• the maximum sample value M n := X n:n is the number of species encountered in the secondary sample up to and including the first time τ n that a member of each of the initial K n species has been encountered in the secondary sample.
is the index of the first individual of the second species to appear in the secondary sample, and so on. The relabeling of species by their order of appearance in the secondary sample yields a size-biased permutation P * • of P • with P * x the almost sure limiting relative frequency of C(x) for each x = 1, 2, . . .. It then follows by exchangeability that X 1 , . . . , X n is a sample of size n from P * • . The interpretations of the various statistics are then straightforward.
One can also give similar interpretations of the gaps G •:n , but this is a bit trickier. For a sample X 1 , . . . , X n let N X↑ •:n denote the list of sample frequencies in increasing order of X values, that is the subsequence of all strictly positive counts in the complete list of counts N • •:n derived from the sample X 1 , . . . , X n as in (1.10). So the count of values equal to X 1:n comes first, and the count for X n:n last. In any random sample X 1 , . . . , X n , it is clear from the definition that for any given composition (n 1 , . . . , n k ) of n, N X↑ •:n = (n 1 , . . . , n k ) if and only if the sequence of gaps G •:n between order statistics is such that In the species sampling setting of Proposition 3.1, X i is the number of species discovered by the secondary sample up to and including discovery of the species of the ith initial individual, and N X↑ •:n is the listing of cluster sizes in the primary sample in order of their discovery by the secondary sample. In this setting the gaps can be described as follows: • For ∈ {2, . . . , k}, G n +···+n k :n is the number of new species encountered in the secondary sample after − 1 species of the primary sample are found up to and including the time when -th species from the primary sample is found.
• G n:n := X 1:n − 1 is the number of species encountered in the secondary sample before some species present in the primary sample is found.
• All other gaps are zero according to (3.2).

Some corollaries of Theorem 1.1
We now explain how Theorem 1.1 implies a number of known results about GEM(0, θ) samples. Most of these were first discovered in the context of population genetics, where the age-ordering of alleles in a large population provides a natural indexing of allelic types, and the age-ordering of clusters of alleles found in a sample is of interest. The first result is an easy corollary of Theorem 1.1:  Consider now the three random compositions of n defined above in terms of a sample X 1 , . . . , X n from a random discrete distribution P • : • the value-ordered sample frequencies N X↑  for every composition (n 1 , . . . , n k ) of n into k ≤ n parts, with (x) n := Γ(x + n)/Γ(x) the Pochhammer symbol. The subscript notation P α,θ or E α,θ signals that probabilities or expectations are governed by the GEM(α, θ) model. As indicated by Donnelly and Tavaré, summing (3.4) over all compositions of n with a prescribed weakly decreasing rearrangement yields the celebrated Ewens sampling formula [2] [23] for the distribution of the partition of n generated by sampling from GEM(0, θ). That is, for K j:n the count of clusters of size j as in (3.1), for each weak composition (m 1 , . . . , m n ) of k, meaning m j ≥ 0 with n j=1 m j = k, with n j=1 jm j = n P 0,θ (K j:n = m j , 1 ≤ j ≤ n) = n! θ k (θ) n n j=1 1 m j !j mj .  The question of how to extend this result to GEM(α, θ) for 0 < α < 1 led us combine the representation of sampling from GEM(α, θ) provided by Proposition 3.1 with the known description of the distribution of N * •:n for GEM(α, θ) [58] [60, §3.2], to obtain the following theorem, whose proof will be detailed elsewhere [62]. The meaning of (3.6) is that given N * •:n = (n 1 , . . . , n k ), the frequency N X↑ 1:n of the minimal value is distributed like a random choice of (n 1 , . . . , n k ), with n i chosen with probability (n i − α)/(n − kα), and so on, as in the general definition of a size-biased permutation, just with the usual size n i of a cluster replaced by n i − α. In particular, for 0 < α < 1 and n ≥ 3 the two random compositions N X↑ •:n and N * •:n are not identically distributed. Remarkably, the conclusion (3.6) holds not only for GEM(α, θ), but also for the sampling from P • the size-biased presentation of frequencies in any of the Gibbs(α) models introduced in [60, Theorem 4.6] and studied further in [32].
We note that Theorem 1.1 for GEM(0, θ) yields also the following further corollary, which identifies the known distribution of the minimal order statistic X 1:n . This can be read from a result for the infinitely many alleles model, due to Saunders, Tavaré, and Watterson [67,Theorem 8] and the fact that this model generates age-ordered GEM(0, θ) frequencies. See Donnelly and Tavaré [18] and Donnelly [15,Proposition 3.5] for further discussion. The independence assertion in the corollary is easily shown to hold in sampling from any RAM with i.i.d. factors, due to the regenerative property of these models discussed in [31]. Corollary 3.5. In sampling from GEM(0, θ), the minimum of the sample, X 1:n = 1 + G n:n has a shifted geometric(n/(n + θ)) distribution, and X 1:n is independent of the pair of random compositions N X↑ •:n and N * •:n , hence also independent of the Ewens(θ) distributed partition of n generated by the sample.
Proof. The independence of G n:n and N X↑ •:n is clear, because N X↑ •:n is a function of the G i:n 1 ≤ i ≤ n − 1. But by exchangeability, conditionally given the entire collection of order statistics, N * •:n is just a size-biased permutation of N X↑ •:n . So the order statistics and N * •:n are conditionally independent given N X↑ •:n , from which the conclusion follows easily.

Combinatorial limit theorems
Combinatorial models often involve exchangeable random partitions of [n] into a collection of subsets of various sizes, typically connected components of a graph associated with the model, such as the cycles of a permutation, trees in a forest, or connected components of a mapping digraph. It is known [4] [60] that in many models for such a combinatorial structure picked uniformly at random, the sequence N ↓ •:n of ranked component sizes converges in law after scaling by n: for some (α, θ), where PD(α, θ), the Poisson-Dirichlet distribution with parameters (α, θ) is the distribution of the decreasing rearrangement of GEM(α, θ) [63]. According to the general theory of such limit distributions [17] [35], this is equivalent to the corresponding convergence  [63] that (3.7) holds for this α with θ = 0, for the Markov chain started in state 0, and with the same α with θ = α for the Markovian bridge obtained by further conditioning to return to state 0 at a late time n. In this setting, the lengths of excursions are most naturally listed in their order of creation by the Markov chain, which is neither ranked nor size-biased. Still, the analysis of such limit laws for excursions is assisted by the device of deliberately size-biasing the order of excursions, to create a GEM(α, θ) limit as in (3.8) which is much easier to deal with than the PD(α, θ) limit of ranked lengths.
In any of these settings where PD(α, θ) and GEM(α, θ) arise hand-in-hand as limit laws for ranked and size-biased counts of some kind, it was shown by Kingman [50] that the structure of the limit theorems extends to corresponding limit theorems for sampling components of the structure of size n. To set this up, consider a limited sample of n elements from a much larger set of size m supporting a combinatorial structure whose ranked relative component sizes m which implies also convergence in distribution of corresponding order statistics, counts and gaps to those derived from the GEM sample.
This proposition shows how any exact result for a sample of size n from a GEM can be turned into the conclusion of a limit theorem for sampling from various random combinatorial structures. Just that a double sampling process is involved, much as in Proposition 3.1, which acquires further interpretations in this context. The sample of size n may be regarded as an initial sample of size n, as in Proposition 3.1. Then there needs to be a secondary sample, to generate a size-biased labeling of components, run at least long enough to allocate a label to every component that intersects the initial sample.
The limit in distribution as m → ∞ of the list of secondary labels found in the primary sample of size n is then a size n sample from GEM(α, θ). The case of exchangeable partitions is particularly natural, as the initial sample of size n can be taken to be the set [n] instead of a random subset of size n, and the secondary labeling of components can be taken to be the order of least elements of components, starting the labeling afresh after the initial sample of size n, as in Proposition 3.1. As m → ∞ there is a negligible difference between this construction and a completely independent size-biased listing of components, so the conclusions of the above Proposition are valid in either setup.
For application to the excursions of a Markov chain, given the path of the Markov chain of length m, due to lack of exchangeability, two random samples are required, one to choose n sample times from [m], and the other to assign secondary labels to excursions in their order of discovery by a random permutation of [m]. As in the exchangeable case, it makes no difference if the second random sample is just a continuation of the first, avoiding the first n elements drawn and continuing without replacement until all m time points have been covered, and all the excursions found. In this scenario there is a tiny probability that the first n time points sampled might find excursions which were not part of the subsequent sample, and hence to which no label can be assigned, but the probability of this event and all other differences in the distribution of the first n sample labels is negligible in the large m limit, because the assumption of convergence to GEM(α, θ) proper frequencies means that with overwhelming probability these first n sample points will fall in some collection of K n excursions each of which acquires a significant fraction of the rest of the sample points in [m] and m → ∞. The interpretation in this setting of large n limit theorems for sampling a GEM distribution is not immediate. But such interpretations might still be made with adequate provision of a limit regime with both n and m tending to infinity. To adequately justify such a double limit theorem, some estimate of the adequacy of the approximation of m −1 N * •:m by GEM(α, θ) would be required, such as that provided in [5] for the random permutation statistics approaching GEM(0, 1).

The maximum of a sample from a random discrete distribution
This section develops some general formulas for the distribution of the maximum of a sample from a random discrete distribution on positive integers. These formulas allow us to check Corollary 1.3 without appeal to Ignatov's representation of GEM(0, θ). Our interest in this approach is that it at least gives us an explicit if difficult formula for the distribution of the maximum of a sample from GEM(α, θ).
We begin with a well known representation of the probability generating function of a discrete random variable in terms of its tail probabilities.
This allows us to provide the following general expression for the distribution of the maximum of a sample from a random discrete distribution: Lemma 4.2. Let M n = max 1≤k≤n X k for a sequence of exchangeable positive integer valued random variables X 1 , . . . , X n which are conditionally i.i.d. P • given some random discrete distribution P • with R k := 1 − k j=1 P j ↓ 0 a.s. Then the probability generating function of M n − 1 admits the representation Proof. We apply (4.2) to X = M n − 1. For k = 1, 2, . . . the term for m = k − 1 is evaluated by taking expectations in the following identity:
For j > 1 only in the case α = 0 does there seem to be much simplification in (4.3).
Then we can proceed as follows: Computational proof of Corollary 1.3. For α = 0 in (4.4) we find that .  which can be verified, for instance, by multiplying (4.9) by x + k and plugging in x = −k for k = 0, 1, . . . , n. Since the factors in the right-hand side of (4.8) are the probability generating functions of geometric variables G i with parameters i/(i + θ), the conclusion of Corollary 1.3 follows.
where the G i (b, θ) are independent with geometric(p i (b, θ)) distributions, for Proof. Consider first N F (0, W ] where W is a random variable with some arbitrary distribution on [0, 1], independent of N F . For W = u fixed, the distribution of N F (0, u] is Poisson(−θ log(1 − u)) with the probability generating function For general W the distribution of N F (0, W ] ranges over all mixed Poisson distributions. Explicitly, the probability generating function of W is In particular, if W = β a,b has the beta(a, b) distribution then If a = n is a positive integer, then Γ(n+b) θ))z for p i (n, θ) as in (5.2). Since the i-th factor is the probability generating function for the geometric(p i (n, θ)) distribution, the claim (5.1) follows.
6 The maximum of a GEM(α, θ) sample for 0 < α < 1 The technique of the previous sections does not seem to work for the case 0 < α < 1. However the asymptotics of the GEM distribution in this case are known sufficiently well to find the asymptotic behavior of M n as n → ∞. In particular, it is known that GEM(α, θ) frequencies P i almost surely decay as random factors of i −1/α . Similar behavior is also known for the sampling from a random branching process model introduced by Robert and Simatos [66] where different but similar aspects of samples are studied, such as the limit behavior of the first unoccupied box, as the sample size grows.
A key role in the study of the GEM(α, θ) distribution for the case 0 < α < 1 is played by the notion of the α-diversity of the exchangeable sample. It is known [60,Th. 3.8] that for K n defined by (1.13) from a GEM(α, θ) sample with 0 < α < 1 and θ > −α there exists a limit lim n→∞ K n n α = D α > 0 almost surely (P α,θ ) and also in p-th mean for every p > 0. The distribution of the limiting random variable D α , which depends on θ, is known as the α-diversity and is determined by its moments .
Moreover, the α-diversity D α is a.s. determined by P • and P α,θ [X 1 > k|P • ] ∼ αD 1/α α,θ k 1−1/α almost surely (P α,θ ) as k → ∞, see [28,Sec. 10] or [60,Lemma 3.11]. For such a power law it is well known that the maximum of an independent sample of size n converges in distribution to the Fréchet distribution. Namely, writing for short γ = 1/α − 1, for any fixed x > 0 x −γ n n almost surely (P α,θ ) Hence, by integration with respect to the distribution of the α-diversity, we have the following result.
Theorem 6.1. Let M n be the maximum of a size n GEM(α, θ) exchangeable sample with 0 < α < 1 and θ > −α. Then for each x > 0 Theorem 6.3. For the P α,θ distribution of D α determined by the moment function (6.1), where J θ is the Bessel function.
Proof. Writing for short y = αx −(1−α)/α we have, for any c > 0, because e −y and Γ(s) form the Mellin pair. We refer to [54] for the necessary information about Mellin's transform. By analyticity the expression (6.1) for P α,θ moments of D α is valid also for complex p at least with Re p > −1 − θ α . Hence taking expectation and applying Fubini's theorem yields y −s ds Plugging this into (6.4), changing the variable v = √ u and returning to the variable x yields the result.
The right-hand side of (6.3) does not seem to allow much simplification for general α. For some rational α Mathematica evaluates this integral in terms of the hypergeometric function. However for α = 1/2 the integral can be taken explicitly and leads to a simple expression. In this case the integral is the Mellin transform of f (v) = e −v √ 2x J θ (2v) evaluated at θ + 1. According to [54, I.10.7] x and the last expression simplifies for s = θ + 1 because for |z| < 1 and by analyticity also for all z with Re z < 1. Hence  Some simplification is also possible for rational α = 1/2 using the representation of the α-diversity in terms of product of random variables with beta and gamma distributions given in [48,Sec. 8].

Limit laws for the number of missing values and number of ties at the maximum
This section offers some complements to the analysis of limit laws for M n in GEM(0, θ) settings, following the work of Gnedin et al. [37].
It was observed in other notation in [37, (19)] that in sampling from GEM(0, θ), for the number of missing values in the range K 0:n := M n − K n there is the representation in distribution (1.14) in terms of independent geometric(p i ) random variables G i:n with p i := i/(θ + i). Writing now G(p i ) instead of G i:n to emphasize the lack of dependence on n in this representation, apart from the trivial term G n:n which obviously converges almost surely to 0 as n → ∞, we deduce easily that This is just an explicit presentation of a random variable with the limit distribution of K 0:n as n → ∞ which was described in [37, Proposition 5.1] by the probability generating function g θ (z) := Ez K0: which can be read from (7.1) as an infinite product of modified geometric generating functions. This product simplifies to (7.2)  That result may by understood as a refinement of (7.1) in which each term (G(p i )−1) + is replaced by the distributionally equivalent random variable N i (θB 1−pi ε i /i) where the ε i are independent standard exponential variables, the B 1−pi are independent Bernoulli variables with the indicated parameters, independent also of the ε i , and the N i are independent rate one Poisson processes independent of both the ε i and the B 1−pi . Then there is the identity in distribution which can be checked by computing the Laplace transform of both sides at λ > 0. This identity (7.3) is the instance a = 1, b = θ of the identity (7.4) presented in the following proposition, which is the simpler variant for log beta variables of a representation of log gamma variables due to Gordon [38]. This identity can be found in a not easily accessible text by Pakes [55, (32.17)].
Proposition 7.1. For each a, b > 0, for 0 < β a,b < 1 with density proportional to u a−1 (1 − u) b−1 at 0 < u < 1 there is the identity in distribution (7.4) where the ε j are i.i.d. standard exponential variables, independent of a sequence of independent Bernoulli variables B pj with parameters p j = b/(a + b + j).
Proof. The well known identity in distribution β a,b γ a+b d = γ a for independent beta and gamma variables with the indicated parameters, and known representations of log gamma variables, show that the distribution of the non-negative random variable − log β a,b is infinitely divisible with Lévy density at x > 0 which is given by the formula [8, p. 769] x −1 1 − e −x (e −ax − e −(a+b)x ) = ∞ j=0 x −1 (e −(a+j)x − e −(a+b+j)x ). (7.5) But it is also well known and easily checked that the jth term on the right side of (7.5) is the Lévy density of the infinitely divisible law of the jth term in (7.4). So the conclusion follows easily from the additivity of Lévy measures.  where the first d = holds only if θ = m ∈ N, but the next two d = hold for all θ > 0 by Theorem 1.1. Also for θ = m ∈ N, the right side of (7.6) is the inverse of the central binomial coefficient 2m m −1 ∼ 2 −2m √ πm as m → ∞, and the approach of this probability to 0 is similarly rapid for θ → ∞ through real values, due to Stirling's approximation to the gamma function. In particular, for θ = 1, the probability of a complete sample is simply 1/2. This is also known [43] to be the common value of P(K n = M n ) for every n in the case of i.i.d. sampling from geometric(1/2). Some extensions of these results to more general RAMs with i.i.d. factors will be given in [61].
Also either from (7.1) or from the probability generating function g θ given in (7.2) it is easy to find the generating function for the Lévy measure of K 0:∞ which has atoms λ k in k = 1, 2, . . .: ∞ k=1 λ k z k = − log g θ (0) + log g θ (z) = log Γ(1 + 2θ) Γ(1 + θ) Its Taylor's expansion gives an expression for the atoms λ k in terms of the polygamma function.
The tail probability generating function for L ∞ is the Gaussian hypergeometric function (1) k (1 + θ) k z k = 2 F 1 1, 1 1 + θ ; z from which the limiting mean is found to be lim n→∞ E(L n ) = E(L ∞ ) = ∞ k=0 P(L ∞ > k) = θ (θ − 1) + where the last expression should be read as θ/(θ − 1) < ∞ for θ > 1, and θ/0 = ∞ for θ ≤ 1. Similarly, the limiting second moment is finite only if θ > 2 with the simple limit formula for the second binomial moment It appears that this pattern continues, with finite third binomial moment 2!θ/(θ − 3) 3 for θ > 3, and so on. Since the number of ties at the maximum L n can be defined on a common probability space, one can also be interested in some stronger types of convergence for these random variables than the convergence in distribution. The answer to this question for independent random variables is well known: Brands et al. [9] conjectured and Baryshnikov et al. [7] confirmed that the number L n of maxima in a sample of n independent discrete random variables can exhibit just three types of behavior as n → ∞: either it converges to 1 or to ∞ in probability, or it does not have a limit. These three cases can be distinguished in terms of the discrete hazards (1.7) of the distribution of X 1 , which are nonrandom in this case, say, H j = h j . If h j → 0 as j → ∞, then the number of maxima converges in probability to 1, and this is the only possibility for convergence to a proper distribution. This result was extended to an almost sure convergence by Qi [64], who showed that a.s. convergence holds if and only if the series j h 2 j converges. Later, a probabilistic proof of this result was given by Eisenberg [19] along with some extensions. His results are also formulated in terms of the discrete hazards: This result can be immediately translated to the exchangeable GEM case, because in this case hazards are independent random variables. Heuristically, the next Theorem means that for α ∈ (0, 1), as k becomes large, the GEM(α, θ) probabilities P k a.s. tend to zero regularly and the maximum is unlikely to be hit twice before a new maximal value is reached, while for α = 0 the situation is opposite, there exist k such that P k is arbitrary large compared to the tail 1 − F k and such values k are repeated as maxima of the sample many times. (1 − α) k (1 + (i − 1)α + θ) k .