LIMIT DISTRIBUTIONS AND RANDOM TREES DERIVED FROM THE BIRTHDAY PROBLEM WITH UNEQUAL PROBABILITIES

Given an arbitrary distribution on a countable set S consider the number of independent samples required until the (cid:12)rst repeated value is seen. Exact and asymptotic formul(cid:26) are derived for the distribution of this time and of the times until subsequent repeats. Asymptotic properties of the repeat times are derived by embedding in a Poisson process. In particular, necessary and su(cid:14)cient conditions for convergence are given and the possible limits explicitly described. Under the same conditions the (cid:12)nite dimensional distributions of the repeat times converge to the arrival times of suitably modi(cid:12)ed Poisson processes, and random trees derived from the sequence of independent trials converge in distribution to an inhomogeneous continuum random tree.


Introduction
Recall the classical birthday problem: given that each day of the year is equally likely as a possible birthday, and that birthdays of different people are independent, how many people are needed in a group to have a better than even chance that at least two people have the same birthday? The well known answer is 23. Here we consider a number of extensions of this problem. We allow the "birthdays" to fall in some finite or countable set S and let their common distribution be arbitrary on this set. We generalize the birthday problem in this setting as follows: in a stream of people, what is the distribution of the number who arrive before the mth person whose birthday is the same as that of some previous person in the stream? Our main motivation for studying the distributions of these random variables, which we call repeat times, is that they arise naturally in the study of certain kinds of random trees.
The distribution of the first repeat time has been studied widely. By truncating the Taylor series of the generating function Gail et al [14] derived an approximate distribution and applied their result to a problem of cell culture contamination. Using Newton polynomials Stein [28] derived the same approximation and supplied an error bound. Mase [21] used similar techniques to derive an approximation (with bound) in connection with the number of surnames in Japan. See also [18].
In the quota problem each possible value j, is assigned a quota, say v j , and the problem is to describe the distribution of the time that a quota is first met. If v j = 2 for all j this is the time of the first repeat. Using the technique of embedding in a Poisson process, Holst [15,16] found expressions for the moments of a general quota fulfilment time, and specialised to find the asymptotic distribution of the first k-fold repeat time with the assumption that the probabilities are uniform across values. Here we use a Poisson embedding to derive asymptotic repeat time distributions for an arbitrary sequence of underlying value distributions. These results can easily be extended to the setting of the general quota problem. Aldous [1] gave a heuristic derivation of the limiting distributions for k-fold repeats. The results of section 4 are extended to k-fold repeats in the companion paper [9].
The birthday problem can also be approached by counting the number of matched pairs in a set. Theorem 5.G in Barbour, Holst, Jansen [6] gives a Poisson approximation (with error bound) to the number of matched pairs, from which the "if" part of Corollary 5 below may be deduced.

Overview of Results
This section presents some of the main results of the paper, with pointers to following sections for details and further developments.
Let p be a probability distribution on a finite or countable set S with p s > 0 for all s ∈ S. We refer to elements of S as values. Let Y 0 , Y 1 , . . . be i.i.d.(p), meaning independent and identically distributed with common distribution p. Let R m be the time of the mth repeat in this sequence.
denote the random set of observed values at the time of the mth repeat.
For an arbitrary A ⊆ S let |A| denote its cardinality, and define p A := i∈A p i and Π A := i∈A p i . Section 3 derives some exact formulae for the distribution of R m by conditioning on A m . In particular, for the first repeat R 1 there are the formulae where the sums are over all subsets A of S of size k. The Ath term in (1) is where the Ath term is P (A 2 = A). These formulae allow random variables with the same distribution as R m to be recognized in other contexts, where results of this paper concerning the asymptotic distribution of R m may be applied.
In particular, the distribution of R m arises in the study of random trees. Given a sequence of S-valued random variables (Y 0 , Y 1 , . . .) define a directed graph Then T (Y 0 , Y 1 , . . .) is a random tree labelled by {Y 0 , Y 1 , . . .} with root Y 0 . Intuitively, the tree grows along the sequence until it encounters a repeat, at which point it backtracks to the first occurrence of the repeated value and continues its growth from there. The random tree T (Y 0 , Y 1 , . . .) has been studied for (Y 0 , Y 1 , . . .) a finite state Markov chain [8], [20, §6.1]. By specializing a general Markov chain formula to the present setting, and evaluating a constant of normalization by use of Cayley's multinomial expansion [26,25], there is the following result, an alternative proof of which is indicated after Lemma 7.
. .) has the following distribution on the set T(S) of all rooted trees labelled by S.
where C s t is the number of children (out-degree) of s in t.
Properties of these random trees are linked to repeat times via the following two results, which are proved in Section 3.2.

.(p) and this collection of random variables is independent of the ran-
For a discrete distribution p with support S call a random tree T labelled by S a p-tree if T has the same distribution as T (Y 0 , Y 1 , . . .) for an i.i.d.(p) sequence (Y i ). For finite S, the distribution of a p-tree T on T(S) is given by formula (5). See [25,23,24] regarding p-trees and related models for random forests.
which also holds jointly as m varies. In particular, the number of vertices of S m has the same distribution as the number R m − m + 1 of vertices of T m , which is the number of distinct values before the mth repeat in an i.i.d.(p) sequence.
The joint distribution featured in (6) is described explicitly in Section 3.2 by formula (18). According to Corollary 3 for m = 1, the distribution of R 1 described by (1) and (2) is also the distribution of the number of vertices on the path from X 1 to X 2 in a p-tree, for X 1 and X 2 with distribution p picked independently of each other and of the tree. For p the uniform distribution on a finite set this is equivalent to the formula of Meir and Moon [22] for the distribution of the distance between two distinct points in a uniform random tree. Another random variable with the same distribution as R 1 is the number C of cyclic points generated by a random M : S → S such that the M (s) are i.i.d.(p) as s ranges over S. Jaworski [17] obtained an equivalent of (1) with C in place of R 1 for finite S. As observed in [23], this identity in distribution is explained by Joyal's [19] bijection between S S and S × S × U(S) where U(S) is the set of unrooted trees labelled by S.
Consider now the problem of describing the asymptotic distribution of the first repeat time R 1 in an i.i.d.(p) sequence, in a limiting regime with the probability distribution p depending on a parameter n = 1, 2, . . .. By an appropriate relabeling of the set of possible values by positive integers, there is no loss of generality in supposing that the nth distribution is a ranked discrete distribution (p ni , i ≥ 1), meaning that For each n let Y nj , j ≥ 0 be i.i.d. with this distribution, and for m ≥ 1 define R nm to be the time of the mth repeat in the sequence (Y nj , j ≥ 0). In the uniform case, when it is elementary and well known [12, p. 83] that for all r ≥ 0 Consider more generally the problem of characterizing the set of all possible asymptotic distributions of R n1 derived from a sequence of ranked distributions (p ni , i ≥ 1) with p n1 → 0 as n → ∞. A central result of this paper, established in Section 4.2, is the solution to this problem provided by the following theorem: Theorem 4 Let R n1 be the index of the first repeated value in an i.i.d. sequence with discrete distribution whose point probabilities in non-increasing order are (p ni , i ≥ 1). Let s n := i p 2 ni and θ ni := p ni /s n .
(i) If p n1 → 0 as n → ∞ and θ i := lim n θ ni exists for each i (9) then for each r ≥ 0 (ii) Conversely, if there exist positive constants c n → 0 and d n such that the distribution of c n (R n1 − d n ) has a non-degenerate weak limit as n → ∞, then p n1 → 0 and limits θ i exist as in (i), so the weak limit is just a rescaling of that described in (i), with c n /s n → α for some 0 < α < ∞, and c n d n → 0.
Thus for a general sequence of ranked discrete distributions (p ni , i ≥ 1) with p n1 → 0 the appropriate scaling constants for the first repeat times are (s n , n ≥ 1). The quantity θ n1 measures the probability of the most probable value relative to this scaling. In particular, Theorem 4 shows when the limit distribution of R n1 is the same as in the uniform case:

Corollary 5 With the notation of the previous theorem,
for all r ≥ 0 if and only if both p n1 → 0 and θ n1 → 0 as n → ∞.
This limiting Rayleigh distribution is that of the first point of a Poisson process on [0, ∞) of rate t at time t. It is implicit in the work of Aldous [3] that in the uniform case the rescaled repeat times R n1 / √ n, R n2 / √ n, . . . converge jointly in distribution to the arrival times of such a Poisson process. In Section 4.3 we establish a corresponding generalisation of Theorem 4: where 0 < η 1 < η 2 < · · · are the arrival times for the superposition of independent point processes Theorem 14 in Section 4.4 presents a refinement of this result in terms of a family of point processes in the plane constructed from independent Poisson processes. A corollary of Theorem 14, presented in Section 5, describes a sense in which the sequence of random trees T (Y nj , j ≥ 0) converges in distribution in the same limit regime (9) to a continuum random tree (CRT) which can be constructed directly from the point processes in the plane. This leads to a new kind of CRT, an inhomogeneous continuum random tree (ICRT) T θ , parameterised by the ranked non-negative sequence θ := (θ i , i ≥ 1) with i θ 2 i ≤ 1. See Aldous-Pitman [4] for the study of various distributional properties of the limiting ICRT T θ , and Aldous-Pitman [5] for the application of this ICRT to the study of a coalescent process.

The exact distribution of
Thus to describe the distribution of R m it is enough to describe the distribution of the random set A m .
If A 1 = A then the first |A| values taken by the Y i are distinct and exactly the values A. Note that R 1 = |A|. By independence, This yields formula (1). More generally, if A m = A then (Y 0 , . . . , Y Rm−1 ) contains each of the elements of A plus m − 1 repeated values. Again Y Rm takes a repeated value and so In particular, (Y 0 , Y 1 , . . . , Y R 2 −1 ) contains exactly one repeated value. The number of permutations of k objects with two indistinguishable and the rest distinct is k!/2!, thus for an arbitrary set A Combined with (12) this yields (3).
contains either one triple repeat or two values repeated once each. Hence which combines with (12) to give a formula for the distribution of R 3 .
To present a general formula for the distribution of A m we need some notation involving partitions. Let a := (a 1 , a 2 , . . .) be a non-increasing sequence of non-negative integers with |a| := a 1 + a 2 + · · · < ∞ and l(a) := max{i : a i = 0}. Call a a partition of |a| into l(a) parts. Let By a straightforward extension of the argument which led for m = 1, 2, 3 to formulae (13), (14) and (15) respectively, there is the following general formula: for m ≥ 1 where the sum is over all partitions a = (a 1 , a 2 , . . .) of m − 1. The distribution of R m is now determined by summing over appropriate sets A, as in formula (12). Alternatively, an expression for the tail probabilities of R m is obtained by conditioning on the partition of k induced by values where the sum is over all partitions b = (b 1 , b 2 , . . .) of k into more than k − m parts. In the particular case m = 1 this gives formula (2).

Analysis of the tree
Recall the definition (4) of T (Y 0 , Y 1 , . . .). Theorem 2 and Corollary 3 are obtained by letting m → ∞ in the following Lemma. Define T * (S) to be the set of all rooted trees labelled by some finite non-empty subset of S. For t ∈ T * (S) the set of leaves of t is the set of all vertices of t whose out-degree in t is zero.
Then for each t ∈ T * (S) whose set of leaves is contained in the set and Proof Essentially the same inductive argument shows that for each given sequence of values (y i , 1 ≤ i ≤ m) ∈ S m , each tree t ∈ T * (S) with a vertices whose set of leaves is contained in the set The probability of this event is therefore m+a−1 j=0 p w j and it is easily shown that this product can be rearranged as in the formula (18). Formula (19) now follows by summing (18)  Proof of Theorem 2. For finite S this is obtained by a reprise of the previous argument, using formula (19) and Lemma 1. The result for infinite S follows using the fact that the σ-field generated by T m increases to the σ-field generated by T (Y 0 , Y 1 , . . .). 2 Proof of Corollary 3. This follows immediately from Theorem 2 and the first sentence in the proof of Lemma 7. 2 This is obvious for Then it is easily seen that for y = z and every finite subset A of S − {y, z} Now (20) for y = z follows from (21) and the following formula, which is valid for every subset B of a countable set S, and every probability distribution p on S, with Π A := i∈A p i :

Limit distributions
Throughout this section we work with the setting and notation introduced in Theorem 4.

Poisson embedding
Without loss of generality, it will be assumed from now on that the i.i For n ≥ 1 partition [0, 1] into intervals I n1 , I n2 , . . . such that the length of I ni is p ni . For n > 0, so for each n the Y nj , j ≥ 0 are i.i.d. with distribution (p ni , i ≥ 1). Let (R nm , m ≥ 1) mark the repeats in this sequence and let (T nm , m ≥ 1) be the corresponding times within N , that is The next lemma allows us to deduce limits in distribution for the finite dimensional distributions of (R nm , m ≥ 1) from corresponding limits in distribution of (T nm , m ≥ 1).

Lemma 8
If p n1 → 0, then for each m ≥ 1 there is the convergence in probability Proof. By the strong law of large numbers N (t − )/t converges almost surely to 1 as t → ∞ and hence by (24) it suffices to show that T nm converges in probability to infinity. Since T n1 ≤ T nm for each m ≥ 1 it is enough to consider m = 1. But formulae (26) and (28) below imply that and the conclusion follows.

Lemma 9
Let θ := (θ 1 , θ 2 , . . .) be such that θ 1 ≥ θ 2 ≥ · · · ≥ 0 and i θ 2 i < ∞. Then for where the series is absolutely convergent; consequently, for such t |log g(t; θ)| ≤ t 2 2 and Proof. If 0 ≤ tθ 1 < 1 then also 0 ≤ tθ i < 1 for all i, so the expansion log(1 + z) = z − z 2 /2 + z 3 /3 − · · · for |z| < 1 yields which becomes (27) after switching the order of summation. To justify the switch by absolute convergence, let s 2 := i θ 2 i and note that for k ≥ 2 The estimates (28) and (29) follow easily by similar comparisons of (27) to a geometric series with common ratio tθ 1 . As a simple special case of the following proof, the case of Theorem 4 (i) when θ 1 = 0 and the conclusion is (11) follows immediately from this formula combined with the estimate (29) above and the substitution of T n1 for R n1 justified by Lemma 8.

Proof of Theorem 4 (i).
Fix r > 0 and let j r , n r be such that n > n r implies rθ njr < 1. Clearly lim n→∞ i≤jr In view of (32) and Lemma 8 it only remains to show lim n→∞ i>jr From the choice of j r , if n > n r equation (27) implies and it is easily checked, using the bound θ m ni ≤ θ 2 ni θ m−2 nj for i ≥ j with large j, and i θ 2 ni = 1 for all n, that for all m > 2 lim n→∞ i>jr The kind of bound used in equation (31) now allows the proof to be completed by dominated convergence 2 Proof of Theorem 4 (ii). By consideration of subsequential limits and convergence of types [7,Theorem 14.2], it is easily seen that it suffices to establish the following lemma.

Asymptotics of Joint Distributions
We start by proving the particular case of Theorem 6 when θ i = 0 for all i ≥ 1. That is: For n, i ≥ 1 let F ni := (F ni t , t ≥ 0) be the natural filtration of N ni (·/s n ) and let F n := (F n t , t ≥ 0) be the smallest filtration containing {F ni : i ≥ 1}. Let (C ni (t), t ≥ 0) be the compensator of N − ni (·/s n ) with respect to the filtration F ni and (C n (t), t ≥ 0) the compensator of X n with respect to F n . Thus The compensator of M with respect to its natural filtration is C(t) := t 2 /2. By Theorem 13.4.IV of Daley and Vere-Jones [11] it is sufficient to show C n (t) The process N ni := (N ni (r), r ≥ 0) is a homogeneous Poisson process of rate p ni , with compensator (p ni r, r ≥ 0). Thus (N ni (t/s n ), t ≥ 0) has compensator (θ ni t, t ≥ 0). If T ni1 is the time of the first point of N ni then (N − ni (t/s n ), t ≥ 0) counts only those points that arrive after t = s n T ni1 . Hence where s n T ni1 has an exponential distribution with rate θ ni . A little calculus and equations (35) and (36) yield For x ≥ 0 there are the elementary inequalities which applied to (37) and (38) imply By hypothesis θ n1 → 0 as n → ∞ and the proof is complete. 2 Proof. Let K nm be the time corresponding to J nm in the Poisson embedding. It is easily seen using Lemma 8 that J nm and K nm have common asymptotics in any regime with p n1 → 0. The claimed weak convergence therefore amounts to the following: for each m ≥ 1 So it is enough to show that the limit of the G n satisfies these conditions. Condition (a) follows from Lemma 11 and (b) can be seen as follows. Given that none of the first m repeats is a triple repeat, each of the pairs (T nj , K nj ) is the first two points of some homogeneous Poisson process, so K nj given T nj is uniform on [0, T nj ], and this feature passes easily to the limit. The argument is then completed by the following lemma. Proof. This is a straightforward variation of the the proof of Theorem 6.

Asymptotics for the tree
Let T nm denote the tree T m derived as in Corollary 3 from an i.i.d. sequence (Y nj , j ≥ 0) with distribution (p ni , i ≥ 1). So T nm is the subtree of T (Y nj , j ≥ 0) spanned by Y n0 and the Y n,R ni −1 for 1 ≤ i ≤ m. Consider the behaviour of the trees T nm in the asymptotic regime (9). In the uniform case (7), results of Aldous [2] describe the asymptotic behaviour of a suitably reduced version of T nm , with edge lengths normalized by 1/ √ n, in terms of a continuum random tree (CRT). It follows from the previous results that Aldous's description can be transferred to the case p n1 → 0 and lim n θ n1 = 0, with normalization of edge lengths of T nm by s n := i p 2 ni instead of 1/ √ n, and with the same limiting CRT. We now describe the limiting behaviour of T nm in the more general case, with p n1 → 0 and lim n θ ni = θ i for all i. This leads to a new kind of CRT, an inhomogeneous continuum random tree (ICRT) T θ , parameterised by the ranked non-negative sequence θ := (θ i , i ≥ 1) with i θ 2 i ≤ 1. Following Aldous-Pitman [5], we first introduce an appropriate space of trees for the description of the limit process involved. For k ≥ 0 and m ≥ 1 let T k,m be the space of trees such that (i) there are exactly m + 1 leaves (vertices of degree 1), labeled 0+, . . . , m+; (ii) there may be extra labeled vertices, with distinct labels in {1, . . . , k}; (iii) there may be unlabeled vertices of degree 3 or more; (iv) each edge e has a length l e , where l e is a strictly positive real number.
Let E nm denote the event that the vertices Y n0 and Y n,R ni −1 for 1 ≤ i ≤ m are m + 1 distinct leaves of T nm , where edge directions in T nm are now ignored, so the root Y n0 of T nm may be a leaf. It follows easily from the previous results that the event E nm has probability approaching 1 in the limit. If E nm occurs define a T k,m -valued random tree R nkm , as follows. First make T nm into a "tree with edge-lengths" by assigning length s n := i p 2 ni to each edge. Relabel vertex Y n0 as vertex 0+ and, for each 1 ≤ j ≤ m, relabel vertex Y n,R j −1 as vertex j+. Of the remaining vertices, those with labels 1 ≤ i ≤ k retain the label, and the others are unlabeled. Finally, unlabeled vertices of degree 2 are deleted. More precisely, each maximal l-edge path joining such vertices is replaced by a single edge of length ls n . The resulting tree is R nkm . See [5] for a more detailed account of this and the following construction, with diagrams. If E nm does not occur, we set R nkm = ∂ for some conventional state ∂ not in T k,m . We call R nkm a reduced tree derived from T nm . To discuss weak convergence of the distribution of R nkm as n → ∞, we put the following topology on T k,m , then add ∂ as an isolated point. Each tree t ∈ T k,m has a shape shape(t), which is the combinatorial tree obtained by ignoring edge-lengths. The set T shape k,m of possible shapes is finite. One can formally regard t as a vector (shape(t); l e , e an edge of shape(t)) and thereby T k,m inherits a topology from the discrete topology on T shape k,m and the usual product topology on R d .
By construction of R nkm , given that E nm occurs, the total length of all edges of R nkm is s n R nm . According to Theorem 6, in the limit regime (9) the distribution of this total length converges as n → ∞ to the distribution of the time η m of the mth arrival in a limiting point process. Theorem 14 allows this convergence in distribution of the total length of R nkm to be strengthened to convergence in distribution of R nkm to R θ km for a random element R θ km of T k,m which can be constructed directly from the Poisson point processes featured in Theorem 14. We state this formally in Corollary 15 below, following the construction of R θ km in the next paragraph from the Poisson processes featured in Theorem 14.
Fix θ := (θ 1 , θ 2 , . . .) with θ 1 ≥ θ 2 ≥ · · · 0 and i θ 2 i ≤ 1, and define a := 1 − i θ 2 i . So 0 ≤ a ≤ 1. If a > 0 let ((U j , V j ), 1 ≤ j < ∞) be the points of the Poisson point process M 0 of rate a per unit area on the octant {(u, v) : 0 ≤ v ≤ u < ∞}, labeled so that 0 < U 1 < U 2 < . . .. In the case a = 0, ignore subsequent mentions of U j and V j . For each i such that θ i > 0, let 0 < ξ i,1 < ξ i,2 < . . . be the points of the Poisson point process on (0, ∞) of rate θ i per unit length. Call each point U j a 0-cutpoint, and say that V j is the corresponding joinpoint. Call each point ξ i,j with θ i > 0 and j ≥ 2 (note the 2) an i-cutpoint, and say that ξ i,1 is the corresponding joinpoint. Note that there are (with probability 1, a qualification in effect throughout the construction) only finitely many cutpoints in any finite interval [0, x], because for i ≥ 1 the mean number of i-cutpoints in that interval equals θ i x − (1 − exp(−θ i x)) ≤ θ 2 i x 2 . We may therefore order the cutpoints as 0 < η 1 < η 2 < . . ., where η j → ∞ as j → ∞. These η j then have the same joint distribution as the η j in Theorem 6. We now build a tree by starting with the branch [0, η 1 ] and then, inductively on j ≥ 2, attaching the left end of the branch (η j−1 , η j ] to the joinpoint η * j−1 corresponding to the cutpoint η j−1 . After m steps of this process, the interval [0, η m ] has been randomly cut up and reassembled to form a random tree T θ m say, with vertex set [0, η m ], with m + 1 leaves 0, η 1 , . . . , η m , and with a finite set of branchpoints {η * j−1 , 2 ≤ j ≤ m}. For any finite subset F of [0, η m ] such that F contains all the leaves and branchpoints of T θ m , the tree T θ m can be regarded as a tree with edge-lengths and vertex set F . For each k ≥ 0 and m ≥ 1 let R θ m be the tree with edgelegths so obtained from T θ m and the almost surely finite set F km defined as the union of the set of all leaves and branchpoints of T θ m and the set of all i-joinpoints ξ i,1 with ξ i,1 < η m and 1 ≤ i ≤ k. Finally, let R θ km be the random element of T k,m derived from R θ m by relabeling F km as follows: let the leaves 0, η 1 , . . . , η m be relabeled by 0+, 1+, . . . , m+; for each i with ξ i,1 < η m and 1 ≤ i ≤ k let the i-joinpoint ξ i,1 be relabeled by i, and let all remaining elements of F km (i.e. the 0-joinpoints and i-joinpoints with i > k among {η * j−1 , 2 ≤ j ≤ m}) be unlabeled. It can be checked that the various operations involved in this continuous analog of the construction of R nkm are appropriately continuous except on a set of probability zero in the limiting construction. Thus from the joint convergence of point processes underlying Theorem 14 we obtain:

Corollary 15
For each k ≥ 0 and m ≥ 1, in the asymptotic regime (9) The random trees T θ m , just used in the construction of R θ km , are subtrees of an infinite tree with vertex set [0, ∞), which defines the ICRT T θ of [5] by completion in the metric on [0, ∞) defined by path lengths in the infinite tree. The reduced trees R θ km then describe a consistent collection of finite-dimensional features of the infinite-dimensional ICRT T θ . See [5,4] for further developments.