Size bias for one and all

Size bias occurs famously in waiting-time paradoxes, undesirably in sampling schemes, and unexpectedly in connection with Stein's method, tightness, analysis of the lognormal distribution, Skorohod embedding, infinite divisibility, and number theory. In this paper we review the basics and survey some of these unexpected connections.


Size bias in
In the famous "waiting time paradox", see Feller [31,Section I.4], there are two plausible but conflicting analyses of the waiting time for the next bus, once you get to the bus stop. More formally, this paradox concerns the waiting time W t for the next arrival, starting from an arbitrary instant t, in a standard homogeneous Poisson process with intensity parameter λ = 1: (a) The lack of memory of the exponential interarrival time suggests that E W t is not sensitive to the choice of t; so E W t = E W 0 = 1. (b) Since the starting time is chosen uniformly in the interval between two successive arrivals, an interval of mean length 1, symmetry suggests that E W t = 1/2. As Feller shows, the reasoning behind both analyses is faulty, because it is the instant and not the interval which is arbitrary: a longer interval thereby becomes more likely than the relative frequencies of interarrival lengths would suggest, a canonical instance of size biasing. So an unqualified appeal to properties of the original interarrival distribution is fallacious.
In fact, as we will discuss, a reasonable but precise interpretation of "arbitrary instant" leads to the answer given in (a), though not for the reason given in (a).
Not just recreational chestnuts, but also practical matters, such as statistical sampling tasks, are bedeviled by size bias; we provide a few references later. Surprisingly, however, size bias plays a role in such unexpected contexts as Stein's method, Skorohod embedding, nonuniqueness in the method of moments, infinite divisibility of distributions, and number theory. We will return to the "paradox" shortly, after giving the basics of size bias. Then we will survey size bias as it appears in some of the non-sampling contexts. 1 In [8, pp. 78-80], the authors introduce their two and one half page survey of size bias by saying "Size-biasing arises naturally in statistical sampling theory (cf. Hansen and Hurwitz (1943) [39], Midzuno (1952) [52] and Gordon (1993) [37]), and the results we present below are all well known in the folk literature." In the present paper, we feel that we have contributed a number of new results: the conceptual heuristic given in Section 3 to explain (21), where a sum of independent variables is size biased by biasing only a single term, the explanation of an intimate connection between uniform integrability and tightness in Section 6, the size bias perspective on Chihara-Leipnik example in Section 7 (in particular the Choquet simplex result therein), the size bias perspective on Skorohod embedding in 8, and the treatment of infinite divisibility in Section 9 -at least the argument based on (60), size biasing a sum by size biasing a single summand.
Another survey of size bias, with a different focus, is [22].

Size bias basics
2.1. Bias in general. Let h be a nonnegative function, and X be a random variable taking values in the domain of h, with E h(X) ∈ (0, ∞). For such X and h, we say X h has the h-biased X distribution if and only if the distribution X h , relative to the distribution of X, has Radon-Nikodym derivative given by The support of (the distribution of) X h is then a subset of the support of X, possibly a proper subset due to the set where h=0: (2) supp(X h ) = (supp(X) \ h −1 (0)) cl , where A cl denotes the closure of A. A nice example which shows this visually is presented in Figure 2.4.1.
1 An early draft of the present paper, with the title 'Size biasing, when is the increment independent?', has been circulated since 1998, and was cited in [57]; an update, 'Size bias, sampling, the waiting time paradox, and infinite divisibility: when is the increment independent?' was cited in [55,56,61]. Both of these drafts are superseded by this paper.
The class of exponential functions, h(x) = e β x for various choices of β ∈ (−∞, ∞), is very important. This class is central to exponential families and large deviation theory, but no single value β plays a special role. The family of power functions h(x) = x β for β > 0 might be viewed as runner up, behind the family of exponential functions, but here the choice β = 1 is truly special. We believe that h(x) = x for x ≥ 0 is the most important example of bias.

Size bias in particular.
When h is the function h(x) = x with domain [0, ∞), the h-bias above is called size bias. Thus, one can size bias the distribution of any nonnegative random variable X for which a := E X ∈ (0, ∞). Instead of X h one writes X * or X s for a random variable with the size biased distribution of X. The characterization (1) reduces to For the common special cases, where X is discrete with probability mass function f , or where X is absolutely continuous with density f , the formula completely specifies the size biased distribution.
This Bernoulli family example shows that the size bias transformation is not one to one.
Even in the discrete and absolutely continuous cases, where the elementary identity (4) applies, the characterization of size bias via (7) is very handy for manipulations.

2.2.2.
Unbounded functions, and moments. Recall that for a real valued random variable, "E Y ∈ [−∞, ∞] exists" means that it is not the case that both the positive and the negative parts of Y have infinite expectation. We extend slightly the usual statement that, if E |Xg(X)| < ∞, then E g(X * ) = E (Xg(X))/E X.
Proof. In outline, the proof is: consider separately the positive and negative parts of g; for each of these, apply (7) to truncations, and apply monotone convergence.
In detail: when g(x) ≥ 0, by applying (7) to g n (x) = max(g(x), n), and taking limits, we conclude that holds, including the case where both sides are infinite. Write y + and y − for the positive and negative parts of y. Then the functions g + and g − given by g + (x) = (g(x)) + and g − (x) = (g(x)) − are nonnegative. Note that on the domain [0, ∞), (xg(x)) + = xg + (x) and (xg(x)) − = xg − (x). Under the hypothesis that E (Xg(X)) ∈ [−∞, ∞] exists, at least one of h = g + and h = g − has E (Xh(X)) < ∞, and hence E g( In particular, taking g(x) = x n in (9), we have Apart from the extra scaling by 1/E X, (10) says that the sequence of moments of X * is the sequence of moments of X, but shifted by one. Hence one way to recognize size biasing is through the shift of the moment sequence; this plays a role in two interesting examples, (13) and (40).

Stochastic monotonicity.
It is easy to see that, in general, X * lies above X in distribution, i.e., P(X * > t) ≥ P(X > t) for all t. In detail: letting g(x) = 1(x > t) in (7) for some fixed t, where the inequality above is the special case for any random variable and any two increasing functions f, g.
The notation used above, X = d Y , is often written L(X) = L(Y ), to say that random variables X and Y have the same law, or distribution. The simpler notation X = Y would imply a coupling, i.e., that X and Y are defined on the same probability space, with X(ω) = Y (ω) for all outcomes ω.
It is also true that size bias respects convergence in distribution, provided one is careful to make the additional hypothesis that the means converge to the mean of the limit random variable, which is in this context equivalent to uniform integrability.
Theorem 2.2. Suppose that X, X 1 , X 2 , . . . are nonnegative random variables with a := E X ∈ (0, ∞), a n := E X n ∈ (0, ∞), that X n ⇒ X, and that a n → a. Then Proof. The desired convergence in distribution is implied by the condition that for all bounded continuous functions with compact support, h : R → R, we have E h(X * n ) → E h(X). The function g given by g(x) = x h(x) is bounded and continuous. Since g is bounded, (7) applies, and since g is continuous, the hypothesized distributional convergence implies E g(X n ) → E g(X). Using (7) with h in the role of g, we have The necessity of the hypothesis that E X > 0, in Theorem 2.2, is shown by the example with X n distributed as Bernoulli(1/n), so that X * n ⇒ 1, and X n ⇒ X = 0, but the limit random variable X cannot be size biased.
The converse of Theorem 2.2 is false, since the correspondence L(X) → L(X * ) is many to one. In detail, take any A, B with A = d B and A * = d B * ; then the sequence X 1 , X 2 , X 3 , X 4 , . . . = A, B, A, B, . . ., together with X = A, has X * n ⇒ X * but not X n ⇒ X.
Suppose that I ⊂ R, that h is a probability measure on I, that for each b ∈ I µ b is a distribution for a nonnegative random variable X b , with m(b) := E X b ∈ (0, ∞), and that b → µ b is measurable. Note, we have assumed that for every b, m(b) ∈ (0, ∞) in order that, for every b, the size biased distribution for X * b be defined. We say that (the distribution of) X is the mixture of (the distributions of) X b , governed by h, if for all bounded measurable g, Of course, for such a mixture, E X = m(b) dh(b) ∈ (0, ∞], but since we are interested is size bias, we make the additional assumption that a := E X < ∞. Lemma 2.3. Under the setup of the previous paragraph, with a = m(b) dh(b) ∈ (0, ∞), the distribution of X * is a mixture of the distributions of the X * b . The measure h governing this mixture is defined in terms of the original governor h via its Radon-Nikodym derivative, dh (b)/dh(b) = m(b)/a. In particular, if m(b) is constant, then h = h, i.e., the measure governing X * as a mixture of the X * b is equal to the measure governing X as a mixture of the X b .
Proof. For bounded measurable g In a different direction, the following result from [34] can be useful for constructing size bias couplings for continuous random variables that are not represented as sums, though it may also be noted that Lemma 2.4 implies (21) for sums of indicator variables, see [14, Lemma 2.6 ff].
Lemma 2.4. Let X = Pr(A|F) where F is some σ-algebra and A is some event with 0 < Pr(A) < 1. Then X * has the distribution of X conditioned on A.
Proof. For any bounded measurable g, we have 2.2.6. Many to one, one to one. We describe the preimage, under size biasing, of a random variable Z. Note first that if Z = d X * , then for any mixture M = aδ 0 + (1 − a)L(X) with 0 ≤ a < 1, a random variable Y with L(Y ) = M is also a preimage. We claim that changing the amount of point-mass at 0 is the only source of non-uniqueness.
and E (1/Z) < ∞, and then there is a unique law for Y > 0 such that any X having Proof. Let Z = d X * for some X; this implies X ≥ 0 and 0 < E X < ∞. Let b := P(X = 0), so clearly b ∈ [0, 1). Let Y have the distribution of X conditioned on X > 0, so Y > 0, Z = d Y * , and L(X) , we have, as in (3), that the distributions ν of Z and µ of Y , as measures on (0, ∞), are mutually absolutely continuous, with Radon- This shows the uniqueness of the law for Y ; that E (1/Z) < ∞ follows from the explicit calculation Conversely, if Z > 0 with probability measure ν(dz) satisfies 0 < E (1/Z) < ∞, then with 1/c = E (1/Z), the law µ on (0, ∞) with µ(dy)/ν(dy) = c/y, as the distribution for Y , yields Z = d Y * .
A paraphrase of Lemma 2.5 is that size bias is a bijection, between equivalence classes of distributions for nonnegative random variables with strictly positive finite mean, modulo varying the size of the point mass at zero, and distributions for strictly positive random variables having finite minus first moment.

2.3.
To bias a process by one coordinate. The following is taken from [36]. Readers who dislike technicalities might prefer to jump directly to (18), and then come back only if they feel uncomfortable that our proof of (22) doesn't involve any limits! Suppose that X = (X 1 , X 2 , . . .) ∈ [0, ∞) ∞ has joint law µ, and for a particular choice of i, a i := E X i ∈ (0, ∞). To bias by X i means, analogous to (3), to switch to the joint law µ (i) on [0, ∞) ∞ with Radon-Nikodym derivative (14) dµ We write X (i) = (X for a process having this joint distribution µ (i) . Equivalent to (14) is the following (15) for all bounded measurable g, E g(X (i) ) = 1 a i E (X i g(X)), which looks very much like (7), except that now we have g : [0, ∞) ∞ → R. Note that given a bounded measurable h : [0, ∞) → R, applying (15) to the special case g(x) := h(x i ) shows that our notion of process bias by one coordinate, restricted to viewing that coordinate, agrees with the original notion of size bias, i.e., X The preceding paragraph applies regardless of whether or not the joint law µ of (X 1 , X 2 , . . .) involves dependence. In case the original coordinates were independent, biasing by the i th coordinate preserves the property of mutual independence. Some might consider that statement to be so obvious as to not require a proof; nevertheless, we give a careful statement and proof, as Lemma 2.6. Lemma 2.6. Fix a particular value i. Assume that X 1 , X 2 , . . . are mutually independent, nonnegative, and that 0 < E X i < ∞. For j = i let Y j = d X j , let Y i = d X * i , and let Y 1 , Y 2 , . . . be mutually independent. Then Y = (Y 1 , Y 2 , . . .) is distributed according to the law µ (i) for X (i) given by (14), i.e., Proof. First we check that the marginals match, i.e., that for each j, X We already noted that this is so, for j = i, as a consequence of (15), even without the hypothesis of mutual independence. For j = i, and a bounded measurable h : [0, ∞) → R, applying (15) to the special case g(x) := h(x j ) yields the relation ). Using the independence of X i and X j , we get E h(X (i) Next we show that X (i) and Y have the same joint distribution, either by showing that X (i) has independent coordinates, or by checking that for all measurable C ⊂ [0, ∞) ∞ , P(X (i) ∈ C) = P(Y ∈ C), first by checking finite-dimensional cylinder sets, then applying the π − λ theorem -either route seems to require the same work. Without loss of generality, the cylinder set C includes a restriction on the i th coordinate, i.e., it has the form C = (X i ∈ B i ) ∩ j∈J (X j ∈ B j ), where i / ∈ J. Write g 1 (x) = 1(x i ∈ B i ) and g 2 (x) = 1(x j ∈ B j for j ∈ J). With g = g 1 g 2 in (15), calculation that E g(X (i) ) = E g(Y) is a simple extension of the calculation for the special case |C| = 1, given in the first paragraph of this proof.
Another technical issue involves the value infinity. It would have been possible to present the basic discussion of size bias, in particular (3) and (7), in terms of a random element Y with values in [0, ∞]. But since 0 < E Y < ∞ implies P(Y = ∞) = 0, it is of course possible, and simpler, to deal with Y taking values in [0, ∞), and this is what everyone does. However, in dealing with infinite sums of finite nonnegative random variables, one cannot simply declare that the space of values for the sum be taken as [0, ∞), even if one knows that the sum is finite with probability one.
Our goal is to deal with the distribution of random variables Y = h(X), such as Y = X 1 + X 2 + · · · , 4 and to specify the distribution of Y (i) , distributed as Y with µ changed to µ (i) . Hence we consider measurable h : (15) applies. The distribution of Y (i) is then specified by (16) for bounded measurable f :

2.4.
To size bias a sum. Consider a finite sum S = X 1 + · · · + X n , n ≥ 1, or an infinite sum S = X 1 + X 2 + · · · , with X i ≥ 0 and a i := E X i > 0, and a = E S < ∞. After biasing by X i , we have a sum 5 S (i) = X (i) n , so that, as a special case of (16), for bounded nonnegative g, and then with (7) to justify the first line, and elementary algebra (here using g ≥ 0) to justify the second line, Suppose furthermore that the summands X 1 , X 2 , . . . are independent. If size biased random variables X * 1 , X * 2 , . . . are realized on the same probability space, with (X 1 , X * 1 ), (X 2 , X * 2 ), . . . mutually independent, then for each i, by Lemma 2.6, . 4 Thanks to only having nonnegative numbers for the coordinates of the domain, there are no convergence issues in dealing with the sum X 1 + X 2 + · · · ∈ [0, ∞]. 5 Warning: our notation here conflicts with standard expositions of Stein's method, such as [23], [13,Theorem B.1], and [38], where notation V i refers to the sum, with ith term omitted, size biased by the ith term.
The result above says precisely that S * can be represented by the mixture of the distributions of S + X * i − X i with mixture probabilities a i /a. With a random I having distribution defined by (19) P(I = i) = a i /a, and all of I, (X 1 , X * 1 ), (X 2 , X * 2 ), . . . mutually independent, the mixture formula (18) can be restated as (20) S * = d S − X I + X * I . In the preceding coupling, for each i, marginal distributions of X i , X * i are specified, but the joint distribution of (X i , X * i ) is otherwise arbitrary. Allowing such dependence is important for use with Stein's method; see Section 5. Of course, mutual independence for I, For each case, S = X 1 + · · · + X n or S = X 1 + X 2 + · · · , (20) can be written out with notation to emphasize that a single term has been biased 6 : It is a natural abuse of notation to view (21) as a special case of (22). The reason that this is abuse, rather than the special case X n+1 = X n+2 = · · · = 0 is that the identically zero random variable X cannot be size biased. Specifically, X = 0 doesn't satisfy the conditions of the definition in (3), and size biasing this X, if allowed, would abrogate Lemma 2.5. Nonetheless, it is customary to follow the notational abuse that if X = 0 then X * = d X = 0, so that one can view (21) as the special case of (22), and later, write formulas such as (29) for a sum with infinitely many terms, without writing out a second instance for a sum with finitely many terms.
In contrast to a sum of independent nonnegative summands, which is size biased by biasing a single term, a product W = X 1 X 2 · · · X n , of independent, nonnegative random variables X 1 , . . . , X n , each with finite, strictly positive mean, is size biased by biasing every factor: taking X * 1 , . . . , X * n independent, one has (23) Here, we leave the proof as an exercise; this result comes from [50]. For the case of dependent summands, the decomposition (17) is useful; in contrast, for dependent factors, we don't know of any useful relation.
An interesting example of the use of (22) The cumulative distribution of this sum S is known as the Cantor function; the distribution of S is, by all reasonable interpretations, the uniform distribution on the Cantor middle thirds set. By (19), the random index I has the geometric distribution P(I = i) = 2/3 i for i = 1, 2, . . ., and by (5), the size biased version of A closely related example, using the same B i , is the standard uniform (0,1) (20) simplifies to Of course, it is easy to calculate that the density of U * is 2x on (0,1), using (4): multiply the density of the uniform by x and divide by E U = 1/2. But perhaps the following exercise is not easy.
Exercise Prove, without using size bias, that the sum on the right side of (24) has density f (x) = 2x on (0,1). Cumulative distribution functions for the uniform distribution on (0,1), the uniform distribution on the Cantor set, and the size biased versions of these. Image produced using Math-Studio [58].
For the case with a finite number of summands, where the summands are not only independent but also identically distributed, the recipe (21) simplifies. In this case it does not matter which summand is biased, as all the distributions in the mixture are the same; hence we may replace the random I with the fixed i = 1, yielding (25) ( Here are some elementary consequences of (25). Recall (5), that for p ∈ (0, 1], a Bernoulli random variable with mean p, size biased, is the constant 1. Summing n independent copies gives us random variables S n whose distribution is Binomial(n, p). Hence using (25), (26) S * n = d 1 + S n−1 . Finally, taking λ ∈ (0, ∞) fixed, Z to be Poisson(λ), and X n to be Binomial(n, λ/n), the Poisson limit for the Binomial, together with Theorem 2.2 and (26), implies (27) Z Of course, this equality may also be verified directly using (4), with mass function f (k) := P(Z = k) = e −λ λ k /k!, k = 0, 1, 2, . . ., but the beauty of the argument via (26) is that it is purely conceptual. Perhaps less obvious is that (27) implies that there exists λ ∈ (0, ∞) for which Z is Poisson(λ); see Section 5.

2.4.1.
Example: compound Poisson. Given the distribution for a discrete positive random variable Y with finite mean, and 0 < a < ∞, we will show how to construct a distribution for S such that To specify the distribution of Y , suppose that p i = P(Y = y i ), for distinct constants y 1 , y 2 , . . . > 0, with gives a solution to (28), using only formula (27) for size biasing a single Poisson distributed random variable, the scaling property (12), formula (22) for size biasing a sum of independent, non identically distributed summands, and the trivial calculation that a i : First, using (27), Z * i = d Z i + 1. Second, using the scaling property (12), X * i = d X i +y i . In the recipe (22), there is a random index I, independent of the X 1 , X 2 , . . ., with (30) P(I = i) = E X i /a = λ i y i /a = p i , and we can take the coupling in which X * i = X i + y i for each i. This yields S * = d S + y I , with S, I independent. Since the y i are distinct, for each i, as events, (Y I = y i ) = (I = i), hence the distribution of y I is the given distribution for Y . To summarize, we were given the distribution for Y , and we constructed a distribution for S so that (28) holds. We will revisit the relation S * = d S + Y with S, Y independent in Section 9; the preceding is then seen as an explicit example of (58), with the distribution of Y specified in advance. In the standard literature, the random variable S in (29) is said to have a compound Poisson distribution, given the further restriction that i λ i < ∞. Compound Poisson with finite mean requires both i λ i < ∞ and λ i y i < ∞; in contrast, we require only the latter.
Recall, if Z is Poisson(λ) then its probability generating function is G Z (s) := E s Z = exp(λ(s − 1)). Substituting s = e β , the moment generating function of Z is M Z (β) := E e β Z = exp(λ(e β − 1)). Hence in (29), the moment generating function of X i is M Xi (β) = exp(λ i (e β yi − 1)) and the moment generating function of S is with the distribution of I given by (30). Likewise, the characteristic function of S, φ S (u) := E e iuS is given by

Waiting time paradox: the renewal theory connection
We resolve the waiting time paradox from Section 1 in the general context of renewal processes, at the same time providing a conceptual explanation of the identities (21) and (25).
Let the interarrival times in Section 1 be denoted X i so that, starting from 0, arrivals occur at times X 1 , X 1 + X 2 , X 1 + X 2 + X 3 , . . ., and assume only that the X i are i.i.d., strictly positive random variables with finite mean; the paradox presented earlier was for the special case with X i exponentially distributed.
The following argument is heuristic. One way to model the "arbitrary instant t" is to choose a random T uniformly from 0 to l, independent of X 1 , X 2 , . . ., and then take the limit as l → ∞. For large but finite l, conditional on X 1 , X 2 , . . ., apart from possible cutoff at the extreme right 7 the probability of T landing in a given interarrival interval is proportional to its length. In other words, if the interarrival times X i have a distribution dF (x), the distribution of the length of the selected interval is approximately proportional to x dF (x). In the limit, it is precisely correct that the distribution of the length of the selected interval is the distribution of X * .
For the particular case of exponentially distributed interarrival times, the density of X * is xe −x , with mean value 2, and so a right-left symmetry argument gives the answer in a). 7 Conditional on T = t and X 1 + · · · + X m−1 < t < X 1 + · · · + X m−1 + Xm, there are m interarrival intervals, and for i = 1 to m − 1 interval i is selected with probability proportional to X i , but interval m is selected with probability proportional to t − (X 1 + · · · + X m−1 ) < Xm.
A conceptual explanation of identity (25) is given by the following heuristic. Group the interarrival intervals into successive blocks of n intervals. By considering the thinned process induced by ignoring arrivals internal to the grouping, the random time T must find itself in a block with total length distributed as (X 1 + · · · + X n ) * . But regardless of the grouping, the random time T still finds itself in an internal interval whose length is distributed as the size biased distribution of the interarrival times; the lengths of the other intervals in the same block are not affected. Thus the total block length must also be distributed as Further heuristic argument, that the internal interval i is chosen with probability proportional to the contribution E X i makes to the total block size, may convince one of the identity (21).
The standard rigorous analysis of the waiting time paradox, for instance in [71], is a bit less direct, based on randomizing the starting point of the arrivals, so that the arrival times form a stationary sequence. Begin by extending X 1 , X 2 , . . . to an independent, identically distributed sequence . . . , X −2 , X −1 , X 0 , X 1 , X 2 , . . . . Informally, if the arbitrary instant t could be uniform on the whole line (or by adapting the above limiting argument) then t would fall uniformly inside a size biased interarrival interval; relabeling, we call t by the name zero, and the landing interval has length X * 0 . Then the prior arrival and next arrival would be at times −(1 − U )X * 0 and U X * 0 respectively, where the uniform U ∈ [0, 1] is independent of the X i 's. Thus motivated, we define a process by setting arrivals at positive times U X , . . . . It can be proved that this process is stationary, see [71], Theorem 9.1. Our desired waiting time W t is then equal in distribution to W 0 = U X * 0 .
The interval which covers the origin has expected length E X * 0 = E X 2 0 /E X 0 (by (10) with n = 1,) and the ratio of this to E X 0 is E X * 0 /E X 0 = E X 2 0 /(E X 0 ) 2 . By Cauchy-Schwarz, this ratio is at least 1, and every value in [1, ∞] is feasible. See also (11). Since the mean waiting time is the ratio E W t /E X 0 can be any value between 1/2 and infinity, depending on the distribution of X 0 .
The exponential case is very special, where "coincidences" effectively hide all the structure involved in size biasing. As suggested by Feller's argument (a) at the start of this paper, but now justified by stationarity, E W t = 1. Furthermore, for the exponential case, where X 0 has density e −x for x > 0, one gets X * 0 has density xe −x and the two summands U X * 0 and (1−U )X * 0 are independent, each with the original exponential distribution. 8 Thus the general recipe for cooking up a stationary process, involving X * 0 and U in general, simplifies beyond recognition: the original simple process with arrivals at times X 1 , X 1 + X 2 , X 1 + X 2 + X 3 , . . . forms half of a stationary process, which is completed by its other half, arrivals at −X 1 , −(X 1 + X 2 ), . . . , with X 1 , X 2 , . . . , X 1 , X 2 , . . . all independent and exponentially distributed. 8 Exercise for the reader: prove that if U X * = d X when U is independent of X * and U is distributed uniformly on (0, 1), than X has an exponential distribution -on some scale. Not hard; or, see [54].

Size bias in statistics
We now touch briefly on the topic of inadvertent or unavoidable size bias 9 in statistical sampling by citing two references from a vast literature. 10 . We also the discuss the deliberate use of size bias, as a sampling tool. 4.1. Inadvertent size bias. In a seminal 1969 paper [29] David Cox identifies, among other topics, length bias in a then-standard process for estimating the mean length of textile fibres: In outline, as he describes it, an assembly of fibres is gripped by a pincer, all ungripped fibres adhering to the gripped ones are carefully removed, and the remaining fibres are measured. Cox points out that since shorter fibres are more likely to be missed by the pincer, the distribution of the sampled lengths is length biased. He proposes some adapted estimators for getting at parameters of the original distribution if the sampling process itself cannot be refined.
Nearer to the present, the 2009 paper [42] considers issues arising in assessing the value of medical screening and the effects of subsequent early treatment on survival time. As discussed in [42], for reasons analogous to waiting-time bias, the durations of preclinical disease states detected by certain screening protocols are subject to length bias. Even though the durations themselves are not observed, longer durations are likely to derive from slower-acting instances of the disease under consideration, and hence are correlated a priori with longer survival times. Therefore improvement in survival time is likely to be overestimated by such studies if suitable adjustments are not made.

4.2.
Deliberate size bias to create something unbiased. Somewhat paradoxically, size biasing can occasionally be used to construct unbiased estimators of quantities that would seem, at first glance, difficult to estimate without bias. The following procedure for unbiased ratio estimation is due to Midzuno [52]; see also Cochran [28]. Suppose that for each individual i in some large population there is a pair of numbers (x i , y i ), with the value x i easy to obtain but y i more difficult. Assume each x i ≥ 0, with not all zero. Suppose that it is desired to estimate the ratio i y i / i x i without bias and without sampling the entire population. Perhaps x i is how much the i th customer was billed by their utility company last month, and y i , say a smaller value than x i , the amount they were supposed to have been billed. Suppose we would like to know just how severe the overbilling error is; that is, we would like to know the 'adjustment factor', the ratio i y i / i x i . Even though i x i is known, collecting the paired values for everyone is laborious and expensive, so we would like to be able to use a sample of m < n pairs to make an estimate. It is not hard to verify that, if we select a set R of m indices, with all n m sets equally likely, then the estimate j∈R y j / j∈R x j will be biased.
The following device gets around this difficulty. Draw a random set R of size m by first selecting i with size-biased probability x i / j x j . Then draw m − 1 9 or length bias, as it is sometimes called in sampling literature 10  indices uniformly from the remaining n − 1. Though we are out of the independent framework, the principle of (25) is still at work: size biasing one element has size biased the sum. This is so because we have size biased the one, and then chosen the others from an appropriate conditional distribution. Thus, we have selected a set r of indices with probability proportional to j∈r x j . From this observation it follows that E ( j∈R y j / j∈R x j ) = j y j / j x j .
Here is Midzuno's procedure in a bit more detail. Let For the variance of the estimator, see [60].

Relation to Stein's method and concentration inequalities
Implicit in Chen 1975 [23], with improved constants due to [13], see also [38, Theorem 4.12.12], is the following result from [35], Theorem 1.1, see also [61,Theorem 4.10], which we paraphrase 11 here as Theorem 5.1. Let X be a nonnegative integer valued random variable with λ := E X ∈ (0, ∞); let Z be Poisson with parameter λ. Then for any coupling of X with X * , the total variation distance between the distributions of X and Z satisfies The total variation distance appearing in Theorem 5.1 is defined, for random variables X, Y in general, by d TV (X, Y ) = sup B (P(X ∈ B) − P(Y ∈ B)), with the supremum taken over all Borel sets.
Size biasing also has a connection with Stein's method for obtaining error bounds when approximating the distributions by the normal distribution, see [12,11,24,36].
Size bias also plays a role in concentration inequalities, see [33,32,4,14]. The results from [33,4] include: if X ≥ 0 with a := E X ∈ (0, ∞) can be coupled to X * so that P(X * ≤ X + c) = 1, then To see how size bias enters, if a coupling satisfies P(X * ≤ X + c) = 1, then for all x, the event X * ≥ x is a subset of the event X ≥ x − c. Hence for x > 0, 11 The theorem in [35] is stated with the condition that X be a finite sum of indicator random variables. However, an arbitrary nonnegative integer valued X is a sum of indicators, namely X = i≥1 1(X ≥ i), and the restriction on finite sum can be removed using Theorem 2.2 applied to Xn := X ∧ n = n i=1 1(X ≥ i).
and dividing by x we get Iterating (33) leads to the sharp upper bounds on P(X ≥ x), for each x ≥ a.
In the context of sums of independent random variables each with a bounded range, the concentration bounds based on bounded size bias couplings are stronger than the corresponding Chernoff-Hoeffding bounds, as well as being broader in scope; see [4]. Applications of these bounds to situations involving dependence, such as the number of relatively ordered subsequences of a random permutation, sliding window statistics including the number of m-runs in a sequence of coin tosses, the number of local maxima of a random function on a lattice, the number of urns containing exactly one ball in an urn allocation model, and the volume covered by the union of n balls placed uniformly over a volume n subset of R d , are discussed in [32]. An example showing that the size bias concentration bounds supply a desired uniform integrability, in a situation where the usual Azuma-Hoeffding bounded martingale difference inequality is not adequate, is given in [5].
Also not directly linked to Stein's method or concentration inequalities, but nevertheless worth mentioning: a beautiful treatment of size biasing for branching processes is [51] by Lyons, Pemantle, and Peres.

Size bias, tightness, and uniform integrability
Recall that a collection of random variables {Y α : α ∈ I}, where I is an arbitrary index set, is tight iff for all ε > 0 there exists L < ∞ such that This definition looks quite similar to the definition of uniform integrability, where we say {X α : α ∈ I} is uniformly integrable, or UI, iff for all δ > 0 there exists L < ∞ such that Intuitively, tightness for a family is that uniformly over the family, the probability mass due to large values is arbitrarily small. Similarly, uniform integrability is the condition that, uniformly over the family, the contribution to the expectation due to large values is arbitrarily small. Since size bias relates contribution to the expectation to probability mass, it should be possible to state a relation between uniform integrability and tightness.
We show, in Theorems 6.1 and 6.2, that for random variables, i.e., real valued random elements, there is an intimate connection between tightness and uniform integrability, and that this connection is made via size bias. But we must note, the concept of tightness is much broader than the concept of uniform integrability, in that tightness applies to random elements of metric and topological spaces, whereas uniform integrability is inherently a real valued notion. In more general spaces, to define tightness, the closed intervals [−L, L] are replaced by arbitrary compact sets, and the discussion below relates to such spaces only for metric spaces with the property that balls {x : d(x, y) ≤ L} are compact.
To discuss the connection between size biasing and uniform integrability, it is useful to restate the basic definitions in terms of nonnegative random variables. It is clear from the definition of tightness above that a family of nonnegative random variables {Y α : α ∈ I} is tight iff for all ε > 0 there exists L < ∞ such that (34) P(Y α > L) < ε for all α ∈ I, and from the definition of UI, that a family of nonnegative random variables {X α : α ∈ I} is uniformly integrable iff for all δ > 0 there exists L < ∞ such that (35) E (X α ; X α > L) < δ for all α ∈ I. For general random variables, the family {G α : α ∈ I} is tight [respectively UI] iff {|G α | : α ∈ I} is tight [respectively UI]. Hence we specialize in the remainder of this section to random variables that are non-negative.
Care must be taken to distinguish between the additive contribution to expectation, and the relative contribution to expectation. The following example makes this distinction clear. Let P(X n = n) = 1/n 2 , P(X n = 0) = 1 − 1/n 2 , n = 1, 2, . . . .
Here, E X n = 1/n, the family {X n } is uniformly integrable, but 1 = P(X * n = n), so the family {X * n } is not tight; the additive contribution to the expectation from large values of X n is small, but the relative contribution is large -one hundred percent! The following two theorems, which exclude this phenomenon, show that tightness and uniform integrability are very closely related. Theorem 6.1. Assume that for α ∈ I, where I is an arbitrary index set, the random variables X α satisfy X α ≥ 0 and 0 < E X α < ∞, and let Y α = d X * α . Then Assume further that the values E X α are uniformly bounded away from 0, say c > 0 and ∀α, c ≤ E X α . Then First, we show that tightness implies UI. Assume that {Y α : α ∈ I} is tight, and take L 0 > 0 to satisfy (34) with ε = 1/2, so that P(Y α > L 0 ) < 1/2 for all α ∈ I. Hence, for all α ∈ I, E (X α ; X α > L 0 ) = E X α P(Y α > L 0 ) < E X α /2, and therefore, and hence E X α < 2L 0 . Now given δ > 0 let L satisfy (34) for ε = δ/(2L 0 ). Hence ∀α ∈ I, Second we show that UI implies tightness, in the presence of means bounded uniformly away from zero . Assume that {X α : α ∈ I} is UI, and let ε > 0 be given to test tightness in (34). Let L be such that (35) is satisfied with δ = εc. Now, using E X α ≥ c, for every α ∈ I, As an alternate to Theorem 6.1, for the sake of having cleaner hypotheses and a cleaner conclusion, we also give the following theorem. Note below that the X α to be involved in size bias are allowed to have E X α = 0 -it is not a typo -because we will be taking (X α + c) * for some c > 0. Theorem 6.2. Assume that for α ∈ I, where I is an arbitrary index set, the random variables X α satisfy X α ≥ 0 and E X α < ∞. Pick any c ∈ (0, ∞), and for each α let Y α = (c + X α ) * . Then Proof. By Theorem 6.1, the family {c+X α } is UI iff the family {(c+X α ) * } is tight. As it is easy to verify that the family {X α } is tight [respectively UI] iff the family {c + X α } is tight [respectively UI], Theorem 6.2 follows directly from Theorem 6.1.

Size bias, the lognormal, and Chihara-Leipnik
In this section we review a construction due to Chihara in 1970, [25], and Leipnik in 1979, [47,48], of a family of discrete distributions having the same moment sequence as the lognormal. Durrett [30] presents this result 12 with the comment "Somewhat remarkably, there is a family of discrete random variables with these moments." We hope here to show that, from the point of view of size bias, this construction is natural and inevitable, but we can only speculate that for the original discoverers, size bias played a role in the creative process, perhaps via (8); see [48, page 332, formula (16)]. As a reward for using size-bias, we are able to show, in Theorem 7.4, that the lognormal itself is a mixture of these discrete distributions, and furthermore that these discrete distributions are the extreme points of a Choquet simplex -in this case, the set of solutions of (41), which is a subset of the closed convex set formed by all distributions having the same moments as the lognormal. The structure of the larger convex set is discussed in Conjecture 7.6. 12 specialized to σ = 1, c = e, a = √ e Throughout this section, we write Z for a standard normal, with moment generating function M (β) = e β 2 /2 . The standard lognormal is given by X = e Z , with moments (36) E X n = E exp(n Z) = M (n) = e n 2 /2 . Similarly, for σ > 0, the lognormal X = e σZ obtained by exponentiating the normal with mean zero and variance σ 2 has moments E X n = E exp(n σZ) = M (σn) = e n 2 σ 2 /2 . The famous fact that the lognormal distribution is not determined by its moments, and the family of examples in (39), are due to Stieltjes in 1894 [66, Section 56, page J. 106], reprinted in [67]. The family has sometimes been attributed to Heyde 1963 [41], (see, e.g., [30,31]), who rediscovered it much later. The alternate probability distributions having the same moments as the lognormal are continuous, with density presented via a perturbation of the lognormal density, as follows. We will write f 0,σ 2 for the density of the lognormal X = e σZ : For positive integers m and real δ ∈ [−1, 1] define so in case δ = 0, one has g m,δ (x) = 1 for all x. Finally, let X m,δ have density given by One then checks that for integers n, x n h m,δ (x) dx = x n f 0,σ 2 (x) dx = e n 2 σ 2 /2 .
Let X = e Z , and consider its size biased version, X * . By (10) and (36), for integers n, Of course, since the lognormal distribution is not characterized by its moments, this only suggests, and does not prove, that X * = d eX. Similarly, the general lognormal is given by X = exp(σZ + µ), with moments E X n = (e µ ) n e σ 2 n 2 /2 , and calculation of the moments of X * suggests that for X = exp(σZ + µ) we have X * = d e σ 2 X. Simple computation with the density and (4) shows that indeed, for X = exp(σZ + µ) we have X * = d cX, with c = exp(σ 2 ). We leave this as an exercise for the reader, with our solution given by this 13 footnote.
For the remainder of this section, we investigate: for c ∈ (1, ∞), for an arbitrary X ≥ 0 with finite strictly positive mean, what are the consequences of (41) As the first step in our investigation of (41), inspired by Feynman's maxim, 14 we note that our considerations lead twice to a homogenous system of equations, of the form (42) ∀n ∈ Z, s n+1 = ac n s n , which has solution s n = s 0 a n c n(n−1)/2 .
For the first instance of (42), write m n := E X n , with m 1 = E X = a, so the moment shift relation (10) can be written as E (X * ) n = m n+1 /a. Using (41), we have E (X * ) n = E (cX) n = c n m n , hence (43) m n+1 = ac n m n .
Combining m 0 = 1 with the solution to (42), we have (44) m n = a n c n(n−1)/2 = c n 2 /2 (using a = √ c), for all n ∈ Z. In summary, so far we have shown that any solution of (41) has the same moments as the lognormal e σZ .
For the second instance of (42), if X satisfying (41) has any pointmass at some b > 0, then it must have pointmass at every point b c n for n ∈ Z. With the benefit of hindsight 15 we go doubly negative, and for n ∈ Z define p n and r n by r n = 1/p n = 1/P(X = bc −n ). We have p n+1 = P(X = bc −n−1 ) = P(cX = bc −n ) = P(X * = bc −n ) = (bc −n /a)P(X = bc −n ) = (bc −n /a)p n , so that (45) r n+1 = (a/b)c n r n .
This is (42) with r n in the role of s n and a/b in the role of a, so quoting the solution, and using a = √ c, we get p 0 /p n = r n /r 0 = (a/b) n c n(n−1)/2 = b −n c n 2 /2 . Finally, replacing n by −n in p n = P(X = bc −n ), we have, for n ∈ Z, (46) P(X = bc n ) = b −n c −n 2 /2 P(X = b).
With some fixed c > 1 in mind, for any b ∈ (0, ∞) we call the set the "orbit of b," for short, or to say it fully, the orbit of b modulo multiplication by powers of c. The language here comes from the theory of a group acting on a set; orbits are equivalence classes, and (0, ∞) is a disjoint union of orbits. For a set containing exactly one representative for each orbit, the natural choice is [1, c).
the same solutions." 15 Defining for example sn = P(X = bc n ) or sn = 1/P(X = bc n ) or sn = P(X = bc −n ) does not lead directly to (42) -the reader might enjoy trying these.
If we want X supported on a single orbit, that is, with 1 = n∈Z p n , then we need (47) P The function t is essentially the Jacobi theta function; the convergence of the series, for any c > 1, is obvious.
However, the calculation connecting (46) with (41) was done assuming that E X = √ c, and we will only have succeeded, in getting a random variable with X * = d cX and supported on a single orbit, if, and only if, it turns out that, under the mass function (47) (47), with the change of variables m = n − 1 justifying the final equality.
The above discussion shows how the use of size bias, particularly (41), makes it relatively straightforward to rediscover and prove the following theorem of Chihara and Leipnik: (Chihara -Leipnik). For any σ > 0, with c := exp(σ 2 ), and for any b ∈ (0, ∞), there is a distribution (b, c) for a discrete random variable X b,c , whose support is the single orbit {. . . , b/c 2 , b/c, b, bc, bc 2 , bc 3 , . . .}, with probability mass function given by (47). This random variable satisfies (41), which implies that for n ∈ Z, E X n b,c = exp(n 2 σ 2 /2), so taking n ≥ 0 in particular, the discrete random variable X b,c has the same moments as the lognormal exp(σZ), where Z is standard normal.
Another issue is whether the lognormal can be expressed as a mixture of these discrete distributions. Leipnik 1991, [48, page 337], wrote 16 "One hopes that for some mixing distribution dh(b) we have that the lognormal distribution for e σZ is a mixture, governed by h, of the single orbit distributions, and so too [The display above expresses the characteristic function of the lognormal as a mixture of the characteristic functions of the distributions (b, c).] Unfortunately, the necessary d h(b) is somewhat complicated and hence sheds little light on the sum distribution problem. However, the extraordinary non-uniqueness of the lognormal moment problem is apparent.
The words "one hopes" signal a conjecture; the sentence beginning "Unfortunately · · · " suggests that he may have had a proof too messy to publish. Whatever the case, we supply a proof here, in the form of Theorem 7.4 below. Conceivably, the complication encountered by Leipnik might have arisen from considering mixtures indexed by (0, ∞), without exploiting the formula (b, c) = (bc, c). It is natural, and simple, to take mixtures indexed by [1, c); then there is a unique choice for h, with one simple computation to check. For notation, we follow Leipnik, and write dh(b) to denote a general measure h to govern a mixture; so that h may be discrete, absolutely continuous, singular continuous, or a mixture of these. In the special case in in Theorem 7.4 given by (51), expressing the lognormal as a mixture of the (b, c), we have h absolutely continuous, with density h c with respect to Lebesgue measure.
There is a related result, expressing a particular continuously distributed random variable, not the lognormal, but having the same moments, as a mixture of these discrete distibutions, in [ (41) is also a distribution which satisfies (41).
we are in the situation for Lemma 2.3 where the measure h governing X * as a mixture of the X * b,c is the same as the original h, governing X as a mixture of the X b,c . Hence (41) holds, since, obviously, scaling respects mixtures, i.e., the law of cX is the mixture, governed by h, of the laws of c X b,c .
That the set of solutions to (41) is closed is a bit subtle: to invoke Theorem 2.2 when taking limits on the left side of X * = d cX, one must know that the set of solutions is uniformly integrable. Fortunately, (44) implies that E X 2 = exp(c 2 ) for any solution of (41), which implies that the family is uniformly integrable. Finally, just as with mixtures of the (b, c), Lemma 2.3 applies, with the same measure governing X as mixture of solutions X α , governing X * as a mixture of the X * α , and cX as a mixture of the cX α . Lemma 7.3. Suppose c > 1, and X, Y are positive random variables which satisfy If the laws of X and Y , both restricted to [1, c) agree, even only up to a constant mass factor k ≥ 0, i.e., if (48) for all measurable A ⊂ [1, c), P(X ∈ A) = k P(Y ∈ A). then X = d Y . (The case k = 0 is specifically included in the hypothesis (48), but in every case, the conclusion implies that k = 1.) Proof. Let a := E X, so by hypothesis, we also have a = E Y , (but unlike (41), we are not assuming that a = √ c). Let S(n) be the statement that for all bounded measurable g which vanish outside [c n , c n+1 ), we have E g(X) = k E g(Y ). The hypothesis (48) clearly implies the statement S(0). Assume now that S(n) holds. Given a bounded measurable function g which vanishes off of [c n+1 , c n+2 ), we define new functions g , g by g (x) = g(x)/x and g (x) = g (cx). Clearly g is bounded, and vanishes off of [c n , c n+1 ). We have and similarly E g(Y ) = (1/a)E g (Y ). Invoking S(n) for the function g , we get hence S(n) implies S(n + 1).
A similar argument shows that S(n) implies S(n − 1). In detail, given a bounded measurable function g which vanishes off of [c n−1 , c n ), we define new functions g , g by g (x) = g(x/c), so that g (cx) = g(x), and g (x) = xg (x). Clearly g , g are bounded, and vanish off of [c n , c n+1 ). We have and similarly E g(Y ) = (1/a)E g (Y ); hence (49) holds exactly as before, but this time showing that S(n) implies S(n − 1).
Finally, knowing S(n) for all n ∈ Z implies that for bounded measurable g, E g(X) = kE g(Y ), and the special case g = 1 shows that k = 1, and hence X = d Y .
Theorem 7.4. Fix σ > 0, c = exp(σ 2 ), and let X be the lognormal e σZ , or any other positive random variable which satisfies (41). Then there is a unique probability measure h on [1, c) such that the distribution of X is the mixture, governed by dh(b), of the Chihara-Leipnik single orbit distributions (b, c) of Theorem 7.1, with point mass functions given by (47). The measure h governing the mixture is specified as follows: let B be distributed as X, conditional on (X ∈ [1, c)), and let the probability measure h have Radon-Nikodym derivative, relative to the distribution of B, given by (50) h(db) .
For the lognormal, with density f (x) = 1/(x √ 2πσ) exp(−(log x) 2 /(2σ 2 )), the recipe (50) says that with normalizing constant k c and function h c with domain [1, c), defined by the measure h governing the mixture has density h c , so that for measurable A ⊂ Proof. First, we must show that the distribution of B was well-defined, i.e., that P(X ∈ [1, c)) > 0. Here we argue by contradiction: if P(X ∈ [1, c)) = 0, then Lemma 7.3 could be invoked, with Y = e σZ , k = 0, to prove X = d Y , a contradiction since P(Y ∈ [1, c)) > 0. Now write Y for a random variable whose distribution is the mixture of the (b, c), governed by h. We use the Dirac notation, that δ x is unit mass at x, so that g(z)δ x (dz) = g(z) for any measurable g. Restricting our attention to b ∈ [1, c), the Chihara-Leipnik distributions are then expressed as Focus on the case n = 0, so that µ b,0 is mass 1/t(b, c) at the point b. The specification of h in (50) implies directly that the hypothesis (48) holds -with k = P(X ∈ [1, c)) × E t(B, c). Hence by Lemma 7.3, we have X = d Y .
The argument for uniqueness is essentially the same: suppose that Y is a mixture of (b, c), governed by some probability measure h on [1, c), and that X = d Y , not assuming that h is given by (50). Restricting the distributions of both X and Y to [1, c), it clear, from µ b,0 = 1/t(b, c) δ b , that the Radon-Nikodym derivative h(db)/P(X ∈ db |X ∈ [1, c)) must be proportional to t(b, c). The recipe in (50) gives the unique constant of proportionality to make such an h into a probability measure. It is now natural to ask whether Stietljes' examples, with density given by (39), lie in this Choquet simplex.
Proof. For random variables with a density, the size-bias-scaling relation in (41), can be expressed in terms of the density, as follows. First, when X has density f , the scaled multiple cX has density (1/c)f (x/c). Second, when X has density f , and mean a = E X = √ c, (4) states that X * has density (x/ √ c)f (x). Hence, if X has density f , mean √ c, and then X satisfies (41). Now it is clear that (52) holds for f = h m,δ given by (39): we have c = e σ 2 , and upon substituting x/c for x, the lognormal factor f 0,σ 2 supplies the factor x √ c, and the perturbation factor g m,δ supplies no change, since dividing x by c causes log x to decrease by log c = σ 2 , so that the argument to the sine function, 2πm log x/σ 2 , goes down by 2πm.
To review: both the lognormal, and the examples given by Stieltjes, are solutions of (41) and hence lie in the Choquet simplex discussed in Theorem 7.4. Do all distributions having the lognormal moment sequence lie in this simplex? This question is answered, in the negative, by Berg [16,Proposition 2.1], with b = √ c, and an example that is the perturbation of (47) by a factor of (1 + s(−1) n ) for s ∈ [−1, 1]. That is, he shows that for any c > 1, leads to E X n s = c n 2 /2 for n ∈ Z. In particular, for b = √ c the Chihara-Leipnik distribution (b, c) is the midpoint of the line connecting the distributions of X −1 and X 1 . The construction is special to b = √ c as the only value of b ∈ [1, c) for which a line of distributions with moments E X n = c n 2 /2 can be constructed, with (b, c) as the midpoint.
Going out on a limb, we conjecture that apart from the above construction, the Chihara-Liepnik distributions are extreme points, relative to the given moment sequence. That is Conjecture 7.6. For any σ > 0, with c = exp(σ 2 ), consider the set V of probability measures on [0, ∞), serving as distributions for a nonnegative random variable X with E X n = c n 2 /2 for n = 0, 1, 2, . . .. This set is a Choquet simplex, whose extreme points are precisely the distributions (b, c) for b ∈ [1, √ c) ∪ ( √ c, c), together with two additional distributions, those of X −1 and X 1 , as given by (53).
Any lognormal distribution is infinitely divisible; for background and references on this, see Examples 9.7 and 9.8. So, in the following conjecture, the emphasis is on the word only.
Conjecture 7.7. For each c > 1, in the Choquet simplex of solutions of (41), as described in Theorem 7.4, the only infinitely divisible distribution is the lognormal, that of e σZ with c = e σ 2 /2 .

Size bias and Skorohod embedding
Skorohod's embedding theorem states that given a nonconstant mean zero random variable X, there is a random time T for Brownian motion (W t ) t≥0 such that X = d W T . We discuss Skorohod's proof as presented, for example, in [30,53]. The proof is based on the construction of a joint distribution for a dependent pair (U, V ) with U, V ≥ 0 so that, with the pair independent of the Brownian motion, the random time T := T U,V := inf{t : , and the function (u, v) → u/(u + v) is nonlinear, it is somewhat surprising that a simple distribution of (U, V ) can satisfy X = d W T U,V . That distribution, specified in [30,53] by the formula is the same 17 as the distribution (55) in our size bias treatment. Display (57) highlights how size bias overcomes the nonlinearity of (u, v) → u/(u + v). The excellent survey by Ob lój [53] should be consulted for the history and connections to the potential of a measure. since X is nonconstant and mean zero, both p − := P(X < 0) > 0 and p + := P(X > 0) > 0, so the conditioning is elementary. Note that Write p 0 := P(X = 0). Since A and B have finite positive mean, the distributions of A * and B * are well defined. Couple so that A, A * , B, B * are independent. The final recipe, writing δ q for unit mass at the point q, is and then take (U, V ) to be independent of the Brownian motion W .
To prove that (55) and T = T U,V achieve X = d W T , first consider the case where P(X = 0) = 0. Given a bounded measurable function h : R → R, conditioning on U, V and using the exit distribution for Brownian motion from the interval [−u, v] we have (56) E Next, since we are in the case where p − + p + = 1, using (54) we have The size bias relation for processes from Section 2.3, together with the independence of A, B, justifies the transition from line 2 to line 3 below: for any bounded measurable g : . 17 apart from a notational switch between −u and u; we write −u ≤ 0 ≤ v and they write Using this identity for our function g defined in (56), and using the independence of A and B to go from line 3 to line 4, we have and hence L(W T ) = L(X), as claimed.
That X = d W T in the general situation, allowing P(X = 0) ∈ (0, 1), is easily seen, since the distribution of X is then a mixture of pointmass at zero, and the distribution of X conditional on X = 0, and the recipe (55) is the corresponding mixture of pointmass at (0,0) and the distribution of (U, V ) treated above.

Size bias and infinite divisibility
Paul Lévy's theory of infinitely divisible distributions is celebrated; see any of [19,27,31,43] for introductory treatements, or [2,17,62] for advanced treatments. For the special case of nonnegative random variables with finite mean, size bias provides an easy handle on the theory. 9.1. Steutel revisited. Theorem 9.1. Suppose X can be size biased, i.e., X ≥ 0 and a := E X ∈ (0, ∞). If X is infinitely divisible, then there exists a distribution for Y such that (58) X * = d X + Y, and X, Y are independent.
Conversely, given that X can be size biased, and that (58) holds for some Y , then X is infinitely divisible.
In either case, the distribution of Y is unique, and P(Y ≥ 0) = 1.
Remark: In [64] (see also [63]), F. Steutel shows that a cumulative distribution for a non-decreasing K. Our decomposition (58) is clearly a consequence of his integral formula, though he does not use the language of size biasing -he does not, in fact, assume that F has finite mean-and his proof proceeds by way of the Levy representation formula, which we will derive instead as a corollary of (58). Steutel's result is also presented in Sato [62], Theorem 51.1, as well as in the book [65] by Steutel and van Harn.
Proof. We begin by assuming that X is infinitely divisible, which by definition means that for each n there exists a distribution such that if X Then by (25) (60) It is obvious that, with probability 1, X → 0 in L 1 and hence in probability. Hence X − X (n) 1 converges in distribution to X.
Next, the family of random variables (X (n) 1 ) * is tight, because given > 0, there is a K such that P(X * > K) < , and by (60), for all n, P((X (n) 1 ) * > K) ≤ P(X * > K). Thus, by Helly's theorem, there exists a subsequence n k of the n's along which (X (n) 1 ) * converges in distribution, say (X (n k ) 1 ) * ⇒ Y . As n → ∞ along this subsequence, the pair (X − X n 1 , (X n 1 ) * ) converges jointly to the pair (X, Y ) with X and Y independent. From X * = d (X − X (n k ) 1 ) + (X (n k ) 1 ) * ⇒ X + Y as k → ∞ we conclude that X * = d X + Y , with Y ≥ 0, and X, Y independent. This completes the proof that if X is infinitely divisible, then it satisfies (58).
That the law of Y in (59) is unique requires a little work; we will need to know that the characteristic function φ for X satisfies φ(u) = 0 for all real u. Once we have this, uniqueness is easy: from (8) and (58), writing φ Y for the characteristic function of Y , we have two expressions for φ X * (u), hence This determines φ Y (u), provided we know that φ(u) = 0.
The characteristic function of any infinitely divisible X has φ(u) = 0 for all u: Feller [31, p. 500 and pp. 555-557], and Chung [27, Theorem 7.6.1], give straightforward proofs. However, under the hypothesis that (58) holds and E X is finite, there is a simpler proof, as follows. Suppose that φ(u) = 0 for all u ∈ (−t, t), for some t > 0. From equation (61), for u ∈ (−t, t) Since φ is continuous with log φ(0) = 0, it follows that for all u ∈ [−t, t], | log φ(u)| ≤ tE X < ∞. If it were the case that φ(u) = 0 for any u, we could take t = inf{|u| : φ(u) = 0} < ∞ to get a contradiction. 18 Finally, we prove the converse statement, that (58) implies infinite divisibility. Starting with the assumption (58), we have (61), which -with details given in the next section -lets us solve for (log φ(u)) , and integrate, to get (63). That (63) is the characteristic function of an infinitely divisible distribution is well-known, but to review, for the sake of a self-contained proof: the function in (63) can be expressed as the limit of characteristic functions of random variables with compound Poisson distribution, as in (32), and scaling all the Poisson parameters down by a factor of n, and then taking the limit, we get the distribution for the n th convolutional root X (n) 1 for use in (59).
9.2. The Lévy representation. We continue to work with an X ≥ 0 with a := E X ∈ (0, ∞), assuming also that X is infinitely divisible, or equivalently, that X satisfies (58). Using (61), and since φ(0) = 1 with log φ(0) = 0, we get Let α be the distribution of Y in (59), so α is a probability measure on [0, ∞). We have Combining the three previous displayed equations, the characteristic function φ for X may be expressed as To review, a ∈ (0, ∞), α is the probability distribution of a nonnegative random variable Y , and φ(u) is the characteristic function of a random variable X, with a = E X, and, with X, Y independent, X * = d X + Y . We have derived this under the assumption that (58) holds. However, given a ∈ (0, ∞), and a probability distribution for a nonnegative random variable Y , it can be seen that (63) is the characteristic function of a random variable X, by taking distributional limits of the discrete compound Poisson sums in (32). Then, working back through (62), one sees easily that E X = a and, with X, Y independent, X * = d X + Y .
Here γ is a nonnegative measure on (0, ∞), with γ(dy)/α(dy) = 1/y, and this allows a broader class than (63). To get E X < ∞, there is the additional requirement that (0,∞) y γ(dy) < ∞ -this is the price one pays for being able to size bias. Regardless of whether E X = ∞ or E X < ∞, the nonnegative measure γ can have infinite mass, due to mass near zero, and the requirement, to get a nonnegative infinitely divisible X, allowing E X = ∞, is that (0,∞) (1 ∧ y)γ(dy) < ∞. Examples 9.10 and 9.11 illustrate this, where, in both cases, α is a uniform distribution on an interval, and E X < ∞.
We read (64) as: the random variable X is the constant a α 0 , plus the sum of arrivals, in the Poisson process on (0, ∞) with intensity measure a γ. Formula (64) 19 is called the Lévy-Khintchine formula in the survey paper on subordinators [18], the one difference being that the random variable X representing the value of the subordinator at time a is also allowed to have P(X = ∞) = 1 − exp(−ka) > 0, where k is called the killing rate. 20 9.3. The size bias equation. When X, Y are both discrete or both absolutely continuous, it is worth highlighting how (4), together with (58), yields a simple relation satisfied by the mass functions or densities. Sato [62] Section 51, especially Corollary 51.2, already highlights these relations, though of course without referring to them as being size-bias relations.
In the discrete case, if (58) holds, so that f X * is the convolution of f X and f Y : f X * (x) = y f X (x − y)f Y (y), combining with (4) yields, for all x > 0, A common special case is that Y is supported on the positive integers, and X on the nonnegative integers, so that (65) specifies a recursion: starting from f (0) = c, for m = 0, 1, 2, . . ., and the normalizing constant is determined by c −1 = i≥0 f X (i). Furthermore, from (32) and (63)  The relation (66) was used in [9], where it was referred to as a result from [59]. The situation with X = 1≤i≤n iZ i with Z i independent Poisson(λ i ) is universal to combinatorial assemblies; here X is usually denoted as T n , and conditional on the event (T n = n) one has a labelled combinatorial object of total size n, in which there are Z i components of size i, jointly for i = 1 to n. See [10,8].
Likewise, in the absolutely continuous case, where X and Y have densities, if (58) holds, then f X * is the convolution of f X and f Y : f X * (x) = y f X (x − y)f Y (y) dy. Combined with (4), this says that for all x > 0, (67) f X (x) = a x y f X (x − y)f Y (y) dy.

9.4.
Examples. Of course, the Lévy representation (64) yields all examples of nonnegative infinitely divisible distributions. However, recognizing when a given distribution for X takes the form (63) or (64) remains a nontrivial problem. We present our favorite examples in which Theorem 9.1 provides a convenient criterion, and we will use the notation from Theorem 9.1, in particular (58).
The infinite divisibility of geometric and negative binomial distributions plays a key role in estimates comparing logarithmic combinatorial structures with their limits; see [8]. The compound Poisson representation of the geometric is the starting point for a coupling, in [3], showing that a random integer may be chosen uniformly from 1 to n, on the same probability space with a Poisson-Dirichlet process (L 1 , L 2 , . . .), so that if P i is the i th largest factor of the random integer, 21 then E i≥1 | log P i − (log n)L i | = O(log log n). This construction is analogous to Skorohod embedding: it starts with the continuum limit process -Poisson-Dirichlet instead of Brownian motion-and constructs the nearby (in the limit) discrete random object -the random integer expressed as a product of primes instead of a random walk-as a deterministic function of the continuum limit process, together with a small amount of auxiliary randomization.
A necessary and sufficient condition for a nonnegative integer valued random variable to be infinitely divisible is given in [45], and a useful sufficient condition is given in [72]. The sufficient condition is log-convexity: the support of X is the nonnegative integers, and for all n ≥ 1, P(X = n − 1)P(X = n + 1) ≥ P(X = n) 2 . Example 9.2 shows that the sufficient condition of log-convexity is not necessary -any Poisson distribution is log-concave, rather than log-convex. See [6] for a 21 with the convention that P i = 1 when i exceeds the number of prime factors, including multiplicities discussion of how the sufficiency of log-convexity is perhaps attributable to Kaluza, [44]. Of course, for any constant c, X is infinitely divisible if, and only if, c + X is infinitely divisible; this remark is often used with c = ±1. There are several famous discrete distributions that can be seen to be infinitely divisible via log-convexity; some examples of this type are given in [72], and two of our favorite examples are the following: Example 9.4. The zeta distributions: For s > 0, P(X = n) = n −s /ζ(s), n ≥ 1.

Continuous examples.
Example 9.6. Y is exponential, with P(Y > t) = e −t for t ≥ 0. When a = 1, X = d Y , and X * = d X + Y is the sum of two independent copies of X, as observed in Section 3 on the waiting time paradox. For positive integers a, X is the time of the a th arrival in a standard Poisson process. For general a > 0, X has the Gamma distribution, with shape parameter a.
In the Lévy representation (64) for the characteristic function of the Gamma random variable X, we have γ(dy) = e −y /y dy. This measure γ, or the increasing process it governs, is also known as the Moran subordinator, and used to construct the Poisson-Dirchlet process; see [46]. Example 9.7. Pareto distributions, of the form P(X > t) = (1 + t) −α , α > 0. This is the example for which Thorin [70] first developed his theory of generalized Gamma convolutions, which is a subclass of the infinitely divisible distributions for positive random variables. See [20], as well as [21].
Example 9.8. The lognormal distributions. Again, this is from Thorin in 1977, [69], and his proof is based on a generalized Gamma convolution. Example 9.9. Distributions with a log-convex density.
Taking limits of discrete distributions on the nonnegative integers with logconvex pointmass function, Sato [62,Theorem 5.1.4] shows that if X has a density f on (0, ∞), such that log f is convex on (0, ∞), then X is infinitely divisible. This also shows that the Pareto distributions are infinitely divisible! Example 9.10. Y is uniform (0, 1), leading to Dickman's function ρ.
In (58), take Y to be the standard uniform random variable on (0, 1). Then (63) specializes to (68) φ X (u) = exp a Here as always, a = E X; the choice a = 1 yields f X (x) = e −γ ρ(x), where ρ is Dickman's function, of central importance in the study of integers without large prime factors; see [68] and [8,Section 4.2]. For the general case a ∈ (0, ∞), the density f X is a "convolution power of Dickman's function," normalized to be a probability density; see [40] .
Up to scaling, any uniform distribution on an interval on nonnegative numbers is either the uniform on (0, 1), or else on (b, 1) for some 0 < b < 1. In (58), the seemingly small change of replacing Y uniform on (0,1) with Y uniform on (b, 1) for some fixed b ∈ (0, 1) leads to a substantial qualitative change: X is no longer absolutely continuous, since P(X = 0) = b a/(1−b) > 0. This computation of P(X = 0) is easy to understand as follows, starting with a unified description of Examples 9.10 and 9.11: X has mean a ∈ (0, ∞), X satisfies (58), Y is uniform on (b, 1) for some fixed b ∈ [0, 1), so (64) becomes (70) φ X (u) = exp a 1 − b 1 b e iuy − 1 y dy .
In the unified description, Example 9.10 is the case b = 0, and Example 9.11 is the case 0 < b < 1. Viewing (64) as the description of X as the sum of the arrivals in the Poisson process with arrival intensity measure a γ, the special case (70) is aγ(dy) = a/(1 − b) 1(b < y < 1) dy/y, and the expected number of arrivals in this Poisson process is λ = 1 b a/(1 − b) dy/y, with λ = ∞ if b = 0 and λ < ∞ if b > 0, and of course P(X = 0) = e −λ . See [7].
The size bias squation, which was (69) for the case b = 0, is more complicated with 0 < b < 1: the distribution of X has pointmass b a/(1−b) > 0, and a defective density f X whose support is ∪ k≥1 [kb, k]. The size bias equation obtained by combining (4) with (58) takes the form: for x > 0, (71) f