Applications of size biased couplings for concentration of measures

Let $Y$ be a nonnegative random variable with mean $\mu$ and finite positive variance $\sigma^2$, and let $Y^s$, defined on the same space as $Y$, have the $Y$ size biased distribution, that is, the distribution characterized by $$ E[Yf(Y)]=\mu E f(Y^s) \quad \mbox{for all functions $f$ for which these expectations exist.} $$ Under a variety of conditions on the coupling of $Y$ and $Y^s$, including combinations of boundedness and monotonicity, concentration of measure inequalities such as $$ P\left(\frac{Y-\mu}{\sigma}\ge t\right)\le \exp\left(-\frac{t^2}{2(A+Bt)}\right) \quad \mbox{for all $t \ge 0$} $$ are shown to hold for some explicit $A$ and $B$ in \cite{cnm}. Such concentration of measure results are applied to a number of new examples: the number of relatively ordered subsequences of a random permutation, sliding window statistics including the number of $m$-runs in a sequence of coin tosses, the number of local maxima of a random function on a lattice, the number of urns containing exactly one ball in an urn allocation model, and the volume covered by the union of $n$ balls placed uniformly over a volume $n$ subset of $\mathbb{R}^d$.


Introduction
Theorem 1.1, from [8], demonstrates that the existence of a bounded size bias coupling to a nonnegative variable Y implies bounds for the degree of concentration of the distribution of Y . In this work we explore a spectrum of new consequences of Theorem 1.1. The couplings required here which yield concentration of measure results for Y are to a random variable having the size biased distribution of Y , denoted Y s . Size biasing of a random variable is 70 DOI: 10.1214/ECP.v16-1605 essentially sampling it proportional to its size, and is a well known phenomenon in the literature of both probability and statistics; see, for example, the waiting time paradox in Feller [7], Section I. 4, and the method of constructing unbiased ratio estimators in [16]. Size biased couplings are used in Stein's method for normal approximation (see, for instance, [11], [9] and [5]), and is a method which in some sense parallels the exchangeable pair technique. In fact, these two techniques are somewhat complementary, with size biasing useful for the approximation of distributions of nonnegative random variables such as counts, and the exchangeable pair for mean zero variates. Recently, the objects of Stein's method have also proved successful in deriving concentration of measure inequalities, that is, deviation inequalities of the form P(|Y − E(Y )| ≥ t Var(Y )), where typically one seeks bounds that decay exponentially in t; for a guide to the literature on the concentration of measures, see [14] for a detailed overview. As far as the use of techniques related to Stein's method is concerned, Raič [22] obtained large deviation bounds for certain graph related statistics using the Stein equation (see [24]) along with the Cramér transform. Chatterjee [3] derived Gaussian and Poisson type tail bounds for Hoeffding's combinatorial CLT and the net magnetization in the Curie-Weiss model in statistical physics in [3] using the exchangeable pair of Stein's method (see [24]). Considering the complementary method, Ghosh and Goldstein [8] proved Theorem 1.1 which relies on the existence of bounded size bias couplings. Here we demonstrate the broad range of applicability of Theorem 1.1 by presenting a variety of examples. First recall that for a given nonnegative random variable Y with finite nonzero mean µ, we say that Y s has the Y -size biased distribution if for all functions f for which these expectations exist.
(1) If Y s ≥ Y with probability one, then If the moment generating function m(θ ) = E(e θ Y ) is finite at θ = 2/C, then In typical examples the variable Y is indexed by n, and the ones we consider have the property that the ratio µ/σ 2 remains bounded as n → ∞, and C does not depend on n. In such cases the bound in (2) decreases at rate exp(−c t 2 ) for some c > 0, and if σ → ∞ as n → ∞, the bound in (3) is of similar order, asymptotically.
In [8], the number of lightbulbs switched on at the terminal time in the lightbulb process was shown to obey the hypothesis of Theorem 1.1 and concentration of measure inequalities were obtained. In Section 3 we apply Theorem 1.1 to the number of relatively ordered subsequences of a random permutation, sliding window statistics including the number of m-runs in a sequence of coin tosses, the number of local maxima of a random function on the lattice, the number of urns containing exactly one ball in the uniform urn allocation model, and the volume covered by the union of n balls placed uniformly over a volume n subset of d . In Section 2, we review the methods in [11] for the construction of size bias couplings in the presence of dependence, and then move to the examples.

Construction of size bias couplings
In this section we will review the discussion in [11] which gives a procedure for a construction of size bias couplings when Y is a sum; the method has its roots in the work of Baldi et al. [1]. The construction depends on being able to size bias a collection of nonnegative random variables in a given coordinate, as described Definition 2.1. Letting F be the distribution of Y , first note that the characterization (1) of the size bias distribution F s is equivalent to the specification of F s by its Radon Nikodym derivative Definition 2.1. Let be an arbitrary index set and let {X α : α ∈ } be a collection of nonnegative random variables with finite, nonzero expectations EX α = µ α and joint distribution d F (x). For β ∈ , we say that Just as (4) is related to (1), the random vector X β has the X size bias distribution in coordinate β if and only if for all functions f for which these expectations exist.
Letting f (X) = g(X β ) for some function g one recovers (1), showing that the β th coordinate of X β , that is, X β β , has the X β size bias distribution. The factorization of the joint distribution of X suggests a way to construct X. First generate X β , a variable with distribution P(X β ∈ d x). If X β = x, then generate the remaining variates {X β α , α = β} with distribution P(X ∈ dx|X β = x). Now, by the factorization of d F (x), we have Hence, to generate X β with distribution d F β , first generate a variable X β β with the X β size bias distribution, then, when X β β = x, generate the remaining variables according to their original conditional distribution given that the β th coordinate takes on the value x. Definition 2.1 and the following special case of a proposition from Section 2 of [11] will be applied in the subsequent constructions; the reader is referred there for the simple proof.
Then if X I has the mixture distribution β∈A P(I = β) (X β ), the variable Y s = α∈A X I α has the Y -sized biased distribution as in (1).
In our examples we use Proposition 2.1 and the random index I, and (5), to obtain Y s by first generating X I I with the size bias distribution of X I , then, if

Applications
We now consider the application of Theorem 1.1 to derive concentration of measure results for the number of relatively ordered subsequences of a random permutation, the number of m-runs in a sequence of coin tosses, the number of local extrema on a graph, the number of nonisolated balls in an urn allocation model, and the covered volume in a binomial coverage process. Without further mention we will use that when (2) and (3) hold for some A and B then they also hold when these values are replaced by larger ones, also denoted by A and B, and that moment generating functions of bounded random variables are everywhere finite.

Relatively ordered sub-sequences of a random permutation
For n ≥ m ≥ 3, let π and τ be permutations of = {1, . . . , n} and {1, . . . , m}, respectively, and let where addition of elements of is modulo n. We say the pattern τ appears at location α ∈ if the values {π(v)} v∈ α and {τ(v)} v∈ 1 are in the same relative order. Equivalently, the pattern τ appears at α if and only if π(τ −1 (v) + α − 1), v ∈ 1 is an increasing sequence. When τ = ι m , the identity permutation of length m, we say that π has a rising sequence of length m at position α.
Rising sequences are studied in [2] in connection with card tricks and card shuffling. Letting π be chosen uniformly from all permutations of {1, . . . , n}, and X α the indicator that τ appears at α, the sum Y = α∈ X α counts the number of m-element-long segments of π that have the same relative order as τ. [9]. Let σ α be the permutation of {1, . . . , m} for which In other words π α is the permutation π with the values π(v), v ∈ α reordered so that π α (γ) for γ ∈ α are in the same relative order as τ. Now let X α β = X β (π α (v), v ∈ β ), the indicator that τ appears at position β in the reordered permutation π α . As π α and π agree except perhaps for the m values in α , we have Hence, as we may take C = 2m − 1 as the almost sure bound on the coupling of Y s and Y .
Regarding the mean µ of Y , clearly for any τ, as all relative orders of π(v), v ∈ α are equally likely, To compute the variance, for 0 ≤ k ≤ m − 1, let I k be the indicator that τ(1), . . . , τ(m − k) and τ(k + 1), . . . , τ(m) are in the same relative order. Clearly I 0 = 1, and for rising sequences, as as the joint event in this case demands two different relative orders on the segment of π of length m − k of which both X α and X α+k are a function. If I k = 1 then a given, common, relative order is demanded for this same length of π, and relative orders also for the two segments of length k on which exactly one of X α and X β depend, and so, in total a relative order on m − k + 2k = m + k values of π, and therefore As the relative orders of non-overlapping segments of π are independent, now taking n ≥ 2m, the variance σ 2 of Y is given by Clearly Var(Y ) is maximized for the identity permutation τ(k) = k, k = 1, . . . , m, as I m = 1 for all 1 ≤ m ≤ m − 1, and as mentioned, this case corresponds to counting the number of rising sequences. In contrast, the variance lower bound given when Hence, the bound (3) of Theorem 1.1 holds where µ and σ 2 are given in (8) and (9), respectively, and Size biased coupling for concentration 75

Local Dependence
The following lemma shows how to construct a collection of variables X α having the X distribution biased in direction α when X α is some function of a subset of a collection of independent random variables.
Lemma 3.1. Let {C g , g ∈ } be a collection of independent random variables, and for each α ∈ let α ⊂ and X α = X α (C g , g ∈ α ) be a nonnegative random variable with a nonzero, finite expectation.
and is independent of {C g , g ∈ }, letting the collection X α = {X α β , β ∈ } has the X distribution biased in direction α. Furthermore, with I chosen proportional to EX α , independent of the remaining variables, the sum Y s = β∈ X I β has the Y size biased distribution, and when there exists M such that X α ≤ M for all α, Proof. By independence, the random variables Thus, with X α as given, we find That is, X α has the X distribution biased in direction α, as in Definition 2.1. The claim on Y s follows from Proposition 2.1, and finally, since X β = X α β whenever β ∩ α = , This completes the proof.

Sliding m window statistics
For n ≥ m ≥ 1, let = {1, . . . , n} considered modulo n, {C g : g ∈ } i.i.d. real valued random variables, and for each α ∈ let α be as in (6). Then for X : m → [0, 1], say, Lemma 3.1 may be applied to the sum Y = α∈ X α of the m-dependent sequence X α = X (C α , . . . , C α+m−1 ), formed by applying the function X to the variables in the 'm-window' α . As for all α we have X α ≤ 1 and we may take C = 2m − 1 in Theorem 1.1, by Lemma 3.1. For a concrete example let Y be the number of m runs of the sequence ξ 1 , ξ 2 , . . . , ξ n of n i.i.d Bernoulli(p) random variables with p ∈ (0, 1), given by with the periodic convention ξ n+k = ξ k . In [23], the authors develop smooth function bounds for normal approximation for Y . Note that the construction given in Lemma 3.1 for this case is monotone, as for any i, size biasing the Bernoulli variables ξ j for j ∈ {i, . . . , i + m − 1} by setting For the variance, now letting n ≥ 2m and using the fact that non-overlapping segments of the sequence are independent, Cov(ξ i · · · ξ i+m−1 , ξ i+ j · · · ξ i+ j+m−1 ).

Local extrema on a lattice
Let = { , } be a given graph, and for every v ∈ let v ⊂ be a collection of vertices depending on v; we think of v as some 'neighborhood' of the vertex v. Let {C g , g ∈ } be a collection of independent and identically distributed random variables, and let X v be the indicator that vertex v corresponds to a local maximum value with respect to the neighborhood v , that is The sum Y = v∈ X v counts the number of local maxima. Size biased couplings to Y , for the purpose of normal approximation, were studied in [1] and [9]. In general one may define the neighbor distance d between two vertices v, w ∈ by d(v, w) = min{n : there ∃ v 0 , . . . , v n in so that v 0 = v, v n = w and (v k , v k+1 ) ∈ for k = 0, . . . , n}, and for r ∈ , the r neighborhood of v ∈ consisting of vertices at distance at most r from v, We consider the case where there is some r such that the graphs For example, for p ∈ {1, 2, . . .} and n ≥ 5 consider the lattice = {1, . . . , n} p modulo n in p and = {{v, w} : To consider the case where we call vertex v a local maximum if C v exceeds the values C w over the immediate neighbors w of v, we take r = 1 and obtain v = v (1) and that | v (1)| = 1 + 2p, the 1 accounting for v itself, and 2p for the number of neighbors at distance 1 from v, which differ from v by either +1 or −1 in exactly one coordinate. Lemma 3.1, (11), and |X v | ≤ 1 yield where the 1 counts v itself, the 2p again are the neighbors at distance 1, and the term in the parenthesis accounting for the neighbors at distance 2, 2p of them differing in exactly one coordinate by +2 or −2, and 4 p 2 of them differing by either +1 or −1 in exactly two coordinates. Note that we have used the assumption n ≥ 5 here, and continue to do so below. Now letting C v have a continuous distribution, without loss of generality we can assume C v ∼ [0, 1]. As any vertex has chance 1/| v | = 1/(2p + 1) of having the largest value in its neighborhood, µ = EY satisfies To begin the calculation of the variance, note that when v and w are neighbors they cannot both be maxima, so X v X w = 0 and therefore, for d(v, w) = 1, If the distance between v and w is 3 or more, X v and X w are functions of disjoint sets of independent variables, and hence are independent.
When d(w, v) = 2 there are two cases, as v and w may have either 1 or 2 neighbors in common, and where m is the number of vertices over which v and w are extreme, so m = 2p, and k = 1 and k = 2 for the number of neighbors in common. For k = 1, 2, . . ., letting M k = max{U m−k+1 , . . . , U m }, as the variables X v and X w are conditionally independent given U m−k+1 , . . . , U m as Hence, averaging (14) over U m−k+1 , . . . , U m yields For n ≥ 3, when m = 2p, for k = 1 and 2 we obtain and Cov(X v , X w ) = 2 (2p + 1) 2 (2(2p + 1) − 2) , respectively.
For n ≥ 5, of the 2p + 4 p 2 vertices w that are at distance 2 from v, 2p of them share 1 neighbor in common with v, while the remaining 4 p 2 of them share 2 neighbors. Hence, We conclude that (2) of Theorem 1.1 holds with A = Cµ/σ 2 and B = C/2σ with µ, σ 2 and C given by (13), (15) and (12), respectively, that is,

Urn allocation
In the classical urn allocation model n balls are thrown independently into one of m urns, where, for i = 1, . . . , m, the probability a ball lands in the i th urn is p i , with m i=1 p i = 1. A much studied quantity is the number of nonempty urns, for which Kolmogorov distance bounds to the normal were obtained in [6] and [21]. In [6], bounds were obtained for the uniform case where p i = 1/m for all i = 1, . . . , m, while the bounds in [21] hold for the nonuniform case as well. In [19] the author considers the normal approximation for the number of isolated balls, that is, the number of urns containing exactly one ball, and obtains Kolmogorov distance bounds to the normal. Using the coupling provided in [19], we derive right tail inequalities for the number of non-isolated balls, or, equivalently, left tail inequalities for the number of isolated balls. For i = 1, . . . , n let X i denote the location of ball i, that is, the number of the urn into which ball i lands. The number Y of non-isolated balls is given by We first consider the uniform case. A construction in [19] produces a coupling of Y to Y s , having the Y size biased distribution, which satisfies |Y s − Y | ≤ 2. Given a realization of X = {X 1 , X 2 , . . . , X n }, the coupling proceeds by first selecting a ball I, uniformly from {1, 2, . . . , n}, and independently of X. Depending on the outcome of a Bernoulli variable , whose distribution depends on the number of balls found in the urn containing I, a different ball J will be imported into the urn that contains ball I. In some additional detail, let be a Bernoulli variable with success probability P( = 1) = π M I , where with N ∼ Bin(1/m, n − 1). Now let J be uniformly chosen from {1, 2, . . . , n} \ {I}, independent of all other variables. Lastly, if = 1, move ball J into the same urn as I. It is clear that |Y − Y | ≤ 2, as at most the occupancy of two urns can affected by the movement of a single ball. We also note that if M I = 0, which happens when ball I is isolated, π 0 = 1, so that I becomes no longer isolated after relocating ball J. We refer the reader to [19] for a full proof that this procedure produces a coupling of Y to a variable with the Y size biased distribution. For the uniform case, the following explicit formulas for µ and σ 2 can be found in Theorem II.1.1 of [13], Hence with µ and σ 2 as in (16) In the nonuniform case similar results hold with some additional conditions. Letting ||p|| = sup 1≤i≤m p i and γ = γ(n) = max(n||p||, 1), in [19] it is shown that when ||p|| ≤ 1/11 and n ≥ 83γ 2 (1+3γ+3γ 2 )e

An application to coverage processes
We consider the following coverage process, and associated coupling, from [10]. Given a collection = {U 1 , U 2 , . . . , U n } of independent, uniformly distributed points in the d dimensional torus of volume n, that is, the cube C n = [0, n 1/d ) d ⊂ d with periodic boundary conditions, let V denote the total volume of the union of the n balls of fixed radius ρ centered at these n points, and S the number of balls isolated at distance ρ, that is, those points for which none of the other n − 1 points lie within distance ρ. The random variables V and S are of fundamental interest in stochastic geometry, see [12] and [18]. If n → ∞ and ρ remains fixed, both V and S satisfy a central limit theorem [12,17,20]. The L 1 distance of V , properly standardized, to the normal is studied in [4] using Stein's method. The quality of the normal approximation to the distributions of both V and S, in the Kolmogorov metric, is studied in [10] using Stein's method via size bias couplings.
In more detail, for x ∈ C n and r > 0 let B r (x) denote the ball of radius r centered at x, and B i,r = B(U i , r). The covered volume V and number of isolated balls S are given, respectively, by We will derive concentration of measure inequalities for V and S with the help of the bounded size biased couplings in [10]. Assume d ≥ 1 and n ≥ 4. Denote the mean and variance of V by µ V and σ 2 V , respectively, and likewise for S, leaving their dependence on n and ρ implicit. Let π d = π d/2 /Γ(1 + d/2), the volume of the unit sphere in d , and for fixed ρ let φ = π d ρ d . For 0 ≤ r ≤ 2 let ω d (r) denote the volume of the union of two unit balls with centers r units apart. We have ω 1 (r) = 2 + r, and ω d (r) = π d + π d−1 r 0 (1 − (t/2) 2 ) (d−1)/2 d t, for d ≥ 2.
From [10], the means of V and S are given by µ V = n 1 − (1 − φ/n) n and µ S = n(1 − φ/n) n−1 , and their variances by and It is shown in [10], by using a coupling similar to the one briefly described for the urn allocation problem in Section 3.3, that one can construct V s with the V size bias distribution which satisfies |V s − V | ≤ φ. Hence (2) of Theorem 1.1 holds for V with where µ V and σ 2 V are given in (18) and (19), respectively. Similarly, with Y = n − S the number of non-isolated balls, it is shown that Y s with Y size bias distribution can be constructed so that