Large Deviations for the Empirical Distribution in the Branching Random Walk

We consider the branching random walk on the real line where the underlying motion is of a simple random walk and branching is at least binary and at most decaying exponentially in law. It is well known that the normalized empirical measure converges to the Gaussian distribution for typical sets A. We therefore analyze the probability that at step n the empirical distribution differs from the Gaussian distribution by a constant \epsilon. We show that the decay is doubly exponential in either n or \sqrt{n}, depending on the set A and \epsilon, and we find the leading coefficient in the top exponent. To the best of our knowledge, this is the first time such large deviation probabilities are treated in this model.


Introduction and Results
In this work we analyze the decay of probabilities of certain unlikely deviation events involving the Branching Random Walk (henceforth BRW). As far as we know, very little has been done in this direction, although, after optimal law of large numbers and central limit theorem type results have been fully obtained, both the question and the events we consider seem to us natural and fundamental. To fix notation and context, we begin by briefly describing the model (1.1) and giving a short account of some of the relevant results in its analysis (1.2). A precise statement of the contribution in this paper then follows (1.3), and finally the idea in the proof of the main theorem is conveyed (1.4). Complete proofs for all statements are given in Section 2.
1.1. Setup. The BRW model traces the evolution by means of reproduction and motion of a population of particles on the real line, carried out synchronously in discrete steps or generations. We denote by Z n (henceforth the particles measure) the population at time n = 0, 1, . . . , which we describe as a point measure on R with a mass 1 per particle. The process is formally defined as follows. Initially there is a single particle at the origin Z 0 = δ 0 . It evolves in one generation to a random point measure Z 1 . Although one may consider any law for Z 1 , often and in this paper as well, attention is restricted to evolution by means of independent reproduction and motion. That is, Z 1 is realized by the particle giving birth to a random number of descendants, dying, and then all descendants independently of each other and of their number moving according to some common spatial distribution F .
At any further generation n ≥ 2 we have (conditioned on Z n−1 ), where Z x 1 (·) has the same distribution as Z 1 (· − x) and { Z x 1 : x ∈ Z n−1 } are independent. Here and later, for a point measure ζ with integer masses, we write x ∈ ζ iff x is an atom of ζ, that is if ζ(x) := ζ({x}) > 0. We use (x : x ∈ ζ) for the multi-set of atoms of ζ, where each atom x is repeated ζ(x) times. Moreover, if this multi-set is used as an index set (as above), different copies of the same atom are considered different indices.
Despite the old age of this model it is still quite central in pure and applied probability. It remains a popular model for describing and analyzing phenomena in various applied disciplines, such as biology, population dynamics and computer science. At the same time, due to the fundamentality of the stochastic dynamics it captures, it is frequently found in various seemingly unrelated mathematical models (e.g. the Gaussian Free Field [11], Interacting Particle System [18]). Finally, there are aspects of the model which are still not understood or only beginning to be understood now (e.g. its extremal process [2]). For the classical theory of BRW, we direct the reader to the survey by Ney [19] and the books by Révész [21] and Harris [13].
1.2. Known Results. Since the population-size process (|Z n |) n≥0 = (Z n (R)) n≥0 is a standard Galton Watson process, it is well known that once reproduction is super-critical and assuming then for the normalized particles measure Z n = β −n Z n we have almost surely where | Z| is some non-negative random variable with E| Z| = 1. The optimal version of this theorem is due to Kesten and Stigum [17]. If β ≤ 1, the population dies out with probability 1; hence from now on, we shall assume (2). When displacement is considered as well, an analogous result to the above, conjectured by Harris [13], first proved by Stan [22], and then proved under optimal conditions by Kaplan [16] is lim Here ν is the standard Gaussian measure on R, and the assumptions are (2), (3) for branching, and zero mean and unit variance for the motion, that is Combining (4) and (5) and denoting the empirical particles distribution byZ n = Z n /|Z n |, we have lim Once leading order asymptotics (4), (5) have been obtained, second-order terms, or the question of the rate of the convergence, can be approached. For the population size, Heyde [14] has shown that under E|Z 1 | 2 < ∞, for some (explicit) α 0 > 0, as n → ∞ For the particles measures, more recently Chen [12] has proved that for all as n → ∞, where α 1 > 0, ϕ 1 (·) is a bounded function, and M is some random variableall explicitly defined. In the case he considered, motion is of a simple random walk and branching admits the same assumptions as in Heyde's.
Having settled the main questions in the "typical deviations" regime, it is natural to turn to the regime of atypical or large deviations. Results here are not as abundant. For |Z n |, Athreya [3] has considered the following probabilities: for ∆ > 0 and under the assumptions of exponential moments and |Z 1 | ≥ 1. If p := P(|Z 1 | = 1) > 0, he showed that the probability on the left is for some explicitly defined λ 0 (∆) > 0 and otherwise, it is at most where b is the first integer for which P(|Z 1 | = b) > 0 and λ 1 (∆), α 1 (∆) > 0. For the probability on the right, he obtained the bound Above C, C ′ > 0 are some universal constants. See also [20]. Different atypicality is treated by Jones [15] and Biggins and Bingham [7] who investigate the left and right tail of | Z|.
For the BRW, much effort has been directed into estimating the number of particles which deviate linearly away from the mean displacement in the underlying motion. It is a classical result by Biggins [6] that for any A ∈ A 0 , if the r.h.s. is positive and otherwise Z n (nA) → 0 a.s. Here Λ * is the Legendre-Fenchel transform of Λ(θ) = log E e θx dZ 1 (x), which is assumed to be finite. This can be also used to obtain the speed of the left (or right) most particle as inf{x : Λ * (x) < 0}, although to obtain sharper results, different methods have been used (c.f. Brahmson [9,10], and Addario-Berry and Reed [1]). Perhaps closest to the type of large-deviation analysis we do here is the result by Athreya and Kang in [4], where instead of a motion in R, particles move according to some positiverecurrent Markov chain with invariant measure π. Along with a local version of (8), they find that the probability that at time n the fraction of particles at state s is at least ∆ > 0 away from π(s) decays exponentially as λ(∆)p n for some explicit λ(∆) > 0 and with p as in (12), which is assumed to be positive Nevertheless, this is still quite far from what we do here. First, random walk is typically null recurrent (unless degenerate). Second, there is no spatial component (e.g. CLT-type phenomenon) to their problem. Third, we in fact assume p 1 = 0 and thus obtain very different decay scales.
1.3. New Results. In this work we analyze large deviation probabilities of the form: for some ∆ > 0. In light of (8), the above clearly decays in n and we aim to understand how fast.
Assumptions. We make the following assumptions. For branching, we shall assume that |Z 1 | is non-deterministic, that Ee θ|Z 1 | < ∞ for θ in some neighborhood of 0 and that P(|Z 1 | ≥ 2) = 1. The last condition guarantees that exponential growth of the population size is unavoidable. Although the case of P(|Z 1 | ≥ 2) < 1 is an interesting problem, it is of a different nature as it permits using strategies which suppress the branching in order to realize large deviation events. This will result in a different scale for the decay in (16). For the underlying motion, we shall assume simple random walk steps. The precise step distribution will not change the result, as long as it has mean zero and bounded or sufficiently decaying tails. Again, allowing for steps with fat tails would have given rise to strategies which exploit these tails for achieving the unlikely events, resulting again in a problem of a different nature and a different scale for the decay of (16).
We are now ready to state our main result. Let A be the algebra generated by A 0 (defined in (6)). For A ∈ A non-empty and p ∈ (0, 1) definẽ and with b = min{k : Then, as n → ∞.
Replacing A with A c in Theorem 1, one has Theorem 1 ′ . For all A ∈ A \ R and p ∈ (0, 1) such that p < ν(A), as n → ∞.
As follows from Proposition 3 below, for A and p as in the conditions of the theorems either I A (p) ∈ (0, ∞) or I A (p) = ∞ and J A (p) ∈ (0, log b). Thus on a double-exponential scale, Theorem 1 and 1 ′ capture the right first-order asymptotics for the decay of the probability of a large deviation in the empirical distribution for such A's and p's.
The statement in the theorem still holds if we replace the weak inequality in (21) or (22) by a strong one. Our proof for the lower bound on P(Z n ( √ nA) ≥ p) essentially works The restriction to intervals of the form (−∞, x], (y, x] and (y, ∞) in A is quite arbitrary and the theorem still holds if A is the algebra generated by sets of the form (∞, x) or more generally, the set of all finite unions of disjoint intervals which either contain their endpoints or do not, or contain only one of them and can be finite or infinite, but as long as their interior is non-empty.
On the other hand, (21) cannot be expected to hold for all Borel sets, nor even all continuity sets of ν. Indeed, the following shows that there are simple enough sets for which the decay in (16) has neither linear nor radical rate on a double exponential scale.
Proposition 2. For all α ∈ (1/2, 1) and p ∈ (0, 1), there exists a set A, which is a countable union of disjoint finite intervals, such that Similarly, the restriction in our main theorem to values of p in (0, 1) is essential. In Theorem 1, for instance, in the case p = 0 the probability in the l.h.s. of (21) does not decay, and for certain sets in A, the case p = 1 cannot be handled by the current proof nor a straightforward modification of it.

Idea of Proof.
It is usually the case in the realm of large deviations that obtaining decay asymptotics for probabilities of unlikely events amounts to finding (and proving that it is such) an optimal (that is least "costly" in terms of probability) "strategy" for realizing the unlikely event. Consider therefore A ∈ A and p ∈ (ν(A), 1) as in the conditions of Theorem 1. What is the optimal strategy for having at least p fraction of the population in the set √ nA at time n instead of the likely ν(A)?
As it turns out, among all possible strategies one needs to consider only two: a shift strategy and a dilation strategy. In the former, all particles move together in either the left or right direction for w = |x| √ n generations (up to integer rounding, x ∈ R). This can be done with probability exp(−b |x| √ n(1+o(1) ) by keeping the number of particles at its minimum. Relative to the position of the particles at generation w, the target set has now "shifted" by −x √ n. Therefore after dividing by the CLT scaling of √ n, each particle at generation w will typically have (asymptotically) a fraction of ν(A − x) of its descendants in √ nA, and this will also be the fraction for the entire population. Consequently, if there exists x for which ν(A − x) ≥ p, this strategy will realize the event {Z n ( √ nA) ≥ p} at the sole cost of "steering" the population for w generations. This cost is exp(−e I A (p) √ n(1+o (1) ) once x is chosen closest to 0. If there is no x for which ν(A − x) ≥ p, a "dilation" strategy is employed, whereby all particles move together for w ′ = r ′ n + x ′ √ n generations (x ′ ∈ R, r ′ ∈ (0, 1)) such that at generation w ′ they are all at position ≥ p then as in the shift case, the typical overall fraction in √ nA at a large time n will be at least p. The probabilistic cost of this strategy is therefore incurred just in the first w ′ generations, and by keeping reproduction at its minimum, it can be exp(−b r ′ n(1+o(1)) ). Choosing the smallest r ′ possible, {Z n ( √ nA) ≥ p} can be achieved by a strategy which has probability exp(−e −J A (p)n(1+o (1) ).
Of course these strategies only give lower bounds for the probability in question. One therefore must also show that other strategies would not cost less. In addition, to make the above heuristics precise, our proof requires certain uniform estimates for the probabilities of finding typical fractions as well as coarse (a priori) estimates for finding atypical ones.

Proofs
In this section we provide proofs for the statements in (1.3). We first introduce further notation (2.1) which will be used in the proofs then prove various preliminary statements (2.2) which are required in order to make the ideas from (1.4) precise. We then prove the main theorem (2.3) and finally prove Proposition 2 (2.4).
2.1. A bit more notation. The space of all particles measures, that is, finite point measures on R with integer masses, will be denoted by Z. For ζ ∈ Z, we denote by (Z ζ n ) n≥0 a BRW process with a similar evolution as (Z n ) n≥0 , only that initially Z 0 = ζ. We will write Z x n in place of Z δx n for short. ν n is the distribution of the position of a simple random walk after n steps. For u ∈ R, as usual, u + = max(0, u) and u − = −(−u) + . We will use C, C ′ , C ′′ to denote positive constants whose value is immaterial and changes from one use to the other. Constant values which are used more than once are denoted C 0 , C 1 , .., and their values become fixed the first time they appear in the text.
Proof. Part 1 follows from the dominated convergence theorem and standard arguments once we write For part 2 and 3, if A = R, thenĨ A (p) =J A (p) = 0, and there is nothing to prove.
which, if non-empty, must contain a minimizer of | · |. This shows part 2.
For part 3, if A contains a half-infinite interval, then since ϕ A (0, x) → 1 > p if x → +∞ or x → −∞, we must haveĨ A (p) < ∞. ThereforeJ A (p) = 0, and (25) is satisfied with r = 0 and x from part 2. Otherwise, A is a finite union of finite intervals, and so there must exist R < 1, M < ∞ such that • ϕ A (r, x) ≥ p for some 0 ≤ r ≤ R and x with |x| ≤ M.
Thus,J A (p) is the infimum of the continuous function r over the non-empty compact set Proof. By Theorem 2 in [8], it is enough to check that where for a set D ⊂ R, we set D δ := {x ∈ R : inf y∈D |x − y| < δ} and the supremum is over ρ and ξ as in the statement in the proposition. Since ν is equivalent to λ, Lebesgue measure on R, we may show (30) with λ in place of ν. But, The last term goes to 0 as δ → 0, since λ(∂A) = 0.
We shall need the following uniform Chernoff-Cramér-type upper bound.
Lemma 5. Let X be a family of random variables on R with zero mean such that for some θ 0 > 0 sup Then there exists C > 0 such that for any ∆ > 0 small enough, any m ≥ 1 and X 1 , . . . , X m independent copies of random variables in X Proof. Using the exponential Chebyshef's inequality we may bound the l.h.s. in (33) for any 0 < θ ≤ θ 1 < θ 0 by where we use L X (θ) = log Ee θX for the log moment generating function of X. Since L X (θ) is in C ∞ ([0, θ 0 )) due to (32), we may use Taylor expansion to write (note that the first two terms are 0) for some θ ∈ (0, θ). Now if we denote by M X ( θ) = Ee θX the moment generating function of X then This follows since M X ( θ) ≥ 1 via Jensen's inequality and since for some C > 0 independent of X ∈ X. Therefore (32) implies that there exists K > 0 for which sup and thus Using this bound with θ = ∆/K in (34) and assuming ∆ is small enough, the result follows with C = (2K) −1 in (33).
The last lemma can be used to prove the following.
Lemma 6. There exists C, C ′ > 0 such that for all ∆ > 0 sufficiently small, A ⊂ R, ζ ∈ Z and n ≥ 1, The same holds if we replace > with < and +∆ with −∆.
Proof. Starting with the first inequality and usinḡ the l.h.s. of (40) is bounded above by as long as ∆ is small enough. Now Theorem 4 in [3] gives a uniform bound on the moment generating function e θ Zn(R) for all n ≥ 1 and θ ∈ [0, θ 0 ], for some θ 0 > 0. This uniform bound can be extended to include also the moment generating functions of (the stochastically smaller) Z x n (A) for all A ⊆ R and x ∈ R in the same range of θ. The non-negativity of all these random variables imply that we may extend the bound also to all θ < 0. Thus, it is not difficult to see that the family of random variables satisfies the conditions in Lemma 5, whence (42) is bounded above by Ce −C ′ ∆ 2 |ζ| for some C, C ′ > 0 as desired.
We shall need the following uniform lower bound on the probability of a typical deviation ofZ n from the Gaussian distribution.
Moreover, we may choose the ǫ's such that for fixed A ∈ A and t > 0, and the above limit with A ′ in place of A is uniform in A ′ , where A ′ = ρA + ξ for (ρ, ξ) in any compact subset of (0, ∞) × (−∞, +∞). The same result holds with < in place of > and −t/ √ n in place of +t/ √ n.
Proof. Consider A ′ = ρA + ξ for some (ρ, ξ) ∈ (0, ∞) × (−∞, +∞). We may write √ n(Z n ( √ nA ′ ) − ν(A ′ )) as (recall the definition of | Z| in (4)), Now Theorem 4.2 in [21] states that from which it follows by Borel-Cantelli that √ n(| Z n | − | Z|) → 0, a.s. At the same time Corollary 2.3 in [12] (notice that the typo O(1) instead of o(1) there) implies that for some positive C 0 , C 1 , where M = lim n→∞ M n and M n = x d Z n . The sets considered in the corollary are of the form (−∞, y], but it is clear that by summation one can extend it to all sets in A. Furthermore, it is immediate from the statement of the corollary that the constants C 0 , C 1 can be chosen independently of A ′ = ρA + ξ as long as (ρ, ξ) are chosen from a compact subset of (0, ∞) × (−∞, +∞). Less immediate, but still true, is that the proofs of the corollary and Theorem 2.2 on which it is based in fact give that the above limit is uniform in all such A ′ . Combining all the above and writingM for M /| Z| we have, and it remains to show thatM is unbounded from below.
To this end, note thatM = lim n→∞Mn whereM n = M n /| Z n |, and that for any integers r < n, symmetry ofM n−r around zero entails where C = C(r) > 0 does not depend on n. Therefore P(M ≤ −r) > 0, and since r is arbitrary,M is indeed unbounded. This shows (44) and (45). Finally, applying the above results to A c in place of A, we obtain the same lower bound for the probability of a deviation to the opposite side.
and write The first factor can be lower bounded by exp{−Cb |w| } as the event {Z |w| = ζ} is equivalent to having all particles in the first |w| generations give birth to b children, all of whom take either a +1 step or a −1 step, depending on the sign of x. This requires that at most C ′ b |w| independent particles make certain branching/walking choices, all of which have a uniformly positive probability. The second factor in (52) can be bounded below by The probability in the above expression is further bounded below by which, for ρ = n m and ξ = x √ n−w √ m , is is equal to , whence we may find t > 0 large enough such that (55) is bounded below by This is bounded away from 0 uniformly in n via Lemma 7.
Plugging this back into (53), recalling that |ζ| = b |w| , the second factor in (52) is bounded below by exp{−C ′ b |w| }. Combining the bounds on both factors in (52) we arrive at as desired.
Upper bound. Let ǫ > 0 be arbitrarily small and set Conditioning on the particles measure ζ at generation |w ǫ |, we have Any such ζ must satisfy supp(ζ) ⊆ [−|w ǫ |, +|w ǫ |]. Therefore there exists δ > 0, such that for all such ζ and z ∈ ζ, This follows from the choice of x and Proposition 3.
Using this proposition and also Proposition 4, we further obtain for n large, Then Lemma 6 implies that P(Z ζ mǫ ( √ nA) ≥ p) is bounded above by As |ζ| ≥ b |wǫ| we have from (59) for n large enough, and this concludes the upper bound as ǫ was arbitrary.

2.3.2.
The Case I A (p) = ∞. The proof in this case is technically similar to the proof in the previous case, although the "optimal" strategy for achieving the desired deviation is different. We start by setting r =J A (p) ∈ (0, 1) and choosing x ∈ R such that This is guaranteed by Proposition 3.
Lower bound. Set and write (66) The first factor on the r.h.s. is at least exp{−Cb s } since the event there can be achieved by having all particles give birth to b children in the first s generations, make only +1 or −1 steps in the first |w| generations (depending on the sign of x), and then alternate between +1 and −1 steps in the succeeding q generations. This requires that C ′ b s independent particles make certain branching/walking choices, all of which have a uniformly positive probability.
The second factor is bounded below by Setting and using (64), we may bound below the probability in (67) by Now ρ = 1 + O(1/ √ n) and ξ = O(1/ √ n) hence by Proposition 3 part 1, there exists t > 0 for which the last probability is bounded below by This is uniformly (in n, large enough) positive by virtue of Lemma 7. Therefore the second factor in (66) is bounded below by e −C|ζ| ≥ exp{−Cb s }.
Plugging the two bounds in (66) we obtain as desired.
Upper bound. As in the previous case, let ǫ > 0 be small enough and set This time we condition on the particles measure ζ at generation q ǫ : Now, from the definition of r it follows that there exists δ > 0 such that for all ǫ ′ ∈ [ǫ, 2ǫ] and z ∈ R, Therefore, for any measure ζ and n large enough by Propositions 3 and 4 1 Using Lemma 6 we have that P(Z ζ mǫ ( √ nA) ≥ p) is bounded above by But if ζ is a possible particle measure at generation q ǫ , then |ζ| ≥ b qǫ . Hence from (73) we obtain for n large enough, and since ǫ is arbitrary the upper bound follows.

2.4.
Proof of Proposition 2. Let α ∈ (1/2, 1) and p ∈ (0, 1) be given and choose a > 0 such that ν(A 0 ) = p where A 0 = [−a, +a]. Fix some small δ > 0 and for any integer k ≥ 1 set: Finally, for some k 0 > 0 to be chosen later, set We shall now argue that (23) is satisfied with the above A, α and p.
Lower bound. For any n large enough, set k = ⌈n (α−1/2)/(1+δ) ⌉, w = ⌊x k √ n⌋, m = n−w, ζ = b w δ w and write The first factor on the r.h.s. is at least exp{−Cb w } ≥ exp{−b n α (1+o(1) }, as the event can be achieved by all particles multiplying at rate b and having their descendants take a +1 step for w generations. Therefore, as in the proof of the lower bound in the I(A) < ∞ case, it is enough to show that P(Z m ( √ nA − w) ≥ p) is bounded away from 0 independently of n. This, in turn, follows from Lemma 7 since ν ( √ nA − w)/ √ m is bounded below by Upper bound. Let ǫ > 0 be arbitrarily small and set w ǫ = ⌊(1 − ǫ)n α ⌋ and m ǫ = n − w ǫ . By conditioning on the particles measure in generation w ǫ , it is clear that where the maximum is taken over all feasible particles measures ζ for generation w ǫ . For such ζ, we may write 1 where for the second inequality, we have used lim sup m→∞ sup ρ∈ [1/2,2] sup ξ∈R m 1/2 |ν m ( √ m(ρA + ξ)) − ν(ρA + ξ)| < ∞ , which holds for the set A in light of (2.5) of [5].
Consider now some y in the range of the maximum in (88) and find the index k of the closest point to y among (x k ) k≥k 0 . We can then write For the second set in (90), note that from the definition of A it follows that for large enough k. Then, using a standard bound on the tails of ν, we obtain Combining the two bounds, we have Now if k 0 is chosen large enough, the r.h.s. above is maximized when k is the largest possible. At the same time, the choices of k and y imply (k − 1) 1+δ < y ≤ (1 − ǫ)n α−1/2 (97) which gives an upper bound on k. Using this in (96) we infer that the r.h.s. of (88) is bounded above by We may now use Lemma 6 and the fact that |ζ| ≥ b wǫ to conclude that This finishes the proof as ǫ was arbitrary.