Large deviations for homozygosity *

For any m ≥ 2 , the homozygosity of order m of a population is the probability that a sample of size m from the population consists of the same type individuals. Assume that the type proportions follow Kingman’s Poisson-Dirichlet distribution with parameter θ . In this paper we establish the large deviation principle for the naturally scaled homozygosity as θ tends to inﬁnity. The key step in the proof is a new representation of the homozygosity. This settles an open problem raised in [1]. The result is then generalized to the two-parameter Poisson-Dirichlet distribution.

For any θ > 0, let J 1 (θ) ≥ J 2 (θ) ≥ · · · denote the jump sizes of γ(t) over the interval [0, θ] in descending order. If we set P i (θ) = J i (θ)/γ(θ), i ≥ 1, then the law of P(θ) = (P 1 (θ), P 2 (θ), . . .) is Kingman's Poisson-Dirichlet distribution P D(θ)(cf. [10]). It is a probability on the infinite-dimensional simplex is loosely called the homozygosity of order m. The name is taken from population genetics where the homozygosity corresponds to m = 2. The function is closely associated with the Shannon entropy in communication, the Herfindahl-Hirschman index in economics, and the Gini-Simpson index in ecology. It can be used to measure the population diversity in terms of the number of different types and the evenness in the distribution among those types. The value of H(p; m) decreases when the number of types increases and the distribution among those types becomes more even.
In this paper we are interested in the behaviour of the random variable H(P(θ); m) when θ tends to infinity. When a random sample of size m is selected from a population whose individual types have distribution P D(θ), the probability that all samples are of the same type is given by H(P(θ); m). Since H(P(θ); m) ≤ P m−1 1 (θ), it follows that H(P(θ); m) converges to zero as θ approaches infinity. In [7] and [9] it is shown that H(P(θ); m) goes to zero at a magnitude of Γ(m) where ⇒ denotes convergence in distribution and Z m is a normal random variable with mean zero and variance It is natural to investigate more refined structures associated with the limits In [1], a full large deviation principle is established for H(P(θ); m) describing the deviations from zero. For l in (0, 1/2), the quantity θ l θ m−1 Γ(m) H(P(θ); m) − 1 converges to zero in probability as θ tends to infinity. Large deviations associated with this limit are called the moderate deviation principle for { θ m−1 Γ(m) H(P(θ); m) : θ > 0}. In [5], the moderate deviation principles are shown to hold for l in ( m−1 2m−1 , 1 2 ). The large deviation principle corresponding to l = 0 remains an open problem.
In this paper we will solve this open problem, namely, the large deviation principle for θ m−1 Γ(m) H(P(θ); m) describing deviations from 1. The two-parameter generalization is also obtained. The key in the proof is a new representation of the homozygosity.

Large deviations
Let m is any integer that is greater than or equal to 2. The objective of this section is to establish the large deviation principle for L(P(θ); m) = θ m−1 Γ(m) H(P(θ); m).
We begin with the case that θ takes integer values. For any 1 ≤ k ≤ θ, let J k i , i = 1, 2, . . . denote all the jump sizes of γ(t) over [k − 1, k]. Since the subordinator γ(t) will not jump at t = 0, 1, . . . , θ with probability one, it follows that . . , W θ are independent copies of γ(1), and independently, H 1 , . . . , H θ are independent copies of H(P(1); m). Set Then we have Theorem 2.1. A large deviation principle holds for L(P(θ); m) as θ converges to infinity on space R with rate θ 1/m and good rate function Proof: By Ewens sampling formula and direct calculation we have Thus there exists λ < 0 such that F (λ) < 1 which implies that J(y) > 0 for y < 1. This combined with (2.2) and the fact that J(·) is non-increasing implies that for any x < 1 lim sup On the other hand, for any > 0 and 0 < δ < 1 P{L 0 (P(θ); m) > x} which combined with (2.4), (2.7) and Theorem (P) in [13] implies the large deviation principle for L 0 (P(θ); m) with speed θ 1/m and good rate function I(·).
By direct calculation By Lemma 2.1 in [5], the large deviation principle for L(P(θ); m) is the same as L 0 (P(θ); m).  For any 0 < δ < 1, [θ] where the second equality follows from the fact that γ( distribution. This implies that for any 0 < r < 1 Similarly we can prove that found. Large deviation estimates were obtained in [3] for the scaled probability of two randomly selected individuals at time zero having the same ancestor at time T n . In our notation this probability has the form n γ(n) This is the same as L 0 (P(n); 2) except H k is replaced by 1. Our result shows that the corresponding work in [3] can be generalized to any m ≥ 2.
The two-parameter homozygosity of order m is defined as  where Z α m is a normal random variable with mean zero and variance As in the one-parameter case, the moderate deviation principles hold for the two- Our next result establishes the large deviation principle for L(P(α, θ); m).
Theorem 3.1. A large deviation principle holds for L(P(α, θ); m) as θ converges to infinity on space R with rate θ 1/m and good rate function
The two-parameter homozygosity can now be written as H(P(α, θ); m) = 1 σ α,θ m θ k=1 (W α k ) m H α,k . (2.1) and (3.2) can be generalized to these models. But the independency between the total jump size and the normalized individual jump sizes may no longer hold. It is not clear whether our result can be generalized to these situations. Remark 3.3. For 0 < α < 1, x > 1, we have I α (x) < I(x). Thus L(P(α, θ); m) is more spread out from 1 than L(P(θ); m) and α can then be used to describe the diversity of the population in terms of large deviations.