The third moments of the site frequency spectrum

The analysis of patterns of segregating (i.e. polymorphic) sites in aligned sequences is routine in population genetics. Quantities of interest include the total number of segregating sites and the number of sites with mutations of different frequencies, the so-called site frequency spectrum. For neutrally evolving sequences, some classical results are available, including the expected value and variance of the spectrum in the Kingman coalescent model without recombination as calculated by Fu (1995). In this work, we use similar techniques to compute the third moments of the site frequency spectrum without recombination. We also account for the linkage pattern of mutations, yielding the full haplotype spectrum of three polymorphic sites. Based on these results, we derive analytical results for the bias of Tajima’s D and other neutrality tests. As an application, we obtain the second moments of the spectrum of linked sites, which is related to the neutral spectrum of chromosomal inversions and other structural variants. These moments can be used for the normalisation of new neutrality tests relying on these spectra.


76
Remark 2. Fu (1995) showed in his eq. (34), that for β n (i) exists a more compact form, namely where the summation over k is hidden in the a n . We do not have a similar form 77 for the other expressions, however β 3 n and β 4 n can be expressed in terms of g n , Remark 3. The sum over permutations simplifies the fractions in t b resp. t bb Remark 4. The central third moments can be obtained by Remark 5. For the folded spectrum the corresponding third moments can be computed in a simnilar way as the second moments (eq. (9) in Fu (1995))  1 i m for the n-th harmonic number of order m, the third moment (resp. central moment) of the number of segregating sites S for a sample of size n is: following from Theorem 1 and Theorem 2, the corresponding coefficients for θ, 112 θ 2 and θ 3 have to be the same. We give in the supplement an explicit proof of 113 the non-trivial identities of the coefficients for θ 2 and θ 3 stated as: n−1 j=1 τ hij = H 3 n−1,1 + 3H n−1,1 H n−1,2 + 2H n−1,3 .

119
Their general form is where the variance in the denominator is a linear combination of θ and θ 2 . These two quantities, if unknown, are 121 usually estimated from S and S 2 by the method of moments:θ = S/H n−1,1 122 andθ 2 = S(S − 1)/(H 2 n−1,1 + H n−1,2 ).

123
In this section, we explore the additional information that the third moments 124 of the spectrum reveal about the distribution of neutrality tests, in particular 125 about their skewness and bias.

126
[ The weights Ω i for some commonly used neutrality tests are given in Table   134 1. Figure 1 shows, that analytical results and those from simulations with 'ms' 135 (Hudson, 2002) agree well. However, when the parameter θ has to be estimated 136 from the data, as it is usually the case, the denominator of the test is a function 137 of the estimator, contributing to the skewness. This has a relatively large effect, 138 but surprisingly for most considered values of θ it reduces the skewness. For θ unknown and estimated from S, we can still make use of the third moments. In this case, we can compute an approximate result for the bias of the test. We apply the following formula for the Taylor expansion of moments and the fact that E[ n−1 k=1 kΩ k ξ k ] = 0 to obtain the bias: 2.6. Numerical results

164
In Figure One could naively assume, that this means that in the limit of large n the ξ k s could be treated as independent Gaussian random variables with mean θ/k and variance θ 2 ln(n)/n, leading to the approximation This is however incorrect. The distribution of each component of the spectrum is correlated between lines of the same state because of their shared lengths.

209
The combined sum over the two quantities yields the desired second moments.

210
We re-use method and notation of Fu (1995) with appropriate extensions. A 211 thorough explanation of the main ingredients of his proof, albeit with somewhat 212 different notation, has been given in Durrett (2008). An extended "reprint" We define index variables kl (i), that indicate if the line l of state k has i 217 descendants at state n, (e.g. they take the values 1 resp. 0). It follows that (cf. In the following we use the fact, that the index l serves only to distinguish lines 220 of the same state, but otherwise has no meaning, since all lines of the same 221 state are equivalent. The indicator variables are idempotent ( kl (i) 2 = kl (i)) 222 and independent of the length (resp. mutation rate) ξ kl . The expectation values 223 of the indicator variables correspond to probabilities, which we will define in the 224 following subsection. We recall, that the number of descendants of lines in the coalescent is equivalent to that of balls of a specific colour in a so-called Pólya urn model whose probability distribution is known and reviewed in e.g. Griffiths and Tavaré (2003). We introduce the following notation: p k n (t i) is the probability that t lines at state k have i descendants at state n. This probability is At this point it is helpful to define −1 −1 = 1, while binomial coefficients con-  The probability that t and u (different) lines at state k have respectively i and j descendants at state n is And for three such (non-overlapping) sets of lines the probability yields Using this notation we can now state the probabilities for different configurations. We start with those derived by Fu: The probability, that one line at state k has i descendants at state n is (Fu, 1995, eq. (14)) The joint probability that one line at state k and one nested line at state k ≥ k have i respective j descendants at state n is (Fu, 1995, eq. (18)) The joint probability that one line at state k and one disjoint (not nested) line at state k ≥ k have i resp. j descendants at state n is (Fu, 1995, eq. (19) and (20)) In the latter two cases the summation index t runs over the possible numbers of descendants that the line of state k may have at state k . Since no single line can be ancestor of all k lines, this number has an upper limit of k − 1. There are more constraints on t as detailled by Fu (1995) (e.g. a line from state k can have at most k − k + 1 descendants at state k , hence only t ≤ k − k + 1 is meaningful), however these are already implicit in the binomial coefficients.
Note, that Fu defined these equations only for the case k < k . Using the special definition for the binomial coefficient, they include the case k = k (Durrett, 2008): if the lines are from the same state, then t = 1 and we . These two equations correspond to eq. (14) and (15) of (Fu, 1995).
Hence the probability, that a line at k and a line at state k ≤ k have i resp. j descendants at state n yields for 2 ≤ k ≤ k ≤ n: Now we derive the probabilities involving three lines. These may be all of 235 the same state, of two different states or of three different states. We assume 236 k ≤ k ≤ k . We take a single line at each state k, k and k respectively and p aa (k, h; k , i; k , j) Since the six cases cover all possible combinations, the total probablity that three lines at state k, k and k resp. (with k ≤ k ≤ k ) have h, i and j resp. descendants at state n is given by p(k, h; k , i; k , j) =p aa (k, h; k , i; k , j) + p ab (k, h; k , i; k , j) + p We now relate the indicator variables of eq. (24) to the above probabilities. For two lines we have the three cases distinguished by Fu (1995, text and equations without number, before eq. (22)) With three lines (still assuming k ≤ k ≤ k ), this extends to: 3.1.3. Averaging over line lengths 246 Proposition 1. For any 1 ≤ k, k , k < n, 1 ≤ l ≤ k, 1 ≤ l ≤ k , 1 ≤ l ≤ k the following equation holds: Proof. Let X be a random variable. It can be easily shown that, if X is exponentially distributed (X ∼ Exp(λ)), then the first three moments of X In agreement with the definition of the coalescent the ξ kl are distributed as ξ kl ∼ P oisson( θ 2 T k ) with T k ∼ Exp( 2 k(k−1) ). ξ kl and ξ k l are independent if k = k while ξ kl and ξ kl are independent conditional on T k for l = l . We follow here an analogous derivation as in Wakeley (2008).

Combining results
We insert now the results for averaged topologies and averaged line lengths into eq. (24): Applying eq. (22) of (Fu, 1995) to the first term of (39) yields eq. (2): and applying his eq. (23) to the next three terms of (39) yields eq. (4): We now define the remaining terms (39) as functions where x stands for {aa, ab, ba (3) , ba (2) , ba (1) , bb} and finally we set In the supplement we transform these functions to yield (6).

248
We offer an implementation in C++ for numerical calculation of the third 249 moments, given n and θ, using the expressions (1)-(6). Just for control, we  We derive the third moments of segregating sites S using the method of Watterson (1975). He showed (his eq. (1.3a)), that the probability generating function of S can be approximated for large population size N and small sample size n by: Hence: Setting s = 1 gives and inserting Wattersons results for the first and second moment (his eq. (1.4a) and (1.5a)) yields our theorem 2:

258
Kingman's coalescent (Kingman, 1982) is an extremely useful model to de-259 scribe the patterns of mutations in neutral populations. For this reason, coales-260 cent methods were used to compute analytically the expectation and covariance 261 of the frequency spectrum (Fu, 1995). Here, we derive for the first time the 262 third moments of the full frequency spectrum. We think, the third moments 263 add a valuable building block to coalescent theory.

264
Beyond their fundamental interest, our results have several applications. 265 We show how to compute analytically the bias of neutrality tests. Moreover, we 266 describe the joint frequency spectrum for triplets of sites (fully characterising 267 their expected haplotype structure). In turn, these results can be used to im- we derived.

286
The main limit of our results is that they do not account for recombination

309
The results in this paper apply to a sample of size n much smaller than the  analytical approximation simulation Figure 2: The bias of the tests mentioned in table 1 with sample size n = 50 (top) and n = 500 (bottom). Shown are the values of our analytical approximation and numerical data, obtained by simulation with 'ms', averaged over 10 6 genealogies.  The left middle shows the covariances between mutations nested within the focal mutation, the right middle the covariances of mutations both disjoint and the rightmost the covariance between nested and disjoint mutations.   (2)