An Improved Fst Estimator

The fixation index F st plays a central role in ecological and evolutionary genetic studies. The estimators of Wright (F^st1), Weir and Cockerham (F^st2), and Hudson et al. (F^st3) are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator F^stm using F^st1 and F^st2. We tested F^stm in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that F^stm has smaller bias than F^st2 for small sample sizes and smaller bias than F^st1 for large sample sizes. Also, F^stm has smaller variance than F^st2 for small F st values and smaller variance than F^st1 for large F st values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate F st.


Introduction
The fixation index F st is widely used as a measure of population differentiation due to genetic structure. Wright [1,2] defined F st as the ratio of the observed variance of allele frequencies between subpopulations to the expected variance of allele frequencies assuming panmixis. Wright's estimator of F st is biased, because a priori expected allele frequencies are unknown and the numerator and denominator terms in the equation are not independent. In practice, various frameworks have been proposed to improve estimation of F st . Weir and Cockerham used an analysis of variance (ANOVA) approach to estimate within-and between-population variance components [3,4]. Weir and Cockerham's estimator is widely used because their estimator can describe the genetic population structure in a single summary statistic, is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation unlike Wright's estimator [5]. However, it can be upwardly biased unless adjustment is done for intralocus sampling error, the number of subpopulations sampled, time of divergence, etc. [6]. In the present study, we propose a method that improves F st estimation by combining Wright's and Weir and Cockerham's estimators to achieve a minimum variance estimate. For comparison, we also include Hudson et al.'s estimator [7], which recently has been recommended by Bhatia et al. [8]. We demonstrate application of our modified estimator in analysis of real data.

Methods
For a diallelic marker, let p be the true minor allele frequency in the total population. Let the true subpopulation allele frequencies be p 1 , . . ., p r in r ! 2 subpopulations. Let σ 2 be the true population variance in allele frequencies across subpopulations. Suppose the observed sample frequencies arep 1 ; . . . ;p r and the sample sizes are n 1 , . . ., n r . Let n ¼ P r j¼1 n j and n ¼ P r j¼1 n j =r. Let ϑ be the difference in allele frequencies, such that for two subpopulations, W ¼p 1 Àp 2 .
Wright's F st [2] is defined as and is estimated asF For the special case of two subpopulations, Rosenberg et al. [9] showed that by algebraic rearrangement Thus, F st is a function of the difference in allele frequencies and is proportional to ϑ 2 . Weir and Cockerham's estimator [4], assuming a random union of gametes or equivalently no individual-level inbreeding, is based on p ¼ X r j¼1 n jpj =n; and The definition of F st of Hudson et al. [7] is An unbiased estimate of p j (1 − p j ) is thus given by ½2n j =ð2n j À 1Þp j ð1 Àp j Þ. However, Note that under the null hypothesis of p 1 = Á Á Á = p r , both nF st 1 and nF st 2 are asymptotically zero. Our goal is to construct an estimator based on a linear combination of nF st 1 and nF st 2 such that the new estimator has the smallest variance among all such linear combinations. Let s 2 1 and s 2 2 be the asymptotic variances of nF st 1 and nF st 2 , and σ 12 be the asymptotic covariance.
We propose the following weighted version ofF st : where b > 0 is a fixed number to be chosen later. We choose a = a 0 such that VarðF st ðaÞÞ is minimized: It is seen that VarðF st ða 0 ÞÞ minfVarðF st 1 Þ; VarðF st 2 Þg and hence is more precise in estimation. From the proof of the Proposition we see that Eq (1) is equivalent to, which gives, with b = (δ − 1)/(δ + 1), At the end of the proof of the following Proposition, we show that δ ! 1 with equality if and only if n 1 = Á Á Á = n r . When n 1 = Á Á Á = n r , we have Proposition. Assume that 0 < p 0 < 1 and that the n j 's are not all equal (so that δ > 1).
In the above Proposition, take a = a 0 , then a 0 +δ(1 − a 0 ) = 0 and δ (1 − a 0 ) = −δ/(δ − 1), and we get Corollary 1. Under conditions of the Proposition, ii. If a = 1, then iii. If a = 0, then Simulations Under the Balding-Nichols model [10], the allele frequency in each of r subpopulations conditional on p and F st is a random deviate from the beta distribution β 2p. Therefore, we first assigned the true value for p by drawing a random uniform deviate from the interval (0, 0.5) and the true value for F st by independently drawing a random uniform deviate from the interval (0, 2p). Conditional on the true values of p and F st , we randomly generated p j from the beta distribution. We next assigned the number of individuals per subpopulation n j = [5, 10, 20, 50, 100, 110]. We then randomly drew alleles from the binomial distribution Bin(2n j , p j ). We generated 10,000 independent replicate data sets. Based on the above formulae, the four estimatorsF st 1 ,F st 2 ,F st 3 , andF st m were calculated. Linear regression models were used to evaluate the relationship between F st andF st and betweenF st andŴ 2 . We assessed the fit in a linear regression model with the F-test, r 2 , and the root mean squared error (RMSE), which is the square root of the sum of the variance and the squared bias.

Application to data
We included genotype data from a total of 120 unrelated individuals from the Wolaita (WETH) ethnic group from southern Ethiopia who served as controls in a genome-wide association study of podoconiosis [11]. The Wolaita ethnic group speaks an Omotic language, and comparison with HapMap African populations has shown that it has the closest genetic similarity with the Maasai from Kenya and the lowest genetic similarity with the Yoruba in Nigeria [12]. Genotyping was performed by deCODE Genetics using the Illumina HumanHap 610 Bead Chip, which assays > 620,000 single-nucleotide polymorphisms (SNPs). Of the 551,840 autosomal SNPs in the raw genotype data, we excluded 39,249 SNPs that had a minor allele frequency of < 0.05, 378 that were missing in > 0.05 of individuals, and 321 that had a Hardy-Weinberg p-value < 0.001. The remaining 511,892 SNPs were merged with genotype data for ASW

Results
Simulation 1: We first comparedF st with the true F st for the worst-case scenario of r = 2. For small sample sizes,F st 1 was the least biased estimator, followed byF st m ,F st 2 , andF st 3 (Table 1).
For large sample sizes,F st m andF st 1 were comparably good, andF st 2 andF st 3 were identically worse ( Table 1). None of the four estimators was strongly sensitive to equal vs. unequal sample sizes (Fig 1). WhenŴ was close to 0,F st 3 yielded the most negative estimates, followed byF st 2 andF st m . As expected, all four estimators showed a quadratic relationship withŴ (Fig 1). With respect toŴ 2 , by all four measuresF st 1 was the best estimator whereasF st 2 was the worst estimator (Table 1).
An assessment of bias by the total sample size (n 1 and n 2 ) for r = 2 is presented in Fig 2.F st 1 was biased and this bias was constant across total sample size, as expected given that this estimator does not account for n j . In contrast,F st 2 ,F st m , andF st 3 were less biased as the total sample size increased. When the total sample size exceeded 30,F st 3 was the least biased estimator; otherwise,F st 1 was the least biased estimator. For r = 2, the magnitude of bias for all four estimators was constant when the total sample size was at least 60.  Table 2. The means forF st 1 , F st 2 ,F st 3 , andF st m equaled the expected values, consistent with all four estimators being asymptotically unbiased. First, we investigated the relationship between F st and the variance of the four estimators. Given p = 0.2 and F st < 0.5,F st 1 had the smallest variance, followed byF st m and F st 2 (Fig 3). Given p = 0.2 and F st > 0.5,F st 2 had the smallest variance, followed byF st m andF st 1 . Similar results were obtained for p = 0.1, 0.3, 0.4, and 0.5 (S1 Table).
Second, we investigated how the number of subpopulations and the number of individuals per subpopulation affected bias. When the number of subpopulations was approximately 30, no matter the number of individuals per subpopulation, bias was stable (Fig 3). For r > 30 and  small n j , all four estimators were biased, with the order ofF st 1 <F st 3 <F st m <F st 2 . For r > 30 and large n j , all four estimators were unbiased. For n j > 30, all four estimators were stable and bias decreased as r increased, withF st 3 the best estimator andF st 1 the worst estimator (Fig 4).
Application to Data: The means and variances ofF st values between the WETH and 11 samples in HapMap 3 are presented in Table 3. The WETH sample was closest to the MKK sample, consistent with shared Cushitic and Nilo-Saharan ancestry [13].F st 2 andF st m yielded the same order for all pairs of relationships and all four estimators yielded the same order of relationships for the five HapMap samples closest to the WETH sample. The order of the means waŝ F st 1 <F st m <F st 2 <F st 3 .F st m was approximately 30% larger thanF st 1 and approximately 20% smaller thanF st 2 , which has corresponding effects on divergence time estimates. Given that F st 2 andF st m are less downward biased thanF st 1 for these sample sizes (Fig 2), the larger values are more likely to be correct.

Discussion
F st is directly related to the variance in allele frequencies among subpopulations. The dependence of F st on allele frequencies and genetic diversity has been observed [14]. In our study, an approximately linear relationship betweenF st 1 ,F st 2 ,F st m , andF st 3 with the squared difference of allele frequencies (Ŵ 2 ) was observed, as expected. By simulation, we found that all four estimators were unbiased for large numbers of subpopulations and individuals per subpopulation but that no one estimator was uniformly better than the others. For F st < 0.5,F st 1 had smaller  [5]. We found that the number of subpopulations was more important than the number of individuals per subpopulation. Estimation of F st , both in terms of means and variances, stabilized with approximately 30 subpopulations, regardless of the number of individuals per subpopulation. This behavior occurs because there are r estimates ofp j with which to estimate p and σ 2 . Estimation was biased for r = 2 and improved as r increased, according to the Central Limit Theorem. Estimation was biased for n j < 30 and improved as n j increased (except for Wright's estimator), also according to the Central Limit Theorem. Our proposed estimator is a minimum variance combination of Wright's and Weir and Cockerham's estimators and is less