An Improved Fst Estimator

Guanjie Chen; Ao Yuan; Daniel Shriner; Fasil Tekola-Ayele; Jie Zhou; Amy R. Bentley; Yanxun Zhou; Chuntao Wang; Melanie J. Newport; Adebowale Adeyemo; Charles N. Rotimi

doi:10.1371/journal.pone.0135368

Abstract

The fixation index F_st plays a central role in ecological and evolutionary genetic studies. The estimators of Wright (), Weir and Cockerham (), and Hudson et al. () are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator using and . We tested in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that has smaller bias than for small sample sizes and smaller bias than for large sample sizes. Also, has smaller variance than for small F_st values and smaller variance than for large F_st values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate F_st.

Citation: Chen G, Yuan A, Shriner D, Tekola-Ayele F, Zhou J, Bentley AR, et al. (2015) An Improved F_st Estimator. PLoS ONE 10(8): e0135368. https://doi.org/10.1371/journal.pone.0135368

Editor: Francesc Calafell, Universitat Pompeu Fabra, SPAIN

Received: November 12, 2014; Accepted: July 21, 2015; Published: August 28, 2015

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

Data Availability: The real dataset from the Wolaita were obtained in 2007 with written informed consent that limits data access only to the investigators. Therefore, public deposition of data would breach ethical compliance because participants did not give consent for deposition of their genotypes. Please contact Dr. Fasil Tekola Ayele (fasil.ayele2@nih.gov) for real data detail informaiton.

Funding: This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (Z01HG200362). The data from the Wolaita came from a study supported by the Wellcome Trust (Grant #079791). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The fixation index F_st is widely used as a measure of population differentiation due to genetic structure. Wright [1, 2] defined F_st as the ratio of the observed variance of allele frequencies between subpopulations to the expected variance of allele frequencies assuming panmixis. Wright’s estimator of F_st is biased, because a priori expected allele frequencies are unknown and the numerator and denominator terms in the equation are not independent. In practice, various frameworks have been proposed to improve estimation of F_st. Weir and Cockerham used an analysis of variance (ANOVA) approach to estimate within- and between-population variance components [3, 4]. Weir and Cockerham’s estimator is widely used because their estimator can describe the genetic population structure in a single summary statistic, is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation unlike Wright’s estimator [5]. However, it can be upwardly biased unless adjustment is done for intralocus sampling error, the number of subpopulations sampled, time of divergence, etc. [6]. In the present study, we propose a method that improves F_st estimation by combining Wright’s and Weir and Cockerham’s estimators to achieve a minimum variance estimate. For comparison, we also include Hudson et al.’s estimator [7], which recently has been recommended by Bhatia et al. [8]. We demonstrate application of our modified estimator in analysis of real data.

Methods

For a diallelic marker, let p be the true minor allele frequency in the total population. Let the true subpopulation allele frequencies be p₁, …, p_r in r ≥ 2 subpopulations. Let σ² be the true population variance in allele frequencies across subpopulations. Suppose the observed sample frequencies are and the sample sizes are n₁, …, n_r. Let and . Let ϑ be the difference in allele frequencies, such that for two subpopulations, .

Wright’s F_st [2] is defined as and is estimated as with For the special case of two subpopulations, Rosenberg et al. [9] showed that by algebraic rearrangement Thus, F_st is a function of the difference in allele frequencies and is proportional to ϑ².

Weir and Cockerham’s estimator [4], assuming a random union of gametes or equivalently no individual-level inbreeding, is based on and yielding

The definition of F_st of Hudson et al. [7] is Given observed sample estimates , is a biased estimate of H_W, because An unbiased estimate of p_j(1 − p_j) is thus given by . However, is an unbiased estimate of H_B if , i.e., under the null hypothesis. Therefore, we estimate F_st by which is a ratio of unbiased estimates. This estimator generalizes Bhatia et al.’s [8] version of Hudson et al.’s [7] estimator for r > 2.

Note that under the null hypothesis of p₁ = ⋯ = p_r, both and are asymptotically zero. Our goal is to construct an estimator based on a linear combination of and such that the new estimator has the smallest variance among all such linear combinations. Let and be the asymptotic variances of and , and σ₁₂ be the asymptotic covariance. We propose the following weighted version of : where b > 0 is a fixed number to be chosen later. We choose a = a₀ such that is minimized: (1) It is seen that and hence is more precise in estimation. From the proof of the Proposition we see that Eq (1) is equivalent to, (2) which gives, with b = (δ − 1)/(δ + 1), At the end of the proof of the following Proposition, we show that δ ≥ 1 with equality if and only if n₁ = ⋯ = n_r. When n₁ = ⋯ = n_r, we have and . Let denote convergence in distribution.

Proposition. Assume that 0 < p₀ < 1 and that the n_j’s are not all equal (so that δ > 1). If p₁ = ⋯ = p_r, with and we have where λ₁, …,λ_r are the eigenvalues of Ω′^1/2 BΩ^1/2, Ω = (ω_ij)_{r × r} with ω_ij = p₀(1 − p₀) if i = j and ω_ij = 0 if i ≠ j, Ω^−1/2 is the square root of Ω: Ω = Ω′^1/2Ω^1/2, and , b_j = (−γ₁, …, −γ_j−1, (1 − γ_j), −γ_j+1, …, −γ_r)′.

In the above Proposition, take a = a₀, then a₀+δ(1 − a₀) = 0 and δ (1 − a₀) = −δ/(δ − 1), and we get

Corollary 1. Under conditions of the Proposition,

If a = 1, then
If a = 0, then

Simulations Under the Balding-Nichols model [10], the allele frequency in each of r subpopulations conditional on p and F_st is a random deviate from the beta distribution β , which has mean p and variance p(1 − p)F_st = σ².

Simulation 1. This simulation was designed to estimate bias in the worst case scenario of two subpopulations. We evaluated the relationships between and F_st and between and . First, given the true average allele frequency p for r = 2, F_st reaches its maximum value for p_j values of 0 and 2p. The estimator [4] yields a constrained range for from 0 to 2p. Therefore, we first assigned the true value for p by drawing a random uniform deviate from the interval (0, 0.5) and the true value for F_st by independently drawing a random uniform deviate from the interval (0, 2p). Conditional on the true values of p and F_st, we randomly generated p_j from the beta distribution. We next assigned the number of individuals per subpopulation n_j = [5, 10, 20, 50, 100, 110]. We then randomly drew alleles from the binomial distribution Bin(2n_j, p_j). We generated 10,000 independent replicate data sets. Based on the above formulae, the four estimators , , , and were calculated. Linear regression models were used to evaluate the relationship between F_st and and between and . We assessed the fit in a linear regression model with the F-test, r², and the root mean squared error (RMSE), which is the square root of the sum of the variance and the squared bias.

Simulation 2. This simulation was designed to evaluate variance under sampling conditions approaching unbiasedness, i.e., large numbers of subpopulations and individuals per subpopulation. We evaluated the relationships between and the number of subpopulations (r) and between and the number of individuals per subpopulations (n_j). Conditional on the average allele frequency p, F_st, the number of subpopulations r = [5, 10, 20, 50, 100, 250], and the number of individuals per subpopulation n_j = [5, 10, 20, 50, 100, 250, 1000], we randomly generated r allele frequencies as in Simulation 1 and calculated , , , and .

Application to data

We included genotype data from a total of 120 unrelated individuals from the Wolaita (WETH) ethnic group from southern Ethiopia who served as controls in a genome-wide association study of podoconiosis [11]. The Wolaita ethnic group speaks an Omotic language, and comparison with HapMap African populations has shown that it has the closest genetic similarity with the Maasai from Kenya and the lowest genetic similarity with the Yoruba in Nigeria [12]. Genotyping was performed by deCODE Genetics using the Illumina HumanHap 610 Bead Chip, which assays > 620,000 single-nucleotide polymorphisms (SNPs). Of the 551,840 autosomal SNPs in the raw genotype data, we excluded 39,249 SNPs that had a minor allele frequency of < 0.05, 378 that were missing in > 0.05 of individuals, and 321 that had a Hardy-Weinberg p-value < 0.001. The remaining 511,892 SNPs were merged with genotype data for ASW (n = 49), CEU (n = 112), CHB (n = 84), CHD (n = 85), GIH (n = 88), JPT (n = 86), LWK (n = 90), MKK (n = 143), MXL (n = 50), TSI (n = 88), and YRI (n = 113) in HapMap phase 3, release 2, which contained 1,440,616 SNPs. A total of 464,642 SNPs were common to both of WETH and HapMap data sets. , , and were calculated per marker.

Results

Simulation 1: We first compared with the true F_st for the worst-case scenario of r = 2. For small sample sizes, was the least biased estimator, followed by , , and (Table 1). For large sample sizes, and were comparably good, and and were identically worse (Table 1). None of the four estimators was strongly sensitive to equal vs. unequal sample sizes (Fig 1). When was close to 0, yielded the most negative estimates, followed by and . As expected, all four estimators showed a quadratic relationship with (Fig 1). With respect to , by all four measures was the best estimator whereas was the worst estimator (Table 1).

Download:

Table 1.

vs. F_st and

for two subpopulations.

https://doi.org/10.1371/journal.pone.0135368.t001

Download:

Fig 1. The relationship between

and

for simulated data.

The x-axis shows the difference of allele frequencies between two subpopulations (left plots) and (right plots); the y-axis shows values for Wright’s (top row), Weir and Cockerham’s (second row), the modified (third row), and Hudson et al.’s estimators (bottom row), and the legend indicates the sample sizes n₁ (before hyphen) and n₂ (after hyphen).

https://doi.org/10.1371/journal.pone.0135368.g001

An assessment of bias by the total sample size (n₁ and n₂) for r = 2 is presented in Fig 2. was biased and this bias was constant across total sample size, as expected given that this estimator does not account for n_j. In contrast, , , and were less biased as the total sample size increased. When the total sample size exceeded 30, was the least biased estimator; otherwise, was the least biased estimator. For r = 2, the magnitude of bias for all four estimators was constant when the total sample size was at least 60.

Download:

Fig 2. Bias as a function of total sample size.

The x-axis shows the total sample size (n₁ + n₂). The y-axis shows (red), (blue), (green), and (orange) for r = 2.

https://doi.org/10.1371/journal.pone.0135368.g002

Simulation 2: Given p = 0.2, F_st = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], n = 1000 individuals, and r = 200 subpopulations, mean values are presented in Table 2. The means for , , , and equaled the expected values, consistent with all four estimators being asymptotically unbiased. First, we investigated the relationship between F_st and the variance of the four estimators. Given p = 0.2 and F_st < 0.5, had the smallest variance, followed by and (Fig 3). Given p = 0.2 and F_st > 0.5, had the smallest variance, followed by and . Similar results were obtained for p = 0.1, 0.3, 0.4, and 0.5 (S1 Table).

Download:

Table 2. Means, Variances, and MSEs of

in simulation 2.

https://doi.org/10.1371/journal.pone.0135368.t002

Download:

Fig 3. Effect of the number of subpopulations on bias.

The x-axis shows the number of subpopulations. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F_st = 0.5 and average allele frequency p = 0.2. The top plot represents 5 individuals per subpopulation and the bottom plot represents 1000 individuals per subpopulation.

https://doi.org/10.1371/journal.pone.0135368.g003

Second, we investigated how the number of subpopulations and the number of individuals per subpopulation affected bias. When the number of subpopulations was approximately 30, no matter the number of individuals per subpopulation, bias was stable (Fig 3). For r > 30 and small n_j, all four estimators were biased, with the order of . For r > 30 and large n_j, all four estimators were unbiased. For n_j > 30, all four estimators were stable and bias decreased as r increased, with the best estimator and the worst estimator (Fig 4).

Download:

Fig 4. Effect of the number of individuals per subpopulation on bias.

The x-axis shows the number of individuals per subpopulation. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F_st = 0.5 and an average allele frequency p = 0.2. From top to bottom, the plots represent the number of subpopulations r = 10, 20, and 40, respectively.

https://doi.org/10.1371/journal.pone.0135368.g004

Application to Data: The means and variances of values between the WETH and 11 samples in HapMap 3 are presented in Table 3. The WETH sample was closest to the MKK sample, consistent with shared Cushitic and Nilo-Saharan ancestry [13]. and yielded the same order for all pairs of relationships and all four estimators yielded the same order of relationships for the five HapMap samples closest to the WETH sample. The order of the means was < < < . was approximately 30% larger than and approximately 20% smaller than , which has corresponding effects on divergence time estimates. Given that and are less downward biased than for these sample sizes (Fig 2), the larger values are more likely to be correct.

Download:

Table 3.

between WETH and HapMap 3 samples.

https://doi.org/10.1371/journal.pone.0135368.t003

Discussion

F_st is directly related to the variance in allele frequencies among subpopulations. The dependence of F_st on allele frequencies and genetic diversity has been observed [14]. In our study, an approximately linear relationship between , , , and with the squared difference of allele frequencies () was observed, as expected. By simulation, we found that all four estimators were unbiased for large numbers of subpopulations and individuals per subpopulation but that no one estimator was uniformly better than the others. For F_st < 0.5, had smaller variances and MSE values. For F_st > 0.5, had smaller variances and MSE values. For F_st ≈ 0.5, , , and had similar variance and MSE values.

The numbers of individuals and markers have been reported to affect F_st estimation [5]. We found that the number of subpopulations was more important than the number of individuals per subpopulation. Estimation of F_st, both in terms of means and variances, stabilized with approximately 30 subpopulations, regardless of the number of individuals per subpopulation. This behavior occurs because there are r estimates of with which to estimate p and σ². Estimation was biased for r = 2 and improved as r increased, according to the Central Limit Theorem. Estimation was biased for n_j < 30 and improved as n_j increased (except for Wright’s estimator), also according to the Central Limit Theorem. Our proposed estimator is a minimum variance combination of Wright’s and Weir and Cockerham’s estimators and is less biased than Weir and Cockerham’s estimator for small samples sizes and less biased than Wright’s estimator for large sample sizes.

Conclusion

A modified F_st estimator is proposed, which combines Wright’s and Weir and Cockerham’s estimators. It splits the difference in biases present in Wright’s and Weir and Cockerham’s estimators. We propose the routine use of this new and improved estimator of F_st as a way to reduce the biases and limitations of the classical estimators. We demonstrated that, in order to estimate F_st accurately, at least 30 subpopulations and 30 individuals per subpopulations are required.

Appendix

Proof of the Proposition

As , is asymptotically a chi-squared random variable, , , and Thus and

Let and be the asymptotic variance of , then the asymptotic variance of is , and the asymptotic covariance of is . Now we have

If p₁ = ⋯ = p₂, . Note the ’s are independent, and Let , p₀ = (p₀, …, p₀)’, γ_j = n_j/n, and b_j = (−γ₁, …, −γ_j−1, (1 − γ_j), −γ_j+1, …, −γ_r)′, then for j = 1, …, r, and so by the Central Limit Theorem, where , Ω = (ω_ij)_r×r with ω_ij = p₀(1 − p₀) if i = j and ω_ij = 0 if i ≠ j.

Now we have, with ,

Let Ω^−1/2 be the square root of Ω: Ω = Ω′^1/2Ω^1/2, λ₁, …, λ_r be all the eigenvalues of Ω′^1/2 BΩ^1/2, and Λ = diag(λ₁, …, λ_r), then there is an orthogonal normal matrix Q such that Ω′^1/2 BΩ^1/2 = Q′ΛQ, and so where the ’s are independent chi-squared random variables with one degree of freedom. This gives the desired result.

Lastly, we prove In fact, It is known that for r = 1 or 2, with “=” if and only if n₁ = ⋯ = n_r. Now we use induction to prove this is true for all integer r. In fact, suppose the above conclusion is true for some integer r > 2, then for integer r + 1, and Since by assumption , with “=” if and only if n₁ = ⋯n_r+1, since , with “=” if and only if n_j = n_r+1.

This gives δ ≥ 1 with “=” if and only if n₁ = ⋯ = n_r.

Supporting Information

S1 Table. Means, Variances, and MSEs of in simulation 2.

https://doi.org/10.1371/journal.pone.0135368.s001

(PDF)

Acknowledgments

We thank Professor Gail Davey for allowing us to use the Wolaita GWAS dataset in this study.

Author Contributions

Conceived and designed the experiments: GC AY DS CNR. Analyzed the data: GC AY DS FT JZ AB YZ CW AA. Contributed reagents/materials/analysis tools: MJN. Wrote the paper: GC AY DS FT AB AA CNR.

References

1. Wright S. Genetical structure of populations. Nature 1950; 66(4215): 247–249.
- View Article
- Google Scholar
2. Wright S. The genetical structure of populations. Ann Eugen 1951; 15(4): 323–354. pmid:24540312
3. Cockerham CC. Variance of Gene Frequencies. Evolution 1969; 23(1): 72–84.
- View Article
- Google Scholar
4. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution 1984; 38(6): 1358–1370.
- View Article
- Google Scholar
5. Willing E-M, Dreyer C, van Oosterhout C. Estimates of Genetic Differentiation Measured by F_st Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers. PLOS ONE 2012; 7(8): e42649. pmid:22905157
6. Waples RS. Separating the wheat from the chaff: patterns of genetic differentiation in high gene flow species. J Hered 1998; 89(5): 438–450.
- View Article
- Google Scholar
7. Hudson RR, Slatkin M, Maddison WP. Estimation of Levels of Gene Flow from DNA Sequence Data. Genetics 1992; 132(2): 583–589. pmid:1427045
8. Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting F_ST: The impact of rare variants. Genome Res 2013; 23: 1514–1521. pmid:23861382
9. Rosenberg NA, Li LM, Ward R, Pritchard JK. Informativeness of Genetic Markers for Inference of Ancestry. Am J Hum Genet 2003; 73(6): 1402–1422. pmid:14631557
10. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for identity and paternity. Genetics 1995; 96: 3–11.
- View Article
- Google Scholar
11. Tekola Ayele F, Adeyemo A, Finan C, Hailu E, Sinnott P, Burlinson ND, et al. HLA class II locus and susceptibility to podoconiosis. N Engl J Med 2012; 366(13): 1200–1208. pmid:22455414
12. Tekola-Ayele F, Adeyemo A, Aseffa A, Hailu E, Finan C, Davey G, et al. Clinical and pharmacogenomic implications of genetic variation in a Southern Ethiopian population. Pharmacogenomics J 2014; 15(1): 101–108. pmid:25069476
13. Shriner D, Tekola-Ayele F, Adeyemo A, Rotimi CN. Genome-wide genotype and sequence-based reconstruction of the 140,000 year history of modern human ancestry. Sci Rep 2014; 4: 6055 pmid:25116736
14. Jakobsson M, Edge MD, Rosenberg NA. The Relationship between F_st and the Frequency of the Most Frequent Allele. Genetics 2013; 193(2): 513–528.
- View Article
- Google Scholar

[ref1] 1. Wright S. Genetical structure of populations. Nature 1950; 66(4215): 247–249.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Wright S. The genetical structure of populations. Ann Eugen 1951; 15(4): 323–354. pmid:24540312
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Cockerham CC. Variance of Gene Frequencies. Evolution 1969; 23(1): 72–84.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution 1984; 38(6): 1358–1370.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Willing E-M, Dreyer C, van Oosterhout C. Estimates of Genetic Differentiation Measured by F_st Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers. PLOS ONE 2012; 7(8): e42649. pmid:22905157
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Waples RS. Separating the wheat from the chaff: patterns of genetic differentiation in high gene flow species. J Hered 1998; 89(5): 438–450.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Hudson RR, Slatkin M, Maddison WP. Estimation of Levels of Gene Flow from DNA Sequence Data. Genetics 1992; 132(2): 583–589. pmid:1427045
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref8] 8. Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting F_ST: The impact of rare variants. Genome Res 2013; 23: 1514–1521. pmid:23861382
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref9] 9. Rosenberg NA, Li LM, Ward R, Pritchard JK. Informativeness of Genetic Markers for Inference of Ancestry. Am J Hum Genet 2003; 73(6): 1402–1422. pmid:14631557
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref10] 10. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for identity and paternity. Genetics 1995; 96: 3–11.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref11] 11. Tekola Ayele F, Adeyemo A, Finan C, Hailu E, Sinnott P, Burlinson ND, et al. HLA class II locus and susceptibility to podoconiosis. N Engl J Med 2012; 366(13): 1200–1208. pmid:22455414
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Tekola-Ayele F, Adeyemo A, Aseffa A, Hailu E, Finan C, Davey G, et al. Clinical and pharmacogenomic implications of genetic variation in a Southern Ethiopian population. Pharmacogenomics J 2014; 15(1): 101–108. pmid:25069476
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Shriner D, Tekola-Ayele F, Adeyemo A, Rotimi CN. Genome-wide genotype and sequence-based reconstruction of the 140,000 year history of modern human ancestry. Sci Rep 2014; 4: 6055 pmid:25116736
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Jakobsson M, Edge MD, Rosenberg NA. The Relationship between F_st and the Frequency of the Most Frequent Allele. Genetics 2013; 193(2): 513–528.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

An Improved F_st Estimator

An Improved F_st Estimator

Figures

Abstract

Introduction

Methods

Application to data

Results

Discussion

Conclusion

Appendix

Proof of the Proposition

Supporting Information

S1 Table. Means, Variances, and MSEs of in simulation 2.

Acknowledgments

Author Contributions

References