Genetic Sampling Error of Distance (δμ)2 and Variation in Mutation Rate Among Microsatellite Loci

Zhivotovsky, Lev A.; Goldstein, David B.; Feldman, Marcus W.

doi:10.1093/oxfordjournals.molbev.a003759

Abstract

An expression is obtained for the time-dependent variance of the microsatellite genetic distance (δμ)² when the mutation rate is allowed to vary randomly among loci. An estimator is presented for the coefficient of variation, C_w, in the mutation rate. Estimated values of C_w from genetic distances between African and non-African populations were less than 100%. Caveats to this conclusion are discussed.

Introduction

In order to estimate the time of divergence of two contemporary populations from a single ancestral lineage, a genetic distance that is a known function of this time is desirable. When the populations are assayed for microsatellite polymorphism, the genetic distance (δμ)², based on the average squared differences in the sizes of alleles sampled in pairs, one from each population, has an expectation that increases linearly with time at a rate equal to twice the mutation rate in the case of one-step mutations (Goldstein et al. 1995 ). For multistep mutations, the rate of increase is twice the effective mutation rate, which is the product of the mutation rate and the variance of changes in allele size due to mutation (Zhivotovsky and Feldman 1995 ).

The usual way to analyze a set of microsatellite loci from individuals sampled in two populations is to compute (δμ)² for each locus and average across loci. If the mutation rate (or the effective mutation rate) is the same at all loci, and is known, then simple division gives an estimate of the expected time since separation of the populations. Variation across loci in the mutation rate affects the variance of (δμ)² (but not its expectation).

The evolutionary process involves genetic sampling error due to random genetic drift and mutation, and thus the variance among the possible evolutionary replicates of the distance is an important issue. Zhivotovsky and Feldman (1995) implied that among replicates, the distance follows a chi-square distribution. In fact, the variance of the distance does asymptotically satisfy the most important property of the chi-square distribution, namely, that its variance approaches twice the square of its expectation as time increases (Zhivotovsky, Feldman, and Grishechkin 1997 ), but the actual distribution is not exactly chi-square.

From their analysis of properties of (δμ)² in a study of more than 200 human microsatellite loci, Cooper et al. (1999) found strong evidence for variation among loci in the mutation rate. Our purpose with this paper is to obtain an analytical expression for the variance of (δμ)² when the mutation rate is variable. An important application of this analytical expression could be estimation of the extent of variation in mutation rate among microsatellite loci. Our analysis also allows us to compute the time-dependent dynamics of the variance of (δμ)² and to assess how sensitive these dynamics are to the assumption of a fixed mutation rate that is constant across loci.

Results

Consider a randomly mating diploid population of constant size N with nonoverlapping generations and an autosomal microsatellite locus undergoing multiple-step mutation with mutation rate μ and, possibly, constant mutation bias, as measured by the difference between the mean size of mutations and the size of the parental allele. (There is no bias if the difference is zero). Let η⁽²⁾_m be the expectation of the square of mutational gains and losses (Di Rienzo et al. 1998 ), which in the case of no average mutation bias becomes the variances in mutation changes, σ²_m (Slatkin 1995 ). We call w = μη⁽²⁾_m the effective mutation rate. Also, introduce k = μη⁽⁴⁾_m, where η⁽⁴⁾_m is the fourth noncentral moment of mutational changes in repeat score; w = k = μ in the case of one-step symmetric mutation. Assume for a while that the mutation parameters do not vary between loci.

The within-population variation at a microsatellite locus can be characterized by the mean allele size (r), the variance of allele size (the second central moment) (V), and the unnormalized kurtosis (the fourth central moment) (K) (Zhivotovsky and Feldman 1995 ). The between-population variation can be measured by analogs of F_ST (Slatkin 1995 ; see also Michalakis and Excoffier 1996 ; Rousset 1996 ; Feldman, Kumm, and Pritchard 1999 ). For two populations, the (δμ)² distance is defined as the squared difference of the mean values of their repeat scores: (δμ)² = (r₁ − r₂)² (Goldstein et al. 1995 ).

Suppose that two populations diverged from an ancestral population at initial time t = 0 at which the profile of allele frequencies was represented by 𝒫₀ (i.e., 𝒫₀ produced specific values of the variance V₀, unnormalized kurtosis K₀, etc.) and then evolved independently under random genetic drift and multistep mutation. Given 𝒫₀, ℰ_r{S | 𝒫₀} is an expectation operator that averages the statistic S over all possible realizations (replicates) of the drift-mutation process. Averaging with ℰ_r may then be followed by the operator ℰ₀, which averages over all possible genetic structures 𝒫₀ of the unknown ancestral population, i.e., values of V₀, K₀, etc. Thus, ℰ_r averages over loci having identical mutation parameters and identical starting conditions, and ℰ₀ averages over the different initial conditions. We assume that prior to divergence, the ancestral population had attained mutation-drift equilibrium, where the expectations of the within-locus variances, the between-locus variance of variances, and the unnormalized within-locus kurtosis are

respectively, with κ = (2N − 1)k (Zhivotovsky and Feldman 1995 ); k is defined in equation (6) of the appendix. Recall that expressions (1) are valid if mutation parameters do not vary among loci (see eqs. 12 if they vary among loci).

After τ generations of divergence, the expected distance, ℰ₀ℰ_r((δμ)²), equals 2wτ (Zhivotovsky and Feldman 1995 ; see also Feldman, Kumm, and Pritchard 1999 ; Zhivotovsky 2001 ), which becomes 2μτ with one-step symmetric mutation (Goldstein et al. 1995 ).

The square of the genetic sampling error of a statistic is its variance over replicates (Weir 1996 ). Therefore, the within-locus variance of (δμ)² is defined as

The between-locus variance,

is due to variation in initial conditions 𝒫₀ and together with the within-locus variance makes up the total variance in the case of no variation in mutation rate among loci, Var_T = Var_W + Var_B. Var_T is in fact the quantity of interest in assessing the reliability of (δμ)² as a distance measure in the presence of variation across loci in the mutation rate. Analytical expressions for both variances are given in the appendix. It can be shown that the total variance is greater than 8w²τ² near the beginning of the process, although this value is ultimately approached, as was shown by Zhivotovsky, Feldman, and Grishechkin (1997) and has been observed in numerical simulations.

Suppose the effective mutation rate varies across loci with mean w̄, variance σ²_w, and k̄, the mean value of k over loci. It is proved in the appendix that

assuming mutation-drift equilibrium, which entails that the expected distance is 2wτ at generation τ. As time increases, this asymptotically approaches 8w̄²τ² × [1 + 3σ²_w/2w̄² + 𝒪(1/τ)]. In order to evaluate the accuracy of equation (2) , we carried out a simulation using coalescent techniques following the algorithm of Hudson (1990) , modified to include the stepwise mutation process with and without variation among loci in the mutation rate. In figure 1 , we see that the simulated data produce values for Var_T that are close to those expected from equation (2) .

Rewrite equation (2) as

\[Var_{T}\ {=}\ \mathcal{A}\ {+}\ \mathcal{B}{\sigma}^{2}_{\mathit{w}},\ (3)\]

where 𝒜 is the expression in the first three lines of the right-hand side of equation (2) , and ℬ is the multiplier of σ²_w in the third line of equation (2) . Then, given the observed variance in genetic distances across loci, Var_obs, the variance in mutation rates can be estimated as

As time increases, σ²_w and the coefficient of variation of w, C_w = σ_w/w̄, asymptotically satisfy

where C_(δμ)² is the coefficient of variation of (δμ)². Expression (5) can also provide an upper estimate of C²_w if the asymptote has not been approached (fig. 2 ).

Discussion

We can use expression (5) to estimate C_w from data. Table 1 shows the estimates for different sets of di- and tetranucleotide loci based on genetic distances between African and non-African human populations. Two of three sets show substantial values of C_w. However, probably not more than 10,000 generations have passed since the divergence of Africans and non-Africans, and thus the values of C_w in table 1 are overestimated (see fig. 1 ). Therefore, on average, variation in mutation rate does not seem to be very extensive, although it is not excluded that some microsatellite loci can show much higher or lower mutation rates than an average locus. For example, Forster et al. (2000) found that the average mutation rate at the Y-chromosome loci could be taken as 0.26 × 10⁻³ if locus DYS392 was omitted because of its unusual behavior; otherwise, it was about 10 times as high. However, we should emphasize that our findings concern the effective mutation rate, i.e., the product of mutation rate and the variance in the number of repeats due to mutation, while Forster et al. (2000) considered only the mutation rate.

Two caveats should be noted in connection with the above remarks on the size of C_w. First, our estimates were made under the assumption of constant population size, which is surely erroneous for humans in the last 4,000 generations. Second, since the variance of C_w is likely to be large over this time range and with the number of loci considered here, our confidence that C_w is indeed small cannot be great.

Earlier, Zhivotovsky and Feldman (1995) pointed out that hundreds of loci are required to estimate the genetic distance (δμ)² with reasonable accuracy, and with variable mutation rates, the number of loci must be even greater. Indeed, as follows from equation (5) , the coefficient of variation of genetic distance (δμ)² averaged over L loci, which can be used as a measure of the relative accuracy (R) of estimation of the genetic distance, is approximated by [(2 + 3C²_w)/L]^½, or L = (2 + 3C²_w)/R². For instance, if the relative accuracy is 10%, i.e., R = 0.1, then 200 loci with identical mutation rates would be needed, whereas 500 loci are required to estimate genetic distance with the same precision if the relative variation in mutation rates is 100%, i.e., if C_w = 1. As an example, using combined data on 131 di-, tri-, and tetranucleotide microsatellite loci, Zhivotovsky (2001,table 1 ) estimated approximately 14% for the accuracy of genetic distances between African and non-African populations. It should be noted, however, that in the analyses of Jin et al. (2000) , (δμ)² was not able to reliably distinguish continental groups in trees made using the 28 loci of Bowcock et al. (1994) , although its performance was comparable with other distance measures with 64 microsatellite loci. Again, this reinforces our view that several hundred loci would be needed to produce satisfactory estimates of (δμ)² and C_w.

It should be strongly emphasized that expression (2) , as well as expressions (4) and (5) , derived from it, are only valid for reproductively isolated populations of constant size at mutation-drift equilibrium. Otherwise, if we consider a process of subdivision of a parental population into two populations that subsequently evolve under mutation and genetic drift, the genetic distance (δμ)² becomes a nonlinear function of time; in particular, it underestimates the divergence time if the two populations are growing in size and/or are connected by gene flow (Zhivotovsky 2001 ). Therefore, our estimates in table 1 have to be regarded with caution.

Appendix

A Case of Constant Mutation Bias

We permit a constant bias in mutation; that is, the expected average repeat score in progeny may be larger (or smaller) than the size of a parental allele by a constant value that is independent of the parental allele. As noted by Di Rienzo et al. (1998) and Zhivotovsky (2001) , if η⁽²⁾_m and η⁽⁴⁾_m are the second and fourth noncentral moments of mutational changes in repeat score, equations (3)–(8) of Zhivotovsky and Feldman (1995) , as well as the expectation of (δμ)² (namely, 2wτ), remain valid with the parameters

neglecting terms of order μ² and smaller. In the case of no mutation bias, η⁽²⁾_m is σ²_m, the variance in mutation changes. In particular, the relationships (eq. 1 ) that were obtained by Zhivotovsky and Feldman (1995) under the assumption of no mutation bias remain valid in the case of constant bias if the moments are taken with respect to zero instead of with respect to the mean (Zhivotovsky 2001 ). Expressions for V̂ and Var(V) were extended to the case of constant mutation bias by Kimmel and Chakraborty (1996) and Di Rienzo et al. (1998) .

The Within-Locus Variance of (δμ)²

Using the expression for the within-locus variance Var{Δ(t)} (Zhivotovsky, Feldman, and Grishechkin 1997 , p. 932, right column), which remains valid with the moments taken with respect to zero, and taking the limit as the regression coefficient β → +0, we obtain

(Note that in Zhivotovsky, Feldman, and Grishechkin [1997 , p. 932, right column], the expressions Var{Δ(τ)} and (ℰ{Δ(τ)})² in the above notation are ℰ₀{(Var_r{Δ(τ)}} and ℰ₀{ℰ_r{Δ(τ)})²}, respectively, and the symbol k should read k_m.) Zhivotovsky, Feldman, and Grishechkin (1997) showed that this variance is approximated as

\[Var_{W}\ {\approx}\ \mathcal{E}_{0}{\{}2{[}\mathcal{E}_{\mathit{r}}{\{}({\delta}{\mu})^{2}{\,}{\vert}{\,}\mathcal{P}_{0}{\}}{]}^{2}{\}}\ (8)\]

when τ/2N either is small or increases infinitely. (Earlier, Zhivotovsky and Feldman [1995 , corollaries 1 and 2] had suggested that the variance was twice the squared distance expected at equilibrium, namely, 2(2wτ)² = 8w²τ². However, the latter is 2[ℰ₀{ℰ_r{(δμ)²}}]², which is not the right-hand side of eq. 8 ).

The Between-Locus Variance of (δμ)²

From equations (4) and (14) of Zhivotovsky and Feldman (1995) , the changes in the expected values of the distance and the variance are ℰ_r((δμ)²(τ + 1) | 𝒫₀) − ℰ_r((δμ)²(τ) | 𝒫₀) ≈ (1/N)ℰ_r(V(τ) | 𝒫₀), and ℰ_r(V(τ + 1) | 𝒫₀) − ℰ_r(V(τ) | 𝒫₀) ≈ w − (1/N)ℰ_r(V(τ) | 𝒫₀)), neglecting terms of order less than 1/N and recalling that w is defined by equation (6) . Replacing the differences in the left-hand sides of these approximations with corresponding differentials and solving the resulting linear differential equations, we have

As follows from the definition of the between-locus variance, Var_B is equal to the expectation ℰ₀ of the square of 2(V₀ − V̂)(1 − e^−τ/2N). Then, using equation (1) , we obtain

Variation in Mutation Rate

The well-known partitioning of conditional variance (e.g., Rice 1995 ) can be extended to the case of three random values: for an arbitrary function f(x, y, z), its variance, ℰ_zℰ_yℰ_x(f − ℰ_zℰ_yℰ_xf)², is

Now, consider ℰ_x, ℰ_y, and ℰ_z, respectively, as ℰ_r, ℰ₀, and the expectation operator averaging over varying values of the mutation parameters, ℰ_m, and take the distance (δμ)² as function f. The first two terms in the right-hand side of equation (11) represent the expectation ℰ_m of Var_W in equation (7) and Var_B in equation (10) , respectively. The third term is Var_m(ℰ₀((δμ)²)), the variance of the expected distance in equation (9) with respect to mutation parameters. Taking the expectations and summing in equation (11) , we obtain equation (2) .

Additionally, note that at mutation-drift equilibrium, the within-locus variance, the unnormalized within-locus kurtosis, and the between-locus variance of variances in the case of varying mutation rate become (using the same notation as in eq. 1 )

Di Rienzo et al. (1998) obtained the same expression for Var(V).

Keith Crandall, Reviewing Editor

Keywords: microsatellite loci mutation rate genetic distance

Address for correspondence and reprints: Marcus W. Feldman, Department of Biological Sciences, Stanford University, Stanford, California 94305. marc@charles.stanford.edu .

Table 1 Estimates of the Coefficient of Variation Among Loci of the Effective Mutation rate, C_w, Based on Genetic Distances Between African and Non-African Populations for Different Sets of Data

Open in new tab

Open in new tab Download slide

Fig. 1.—To evaluate the accuracy of equation (2) , we ran coalescent simulations following the algorithm of Hudson (1990) including the stepwise mutation process with and without rate variation. Separation times are given in units of 2N. In the case of a constant mutation rate, 𝛉 is set to 3.5. For rate variation, the average 𝛉 is again 3.5, but the thetas are now drawn from a gamma distribution with a variance of 10. Each of 200 replications involves 30 loci and 30 sampled alleles at each locus. White triangles and circles represent analytical results for no variation in mutation and for variation in mutation among loci, respectively. Black triangles and circles are the corresponding simulated values

Open in new tab Download slide

Fig. 2.—Dynamics of the coefficient of variation (%) of the effective mutation rate C_w (equation 5 ). The parameters are w = 0.001, σ_w = 0.0005 (hence, C_w = 0.5), and 2N = 5,000. Mutation is single-step and symmetric

We are indebted to two anonymous reviewers for helpful comments and constructive suggestions. This research was supported in part by the National Institutes of Health (grants GM 28016, GM 28428, and 1 R03 TW005540), the Russian Foundation of Basic Research (grants 01-04-48441 and 01-07-90197), and the Russian State Program “Human Genome” (grant 26/01).

References

Bowcock A. M., A. Ruiz-Linares, J. Tomfohrde, E. Minch, J. R. Kidd, L. L. Cavalli-Sforza,

1994

High resolution of human evolutionary trees with polymorphic microsatellites

Nature

368

:

455

-457

Cooper G., W. Amos, R. Bellamy, M. R. Siddiqui, A. Frodsham, A. V. S. Hill, D. C. Rubinsztein,

1999

An empirical exploration of the (δμ)² genetic distance for 213 human microsatellite markers

Am. J. Hum. Genet

65

:

1125

-1133

Di Rienzo A., P. Donnelly, C. Toomajian, B. Sisk, A. Hill, M. L. Petzl-Erler, G. K. Haines, D. H. Barch,

1998

Heterogeneity of microsatellite mutations within and between loci, and implications for human demographic histories

Genetics

148

:

1269

-1281

Feldman M. W., J. Kumm, J. K. Pritchard,

1999

Mutation and migration in models of microsatellite evolution Pp. 98–115 in D. G. Goldstein and C. Schlotterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford.

Forster P., A. Rohl, P. L. Lunnermann, C. Brinkmann, T. Zerjal, C. Tyler-Smith, B. Brinkmann,

2000

A short tandem repeat-based phylogeny for the human Y chromosome

Am. J. Hum. Genet

67

:

182

-196

Goldstein D. B., A. R. Linares, L. L. Cavalli-Sforza, M. W. Feldman,

1995

Genetic absolute dating based on microsatellites and the origin of modern humans

Proc. Natl. Acad. Sci. USA

92

:

6723

-6727

Hudson R. R.,

1990

Gene genealogies and the coalescent process

Oxf. Surv. Evol. Biol

7

:

1

-45

Jin L., M. L. Baskett, L. L. Cavalli-Sforza, L. A. Zhivotovsky, M. W. Feldman, N. A. Rosenberg,

2000

Microsatellite evolution in modern humans: a comparison of two data sets from the same populations

Ann. Hum. Genet

64

:

117

-134

Jorde L. B., A. R. Rogers, M. Bamshad, W. S. Watkins, P. Krakowiak, S. Sung, J. Kere, H. Harpending,

1997

Microsatellite diversity and the demographic history of modern humans

Proc. Natl. Acad. Sci. USA

94

:

3100

-3103

Kimmel M., R. Chakraborty,

1996

Measures of variation at DNA repeat loci under a general stepwise mutation model

Theor. Popul. Biol

50

:

345

-367

Michalakis Y., L. A. Excoffier,

1996

Generic estimation of population subdivision using distances between alleles with special reference for microsatellite loci

Genetics

142

:

1061

-1064

Rice J. A.,

1995

Mathematical statistics and data analysis. 2nd edition Duxbury Press, Belmont, Calif

Rousset F.,

1996

Equilibrium values of measures of population subdivision for stepwise mutation processes

Genetics

142

:

1357

-1362

Slatkin M.,

1995

A measure of population subdivision based on microsatellite allele frequencies

Genetics

139

:

457

-462

Weir B. S.,

1996

Genetic data analysis II Methods for discrete population genetic data. Sinauer, Sunderland, Mass

Zhivotovsky L. A.,

2001

Estimating divergence time with the use of microsatellite genetic distances: impacts of population growth and gene flow

Mol. Biol. Evol

18

:

700

-709

Zhivotovsky L. A., M. W. Feldman,

1995

Microsatellite variability and genetic distances

Proc. Natl. Acad. Sci. USA

92

:

11549

-11552

Zhivotovsky L. A., M. W. Feldman, S. A. Grishechkin,

1997

Biased mutations and microsatellite variation

Mol. Biol. Evol

14

:

926

-933

Download all slides

Month:	Total Views:
December 2016	1
February 2017	2
March 2017	4
April 2017	1
July 2017	1
August 2017	1
October 2017	1
December 2017	9
January 2018	11
February 2018	16
March 2018	6
April 2018	7
May 2018	20
June 2018	8
July 2018	10
August 2018	17
September 2018	2
October 2018	9
November 2018	6
December 2018	6
January 2019	4
February 2019	6
March 2019	12
April 2019	11
May 2019	16
June 2019	6
July 2019	7
August 2019	13
September 2019	10
October 2019	14
November 2019	12
December 2019	7
January 2020	4
February 2020	2
March 2020	4
April 2020	10
May 2020	4
June 2020	3
July 2020	5
August 2020	6
September 2020	6
November 2020	11
December 2020	26
January 2021	2
February 2021	6
March 2021	11
April 2021	9
May 2021	16
June 2021	7
July 2021	8
August 2021	13
September 2021	3
October 2021	6
November 2021	9
December 2021	9
January 2022	3
February 2022	4
March 2022	5
April 2022	6
May 2022	3
June 2022	4
July 2022	13
August 2022	10
September 2022	9
October 2022	11
November 2022	8
December 2022	5
January 2023	5
February 2023	1
March 2023	2
April 2023	6
May 2023	7
June 2023	1
July 2023	8
August 2023	10
September 2023	3
October 2023	5
November 2023	11
December 2023	3
January 2024	16
February 2024	11
March 2024	11
April 2024	5

Article Contents

Genetic Sampling Error of Distance (δμ)² and Variation in Mutation Rate Among Microsatellite Loci

Abstract

Introduction

Results

Discussion