A vectorized method of importance sampling with applications to models of mutation and migration

https://doi.org/10.1016/S0040-5809(02)00007-2Get rights and content

Abstract

An importance-sampling method is presented for computing the likelihood of the configuration of population genetic data under general assumptions about population history and transitions among states. The configuration of the data is the number of chromosomes sampled that are in each of a finite set of states. Transitions among states are governed by a Markov chain with transition probabilities dependent on one or more parameters. The method assumes that the joint distribution of coalescence times of the underlying gene genealogy is independent of the genetic state of each lineage. Given a set of coalescence times, the probability that a pair of lineages is chosen to coalesce in each replicate is proportional to the contribution that the coalescence event makes to the probability of the data. This method can be applied to gene genealogies generated by the neutral coalescent process and to genealogies generated by other processes, such as a linear birth–death process which provides a good approximation to the dynamics of low-frequency alleles. Two applications are described. In the first, the fit of allele frequencies at two microsatellite loci sampled in a Sardinian population to the one-step mutation model is tested. The one-step model is rejected for one locus but not for the other. The second application is to low-frequency alleles in a geographically subdivided population. The geographic location is the allelic state, and the alleles are assumed to be sufficiently rare that their dynamics can be approximated by a linear birth–death process in which the birth and death rates are independent of geographic location. The analysis of eight low-frequency allozyme alleles found in the glaucous-winged gull, Larus glaucescens, illustrates how geographically restricted dispersal can be detected.

Introduction

The explosive growth of molecular analysis of genetic variation in human and other populations has led to the development of new statistical methods of data analysis. One class of methods calculates the likelihood of one or more population genetic parameters as a function of the observed configuration of data. Such methods make full use of data rather than relying on summary statistics and provide a statistical framework within which to test hypotheses and estimate parameters. Except in a few special cases, likelihoods cannot be computed analytically, so the focus of recent theoretical efforts has been on the development of efficient computer-intensive methods that rely on randomly generated replicates of population genetic models. At present, these methods fall into two categories: methods based on the Metropolis–Hastings algorithm (MH) and methods based on importance sampling (IS). Both classes of methods rely on coalescent theory, and both have been called Markov chain Monte Carlo (MCMC) methods, although some reserve MCMC for MH methods only. The important distinction between MH and IS methods is that different replicates are independent in IS methods and are correlated in MH methods. Felsenstein and his collaborators (Kuhner 1995, Kuhner 2000; Beerli and Felsenstein, 1999) initiated the use of MH methods in population genetics and have developed several programs that analyze a variety of processes, including population growth, migration and recombination. Recently, several other papers employing MH methods have appeared (e.g., Nielsen, 2000; Pritchard et al., 2000; Rannala and Reeve, 2001).

Griffiths and Tavaré 1994a, Griffiths and Tavaré 1994b introduced an IS method that has been widely used. Their method, which they called an MCMC method, chooses among coalescence and mutation events based on prior probabilities of occurrence. Stephens and Donnelly (2000) examined the general theory of IS as applied to the neutral coalescent and introduced another IS method that is more efficient than the one proposed by Griffiths and Tavaré. The Stephens–Donnelly method uses an approximation to the posterior probabilities of coalescence and mutation events as a guide to sampling gene genealogies. In this paper, I will introduce still another IS method which differs from those of Griffiths and Tavaré and of Stephens and Donnelly. In this method, for each replicate a set of coalescence times is randomly generated and then, given the coalescence times, a gene genealogy is generated by non-randomly choosing among coalescence events based on the contribution each event makes to the overall likelihood. Associated with each branch of the gene genealogy is a vector whose elements are the probabilities of being each of the genetic states. Transitions among states on each branch are modeled by taking the appropriate power of the Markovian transition matrix. Generating coalescence times separately allows this method to be easily applied to models other than the neutral coalescent. Any process, such as a linear birth–death process, for generating a random set of coalescence times can be used. Representing the state of each lineage by a vector and employing efficient methods of matrix multiplication make it possible to allow for an arbitrarily large number of transitions on each branch with no increase in running time.

In this paper, I will introduce the general method and then apply it to two data sets. The first application is to two samples of microsatellite alleles in a human population. Microsatellite alleles are assumed to be neutral and the coalescence times are generated by a neutral coalescent model in an exponentially growing population. The goal is to determine whether the data can be accounted for by a mutation model that assumes a change in only one repeat unit each generation (the one-step model) or whether multiple steps must be allowed for. In this case, the genetic state is the number of repeat units of the microsatellite motif and transitions among states are governed by a mutation matrix that allows for changes in allele size by one or more repeat units.

The second application is to the numbers of allozyme alleles in a geographically subdivided population. In this case, a birth–death model is used to approximate the dynamics of a rare allele. Because the birth–death model assumes that each copy of the allele reproduces independently of all others, allelic reproduction is independent of geographic location. The genetic state is the geographic location of each copy and transitions among states are modeled by a migration matrix. In this example, the goal is to determine whether the data show evidence of geographically restricted dispersal, i.e., whether the migration pattern differs from an island model of migration.

Section snippets

The model

The data consist of the genetic states of n chromosomes in a sample. Each chromosome can be in one of d states, so the data set is a set of numbers D={i1,…,in}, where 1⩽ijd. For example, for a microsatellite locus at which allele sizes differ by the number of repeats units, ij is the number of repeats of the allele on chromosome j. For microsatellite loci, the total number of states may not be known but d can be chosen to be sufficiently large that its value does not affect the results. The

Applications

The method described in the previous section provides a way to approximate the probability of the data as a function of parameters, i.e., the likelihood, from which a maximum likelihood estimate and support interval can be obtained. If a prior distribution of the parameter or parameters is assumed, then the likelihood provides the basis for computing the posterior distribution and carrying out a Bayesian analysis.

Discussion and conclusion

In many applications of population genetics theory to genetic data, the goal is to compute the likelihood of one or more parameters as a function of the data. The method presented in this paper employs importance sampling (IS) to allow efficient approximation of the likelihood in cases in which numerous transitions among genetic states can occur. In choosing a computer-intensive method for approximating the likelihood, there are three considerations: confidence in the results, running time, and

Acknowledgements

This research was supported in part by NIH Grant R01-GM40282. I thank J. Felsenstein and M. Stephens for helpful discussions of this topic, and A. Di Rienzo and D.A. Bell for permission to use their data. D.A. Bell and two referees made helpful comments on an earlier version of this paper.

References (20)

There are more references available in the full text version of this article.

Cited by (0)

View full text