A vectorized method of importance sampling with applications to models of mutation and migration
Introduction
The explosive growth of molecular analysis of genetic variation in human and other populations has led to the development of new statistical methods of data analysis. One class of methods calculates the likelihood of one or more population genetic parameters as a function of the observed configuration of data. Such methods make full use of data rather than relying on summary statistics and provide a statistical framework within which to test hypotheses and estimate parameters. Except in a few special cases, likelihoods cannot be computed analytically, so the focus of recent theoretical efforts has been on the development of efficient computer-intensive methods that rely on randomly generated replicates of population genetic models. At present, these methods fall into two categories: methods based on the Metropolis–Hastings algorithm (MH) and methods based on importance sampling (IS). Both classes of methods rely on coalescent theory, and both have been called Markov chain Monte Carlo (MCMC) methods, although some reserve MCMC for MH methods only. The important distinction between MH and IS methods is that different replicates are independent in IS methods and are correlated in MH methods. Felsenstein and his collaborators (Kuhner 1995, Kuhner 2000; Beerli and Felsenstein, 1999) initiated the use of MH methods in population genetics and have developed several programs that analyze a variety of processes, including population growth, migration and recombination. Recently, several other papers employing MH methods have appeared (e.g., Nielsen, 2000; Pritchard et al., 2000; Rannala and Reeve, 2001).
Griffiths and Tavaré 1994a, Griffiths and Tavaré 1994b introduced an IS method that has been widely used. Their method, which they called an MCMC method, chooses among coalescence and mutation events based on prior probabilities of occurrence. Stephens and Donnelly (2000) examined the general theory of IS as applied to the neutral coalescent and introduced another IS method that is more efficient than the one proposed by Griffiths and Tavaré. The Stephens–Donnelly method uses an approximation to the posterior probabilities of coalescence and mutation events as a guide to sampling gene genealogies. In this paper, I will introduce still another IS method which differs from those of Griffiths and Tavaré and of Stephens and Donnelly. In this method, for each replicate a set of coalescence times is randomly generated and then, given the coalescence times, a gene genealogy is generated by non-randomly choosing among coalescence events based on the contribution each event makes to the overall likelihood. Associated with each branch of the gene genealogy is a vector whose elements are the probabilities of being each of the genetic states. Transitions among states on each branch are modeled by taking the appropriate power of the Markovian transition matrix. Generating coalescence times separately allows this method to be easily applied to models other than the neutral coalescent. Any process, such as a linear birth–death process, for generating a random set of coalescence times can be used. Representing the state of each lineage by a vector and employing efficient methods of matrix multiplication make it possible to allow for an arbitrarily large number of transitions on each branch with no increase in running time.
In this paper, I will introduce the general method and then apply it to two data sets. The first application is to two samples of microsatellite alleles in a human population. Microsatellite alleles are assumed to be neutral and the coalescence times are generated by a neutral coalescent model in an exponentially growing population. The goal is to determine whether the data can be accounted for by a mutation model that assumes a change in only one repeat unit each generation (the one-step model) or whether multiple steps must be allowed for. In this case, the genetic state is the number of repeat units of the microsatellite motif and transitions among states are governed by a mutation matrix that allows for changes in allele size by one or more repeat units.
The second application is to the numbers of allozyme alleles in a geographically subdivided population. In this case, a birth–death model is used to approximate the dynamics of a rare allele. Because the birth–death model assumes that each copy of the allele reproduces independently of all others, allelic reproduction is independent of geographic location. The genetic state is the geographic location of each copy and transitions among states are modeled by a migration matrix. In this example, the goal is to determine whether the data show evidence of geographically restricted dispersal, i.e., whether the migration pattern differs from an island model of migration.
Section snippets
The model
The data consist of the genetic states of n chromosomes in a sample. Each chromosome can be in one of d states, so the data set is a set of numbers D={i1,…,in}, where 1⩽ij⩽d. For example, for a microsatellite locus at which allele sizes differ by the number of repeats units, ij is the number of repeats of the allele on chromosome j. For microsatellite loci, the total number of states may not be known but d can be chosen to be sufficiently large that its value does not affect the results. The
Applications
The method described in the previous section provides a way to approximate the probability of the data as a function of parameters, i.e., the likelihood, from which a maximum likelihood estimate and support interval can be obtained. If a prior distribution of the parameter or parameters is assumed, then the likelihood provides the basis for computing the posterior distribution and carrying out a Bayesian analysis.
Discussion and conclusion
In many applications of population genetics theory to genetic data, the goal is to compute the likelihood of one or more parameters as a function of the data. The method presented in this paper employs importance sampling (IS) to allow efficient approximation of the likelihood in cases in which numerous transitions among genetic states can occur. In choosing a computer-intensive method for approximating the likelihood, there are three considerations: confidence in the results, running time, and
Acknowledgements
This research was supported in part by NIH Grant R01-GM40282. I thank J. Felsenstein and M. Stephens for helpful discussions of this topic, and A. Di Rienzo and D.A. Bell for permission to use their data. D.A. Bell and two referees made helpful comments on an earlier version of this paper.
References (20)
The sampling theory of selectively neutral alleles
Theor. Popul. Biol.
(1972)- et al.
Simulating probability distributions in the coalescent
Theor. Popul. Biol.
(1994) - et al.
High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence
Am. J. Hum. Genet.
(2001) - et al.
Likelihood analysis of disequilibrium mapping, and related problems
Am. J. Hum. Genet.
(1998) Rare alleles with selection
Theor. Popul. Biol.
(2001)- et al.
Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach
Genetics
(1999) - Bell, D.A., 1992. Hybridization and sympatry in the Western Gull/Glaucous-winged Gull Complex. Ph.D. Thesis, University...
Genetic differentiation, geographic variation and hybridization in gulls of the Larus glaucescens-occidentalis complex
Condor
(1996)- et al.
Mutational presses of simple-sequence repeat loci in human populations
Proc. Natl. Acad. Sci. USA
(1994) Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters
Syst. Zool.
(1973)