An approximate likelihood for genetic data under a model with recombination and population splitting

https://doi.org/10.1016/j.tpb.2009.04.001Get rights and content

Abstract

We describe a new approximate likelihood for population genetic data under a model in which a single ancestral population has split into two daughter populations. The approximate likelihood is based on the ‘Product of Approximate Conditionals’ likelihood and ‘copying model’ of Li and Stephens [Li, N., Stephens, M., 2003. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165 (4), 2213–2233]. The approach developed here may be used for efficient approximate likelihood-based analyses of unlinked data. However our copying model also considers the effects of recombination. Hence, a more important application is to loosely-linked haplotype data, for which efficient statistical models explicitly featuring non-equilibrium population structure have so far been unavailable. Thus, in addition to the information in allele frequency differences about the timing of the population split, the method can also extract information from the lengths of haplotypes shared between the populations. There are a number of challenges posed by extracting such information, which makes parameter estimation difficult. We discuss how the approach could be extended to identify haplotypes introduced by migrants.

Introduction

Population structure is a common feature of natural genetic and phenotypic variation (Mayr, 1942). For some applications, summarizing this structure by identifying subgroups and quantifying the extent of differentiation between them may be sufficient (e.g. Pritchard et al., 2000). However, the aim is often to make more explicit statements about the evolutionary history of the populations. While some structured populations may be modeled as a system of populations at equilibrium with respect to gene flow, researchers are often interested in non-equilibrium situations. In particular, at the interface of population genetics and phylogeny we are faced with the challenge of modeling population splitting. Accurate estimates of parameters such as the sizes of the populations, the times at which they separated, and how much subsequent interbreeding there has been would be very valuable. If mechanisms limiting interbreeding between the populations have arisen, then fitting these models may allow us to decide whether this has occurred in parapatry or allopatry (Hey, 2006, Becquet and Przeworski, 2009). Furthermore, the identification of functional loci (such as those involved in reproductive isolation or local adaptation) will be facilitated by knowledge of population history, effective population size and gene flow elsewhere in the genome (Hey and Nielsen, 2004, Becquet and Przeworski, 2009).

Although there is a well-developed theory of population genetic processes that generate data under these types of scenarios (see e.g. Wakeley, 2008), it is often very difficult to compute likelihoods for models of interest. Therefore, in this paper we describe a promising alternative approach that approximates the standard likelihood function. We start by specifying a particular model of population structure, and describing some of the approaches that have been developed for this type of problem.

We focus on the most basic model of population splitting (see for example Wakeley and Hey, 1997), in which an ancestral population splits instantaneously into two daughter populations (see Fig. 1). The parameters of the model are the three effective population sizes (Na in the ancestral population; N1 and N2 in the two daughter populations), the number of generations since the splitting event (G), and the per-generation per-base pair probabilities of mutation and recombination (μ and r) respectively. Since our aim in this paper is to explore the utility of a new approximate likelihood, we focus on a simple version of the model in which all three population sizes are equal (Na=N1=N2=N). The model with unequal population sizes is a straightforward generalization, described briefly in Section 3.2. As is typically the case in population genetics, the data in fact contain information about the parameters (G, μ, r) on the time-scale on which genetic drift occurs, rather than on a time-scale of generations (see e.g. Wakeley, 2008). Thus our model in fact uses the relative rate parameters θ=2Nμ and ρ=2Nr, and the parameter F=G/2N which represents the amount of drift that has occurred in the daughter populations since the split. Note that, as in many coalescent-based models of population genetics, our model of N diploids is equivalent to a model of 2N haploids (Wakeley, 2008). Also note that θ and ρ are often defined to be twice the values that we use in this paper.

Historically, non-recombining loci such as mitochondrial DNA have often been used to fit these models, but there is a growing awareness of the statistical and biological limitations of such data sets (e.g. Hey and Machado, 2003). Therefore much research now focuses on using data from multiple unlinked regions of the genome, which reflect multiple realizations of the genealogical process. Often researchers will type a set of completely unlinked markers (e.g. microsatellites or SNPs) scattered around the genome, in which case the data can be summarized without loss of information by the counts of the different alleles in each population at each locus. The expectations of various quantities can be derived under a model without migration (Wakeley and Hey, 1997), and the likelihood of a particular configuration of allele counts at a locus may be computed analytically for a model with no migration and where mutation since divergence can be ignored (e.g. Nielsen and Slatkin, 2000, Nicholson et al., 2002, Roychoudhury et al., 2008), or alternatively can be estimated accurately under more general models using coalescent simulations.

While useful, completely unlinked loci offer only incomplete information about the genealogical process. One consequence is that there is relatively low power to distinguish between isolation models with and without migration (Nielsen and Slatkin, 2000), although multiallelic markers, such as microsatellites, hold additional information. With linked markers, the data are potentially informative about both migration and splitting times, as one can hope to learn about the variability (i.e. the distribution), over loci, of the pairwise coalescence times between the two populations (Wakeley, 1996). The distribution is informative, because under a model with no gene flow, the coalescence times between different populations necessarily predate the time at which the population split; in contrast, if there has been some low rate of gene flow, the coalescence times between populations are more variable, as some lineages migrate and thus coalesce more recently (Wakeley, 1996). However, unlinked sites provide little more information than the expectation of these times. For loosely linked data, information about the timing of the population split and subsequent migration is captured by the lengths and similarities of haplotypes shared between populations, as ancestrally shared or migrant haplotypes are broken up by recombination and diverge by mutation over time (e.g. Pool and Nielsen, 2008). For example, if two individuals in different populations are found to be identical across a large chromosomal region, then this may be strong evidence for recent gene flow, since such data may be unlikely under a pure split model.

Therefore, attention has turned to developing methods that consider a collection of genomic regions with linkage within each region, and free recombination between regions. Such data contain information about the joint distribution of times in the genealogies underlying the data, and thus potentially contain much more information about the parameters of interest. However, statistical inference in this setting is difficult: the likelihoods cannot be computed analytically and are difficult to estimate by simulation since the observed data will be very improbable, or impossible, on the vast majority of genealogies simulated from the coalescent prior (see e.g. Stephens, 2001). This problem has given rise to a large literature on full-, summary- and approximate-likelihood methods for linked data, a very brief overview of which now follows.

Nielsen and Wakeley (2001) and Hey and Nielsen (2004) developed a full likelihood inference scheme for the isolation and migration model, implemented by the IM software which can handle a set of independent fully linked loci. However, these approaches are limited, as extending them to allow intralocus recombination is challenging even under simple demographic models (Fearnhead and Donnelly, 2001, Nielsen, 2000). This requirement of full linkage is a potentially serious drawback. Firstly, low-recombining chromosomal regions may be atypical (Hilton et al., 1994); secondly, authors frequently trim the regions used in order to fit the no-recombination requirement, which may result in bias (Hey and Nielsen, 2004); and finally, this requirement limits the size of contiguous region that can be analyzed and hence the information available.

Summary likelihood methods are based on replacing the data with low-dimensional summary statistics, which allow likelihoods and posterior densities to be estimated, typically by simulation (Pritchard et al., 1999, Beaumont et al., 2002, Cornuet et al., 2008). This approach has been extended to models of population splitting both with gene flow (Becquet and Przeworski, 2007) and without (Leman et al., 2005, Putnam et al., 2007). Intralocus recombination can be incorporated straightforwardly, simply by allowing recombination in the simulated genealogies (e.g. Becquet and Przeworski, 2007). However, the flexibility and relative ease of computation come at the expense of losing information, and none of the existing approaches use summaries of the data that capture detailed information about haplotype structure.

A promising recent development in population genetics inference is the use of approximate likelihood approaches. One such approach was developed by Li and Stephens (2003, henceforth, “LS”) for inferring recombination rates. They developed a new model for population genetic data that is simpler–and more computationally tractable–than standard models. An attractive feature of the LS formulation is that it contains a formal model of haplotype structure, described in detail below.

The LS approach has been applied to inference in a diverse set of problems including estimating the parameters of recombination (Li and Stephens, 2003), mutation (Cornuet and Beaumont, 2007, Roychoudhury and Stephens, 2007), gene conversion (Gay et al., 2007, Yin et al., 2009) and diversifying selection (Wilson and Mcvean, 2006). It has also been used as a tool for modeling haplotype structure–rather than for formal parameter estimation–in genotype imputation and association mapping (Marchini et al., 2007), HLA typing (Leslie et al., 2008) and in models of population admixture (Price et al., 2009) and clustering (Hellenthal et al., 2008).

In this study, we extend the LS approach to estimate the parameters of a model of population splitting. While LS approaches are computationally convenient, they rely on ad hoc simplifications of standard population genetic models, such as the coalescent. As such, a key challenge in extending the approach is to develop approximations that are suitable for the new problem. Our approach to this will be a central focus of the paper.

There are two main innovations in the LS approach. The first is that it computes a likelihood for the full data by breaking it down into a ‘product of approximate conditional’ (PAC) probabilities. Consider an observed set of haplotypes, H=h1,,hn. The basic inference problem is to compute (and maximize) the likelihood of the haplotype data H, with respect to the parameters ϕ of a population genetic model. The likelihood can be written as a product of conditional probabilities: L(H;ϕ)=pϕ(h1)pϕ(h2|h1)pϕ(hn|h1,,hn1). However, these conditional probabilities are unknown for most models of interest, and the framework proposed by LS is based instead on the PAC likelihood Lpac(H;ϕ)=pˆϕ(h2|h1)pˆϕ(hn|h1,,hn1). Here, terms of the form pˆϕ(hk+1|h1,,hk) denote the (approximate) conditional probability of the (k+1)th haplotype, conditional on the first k haplotypes, as a function of ϕ. We will refer to these as ‘AC probabilities’. Note that under neutrality the unconditional likelihood of the first haplotype does not usually depend strongly on the model of population history. Therefore, following LS, we set pˆϕ(h1)=1, and omit this term from our notation.

The second innovation of LS was to introduce a simple model for population genetic data under which these AC probabilities may be computed. We will refer to this as their “copying model”. The model is an approximation which captures many aspects of the coalescent with recombination, without suffering from the computational difficulties that have hindered attempts to incorporate recombination into coalescent-based MCMC and importance sampling schemes. At a basic level, LS can be thought of as providing a simple model for simulating haplotype data. We will first briefly describe LS in terms of the simulation problem, and then turn to how the LS model can be used for inference.

Consider first the simple case of data at a single SNP. Given k allele copies simulated so far, a new one is simulated by choosing one of the k uniformly at random: the new allele is said to ‘copy’ the chosen allele and, unless a mutation occurs, it is assigned the same allelic state as the copied allele. (The genotype of the first allele is set arbitrarily; e.g., to carry allele 0.) LS set the probability of mutation to be a decreasing function of k, to reflect the expectation that alleles added later in the order tend to match those already sampled (see Li and Stephens, 2003, for details).

Now consider simulating haplotypes at a set of loosely linked sites. Let Xl be the label of the haplotype copied at site l by the new haplotype. LS extended their model to include recombination by introducing correlation between the Xs at nearby sites: unless a ‘switch’ occurs, Xl+1 is the same as Xl. If there is a switch, then the haplotype that is copied at the next site is a draw from the uniform prior on the k previously-sampled haplotypes (including the one that was being copied at site l). Thus the new haplotype is modeled as a mosaic formed of stretches copied from the haplotypes already simulated. Specifically, the probability distribution of the random variables denoting who is copied at each site (X1,X2,,XL) is a Markov chain along the sites. The switch events were intended to mimic the effects of recombination, and the transition probabilities of this Markov chain are controlled by a recombination parameter ρ. As for mutation, the switch probability is also a decreasing function of k. This reflects the fact that a haplotype added into a large sample tends to be similar to at least one other haplotype over a large genetic distance.

Computing the likelihood. There is a large set of possible values of the sequence X1,,XL, i.e. which haplotype is copied by the new haplotype at every site along the sequence, which we will refer to as ‘paths’ through the missing data. To compute the AC probability of the new haplotype pˆϕ(hk+1|h1,,hk) under this copying model, the paths are treated as missing data. Thus the AC probability is computed by averaging the data probability over the prior probability distribution on the paths. The Markov chain prior means that this can be done efficiently using standard hidden Markov model (HMM) methods (e.g. Rabiner, 1989). It is then straightforward to calculate the PAC likelihood from these AC probabilities according to Eq. (2).

One drawback of the scheme is that the use of approximate probabilities in Eq. (2) means that the likelihood depends on the order in which the haplotypes are added to the sample. LS found it satisfactory for inference to sum over a small set of random orders, keeping the set of orders the same over the parameter values the likelihood was estimated for.

Full likelihood approaches for linked data, such as that implemented in the IM software (Hey and Nielsen, 2004, Hey and Nielsen, 2007), typically involve explicitly modeling aspects of the ancestral history of the sample such as the genealogical topology, the branch lengths, details of movements of ancestral lineages, or the types of ancestral haplotypes (see e.g. Stephens, 2001). The PAC likelihood and copying model of LS may be seen as an ad hoc approximation of such a full likelihood scheme. From this point of view, when modeling the new haplotype, its descent from the ancestral lineages of the existing sample is mimicked by forming it as a mosaic of chunks copied from present-day haplotypes with occasional ‘mutations’, and the uncertainty regarding its relationships with ancestral lineages of the existing sample is dealt with by averaging over all possible copying paths.

Throughout this paper we make this link between the copying process and the coalescent fairly explicit. That is, we consider copying haplotype X to be analogous to coalescing with the ancestral lineage of haplotype X at some time in the past. With this analogy in mind, we construct our copying model by deriving approximate probabilities and expectations under a coalescent model. This interpretation also lies behind the work of LS, who modeled the probabilities of mutation and switching (i.e. recombining) as decreasing functions of k, reflecting the fact that haplotypes added into an already large sample coalesce rapidly with other haplotypes, leaving little time for mutation or recombination (see Ewens (1990) for a discussion of this sequential sampling construction of the coalescent process).

However, it should be noted that there is no formal correspondence between the coalescent process and the LS copying process. For example, it will often be the case that the ancestral lineage of the new haplotype coalesces with a lineage that is ancestral to several sampled lineages (haplotypes) at site l. In that case it is less clear which haplotype the new haplotype should be said to be copying at site l. This lack of exact correspondence between the genealogical process and the process under which the approximate likelihood is computed is a feature of the ‘copying’ approximation in general; it is not specific to the model of population history considered here.

Section snippets

Overview of the new method

In this paper we develop a PAC copying model for data sampled from two populations that have split from a common ancestral population, using the same framework of ‘copying’ and ‘switching’ outlined above. We first give a brief intuitive overview of the method. We then fully specify our model for unlinked sites and present results for unlinked data, and subsequently present our model of linkage and results for linked data. A schematic representation of our model is given in Fig. 2. In order to

Methods: Unlinked

In the unlinked case the likelihood for the alignment of haplotypes may be computed as the product of the likelihoods for individual sites. Therefore, in this section we will use hi to refer to an allele copy at a single site rather than an entire haplotype. The problem of computing the likelihood is now reduced to computing the approximate probability of a new allele hk1+k2+1 given the alleles h1,,hk1+k2 observed so far, as a function of the model parameters. To do so we now specify our prior

Methods: Linked

With loosely linked data, correlation is anticipated between the patterns of polymorphism at nearby sites (over and above that induced by population structure), as a result of limited recombination. Modeling this phenomenon is in general challenging, but failure to do so (i.e by treating the sites as unlinked) will have two undesirable effects: firstly, valuable information that is present in the data about the genealogy along the chromosome will be lost; secondly, confidence intervals can be

Discussion

In principle, the data sets of choice for learning about the evolutionary history of populations are those with physical linkage. This preference stems from the value of information about local genealogical structure (i.e. along the chromosome) when trying to learn about such things as the timing of population splitting, gene flow and admixture events. As a result, even in non-intensively studied species many data sets of linked variation from the nuclear genome have now been assembled with the

Acknowledgments

The authors would like to thank the Pritchard and Przeworski labs for helpful discussions. DD was supported in the final stages of this work by grant no. 072974/Z03/Z from the Wellcome Trust. JKP and GC were supported by grants from the David and Lucile Packard Foundation, and the Howard Hughes Medical Institute. GC was also supported by funds from UC Davis. Two anonymous reviewers provided many helpful comments.

References (48)

  • M.G. Blum et al.

    Estimating the number of ancestral lineages using a maximum-likelihood method based on rejection sampling

    Genetics

    (2007)
  • D. Conrad et al.

    A worldwide survey of haplotype variation and linkage disequilibrium in the human genome

    Nature Genetics

    (2006)
  • J.M. Cornuet et al.

    Inferring population history with DIY ABC: A user-friendly approach to approximate Bayesian computation

    Bioinformatics

    (2008)
  • W.J. Ewens

    Population genetics theory—The past and the future

  • D. Falush et al.

    Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies

    Genetics

    (2003)
  • P. Fearnhead et al.

    Estimating recombination rates from population genetic data

    Genetics

    (2001)
  • Y.-X. Fu. et al.

    Statistical tests of neutrality of mutations

    Genetics

    (1993)
  • J. Gay et al.

    Estimating meiotic gene conversion rates from population genetic data

    Genetics

    (2007)
  • G. Hellenthal et al.

    Inferring human colonization history using a copying model

    PLoS Genet.

    (2008)
  • J. Hey

    On the number of new world founders: A population genetic portrait of the peopling of the Americas

    PLoS Biol

    (2005)
  • J. Hey et al.

    The study of subdivided populations—New hope for a difficult and divided science

    Nature Reviews. Genetics

    (2003)
  • J. Hey et al.

    Multilocus methods for estimating population sizes migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis

    Genetics

    (2004)
  • J. Hey et al.

    Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics

    Proceedings of the National Academy of Sciences

    (2007)
  • H. Hilton et al.

    Using hitchhiking genes to study adaptation and divergence during speciation within the Drosophila melanogaster species complex

    Evolution

    (1994)
  • Cited by (20)

    • Inference of population history using coalescent HMMs: review and outlook

      2018, Current Opinion in Genetics and Development
      Citation Excerpt :

      In particular, diCal version 2 allows for the parametric inference of more complex demographic models involving multiple populations, and SMC++ and ASMC push the boundaries of scalability for coalescent-HMMs. Building on diCal v1 [50] and advances to the CSD framework [54,55], diCal v2 [56] was developed to perform parametric inference of essentially arbitrarily complex demographic models, including estimating divergence times, continuous and pulse migration, and population sizes with possible exponential growth. The method can scale to tens of haplotypes and has been used on models with three populations, but can handle arbitrarily many populations at increased computational cost.

    • Recombination hotspots: Models and tools for detection

      2016, DNA Repair
      Citation Excerpt :

      It relies on the ‘Product of approximate conditionals’ likelihood and the ‘copying model’ of Li and Stephens [109]. The LS model (developed by Davison et al.) of approximate likelihood approach enabled estimation of the various parameters viz. recombination, mutation, gene conversion and diversifying selection [110]. Fearnhead and Smith also detected the site of hotspot based on approximate likelihood approach [111].

    • Epidemiology and evolution of fungal pathogens in plants and animals

      2011, Genetics and Evolution of Infectious Diseases
    View all citing articles on Scopus
    1

    Current address: Department of Statistics, University of Oxford, United Kingdom.

    2

    Current address: Department of Evolution and Ecology, and Center for Population Biology, University of California, Davis, United States.

    View full text