Modeling population structure under hierarchical Dirichlet processes

We propose a Bayesian nonparametric model to infer population admixture, extending the Hierarchical Dirichlet Process to allow for correlation between loci due to Linkage Disequilibrium. Given multilocus genotype data from a sample of individuals, the model allows inferring classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process and can be applied to most of the commonly used genetic markers. We present a MCMC algorithm to perform posterior inference from the model and discuss methods to summarise the MCMC output for the analysis of population admixture. We demonstrate the performance of the proposed model in simulations and in a real application, using genetic data from the EDAR gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans.


Introduction
Population stratification, or population structure, refers to the presence of a systematic difference in genetic markers' allele frequencies between subpopulations due to variation in ancestry.This phenomenon arises from the bio-geographical distribution of human populations.The analysis of population structure presents an important problem in population genetics and arises in many contexts.It is central to the understanding of human migratory history and the genesis of modern populations [49,47].The associated admixture analysis of individuals is important in correcting the confounding effects of population ancestry on gene mapping [60] and association studies [42].It is also useful in the analysis of gene flow in hybridization zones [16] and invasive species [46], conservation genetics [59] and domestication events [35].The establishment of inexpensive single nucleotide polymorphism (SNP) genotyping platforms in recent years has facilitated collection of markers to assess genetic ancestry in human populations and in general to investigate genetic relationships in living organisms.This paper focuses on a particular form of population structure: admixture.Genetic admixtures occur when two or more previously isolated populations begin interbreeding, resulting in the introduction of new genetic lineages into a population (e.g. the African-American population).
Broadly speaking, the aim of population structure analysis based on genetic data are: (i) detection of population structure; (ii) estimating the number of subpopulations in a sample; (iii) assigning individual in subpopulations; (iv) defining the number of ancestral populations in admixed populations; (v) inferring ancestral population proportions to admixed individuals; (vi) identifying the genetic ancestry of distinct chromosomal segments within an individual.A variety of modelling approaches have been proposed in the literature.Two of the most widely used approaches are Principal Component Analysis (PCA) and modelbased estimation of ancestry, mainly involving clustering techniques or hidden Markov models.The PCA approach has been used to infer population structure for several decades.In the PCA approach, the individuals' genotypes are projected onto a lower dimensional space so that the locations of individuals in the projected space reflects their genetic similarities [38,34].It should be noted that the top principal components do not always capture population structure but may reflect family relatedness, long range linkage disequilibrium or simply genotyping artefacts.Model based methods aim to reconstruct historical events and therefore to infer explicitly genetic ancestry (e.g.[43,55,2]).In the structured association approach samples are assigned to sub-population clusters, possibly allowing for fractional membership.
An influential early approach is STRUCTURE by Pritchard et al. (2000) [43], which assumes that individuals come from one of K sub-populations.Based on Bayesian mixture models, population membership and population specific allele frequencies are jointly estimated from the data.This simple framework can be extended to genetic admixtures, allowing individuals to have ancestry from more than one population.For each individual, STRUCTURE determines what proportion of the individuals' genome comes from each population, while the alleles at different loci are modelled conditionally independently given these admixture proportions.Taking a Bayesian approach to inference, independent priors on the allelic profile parameters of each population are specified and posterior inference is performed through Markov chain Monte Carlo (MCMC).Various extensions to STRUCTURE have been proposed to address a number of shortcomings.
One problem with STRUCTURE, which we address in this paper , is that of admixture linkage disequilibrium among neighbouring loci.When individuals from different groups admix, their offsprings' DNA become a mixture of the DNA from each admixing group.Chunks of DNA are passed along through subsequent generations, up to the present day.Therefore, the genomes of the descendants contain segments of DNA inherited from each of the original populations.The shorter the distance between two loci, the higher the probability that the population of ancestry will be the same at these two loci.This means that ancestry states are autocorrelated.The lengths of uninterrupted DNA segments inherited from each sub-population reflect how long ago the admixture event occurred.In general long uninterrupted segments from each population imply a recent admixture event.The original version of STRUCTURE did not deal with admixture linkage disequilibrium and as a result it is necessary to thin out tightly-linked loci to reduce correlations which can affect the quality of inference.In [13], Falush et al. (2003) improved on this issue by introducing a module to model linkage locally among neighbouring loci, using a Markov model which segments each chromosome into contiguous regions with shared genetic ancestry.This allows local genetic ancestry from genotype data to be inferred, as opposed to the global admixture proportions in [43].Such local ancestry estimation gives more fine-grained information about the admixture process.In [43] and [13], each population is characterised by an allele frequency profile and they assume that the alleles at each locus are independent.Further dependence can be modelled by introducing dependence among allelic profiles due to common ancestry [13] or by using a Markov model for the alleles conditionally on the ancestral segmentation [55].Another class of models uses haplotypic profiles as richer representations for the allelic dependence within populations [21].
Another important statistical concern in admixture modelling, which we address in this paper, is the determination of the number of ancestral populations.In Pritchard et al. (2000) [43], this is achieved using a model selection criteria based on MCMC estimates of the log marginal probabilities of the data and the Bayesian deviance information criterion, though it has been noted by [13] that such estimates are highly sensitive to prior specifications regarding the relatedness of populations.See also Corander et al. (2003) [8] and Evanno et al. (2005) [12] for other parametric approaches to determining the number of populations.One way in which such model selection can be sidestepped is by using Bayesian nonparametric models [22], which offer a flexible framework and do not require the specification of a fixed and finite model size (which, in our context, is the number of populations).Rather, one assumes an unbounded potential model size, of which only a finite part is observed on a given finite dataset.Such ideas have been applied to population structure inference by Huelsenbeck and Andolfatto [25] who used the Dirichlet Process (DP) to define a Bayesian nonparametric counterpart to the "no-admixture" model of Pritchard et al. (2000) [43] (see also Dawson and Belkhir [9] and Pella and Masuda [39] for extensions to polyploid data), and by Sohn et al. (2012) [53] who used infinite hidden Markov models [6] for modelling linkage disequilibrium in admixture modelling.
In this paper, we propose a method for modelling population structure that simultaneously gives estimates of local ancestries, bypasses difficult model selection issues using Bayesian nonparametrics, and is designed to be computationally more scalable than current Bayesian nonparametric methods.Our approach is based on using the hierarchical Dirichlet process (HDP) [56] to non-parametrically model the unknown and uncertain number of populations without having to perform costly model selection.Unlike [53], we use a simplified transition model in which, during a transition event, the founder identities on either side of the transition are independent.This transition model requires a linear rather than quadratic (in the number of founders) number of parameters, as well as a forward-backward algorithm which scales linearly in the number of extant populations while introducing less auxiliary variables which can slow down convergence.
In Section 2 we introduce our Bayesian nonparametric model, as well as a novel Markov chain Monte Carlo method which allows efficient exact inference using a retrospective slice sampling truncation scheme.Section 3 describes results of simulation studies, and Section 4 describes population structure analyses of genotype data from the EDAR gene region.Section 5 closes with a discussion of our findings as well as potential future work.

Method
We assume we have multilocus genotype data from a sample of admixed individuals arising from a number of populations.For simplicity, we suppose that there are N haploid individuals genotyped at L loci, and we denote by X = (x il ) 1≤i≤N,1≤l≤L the observed data, where x il is the allele of individual i at locus l.

Model specification
Let K be the number of ancestral populations.We denote by Q i = (q ik ) 1≤k≤K the vector of admixture proportions of individual i, where q ik denotes the proportion of the genome of individual i which can be traced to population k.While previous works used a finite value for K, we will take a Bayesian nonparametric approach and let K → ∞, so that there is an unbounded number of potential populations in the model.To account for dependence among loci, we use the model of linkage proposed by Falush et al. (2003) [13].This employs a hidden Markov model which splits the genome into contiguous chunks with common ancestry.The model is parameterised by: d l , the genetic distance between locus l and locus l + 1, for each l = 1, . . ., L − 1, and r, the rate at which splits occur.Let z il be a variable which denotes the population ancestry at locus l of individual i, and s il be a binary variable which denotes whether locus l and locus l + 1 are in the same chunk (s il = 1) or not (s il = 0).The variables S il can be thought of as linkage indicators.The transition model is as follows: The probability of a split between loci l and l+1 is 1−e −rd l , and the ancestral populations of each chromosome segment are independent and identically distributed, with probability q ik for the ancestral population to be k for each chunk.The admixture model of Prichard et al. (2000) [43] can be recovered as r → ∞, as all loci become independent and the chunks consist of a single locus. 3 The model is completed by specifying the likelihood function for the observed alleles.We will assume that within each population Hardy-Weinberg equilibrium holds, and we can model the the allelic profile of the kth population simply by specifying the vector of allele frequencies, that is θ k = (θ kla ) 1≤l≤L,1≤a≤A , where θ kla is the probability for allele a at locus l in population k.That is, where θ kl = (θ kla ) 1≤a≤A .For example, in the case of single nucleotide polymorphism (SNP) data, x il are binary valued and modelled using Bernoulli distributions with means given by θ kl1 .

Prior specification and Bayesian nonparametrics
We use a Dirichlet prior for the admixture proportions Q i .The typical prior in previous works [5,45,43,13] is given by a symmetric Dirichlet distribution, which assumes that all populations have a priori equal contribution to the observed genomes.We will use an asymmetric Dirichlet with mean Q 0 = (q 0k ) 1≤k≤K instead, to capture the assumption that some populations may be more prevalent than others, so has a priori higher chance of contributing more genetic material to each individual (see also Anderson [3] and Anderson and Thompson [4]): where α > 0 is a parameter which controls the concentration of the Dirichlet prior around its mean Q 0 .The asymmetric Dirichlet also allows for a Bayesian nonparametric model, in which the number of populations K is taken to be infinite, while the corresponding infinite K limit does not lead to a mathematically well-defined model for the symmetric Dirichlet.Specifically, consider a hierarchical prior on Q 0 expressed as the so-called stick-breaking representation [52], where α 0 is a hyperparameter which controls the overall diversity of populations, with larger α 0 corresponding to a larger number of populations with more uniform proportions.The conditional distribution of Q i given Q 0 is still a Dirichlet as given in (2.3), though we need to extend the definition to one for infinite-dimensional vectors.Specifically, a constructive definition of such an infinite-dimensional Dirichlet is given as follows: While our model assumes theoretically the existence of an infinite number of populations, given a particular finite-sized dataset, only a finite (but random) number of populations will be used to model the data, and the posterior distribution over this number can be used to estimate the number of populations exhibited in the data; see Subsection 2.4.1 for details.
The model is completed by specifying a prior on α, α 0 , r and θ kl , k = 1, . . ., K; l = 1, . . ., L. For each population k, we use independent Dirichlet for the allele frequencies at each locus.In the case of SNP data, this implies assuming independent Beta prior for each locus in each subpopulation.In our simulations and our application to the EDAR data, we take θ kl1 ∼ Beta(cµ l , c(1 − µ l )), where m l denotes the prior mean for the allele frequency, assumed to be the same for all ancestral populations and c is a concentration parameter.We choose independent Gamma priors for α and α 0 for computational reasons.We specify a uniform prior on log r, on a fairly large interval.Recall that d l denotes the genetic distance between adjacent markers.If it is measured in morgans, then r can be interpreted as an estimate of t, the number of generations since the admixture event [13].When the genetic distance between loci is not available, we can use as a proxy the physical distance measured in nucleotides.In this case r be interpreted as an estimate of the product of t and the recombination rate (expected number of crossovers per base pair per meiosis).
Another important issue which arises is the computational requirements for inference in a model with an infinite number of populations.In this regard, a range of recent truncation and marginalisation techniques can be applied allowing for exact inference using finite computational resources [32,58,37,14].We propose a particular approach in Subsection 2.4, after discussing in the next subsection the theoretical motivation for the hierarchical Dirichlet prior described in (2.3) and (2.4).

Hierarchical Dirichlet processes
The stick-breaking prior for the overall population prevalences (2.4) imposes a particular ordering on the populations, in which populations with higher index have a priori lower prevalences.This is undesirable from a modelling perspective as the induced ordering is artificial, while from a computational perspective it is also undesirable as it introduces a label switching problem into the inference, which can slow down convergence of inference algorithms [26,37].In this section we address this issue by developing a more abstract formalism for the model based on a construction of coupled random probability measures called the hierarchical Dirichlet process [56] (see also Teh and Jordan [57] for a more recent review).
Let (Θ, Ω) be a measurable space.The Dirichlet process G 0 ∼ DP(α 0 , H) is a random probability measure over (Θ, Ω) with the property that for any measurable partition (A 1 , . . ., A L ) of Θ the random probability vector (G 0 (A 1 ), . . ., G 0 (A L )) is distributed according to a Dirichlet distribution with parameters (α 0 H(A 1 ), . . ., α 0 H(A L )) [15].The parameters of the process consist of a positive concentration parameter α 0 and a base probability measure H over (Θ, Ω).A variety of more constructive representations exist for the Dirichlet process, and the reader is referred to Ghoshal [19] for a review on the DP and to Lijoi and Prünster [29] for a review of nonparametric prior distributions generalizing the DP.
One of the noteworthy properties of the Dirichlet process is that the random probability measure G 0 is discrete almost surely, and can be written in the form (2.6) The atoms (θ k ) k≥1 are independent and identically distributed according to the base probability measure H, while the atom masses are independent of the atoms, and have distribution given by the stick-breaking representation (2.4) [52].
In the context of admixture modelling, we will suppose that each atom in G 0 corresponds to a population with allelic frequencies parameterised by the atom, while the masses correspond to the population proportions or prevalences.In other words, θ k denotes the vector of the population specific allele frequencies for the L loci under investigation.As each individual has its own population proportions while the collection of populations are shared across individuals, we can model this using the hierarchical Dirichlet process (HDP).For each individual i, let G i be an individual-specific atomic random probability measure.These measures are conditionally independent and identically distributed given a common base probability measure G 0 : Since each atom in G i is drawn from G 0 , the collection of atoms in G i is precisely those in G 0 , while each G i has its own specific atom masses: where the masses (q ik ) k≥1 have distribution as given in (2.5).The HDP allows sharing of the ancestral among the indivual distributions as the G i place atoms at the same discrete locations determined by G 0 (see Teh et al. (2006) [56] for details).
We refer to the proposed model as HDPStructure.In summary, G i describes the proportion of the alleles on x i = (x i1 , . . ., x iL ) coming from each of the populations, as well as the parameters of the populations.We model the sequence x i given G i as follows: (i) first we place segment boundaries according to an nonhomogeneous Poisson process with rate rd l , (ii) then the alleles on each segment are generated by picking a population of origin according to G i , then sampling the alleles according to the population distribution.We have expressed the hierarchical prior over the population proportions (2.4), (2.5) as the joint distribution of atom masses in a HDP, while the atoms correspond to the population parameters.Further, while the stick-breaking representation imposes a particular ordering among the atoms, there is no ordering of atoms in the representation as random probability measures themselves.As we will see next, this allows for an efficient Markov chain Monte Carlo algorithm for posterior simulation.

Markov Chain Monte Carlo
In this section we describe a Markov chain Monte Carlo (MCMC) algorithm for posterior simulation in the HDPStructure model.The MCMC sampling algorithm iterates between updates to the random probability measures (G i ) 0≤i≤N , the latent state sequences (s il , z il ) 1≤i≤N,1≤l<L , and the model parameters in turn, each update conditional upon all the other variables in the model.Updates to the random probability measures make use of another representation of the HDP called the Chinese restaurant franchise [56], as well as a retrospective slice sampling technique which allows for a finite truncation to the random probability measures while retaining exactness of the procedure.Updates to the latent state sequences make use of an efficient forward filtering-backward sampling procedure as a Metropolis-Hastings proposal distribution.Finally, updates to model parameters are straightforward one-dimensional Metropolis-Hastings updates.Detailed descriptions of these updates are included in the Appendix.MATLAB software implementing this MCMC scheme is freely available at http://BigBayes.github.io/HDPStructure.

Updates to random probability measures
Conditioned on the model parameters and latent state sequences, the update to the random probability measures (G i ) 0≤i≤N follow standard results for the hierarchical Dirichlet process [56].As noted previously, since the data is finite, the number of populations used to model the data is finite as well.Conditioned on the latent state sequences {z il }, suppose the number of such populations (as a function of the latent state sequences) is K * .For simplicity, we may index these populations as 1, . . ., K * .The random probability measures can be expressed as: for each i = 1, . . ., N , where w i is the total mass of all other atoms in G i , which are collected, after normalising by w i , in a random probability measure G i .
For each i = 1, . . ., N and k = 1, . . ., K * , let n ik be the number of DNA segments in sequence i assigned to population k.In the Chinese restaurant franchise representation of the HDP, we introduce a set of discrete auxiliary variables m ik , taking value 0 if n ik = 0, and values in the range {1, . . ., n ik } when n ik > 0. Define n 0k = N i=1 m ik .Then the conditional distributions of the random probability measures given (n ik , m ik ) 0≤i≤N,1≤k≤K * is described by the following [56], (q 01 , . . ., q 0K * , w 0 )|(n ik , m ik ) ∼ Dirichlet(n 01 , . . ., n 0K * , α 0 ) (2.10) (q i1 , . . ., q iK * , w i )|(n ik , m ik ), (q 01 , . . ., q 0K * ), w 0 ∼ Dirichlet(αq 01 + n i1 , . . ., αq where the masses form a hierarchy of finite-dimensional Dirichlet distributions while the random probability measures are independent of the masses and form a hierarchy of DPs as in the prior. A final point of consideration relates to the fact that the random probability measures G 0 , (G i ) have infinitely many atoms, so not all can be simulated explicitly with finite computational resources.We address this using a retrospective slice sampling technique to truncate the random probability measures while retaining exactness [58,37,20].For each individual i, we introduce an auxiliary slice variable C i , with conditional distribution given the other variables in the model: (2.12) (2.13) The slice variables are sampled just before the latent state sequences (whose updates are described in the next subsection).Further, conditioned on the slice variables, populations whose mass fall below C i will have zero probability to be selected when the latent state sequence for individual i is updated.As a consequence, only the (finitely many) atoms with mass above the minimum threshold min i C i need be simulated.This can be achieved by simulating G 0 and (G i ) using the hierarchical stick-breaking representation (2.4), (2.5) until the left-over mass falls below the threshold.

Updates to latent state sequences
We will use a forward-filtering backward-sampling algorithm to resample the latent state sequences one at a time.Conditioned on the slice variable C i , only populations with q ik > C i will have positive probability of being selected, and so the forward-backward algorithm can be computationally tractable.However, as the slice variable depends on all latent state variables, conditioning on the slice variable introduces complex dependencies among the latent state variables which precludes an exact and efficient forward filtering algorithm.We propose instead to ignore the dependencies caused by the slice variable, and use the resulting efficient forward-backward algorithm as a Metropolis-Hastings proposal.Suppose that there are K i populations with proportions above the slice threshold C i .For simplicity of exposition, we will reindex the populations such that their indices are simply {1, . . ., K i }.The forward-backward algorithm efficiently samples from the following proposal distribution which ignores the slice threshold C i : where the population indicators range only over 1, . . ., K i .The forward filtering phase first computes the following probabilities using dynamic programming: ) with b ∈ {0, 1} and k ∈ {1, . . ., K i }.The dynamic programme starts at l = 1: and proceeds with l = 2, . . ., L: Recall that θ klx il is the probability that locus l in population k assume the observed value x il .The backward phase samples from the proposal distribution, starting at l = L: and iterates backwards, for l = L − 1, . . ., 1: where 1(•) denotes the indicator function and s i1 = 0 by construction.In this way we obtain a new sample for the collection of s il and z il .Finally, the Metropolis-Hastings acceptance probability is a simple expression which accounts for the effect of conditioning on C i : min 1, q min-cur i q min-prop i (2.16) where q min-cur i and q min-prop i indicate the minimum population proportions (2.12) under the current and proposed states respectively.
The forward-backward algorithm has a computational scaling of O(LK i ), linear in both the length of the sequence and the number of potential populations, and is the most computationally expensive part of the MCMC algorithm.It must be noted that, since the (G i ) are conditionally independent given G 0 , the algorithm can be easily parallelised so as to exploit modern parallel computation technology.

Extensions
The model can be straightforwardly extended to diploid or polyploid data, by assuming that, for each individual i, the z i along each of individual i's chromosomes form independent Markov chains satisfying (2.2).Other extensions of the Bayesian nonparametric admixture model can be introduced to allow correlated allele frequencies.For instance, following the approaches of Pritchard et al. (2000) [43] and Falush et al. (2003) [13], it is straightforward to introduce a Bayesian nonparametric admixture model with correlated allele frequencies.Specifically, we can assume that allele frequencies in one population provide information about the allele frequencies in another population, i.e. frequencies in the different populations are likely to be similar (probably due to migration or shared ancestry).This can be achieved by specifying a more complex prior structure on θ kl., for example employing the correlated allele frequencies model of Falush et al. (2003) [13], which assumes that allele frequencies at locus l in different populations are deviations from allele frequencies in a hypothetical ancestral population.
At the moment, we use only genetic data to infer admixture parameters.Often it can be useful to include in the model extra information such as physical characteristics (e.g.ethnicity) of sampled individuals or geographic sampling locations [23].These new sources of information would modify the clustering structure and would allow the proportion of individuals assigned to a particular cluster to depend on the new information.This would require a specification of a spatiality dependent model on the weights of the random measures in the HDP.
From a Bayesian parametric perspective, we could also employ other priors such as the Pitman-Yor process [40] and the hierarchical Pitman-Yor process [57].The Pitman-Yor process is a two-parameter generalization of the DP, for which a stick-breaking construction and a Chinese restaurant representation still hold.Under certain assumptions, it can be shown that in the Pitman-Yor process the number of clusters grows much faster than for a standard DP and that the cluster sizes decay according to a power law.This property makes the Pitman-Yor process a more suitable choice in many applications.The implementation of this more flexible prior would require more expensive computations due to the larger number of extant populations possible.

Simulation studies
In order to assess the performance of the model in recovering the number of ancestral populations, we perform simulations based on three demographic scenarios.Each scenario consists of 200 haploid sequences.
We consider 60 bi-allelic genetic markers on a 100Kbp segment.Coalescent simulations were performed using the software ms [24].Mutation and recombination rates were set to 2 × 10 −8 and 10 −8 per base pair per generation respectively.The focus is to identify the populations of origin and to infer demographic history.In particular, the main goal is inference on K.The three simulated scenarios we consider are: (i) sample from a single random mating population; (ii) admixture model with two parental populations and admixture proportion of 0.6 and 0.4; (iii) admixture model with three parental populations and admixture proportion 0.5, 0.4 and 0.1.Figure 5 shows the posterior distribution of K (after burn-in) under the three scenarios.HDPStructure successfully recovers the number of ancestral populations.
In the simulations we have set the parameters for the Gamma priors on α and α 0 as follows: α ∼ Gamma(1, 1) and α 0 ∼ Gamma (5,1).Although inference can be sensitive to the choice of the prior on the number of populations, i.e. to the prior specification on α and α 0 , we note that as the number of sequences and/or markers increases the model tends to generate spurious clusters, i.e. clusters with very few individuals in them.This is in line with recent results on the clustering properties of the Dirichlet Process [31].Nevertheless, the number of clusters that covers the majority of the data, i.e. 95-99%, is quite robust to prior specifications (sensitivity analysis results not shown).In general, the biological interpretation of K is difficult.This is in agreement with the suggestion of Pritchard et al. (2010) [44]: We may not be able to know the TRUE value of K, but we should aim for the smallest value of K that captures the major structure in the data.
We compare our results with the software STRUCTURE [43,13,23], which is arguably the most widely used software in applications to infer population structure.This software implements the parametric version of our model, with fixed K (http://pritchardlab.stanford.edu/structure.html).We have run the Linkage model described in Falush et al. (2003) [13], using default settings and assuming independent allele frequencies between markers.To estimate K, Pritchard et al. (2000) [43] suggested using the value of K which maximises the estimated model log-likelihood, log Pr(Data | K).This latter quantity is estimated by the MCMC run using an approximation based on the harmonic mean estimator of the Bayesian deviance.For each scenario, we have run STRUCTURE for each value of K, K = 1, . . ., 8. Figure shows the estimated log Pr(Data | K) for the three simulated examples.The value K = 1 seems to maximises the model loglikelihood under all scenarios.In general, the authors of the software warn against possible drawback of using this criterion and to interpret the results with caution and give suggestions for improvement.

Experiments
We demonstrate our model on a dataset of 372 Colombians recently genotyped on the Illumina Human610-Quadv1 B SNP array as part of a genome-wide association study [51].Latin American samples are uniquely advantageous for this purpose [50] because of their well-documented history of extensive mixing between Native Americans and people arriving from Europe and Africa.This continental admixture, which has occurred for the past 500 years (or about 20-25 generations), gives rise to haplotype blocks which are about the right length for such analysis.Ancient admixture produces very short haplotype fragments which are hard to assign ancestry with certainty, while very recent admixture allows only large haplotype blocks and there is not sufficient variation in ancestry for individuals.
The Native American population arose as a branch of the East Asian populations who were separated over 15,000 years ago and consequently isolation and genetic drift shaped their genetic landscape.This caused many SNPs to drift even more than their East Asian counterparts, eventually becoming fixed at the alternative allele.The Ectodysplasin-A receptor (EDAR) gene, located on chromosome 2, is a common example, in particular SNP rs3827760 [30], whose ancestral A allele is 100% prevalent in European and African populations, but the alternative G allele is seen at 94% frequency in Han Chinese and 100% in Native Americans.The SNP, a missense mutation, has been observed to have a range of functional effects in humans and replicated in other mammals such as mice, including the characteristic straight hair shape in East Asians [18,54] and dental morphology [27,36].Our dataset does not contain rs3827760, but neighbouring SNPs in LD to rs3827760 are included in the chip panel.This shows a strength of our model does as we manage to capture the ancestries even in absence of SNP rs3827760, the well-known causal and ancestry-informative SNP, by making good use of LD information.Figure 5 shows the LD plot for the EDAR region in the  1 Correlation between ancestry proportions and cluster occurrence proportion from the Bayesian nonparametric model.
Colombian samples.Overall, EDAR signalling acts during prenatal development to specify the location, size and shape of ectodermal appendages, such as hair follicles, teeth and glands [30].Therefore, we considered EDAR to be an interesting candidate for admixture analysis as it carries information regarding ancestry due to its variation across ancestry as well as its range of functional effects, which means it may be showing some signal of selection.
Genotype information on 372 individuals for 16 SNPs in the EDAR region was available from our Illumina chip data.Genotypes were phased for conversion to haplotype format by ShapeIt2 [10].Data from a total of 828 individuals sampled in putative parental populations were used as reference ancestral groups.These were selected from HAPMAP, the CEPH-HDGP cell panel [28] and from published Native American data [48] as follows: 169 Africans (from 5 populations from Sub-Saharan West Africa), 299 Europeans (from 7 West and South European populations) and 360 Native Americans (from 47 populations from Mxico Southwards).
We ran the MCMC sampler for 50,000 iterations.We collected samples after a burn-in of 20,0000 iterations and thinned every 30 iterations.We specify the following prior distributions for the precision parameters in the HDP: a 0 ∼ Gamma(1, 1), α ∼ Gamma (10,20).We centre the prior for the mean parameters of the Beta base measure of the HDP around the overall observed allele frequencies, with c = 0.01.The prior for log r is a Uniform on the interval [−500 , 5].
The posterior analysis shows evidence of four major ancestral populations in the set of 744 Colombian haplotypes (see Figure 5).We use the MCMC output to estimate the cluster assignment, i.e. population allocation, to each of the 4 major ancestral populations for each haplotype sequence and each marker.In Figure 5, we summarise the MCMC output by reporting the clustering that minimizes the posterior expectation of Binder's loss as described by Fritsch and Ickstadt (2009) [17], who also discuss possible alternatives such as Maximum a posteriori clustering.The four major clusters have admixture coefficients, i.e. relative proportion of occurrences of each of the clusters, 51.8%, 32.1%, 11.4% and 4.7% respectively.As we have used a reference panel, we are able to identify in the first cluster, in terms of cardinality, European-origin haplotypes in the sampled Colombians.The second and third clusters correspond to Amerindian and African respectively.This is also confirmed by looking at the "most frequent" haplotype in each cluster.Figure 5 shows the raw data assigned to each of the four major ancestral populations.We verify our findings in two ways.Firstly, we calculate genetic ancestry proportions using reference genotypes as reference ancestral groups.EDAR-specific ancestry proportions for each of the 372 Colombian samples were estimated using Admixture software [2], which provides a faster implementation of the same model in STRUCTURE.We correlated these ancestry proportions to our cluster occurrence proportion (see Table 4).The correlation values are very high and support our assignment of ancestry category to the first three clusters.The average European, Amerindian and African ancestry across Colombian samples are 53.6%,30.8% and 16.6% respectively, which is also very close to our cluster proportions.However, Admixture is a supervised approach and so it cannot give us further details about the fourth, rarer cluster.To explore it further, and also for another line of verification, we calculate genetic principal components, in which SNP genotypes for each person is recoded into 0/1/2 by an additive count of the minor allele on two chromosomes, and this SNP genotype count matrix is converted to principal components (PC) via the usual method.As European+Amerindian continental genetic mixing is the primary source for admixture in our data, the first PC axis reflects this, being positively correlated with European samples and negatively with American samples.The second PC captures the other continental component in our samples, namely the African samples.More specifically, PC1 captures the European-Amerindian axis of variation, and PC2 captures the African-European axis.In Table 4 we show correlations of PCs with supervised ancestry values.As further PCs are orthogonal to these, they do not show high correlation with any ancestry component.Consequently, the first PC shows high correlations with the first two clusters, and PC2 with the third cluster.As shown in  3 Correlations between principal components and cluster assignment.
highly correlated with the new cluster, which validates the signal we capture as genuine genetic component and not a statistical artefact of our method.To investigate the genetic source of the new cluster, we look at the average allele frequency for each SNP in all the clusters, and then take the difference for the new cluster vs. all the others.Figure 5, top panel, shows the absolute differences: we see clearly that only SNPs 11 and 13 primarily contribute to this cluster.The same is seen when we plot the weights given by PC3 onto each SNP (Figure 5 bottom panel).These two SNPs -rs260693 and rs260696 -are rare SNPs, i.e. their minor allele which is recognized in cluster 4 and PC3 is rare.For example, rs260693 has a global MAF of 3.8%, with the minor allele only primarily seen in Europe (9%) while nearly being absent in Africans (1.8%) and East Asians (0%).Their two minor alleles are highly in LD, with a D' of 1 in European populations.This shows that the haplotype that contains the two minor alleles for these two SNPs is also rare, and is being identified as the fourth separate cluster by our model.Table 4 reports the posterior mean of the θ kl of the 16 SNPs in each of the four major populations while Figure 5 shows the posterior mean of admixture proportion π ik for few randomly selected individuals.

Discussion and concluding remarks
We have presented the Bayesian nonparametric counterpart of the Linkage model of Falush et al. (2003) [13] to infer genetic admixtures.The model allows for both the number of ancestral population and the assignment vector to be random, avoiding the use of model selection criteria.The model can be applied to commonly used genetic markers and does not rely on specific assumption on the mutation model.We incorporate dependence between markers due to correlation of ancestry by specifying an inhomogeneous Poisson process on the DNA sequence.Each population is modelled using a simple and independent allele-frequency profile.We have developed an MCMC algorithm which allows us to perform posterior inference on the number of ancestral populations, the population of origin of chromosomal regions, the proportion of an individual's genome coming from each population, the admixture proportions in the population and the allele frequencies in ancestral populations.We have demonstrated the model in simulations and on real data from the EDAR gene.The model has been able to highlight the existence of a rare European haplotype.We have highlighted possible extensions to our method.An interesting direction for future development is to relax the assumption of independent allele-frequency profile in each population by incorporating ideas from Sohn et al. (2012) [53] and model each population as a hidden Markov model over a set of founder haplotypes.
In this article we have devoted considerable attention to inferring K and shown how Bayesian nonparamertic methods automatically provide posterior inference on the number of ancestral populations.Nevertheless, we must be careful when interpreting K.The nonparametric setup will generally yield sensible clustering but clusters will not necessarily correspond to "real" populations.This problem is in common with most of the model-based structure algorithms [43].

Table 4
Posterior mean of θ kl , l = 1, . . ., 16 in each of the four major populations.The colour gradient in each cell is proportional to the numerical value.

ti.
If α is given a Γ(a, b) hyperprior, then given w i and t i , α has a Gamma distribution with parameters a + m .. − N i=1 t i and b − N i=1 log w i .Update of α 0 .Given the total number m .. = i,k m ik of the θ kl 's, the concentration parameter α 0 governs the distribution of the number of population K.We use the auxiliary method of Escobar and West[11].If α 0 is given a Γ(a, b) hyperprior, it can be resampled by introducing a latent variable γ ∼ Beta(α 0 + 1, m .. ) andp(α 0 | K, γ) = πΓ(a + K, b − log(γ)) + (1 − π)Γ(a + K − 1, b − log(γ))whereπ/(1 − π) = (a + K − 1)/m .. (b − log(γ)).Update of r.The rate r of the Poisson process is given a Uniform prior on some interval [r L , r U ].We use a random walk Metropolis step to update r with proposal distribution centred around the current value.Update of hyperparameters in the base measure H.The proportion of the model involving the hyperparameters in H is a convential parametric model.Hence, conditioning on all the other variables usually leaves us with a standard Bayesian model, often in conjugate form.In the case of SNP data, we have takenθ k ∼ H = L l=1 Beta(cµ l , c(1 − µ l )).We specify independent Beta(a l , b l ) for each µ l , l = 1, . . ., L, and update µ l using a random walk Metropolis step with proposal distribution centred around the current value.

Fig 5 .
Fig 5.The top panel shows the genetic data, with black representing the 0 allele for each SNP.The bottom panel presents summary of the posterior population assignment obtained by minimising the Binder loss function.

Fig 6 .
Fig 6.Each panel shows the data distribution for each of the four major populations: white indicates markers on each sequences not assigned to the specific population, grey denotes the 1 allele and black denotes the 0 allele.

Fig 7 .
Fig 7.  Top panel: absolute value of the difference between the average allele frequency for each SNP in all the clusters and the frequency in the rare cluster.Bottom panel: loadings for each SNP of the third PC.

Table 2
Table 4 the third PC is European anc.Amerindian anc.Correlations between principal components and supervised ancestry values.