Inference of Locus-Specific Population Mixtures from Linked Genome-Wide Allele Frequencies

Abstract Admixture between populations and species is common in nature. Since the influx of new genetic material might be either facilitated or hindered by selection, variation in mixture proportions along the genome is expected in organisms undergoing recombination. Various graph-based models have been developed to better understand these evolutionary dynamics of population splits and mixtures. However, current models assume a single mixture rate for the entire genome and do not explicitly account for linkage. Here, we introduce TreeSwirl, a novel method for inferring branch lengths and locus-specific mixture proportions by using genome-wide allele frequency data, assuming that the admixture graph is known or has been inferred. TreeSwirl builds upon TreeMix that uses Gaussian processes to estimate the presence of gene flow between diverged populations. However, in contrast to TreeMix, our model infers locus-specific mixture proportions employing a hidden Markov model that accounts for linkage. Through simulated data, we demonstrate that TreeSwirl can accurately estimate locus-specific mixture proportions and handle complex demographic scenarios. It also outperforms related D- and f-statistics in terms of accuracy and sensitivity to detect introgressed loci.


Inference
We have to infer the transition and emission probabilities from the observations.Since we do not know through which hidden states the trajectories went, we apply the Baum-Welch algorithm to estimate the following parameters θ of the HMM: 1. the initial state distribution η j = P(z 1 = j); 2. the transition matrix parameters, where i is associated to each migration edge: κ i (scaling factors), ϕ i (control number of sites not in "neutral state"), ζ i (control extent of sites not in "neutral state"), a i (attractors, "neutral states"); 3. the branch lengths c = (c 1 , . . ., c k ); 4. the parameters µ, σ 2 of the root state prior.
Applying the Baum-Welch algorithm to HMM is rather standard (Baum et al., 1970), but it involves lengthy calculations for this rather complex model, as we will show in the following.The Baum-Welch algorithm is a variant of the Expectation-Maximization (EM) algorithm which updates the parameters by alternating between an E-step and an M-step.Let θ ′ denote the old and θ the new parameters.In the E-step of the algorithm, we calculate the expected complete data log-likelihood given by where we used the notation from the main text and where and These γ's can be calculated by the standard forward-backward algorithm for HMMs (Murphy, 2012).
Maximization w.r. to the new parameters is standard for the initial state distribution: η * j = γ 1 (j).For each EM iteration, we also need to store all γ's to calculate the derivatives shown below.

Optimization of Emission Probabilities
To infer the branch lengths, we use the Newton-Raphson (NR) algorithm (Nocedal and Wright, 2006;Lange, 2010).To avoid boundary issues, we allow for negative branch lengths as long as they result in valid variance-covariance matrices W j .Specifically, we check that i) all diagonal entries of W j are ≥ 0 and that ii) the matrix is positive semi-definite.We do so for all boundary cases, i.e. for the W j of all combinations with either zero or full migration on each migration edge (e.g.(0,0), (0,1), (1,0) and (1,1) in case of two migration edges).However, we output a warning whenever negative branch lengths are inferred as they are often a sign of a misspecified tree.
The gradient of Q(θ, θ ′ ) w. r. to the branch lengths is 2) The terms ∂ ∂c k log π(d l |µ, σ 2 , G j ) can be evaluated by defining the matrix J jk given by ( 6) .(Note that there we did not yet have the index j because at this stage the matrix was defined for the value w instead of w j .)From ( 7) we get: We can now calculate the partial derivatives of all the terms in (15).The derivative of the identity Using Jacobi's formula we get With this information we obtain the partial derivatives that form a gradient function of the branch lengths: As mentioned, we will use the NR algorithm to solve F (c) = (0, . . ., 0) ′ , for which we also need the Jacobian of F .From (eq.S.2) and (eq.S.3) we obtain where we introduce the notation The parameter µ of the root state prior can be inferred explicitly if we know σ 2 .We get where we have set S lj = Σ l + W j + σ 2 11 ′ , see Eq. ( 14).
There is no simple formula for σ 2 and we optimize for it with the Newton-Raphson algorithm.To this end, we need the first and second derivatives of (eq.S.1) with respect to σ 2 .Using Jacobi's formula, we obtain the expression wherein we write T lj = S −1 lj 11 ′ .For the second derivative of Q w.r. to σ 2 we get

Optimization of Transition Probabilities
The proposed transition matrix is parametrized by κ i , ϕ i and ζ i , and additionally by an index a i that denotes the row index of the attractor (0 corresponding to the bottom row, 1 to the second row, and so on): with the attractor row given by ( 0 . . .0 ϕ i −2ϕ i ϕ i 0 . . .0).
The parameter ζ i models the rate of moving away from the periphery towards the attractor.If ζ i = 0, it's equally likely to move up or down.If ζ i = 1, it is impossible to move to the periphery.The parameter ϕ i models the rate of leaving the attractor.If ϕ i = 0, it's impossible to leave the attractor.If ϕ i = 1, the rate of leaving the attractor matches that of leaving any other state.
If there are only three states, the generating matrix reduces to: Note that in the case of three states and a i = 1 there is no ζ i .
The stationary is complicated to calculate explicitly and we obtain it by solving a system of normal equations.For optimizing the transition matrix parameters κ i , ϕ, ζ i for a specific a i , we use the Nelder-Mead (Nelder and Mead, 1965) algorithm.

Initialization
As the Baum-Welch algorithm is sensitive to good starting values, we developed a multi-step strategy to initialize the branch lengths c, the root priors µ and σ 2 as well as the transition matrix parameters to reasonable values.

Step 1: A Gaussian Mixture Model
In the first step, we seek to get a good estimate of the variance-covariance matrices W l without imposing any constraints given by the graph G.To account for possible variation in W l across loci due to variation in migration strengths, we partition the loci into R classes and infer the variance-covariance matrices W r across populations for each class r = 1, . . ., R.
We proceed as follows: We assume that the ancestral frequencies µ l , l = 1, . . ., L, are drawn i.i.d.from the prior N (µ 0 , σ 2 0 ) and that the observations are the random vectors d l that are drawn from a series of Gaussian mixture models (GMM) with density where we use the abbreviation That is, the values d l are drawn with probabilities (mixing weights) π r from one of the R classes each of which produces random vectors from a multivariate normal distribution with mean µ l 1 and variance matrix W r .Clearly The parameters of this model are thus: 1. the variance matrices W r , r = 1, . . ., R, for each of the R classes of the GMM; 2. the mean µ 0 and the variance σ 2 0 of the Gaussian prior for µ l ; 3. the mixing weights π r .
To simplify the notation we will call this whole set of parameters θ. µ 2 l π(µ l |d l , r, θ ′ ) dµ l = ν 2 lr + ρ 2 r , (S.9) we get (S.10) The relevant part for the optimization of the expected complete data log-likelihood (eq.S.6) with respect to µ 0 , σ 2 0 is Thanks to (eq.S.8) and (eq.S.9) we can explicitly solve and we obtain and get 12) The new estimate for the mixing weights is (S.13)

Initial guess and choice of number of classes
We use the observed variance-covariance matrix of the transformed observed frequencies as an initial guess of the variance covariance matrix W 0 and slightly perturb it to get initial guess for the other classes.
As it is unclear how many classes to use, we start with a rather large number of lasses (usually R = 5) and then reduce the number of classes until the proportion of loci π r > π * is larger than a pre-defined threshold π * .We found that a relative large value of π * = 0.2 gave best results.

Step 2: Guessing initial branch length
The W r inferred in the previous step are a function of two different components of our model, namely the variance-covariance given by the graph G through the branch lengths c and migration rates w, as well as the variance that stems from binomial sampling of the data (Σ l ).We therefore estimate the branch lengths c with the Nelder-Mead algorithm by minimizing the weighted Residuals Sums of Squares (RSS) between W r and W r = f (c) given by ( 7) of the main text, while accounting for Σ l .More precisely, let c = (c 1 , . . ., c K ) ′ , define the vectorizations ŵ = vect( W ), j k = vect(J k ),