Approximate Bayesian computational methods to estimate the strength of divergent selection in population genomics models

Statistical estimation of parameters in large models of evolutionary processes is often too computationally inefficient to pursue using exact model likelihoods, even with single-nucleotide polymorphism (SNP) data, which offers a way to reduce the size of genetic data while retaining relevant information. Approximate Bayesian Computation (ABC) to perform statistical inference about parameters of large models takes the advantage of simulations to bypass direct evaluation of model likelihoods. We develop a mechanistic model to simulate forward-in-time divergent selection with variable migration rates, modes of reproduction (sexual, asexual), length and number of migration-selection cycles. We investigate the computational feasibility of ABC to perform statistical inference and study the quality of estimates on the position of loci under selection and the strength of selection. To expand the parameter space of positions under selection, we enhance the model by implementing an outlier scan on summarized observed data. We evaluate the usefulness of summary statistics well-known to capture the strength of selection, and assess their informativeness under divergent selection. We also evaluate the effect of genetic drift with respect to an idealized deterministic model with single-locus selection. We discuss the role of the recombination rate as a confounding factor in estimating the strength of divergent selection, and emphasize its importance in break down of linkage disequilibrium (LD). We answer the question for which part of the parameter space of the model we recover strong signal for estimating the selection, and determine whether population differentiation-based summary statistics or LD–based summary statistics perform well in estimating selection.


Introduction
Divergent selection occurs when populations adapt to contrasting environments, causing the accumulation of genomic differences due to differential selective pressure in these environments.Identity of loci under divergent selection and estimating the strength of divergent selection at these loci play a key role in detecting divergent selection, which can be a driving force of speciation [1].Here, we aim to build theoretical and experimental models of divergent selection and assess the computational feasibility and quality of statistical inference under these models.Specifically we investigate whether the loci under divergent selection can be identified, and the strength of selection can be estimated with reasonable precision using state of the art simulation-based statistical methods.
Our experimental system is the baker's yeast (Saccharomyces cerevisiae) as a fast-evolving model organism.Yeast is an eukaryotic model organism with a moderate number of linear chromosomes (n=16) and a genome-wide number of cross-over events per meiosis that is comparable to larger eukaryotes [2], but with small genome size (12 Mb).Yeast can undergo both asexual and sexual reproduction.Crossing of yeast strains and meiosis (sporulation) can be experimentally controlled.The sporulation when sexual reproduction takes place requires starvation of yeast from nitrogen, glucose, and carbon source [3].Single haploid spores can be isolated and sequenced to determine haplotype phase across the genome as well as precisely map meiotic crossover, gene conversion events, and recombination rate heterogeneity across the genome [2].Some of the challenges in detecting divergent selection are as follows.On the statistical side, divergent selection models that we investigate are the result of stochastic processes on genomes through generations and therefore, they are large models.This causes two main challenges for statistical inference.The first is formulating a workable exact likelihood function.Simulation-based computational methods working with approximate model likelihoods as opposed to exact likelihoods partially solves this problem.The second is making inference scalable for large simulation studies to investigate model properties so that we know what to expect when we perform inference using data.A well-known simulationbased method based on approximate likelihoods is Approximate Bayesian Computation experimental setup and they differ by over 70,000 SNPs, an average of 1 SNP per ~165bp [8].We treat each SNP as a locus.The two strains form a diploid pool, without recombination, followed by sporulation of randomly selected two parents, during which recombination occurs, to create an offspring ancestral pool at time t = 0. Half of the ancestral becomes a founding population assigned to evolve in sodium dodecyl sulfate (SDS) and the other half of the ancestral pool becomes a founding population assigned to evolve in sodium chloride (NaCl).These environments induce differential selective environmental pressures in two populations.
The biological experiment was performed under four different migration-selection cycles treatments, with three replicates per treatment.The four different treatments were as follows: no migration and no sporulation (asexual reproduction), no migration and sporulation (sexual reproduction), 20% migration and sporulation, and 50% migration and sporulation.During the divergence with gene flow, the populations evolve asexually in one of the mediums for 5 days, which is assumed to be equivalent to t* = 50 generations.Then, one of the four scenarios of migration conditional on the type of sexual reproduction is implemented for one generation.This cycle is repeated four times, resulting in evolution under divergent selection for approximately 200 generations.

Description of the theoretical population genetics model
We assume diploid organisms that differ by L bi-allelic loci.Each population is of constant effective population size N e , and they evolve in non-overlapping generations.We obtain the populations X 0 and Y 0 at time t = 0 by crossing the founding populations at all loci consisting of all private alleles at L loci, then incorporating the recombination events into the process, and finally we evolve the populations through generations as follows.
We assume that the two parental genomes for an offspring are uniformly randomly (independently of each other) distributed, one parent from each population.The number of recombinations on the offspring's genome, n r , is binomially distributed with probability r.We define the position vector on which these n r events happen by L r , ∥ L r ∥ ≤ L, at which recombination events are uniformly randomly (independently of each other) distributed on L loci.The two parental genomes are recombined in positions defined by L r to obtain the offspring genome.
The reproduction in recurring cycles of t* generations: We start from t = 0, isolate X t and Y t from each other and implement asexual reproduction for a sequence of t* − 1 generations.These t* − 1 generations allow for population and loci-specific divergent selection to act on each population.
We let s i to denote the selection coefficient at locus i ∈ 1, 2, …, ∥ L s ∥ , where the fitness of the reference allele a i under selection is 1 + s i if population carrier has a i copy, else (1 + 0).We assume that selection effects are additive across loci on the genome such that the fitness of an individual at time t − 1, in population j j ∈ X, Y (Part A of Fig. 1) is ω n j, t − 1 = ∑ i = 1 L s 1 + I n a i ∈ j, t − 1 s i . (1) Then, for each distinct genome, the probability of including an offspring at generation t is multinomially distributed with the probability of successes proportional to their normalized fitnesses given by p n = ω n j, t − 1 n = 1 N e ω n j, t − 1 . (2) From generation t* − 1 to t* populations undergo asexual or sexual reproduction with recombination and symmetric migration.Migration rate from X t* − 1 to Y t* and from Y t* − 1 to X t* is denoted by m.We sample uniformly randomly (independently of each other) N e m parents to migrate from population j to the other population.After migration, reproduction is either sexual sex = 1 , or asexual sex = 0 .If sex = 0, an offspring is an exact copy of a single parent chosen uniformly randomly.If sex = 1, we choose two parents uniformly randomly (independently of each other) from the same population.The recombination steps are the same as described above (Part B of Fig. 1).
The next reproduction cycle starts at generation t* + 1 (Part C of Fig. 1), for a total of specified number of cycles n cycles , with final generation occurring before migration, i.e. t final = n cycles t* − 1. Visual representation of the experimental design of recombination rates and modes of reproduction is seen in Fig. 2 3. Model parametrization, the data, and the divergent selection simulator In this section, we delineate the constraints on the data-generating process, define the parameters of the population genetic model, and finally build a simulator.The divergent selection simulator allows us to explore the divergent selection model motivated by the yeast populations built in the lab and it constitutes one of our main contributions.

Model parametrization and the data
To capture differences in population X relative to population Y , we economically use signed selection coefficients.We arbitrarily fix the reference allele a i at locus i with selection coefficient s i i ∈ 1, 2, …, ∥ L s ∥ for the other allele at that locus in population X.A negative s i means that for allele a i at locus i, an individual in population X has a lower fitness in comparison to an individual in population Y .Unlike in population X, the fitness of the reference allele a i under selection in population X is always (1 + 0) for a carrier in population Y , regardless of whether the carrier in population Y has the a i copy or not.
SNPs in yeast occur for approximately one in every 165bp.We assume equally spaced SNPs in the genome and rescale the genome-wide recombination rate proportionally.We further assume that there is at most one crossover event between consecutive SNPs.We let SNP spacing be the spacings between L SNPs, which allows us to denote the total genome length by L × SNP spacing .The recombination rate is a function of L and SNP spacing is a constant: r = f L , In the experimental laboratory setup, the migration rate and the length of migration-selection cycles t* are controlled, and the recombination rate can be estimated as described earlier in Section 1, and in Section 2.1.The migration rates are fixed at {0,0.2,0.5}.For the recombination rate r, we use values from the literature and assume them fixed and known.
A typical computational procedure mimicking sequenced yeast genome informed by our laboratory procedures is as follows.We picked to simulate L = 1, 500 SNPs because it is computationally scalable.This number translates into about one-third on mean of SNPs of a chromosome that the two yeast strains in our biological model system from Section 2.1 differ by over 70,000 SNPs [8].The mean recombination rate of Saccharomyces cerevisiae has been estimated as 3.5×10 −6 Morgans/bp in the literature [9], with inferred genome-wide recombination profiles from sequenced isolates from an advanced intercross line (AIL) to be as high as 3.0 × 10 −5 Morgans/bp for a two-way cross at genome hotspots [10].
Here, for our laboratory procedures mimicking the sequenced yeast, we fixed the genome recombination rate of r = 2.0 × 10 −5 Morgans/bp.One way to think about this value is as a best statistical estimate for recombination hotspots in yeast.For the other scenarios with a smaller number of SNPs for computational time efficiency, we scaled the recombination rate to r = 3.0 × 10 −4 , proportionally to the expected number of recombination events.
Fixing unknown parameter values to their point estimates instead of jointly estimating them is not ideal.Recombination rate is known to be a particularly problematic parameter in population genetic models with selection and recombination.This is due to the fact that the two evolutionary processes might generate statistically similar signals in genetics.In models the recombination rate r was varied in our simulations the signal from divergent selection was confounded to the degree that there were no useful statistical inferences to distinguish true loci under selection.However, our work is a first serious effort to model divergent selection and to explore the statistical properties of estimates of a parameter of interest (strength of divergent selection) in a large evolutionary model.Free model parameters of not direct interest such as recombination rate cause statistical identifiability issues that jeopardize the statistical inference.These issues have not been solved in population genetics (except in small models) and their discussion is beyond the scope of this paper.Statistical identifiability of model parameters kicks in when there is a large number of interacting parameters in a population genetics model that are not fixed, but vary.
Table 3 describes the full parametrization with assumed values for fixed and known parameters and prior distributions for unknown and to be estimated parameters.We tried to choose reasonable support for the prior distributions based on yeast literature [10][11][12][13][14][15].To be explicit in probability functions of the model representation, in addition to standard conditionality notation separating the observables and parameters, we denote the fixed and known parameters by K = r, t*, n cycles , N e , L, SNP spacing where r is the recombination rate per genome per generation, t* is the number of generations between reproduction cycles, n cyclec is the number of reproduction cycles, N e is the effective population size, and L is the number of SNPs, SNP spacing is the spacing on the genome between each SNP.We write the joint probability mass function generating the data as where, x is an N e by L matrix of zeros and ones for the SNP data for each population, s = s 1 , s 2 , …, s ∥ L s ∥ is the vector of signed selection coefficients at each locus, m is the symmetric migration rate per generation between the two populations, sex is the mode of reproduction, l = l 1 , l 2 , …, l ∥ L s ∥ is the vector of indicators to denote loci under selection.

Simulator for the data generating process
Below we describe how the data are generated with the simulator.We input effective population size N e , number of SNPs per carrier to be simulated The model parameters that are estimated from the output simulator are the loci under selection and corresponding selection coefficients, although we allow the migration rate between populations m and the type of reproduction during migration generations sex to vary between simulations but we do not estimate them.The estimated model parameters are sampled from the joint prior distribution, i.e. θ i * P (s, m, sex, l; K).
Our simulated data of bi-allelic SNPs are N e × L matrices per population represented by x.
The SNPs of each carrier in the population correspond to the matrix row.For each of N e carries per population an allele a is sampled from discrete uniform distribution bound on [0,1] and replicated L times, corresponding to the probability of 0.5 of carriers from founding F 0 population X and of 0.5 from founding F 0 population Y , to build F 1 populations X and Y respectively.
Offsprings are then generated from the recombination of two parental genomes per offspring.For 1 to N e per population, the first parent is sampled with equal probability from X F 1 , and the second parent is sampled with equal probability from Then, the selection and asexual reproduction occur for t* − 1 generations.Absolute fitness of parents ω n j, t − 1 for each population j in X, Y for each carrier n is calculated.For loci under selection specified by an input vector L s , absolute fitness of parent at t − 1

Inference about model parameters via approximate Bayesian computation (ABC)
The probability distribution given in expression (3) is not available in closed form and the joint likelihood of the data cannot be evaluated given the parameters.There are no exact methods to perform statistical inference about the unknown model parameters in this case.As a practicable solution to the problem of performing inference about the model parameters, we employ ABC to sample the posterior distribution of parameters [4,[18][19][20][21][22][23][24].ABC bypasses the explicit evaluation of the joint likelihood thereby making simulationbased inference feasible when model likelihoods cannot be evaluated.Statistical inference in ABC is characterized by two main approximations.The first approximation is due to substituting the exact likelihood of the data with a kernel-based numerical approximation.
The second approximation is due to substituting the likelihood of the data with the likelihood of the summary statistics.Whether and by how much these approximations affect the quality of the inference depends on the size of the model generating the data and the computational budget available to increase accuracy of the approximation.For the first approximation, good practices have been established.Assessing the quality of the second approximation, however, is particularly challenging in a class of models where there are no known sufficient statistics for unknown parameters.Our divergent selection model falls into this class.In this section, we investigate the usefulness of some population differentiation statistics to perform inference about the parameters of our divergent selection model.
Based on expression (3) for the probability model generating the data, we denote the joint posterior distribution of parameters given the data x and fixed and known parameters as P s, m, sex, l | x; K ∝ P x | s, m, sex, l; K P s, m, sex, l; K , where P x | s, m, sex, l; K is the joint likelihood of the data and P s, m, sex, l; K is the joint prior distribution of unknown parameters.Incorporating the two approximations described in the previous paragraph, we write the likelihood as where SumStat is summary statistics.In the next two subsections, we evaluate some useful summary statistics in the context of our model.

Summary statistics and outlier scan on summarized observed data
The effect of divergent selection on genomes between samples of two populations can be quantified by well-known statistics that measure of genetic differentiation.An example is Wright's fixation index [25], F ST .For bi-allelic loci F ST defined by: which measures allele frequency differentiation among the sampled populations [26].Here, σ p 2 is the variance in allele frequency among sampled populations, and p ‾ is the mean allele frequency in sampled populations.A signed version of F ST , which we denote by signF ST obeys the following: if p X − p Y < 0, signF ST = − F ST , else signF ST = F ST , such that signF ST ∈ − 1, 1 .This statistic captures the information about which sampled population is undergoing selection advantageous with respect to the other sampled population.
At the genomic scale, it is often computationally infeasible to approximate the likelihood based on F ST jointly for all loci.A practicable remedy, which we follow here, is first to determine a set of candidate loci under selection manifesting only outlier values of F ST and consider the likelihood based on these loci.We take the outlier cutoff to be F ST outside of the 95% of all F ST values in the data [27].The selection coefficients can take on values within a range defined in Table 1, it is not a single value, for which we can derive an estimated F ST cut-off value, therefore we do not consider just large F ST values.In the simulator, the summary statistics outliers correspond to specific SNPs.Only the specific loci of SNPs that were detected by the outlier test are considered as potentially under selection when simulating data sets for ABC.The rest of the loci (non-outliers) are fixed to as no selection, i.e. s i = 0.
Sample F ST is not a sufficient statistic for any parameter of a divergent selection model since large values of sample F ST are not only the result of divergent selection.Sample F ST from expression (6) measures genetic variation among sampled populations by assessment of variance between and within sampled populations by calculating allele frequency differences.The sample F ST values have been found to be correlated with the recombination rate [28].Recombination rate between loci and the number of generations of recombination influence LD decay with genetic distance.Under a neutral evolution model, genetic drift is the only driving force of changes in allele frequencies.Many generations will be required for a new variant to reach a high frequency, and the surrounding LD will decay due to recombination events [29][30][31].
In our paper we test an LD-based summary statistic -Cross Population Extended Haplotype Homozygosity XP − EHH [32,33] -along with the sample F ST .In order to derive sample XP − EHH, one must first calculate Extended Haplotype Homozygosity EHH summary statistics for each population.
The EHH between two loci is defined as the probability that the number of distinct haplotypes G v in a genomic region up to a distance v from the locus are equal to each other.For N e carriers per population with possible alleles of either 0 or 1 per locus, with each group z, z = 1, 2, …, G v , having n z haplotypes, EHH is: [34].At distance, v = 0 the EHH v = 0 for each locus with respect to itself is always 1 and the EHH v values decay as the v increases, which is the decay of LD from each core haplotype [32,35].The XP − EHH, compares the integrated extended haplotype homozygosity, EHH, between two populations [32,34].Specifically, XP − EHH is the ratio of the EHH between populations X and Y integrated over the genome.If recombination rates in the model were allowed to vary widely across the genome between and within populations, the EHH statistic can be interpreted as a measure of selection only after suitable normalization [34,36].Our model assumes a mean recombination rate for simulated L SNPs, thus normalization is not required in our model.
Calculating EHH v values for locus 1 to L, and corresponding distances v 0 to L − 1, it yields an L by L symmetric matrix for each of the two populations.The XP − EHH then for population X and Y combined for each locus is just a vector of length L, and is given by: The integration domain D is a cut-off threshold below which the EHH values are set to 0.0, for which we pick D of 0.05 [35,37].The XP − EHH has the advantage of detecting selection on alleles that near fixation in one population but not both, population X and Y [34].This fits our model where the population X evolves under the positive or negative selection with respect to population Y , with occasional gene flow between two populations during migration generations.

Assessment of summary statistics
Following the outlier scan on the summarized observed data set (F ST statistic) outside of the 95% of all F ST values as described in the previous subsection, only the outlier loci in the data sets simulations identified by the outlier scan are considered as potentially under selection, with selection coefficients at non-outlier loci fixed to 0, and only the summary statistics corresponding to loci identified by outlier scan are inputted to Algorithm 1 and Algorithm 2. We assess the performance of summary statistics in terms of how well they capture the signal of observed selection coefficients by using the summary statistics in the ABC, then calculating the standard deviation of posterior distributions of selection coefficients, and mean square errors (MSEs), variance and squared-bias.We compare performance of the following summary statics: signF ST , XP − EHH, p X − p Y , and signF ST with XP − EHH.
We plot SNPs vs. MSE, observed selection coefficients vs. MSE, SNPs vs. squared-bias, observed selection coefficients vs. squared-bias, SNPs vs. variance, observed selection coefficients vs. variance, for the four summary statistics combinations, and for the mode of reproduction and migration rate in Section 5.
In our model p X and p Y represent allele frequency per locus in sampled population X and Y respectively, such that p X − p Y is the difference between the allele proportions per locus between the two populations, with the expected value of zero for the founding F 2 populations, i.e. at generation t = 0. Additionally to testing the performance of summary statistics from the outlier scan on summarized observed data described above, we assessed the effect of genetic drift and found that the population size is large enough for genetic drift not to be an issue as the mean of 100,000 data sets converge to the deterministic model, as verified by simulations.Due to the convergence of the mean of 100,000 data sets and the expected value from the deterministic model, we determined most informative summary statistics performed of simulator output data t = t final on the deterministic model based on the MSEs.We derived a summary statistic called signF ST : if p X − p Y < 0, signF ST = − F ST , else signF ST = F ST .Due to the strong fit of superimposed plots of simulations and deterministic values [38], we evaluated signF ST , F ST , and p X − p Y based on deterministic single locus model.We performed 1000 ABC tests with a tolerance rate of 0.1%.For each of the ABC iterations, a single simulated data set with known parameters was randomly drawn from the 100,000 data sets and assumed as observed data set.For each, the top 0.1% of data sets with the smallest Euclidean distances between observed and simulated summary statistics were accepted.On average of the 1000 ABC iterations, the lowest error was achieved with signF ST MSE = 5.50 × 10 −8 .We compared ABC results to empirical results from the simulator from 100,000 simulations and achieved on average MSE = 2.00 × 10 −3 with signF ST and MSE = 1.10 × 10 −2 with p X − p Y .We also examined parameter space via plots of s L/2 vs. summary statistics and the relations are more often 1:1 with signF ST than with p X − p Y .In Section 5 we address XP − EHH and signF ST and expand it to a much larger scope of parameter space.

Inference about model parameters
We estimate the free model parameters in P s, m, sex, l | SumStats; K given by expression (5), using an ABC-rejection algorithm (Algorithm 1) and the ABC with the linear regression adjustment (Algorithm 2).The term x represents, the data from our bi-allelic SNPs of individuals in two populations.The simulated SNPs are N e × L matrices per population represented by x, and x summarized by summary statistics SumStat by mapping SumStat = S x .The nsim observations denoted by x generated from the model are independent and identically distributed, i.i.d.The x ∈ X, where X is the space in which the data sits.
ABC makes two approximations, the first one is the mapping of x to SumStat, and the second is accepting summary statistics SumStat within a tolerance rate from the observed SumStat obs .The simulations are a mechanistic process that involves random sampling, which can be thought of as an influence of stochastic processes such as genetic drift.The ABC facilitates in model parameters estimation by accepting parameters corresponding to summarized data sets within a tolerance rate that is partly due to the stochastic effect.The ABC outputs a posterior distribution of accepted parameters P s, m, sex, l | SumStat m ; K , m Discrete Unif set m , where set m ∈ 0, 1 .In the ABC-rejection (Algorithm 1) we calculate the Euclidean distance d i , i = 1, 2, …, nsim technique for each of the simulated and summarized by the summary statistics data set [39,40] to scale summary statistics across the nstat dimensional space.The summary statistics are standardized by the median absolute deviation SD, j, j = 1, 2, …, nstat, such that each of the nstat summary statistics per dimension approximately equally contribute to the ABC analysis.Further details on the Euclidean distance are described in Appendix A.
Algorithm 1 ABC-rejection algorithm for summary statistics calculated from data sets simulated from the Simulator and their corresponding parameters from the prior distribution.Input: proportion of simulations to accept, number of simulated data sets, number of summary statistics per simulation.
1: Input: tolerance rate, number of simulations, number of summary statistics.
▷ Sampled from the Simulator.
5: Calculate Euclidean distances between simulated and observed summary statistics.
6: Accept M data sets with the smallest Euclidean distances.
7: Return M accepted data sets.
Following the rejection algorithm, we applied the linear regression correction to compare ABC performance with and without linear regression correction.The input of the ABClinear regression is the output of the ABC-rejection.In ABC-rejection we assign weights ) and standardized observed summary statistics (SumStat obs scaled ), and d M > 0 is the bandwidth parameter [43], in this case the largest of Mth Euclidean distance, order d 1 , d 2 , …, d nsim M .The purpose of the kernel weights calculations is to apply in the calculation of weighted least squares regression coefficients β ˆW LS for the linear regression correction analysis in Algorithm 2, where the accepted data set with smallest Euclidean distance d i from Algorithm 1 output is adjusted the least and the accepted data set with largest d i out of the M accepted is adjusted the most.The beta estimates vector is an approximate draw from the posterior.In our ABC-linear regression correction, the user can choose to calculate the kernel weights either based on Gaussian [41,44], or Epanechnikov [45] kernel and we later show in the results that the two kernels are similarly as effective.Further details on kernels, kernel weights, weighted least squares coefficients, and adjustment of accepted data sets parameter estimates are described in Appendix A.
The ABC-linear regression correction steps are shown in Algorithm 2.
Algorithm 2 ABC-linear regression correction algorithm performed on the output of the ABC-rejection.Input: standardized observed summary statistics, M accepted standardized summary statistics with corresponding M accepted parameters from ABC-rejection.
1: Input: standardized observed summary statistics, M accepted standardized summary statistics with corresponding M accepted parameters.
▷ Calculate each of M weights based on kernel type.
4: Return adjusted parameter values from the linear regression adjustment.

Overview
Below we describe two methods for estimating strength of selection under four migration m and mode of reproduction sex combinations, and 0, 1, or 2 loci under selection combinations.
In the 1stmethod we describe, t* and t final are fixed 5 and 19 respectively, and the selection coefficients conditional on locus span on s i | I = i Unif − 0.25, 0.25 (see Section 3.2).The parameter space for loci under selection is at genomic locus corresponding to SNP number that can only take on values on L/3, L/2, 2L/3 .An advantage for this method is no need for the initial summary statistic outlier scan (as described in Section 4.1) which would require simulation of nsim = 100, 000 data sets per observed data set, with parameter space of potential loci under selection reduced to those identified by the outlier scan.In Section 5.2 and Section 5.2.1 we describe and perform an ABC analysis for each of total of n ABC = 10,000 observed data sets, where we re-use simulated same nsim = 100, 000 data sets for each analysis.We plot observed signF ST summary statistics for all 10,000 observed data sets and visually represent how the parameter space for the strength of selection differs depending on migration and mode of reproduction combinations (Fig. 3), and depending on number of loci under selection (Fig. 4).To reduce the complexity of variable parameters, we visually verify of this, simplified method case that ABC-linear regression correction estimation outperforms ABC-rejection estimation, and that there is no difference in performance of ABC-linear regression with Gaussian kernel versus with Epanechnikov kernel (Appendix C).
In the 2nd method, we performed four scenarios total of t*, t final , number of SNPs L, recombination rate r, and parameter space of selection conditional on locus s i | I = i combinations (see Table 2), unlike where in 1st method we performed only one combination.Additionally, we build on the technique from the 1stmethod by allowing the loci under selection of the observed data to span anywhere between locus corresponding to SNP 1 to L, then performing outlier scan as described in Section 4.1 which requires simulation of nsim = 100, 000 data sets per observed data set.Based on the results from the 1st method about ABC-rejection versus ABC-linear regression, and corresponding ABClinear regression two kernels, we performed analysis only with ABC-linear regression (after the ABC-rejection but without comparing to ABC-rejection) and compare ABC performance with four summary statistics combinations instead of one like in the 1st method, and found that signF ST summary statistics in ABC analysis is has low squared-bias and variance between observed and median accepted posterior selection parameters relative to the other three summary statistics combinations (Fig. 9).

Observed data, model simulations, and ABC for fixed potential loci under selection
In this 1st method, we assessed how the signal of selection from summary statistics changes when the mode of reproduction, strength of migration, as well as number of loci under selections change.As Wright's fixation index [25] measures the allele frequency differentiation among sampled populations [26] of sequenced data, and demographic history of Saccharomyces cerevisiae has been reported to play a role in gene expansion and contraction based on phylogeny reconstruction [46], here we investigate this demographic history and its effect on signature of selection.
For this, we assessed signal strength equivalent to scenario 1 for the 2nd method from Table 2 but for 10,000 ABC iterations n ABC = 10, 000 and without the initial outlier scan on summarized observed data.This method has a computational time advantage as instead of simulating unique 100,000 nsim = 100, 000 data sets per one observed data, the simulated data sets are re-used.We present the results with signF ST summary statistics based on ABC-linear regression with Gaussian kernel (see Fig. 5, and Fig. 6), but we also compare performance of Gaussian versus Epanechnikov kernels, Gaussian kernel versus rejection, and Epanechnikov kernel versus rejection (see Appendix C).
The details of the experimental design of parameter space are described below.

Experimental design of parameter space-
The experimental design of the parameter space which was the same as for scenario 1 in sites as those we used in scenario 4 (Table 1, and Table 3 in Appendix A), resembling a biological yeast system where YPS128 and DBVPG1106 yeast strains of 12Mb genome length differ by over 70,000 SNPs or 1 SNP per ~ 165 bp [8], as described in Section 2.1.The migration rate was randomly chosen to be 0.0, 0.2, or 0.5, sexual or asexual reproduction during migration generation.The migration generation cycles n cycles = 4 took place every fifth generation t* = 5 for total of t final = n cycles t* − 1 = 19 generations.Both, the observed and simulated data can have either 0, 1, or 2 loci under selection with prior parameter space of selection coefficients conditional on locus s i | I = i Unif − 0.25, 0.25 .
The experimental design part that differed from scenario 1 was: We performed 10,000 ABC iterations instead of 100.Out of the potential 0, 1, or 2 loci under selection, the possible loci that the selection could act on were L/3, L/2, 2L/3 , more specifically {33 rd SNP,50 th SNP,67 th SNP}.For this small set of the only possible loci under selection, no outlier scan on observed summarized data was performed, which enabled us to re-use the same 100,000 simulated data sets for the ABC iterations.
The experimental design of the parameters space for selection coefficients and rates of migration, with possible models of reproduction during migration generations of the model are seen in Fig. 2.
For the signature of the strength of selection due to mode of reproduction and strength of migration, we have evaluated signF ST summary statistic of each of 10,000 observed data sets under four variable modes of reproduction (sex) and strength of migration (m) combinations: sex = 0 and m = 0, sex = 1 and m = 0, sex = 1 and m = 0.2, sex = 1 and m = 0.5, while equally probable seven combinations of 0,1,2 loci under selection within a set of L/3, L/2, 2L/3 possible SNP loci (see Fig. 2).In Fig. 3 we see a visible pattern in an increase in summary statistics values away from 0 (signF ST = 0 when the fixation index F ST = 0) at selected loci with a decrease in migration rate, and increase in genetic hitchhiking effect with an asexual mode of reproduction sex = 0 given same m of m = 0.0 as the recombinational distance from the loci under selection is absent.
For the signature of strength of selection due to a variable number of loci under selection, we have evaluated signF ST summary statistic of each of 10,000 observed data sets where 0, 1, and 2 loci are under selection within a set of L/3, L/2, 2L/3 possible SNP loci, while equally probable four combinations of mode of reproduction and migration described above (see Fig. 2).In Fig. 4 we see a visible pattern in summary statistics values closest to 0 for no loci under selection, increase in the magnitude of summary statistics values away from 0 at single locus under selection, and in-between the magnitude of summary statistics values away from 0 and increase in genetic hitchhiking effect for two loci under selection.
In Fig. 3 we see a break down of LD when recombination rate is present versus absent, given no migration rate, and in Fig. 4 we see more break down in LD for one locus under selection instead of two loci, given same average genomic recombination rate, which is consistent with recombination as the primary source of LD break down [7].
To further examine the signature of LD decay and distance between loci under selection, given fixed average genomic recombination rate, we looked at cases with loci under selection in closer proximity to each other.We evaluated effect of number of loci and distance between loci under selection for two loci cases, i.e. four combinations: none, one locus, two loci L/6 distance apart, and two loci L/3 distance apart, as well as each of seven loci combinations, i.e: none, L/3, L/2, 2L/3, L/3, L/2 , L/3, 2L/3 , L/2, 2L/3 but we could not distinguish visually further differences than those seen in Fig. 4 in observed summary statistics values due to position (seven combinations), nor the distance (four combinations) (see Appendix C).
In order to answer the question how well the signal we recover the signal of strength of selection, with variable selection coefficients priors, when the mode of reproduction, strength of migration, and number of loci under selection change, we evaluated the bias and the variability in the estimates of selection coefficients from the ABC.For this, we plotted the true observed selection coefficients versus: posterior medians, MSEs between true observed versus median of posteriors, variance of the posterior medians, and bias squared between true observed and posterior medians based on the ABC-linear regression Gaussian kernel.We picked Gaussian kernel because the PoPoolation software -a pipeline for analyzing pooled next generation sequencing data [47] -uses Gaussian kernel smoothing [48].We assessed them based on four modes of reproduction and migration combinations, and based on number of loci under selection.For the mode of reproduction and migration (Fig. 5), we see a positive relationship between the magnitude of true observed selection coefficient values for strongest migration rate m = 0.5 and variance as well bias (squared), with high migration rate contributing to opposing the effect of divergent selection.For the same range of selection coefficient values, the range of summary statistic values is small of data generated with high m, in compare to low m (Fig. 3).The signal of the strength of divergent selection also diminishes with number of loci under selection.In (Fig. 6), we see a positive relationship between the magnitude of true observed selection coefficient values for two loci, and variance as well bias (squared).
We compared to Epanechnikov kernel, and ABC-rejection and found no visual difference between kernel types, but an improvement in selection coefficient estimations for both Gaussian and Epanechnikov in comparison to ABC-rejection (Appendix C).

Observed data, model simulations, and ABC for potential loci under selection identified by outlier scan
In the 2nd method, to answer question how expansion of the parameter space for the positions of loci under selection contributes to accuracy in estimation of selection coefficients, we performed ABC evaluations for estimating strength of selection for model scenarios described in Table 2.The observed data sets had randomly selected 0, 1, or 2 loci under selection, and randomly selected loci under selection along SNP loci 1 to L (see Model section on parameter methodology).We identified candidates for loci under selection on the n ABC observed data sets via the outlier scan using the F ST summary statistic (see Section 4.1) first, followed by the model simulations, then ABC.
We performed a total of four scenarios.For scenarios 1-3, we simulated same number of SNPs, average recombination rate, as described in 1st method in sub-Section 5.2.1, and with same expected number of crossovers and average spacing of polymorphic sites on the genome equivalent to the biological yeast system as in scenario 4 (Table 1, and Table 3 in Appendix A).Because of relatively low number of SNPs simulated, which referred which required lower computational time per simulated data set, we performed 100 ABC iterations (n ABC ), which translated into 100 observed data sets (one ABC iteration per one observed data set), with nsim = 100, 000 unique data sets simulated conditional on outlier scan on potential locus under selection per each observed data set.
For scenario 1, the migration-selection cycle occurred every t* = 5 generations, with total number of cycles n cycles = 4, and final generation t final = n cycles t* − 1 = 19.The prior parameter space of selection conditional on locus is s i | I = i Unif − 0.25, 0.25 .This parameter space implies that if an overall population fitness ω is ω = 1 at t = 0, that is the average fitness of all carriers N e ω n = 1 at t = 0, for extreme strength of selection if a carrier of one locus under selection has s i | I = i = − 0.25 or s i | I = i = 0.25, it has 25% lower or higher fitness respectively than the population average at t = 0.For two loci under selection, this translates to 50% lower or higher fitness respectively than the population average at t = 0.
For scenario 2, the only parameters that differ from scenario 1 is the length of the selection cycle t* = 50, a 10-fold increase from scenario 1, and the prior parameter space of selection conditional on locus is s i | I = i Unif − 0.025, 0.025 , a 10-fold decrease from scenario 1, as one would expect fixation to reach slower with weaker selection.
For scenario 3, the only parameter that differs from scenario 2 is number of cycles n cycles = 2, thus final generation t final = n cycles t* − 1 = 99.This scenario is to test whether the tested number of mixing (migration) times plays a significant role in the signature of selection in comparison to scenario 2.
For scenario 4, the recombination rate resembles closer to a genomic recombination rate of sequenced yeast data.The average recombination rate of Saccharomyces cerevisiae has been reported 3.5×10 −6 Morgans/bp in the literature [9], with inferred genome-wide recombination profiles from sequenced isolates from an advanced intercross line (AIL) to be as high as 3.0×10 −5 Morgans/bp for a two-way cross at genome hotspots [10].
Here, for the scenario resembling sequenced yeast, we fixed genome recombination rate of r = 2.0 × 10 −5 Morgans/bp, a realistic value for recombination hotspots in yeast [10].
The simulated number of SNPs for scenario 4 for the genomic chunk is L = 1, 500, around an average number of SNPs of a yeast cross for one-third a chromosome [8], which still holds true that the expected number of cross-over events is μ n r = 4.95 (Table 1, and Table 3 in Appendix A) as for scenarios 1-3, and as for the non-outlier scan earlier scenario with smaller parameter space of potential loci under selection L/3, L/2, 2L/3 .The pior parameter space of length of selection cycles, number of cycles (and thus final generation), and selection coefficients conditional on loci are equivalent to scenario 2 t* = 50, n cycles = 4, t final = n cycles t* − 1 = 199, s i | I = i Unif − 0.025, 0.025 .Due to longer computational time because we simulated more SNPs in scenario 4 than in scenarios 1-3, we picked 10 instead of 100 data sets as observed and thus performed n ABC = 10 instead of n ABC = 100 ABC iterations.
Full list of scenarios with initial outlier scan for which the evaluations were performed are described in Table 2.The Fig. 7 and Fig. 8 show the summary statistics of observed data sets based on mode of reproduction and migration, and number of loci under selection respectively for each of the four scenarios.The ABC performance of MSE, variance and squared-bias between true selection parameter values under which the data are generated, and the values of posterior distributions for each of the four scenarios are shown in Fig. 9.In ABC analysis estimators are set to medians of the posterior distributions.
Out of the four scenarios, we see strongest pattern of observed summary statistics for scenario 1, where the range of selection coefficients is 10-fold greater, with 10-fold shorter migration-selection cycles, characterized by sharp peaks (Fig. 7 and Fig. 8).We also see that the observed XP − EHH is almost exclusively positive.From expression (8) with population X in the numerator and population Y in the denominator, population X and Y are modeled under selection and under neutrality respectively.The numerator is larger due larger sum of the extended haplotype homozygosity decay around the locus the selection is acting on.LD-based statistic XP − EHH has been shown to be more effective when one variant of allele is near fixation within one population [49,50].In scenario 1 with stronger selection, and shorter migration-selection cycles, the genetic drift is expected to have smaller effect proportionally to strength of selection and length of migration-selection cycles.With absent migration m = 0 , a build-up of selection over time is expected to occur, and a break down of LD when recombination is present sex = 1 instead of absent sex = 0 , shown most clearly in our results with XP − EHH.
For scenario 4 (resembling a biological yeast system), where 15-fold larger number of SNPs are simulated than for scenarios 1-3, we see peaks of observed summary statistics values with more stochastic effect between the SNPs unlike a more smooth pattern in observed summary statistics of neighboring SNPs in scenarios 1-3.This pattern resembles more so the pattern of the level of heterozygosity [51] on domestication and divergence of S. cerevisiae, and less so in scenarios 1-3.
Our results presented show some difference in genetic divergence dependent on number of mixing cycles.We see larger range of observed summary statistics values for scenario 2 compared to scenario 3, with four migration-selection cycles instead of two.The observed summary statistics values deviate the most from zero for data generated under no migration in both scenarios, where the process of genetic drift and/or divergent selection affects the accumulation genetic differences [52].
Besides the differences in patterns of observed summary statistics between the scenarios, we see a consistent pattern across the scenarios.The lowest level of divergence, expressed as the lowest magnitude of observed summary statistics, is seen for the strongest rate of migration for all summary statistics (Fig. 7), and no loci under selection (Fig. 8) for signF ST and (p X − p Y ).
After examining patterns of observed summary statistics by mode of reproduction with migration, and by number of loci under selection, we compared the ABC performance with these observed summary statistics by evaluating the variance of ABC posteriors, squared-bias between of the posterior estimates and true observed parameters.We find a clear pattern of posterior estimates to be less biased for scenario 1, where the range of selection coefficients is 10-fold of those in scenarios 2-4, and where the migration-selection cycles are shorter (Fig. 9).Once explanation could be that short migration-selection cycles do not allow for the significant build-up of genetic drift, and/or that strong selection has too significant of effect for the genetic drift to act on the population.We also determine a pattern of XP − EHH capturing least information about the signal of divergent selection, and discuss potential explanation in Section 6.2.

Application
By combining ABC methods that incorporate summary statistics on large simulation study from developed simulator, we present an approach of estimating model parameters, which can bypass evaluation of exact likelihood function.We show it on simulations with variable migration rates, modes of reproduction, and number of loci under selection, where fixation index summary statistics outperformed cross-population extended haplotype homozygosity in terms of precision and accuracy.We also recommend fixation index over the LD-based statistic cross-population extended haplotype homozygosity, because calculating the linkage disequilibria is computationally expensive [53].
Moreover, when considering evolutionary models of complex systems, they provide valuable insights into how systems adapt, change, and evolve [54].These models, by simulating intricate interactions among diverse components, unlock crucial patterns shaping the adaptive nature of systems, thereby laying the groundwork for understanding where the strength of recovered signals lies in representing divergent selection.An important question about the application of these methods is: for which part of the parameter space of the model the recovered signal about the selection coefficients works well?In Fig. 9 scenario 1 for instance the posterior median estimates of selection coefficients are closer to the values of the true observed selection coefficients in compare to scenarios 2-4.In scenario 1, the migration-selection cycles are 10-fold smaller, and the selection coefficients range is 10-fold larger thus affected less by the genetic drift, supporting that selection can only be assessed if it is high enough to outperform the effect of genetic drift [55].The shorter migrationselection cycles with overall less generations simulated, took less time to simulate.We showed in scenario 4 that our model is scalable to recombination rate parameter resembling sequenced yeast data [10], with same expected number of recombination events Table 3, Table 1 in Appendix A) as for scenarios 1-3.If the parameter space of possible loci under selection can be assumed to those in Fig. 2, a more robust number of ABC iterations on data resembling biological yeast recombination rate is feasible in terms of computational time.

Recombination rate
A limitation of the study where the main focus was estimation of selection coefficients, was fixing the recombination rate to an expected number of 4.95 recombination events (Table 3, Table 1 in Appendix A) for recombination rates 3.0 × 10 −4 Morgans/bp and resembling biological data 2.0 × 10 −5 Morgans/bp [10] for 100 and 1500 SNPs respectively.We fixed the recombination rate to same for all simulations, such that the parameter r represents the average recombination rate on the simulated genome section.We explain out motive for fixing the recombination rate below.
With our model we attempted to estimate strength of selection with variable recombination rates of simulated data sets, however, we were unable to get consistent estimates.While the buildup of LD (i.e. the correlation between nearby variants of alleles as opposed to random association of alleles [56]) can be a result of several population genetic forces, recombination is the only primary method to break it down [7].The absence of recombination between sites under selection will reduce the overall effectiveness of selection, a phenomenon known as the Hill-Robertson (HR) effect [7,57,58].Our main focus was estimation of strength of selection, with successfully applied variability in mode of reproduction, and in strength of migration, but not with variability in recombination rate.As the XP − EHH was least effective in estimation of selection with fixed recombination, and XP − EHH has been used to measure the haplotype lengths between two populations [32,33], we would expect XP − EHH to perform better in estimation of recombination without taking into account varying strength of selection, mode of reproduction and migration combined.
Future work direction would be an exploration of variable recombination rates.
), m XY , m Y X , t m , sex, SNP spacing ifrv Unif 0, 1 < 0.5then ABC-rejection algorithm for summary statistics calculated from datasets simulated from the Simulator and their corresponding parameters from the prior distribution.Input: proportion of simulations to accept tol, number of simulated data sets nsim, number of summary statistics per simulation nstat.
▷ Simulated from the prior θ i * P (s, m, sex, l; K) in the Simulator.

Table 3
Description of conversion between genomic recombination rate, and recombination rate with respect to SNPs.
One crossover of actual genome that occurs between 2 consecutive SNPs actually would be one crossover between simulated SNPs only.Two crossovers between consecutive SNPs on the genome would look like no crossover between simulated SNPs, which we do not want to consider because it is very rare.Below is a note of how the recombination conversion is performed: For yeast cross from the lab as example, genome length=12.The parents reproduce sexually and parental genome recombination takes place.We denote probability of genome recombination with r (Morgans/bp units).We sample number of recombination events n r from Binomial distribution with a sequence of independent L × SNP spacing − 1 experiments with probability r.For each of the n r recombination events, we sample genome loci of recombination events L r , from Discrete Uniform distribution from locus 1 to locus L × SNP spacing − 1, but we assume no more than one recombination event between two consecutive SNPs.The cross of two populations at t = 0 (end Part II and start Part A of Algorithm 3) serves as an ancestral pool for X and Y populations.
Euclidean distance: In the ABC-rejection we apply the Euclidean distance [39,40] technique to scale summary statistics: with SD, j, j = 1, 2, of the median absolute deviation.The SumStat i, j and SumStat obs, j are points in nstat-dimensional space.
We scale nsim × nstat summary statistics matrix such that each of the summary statistics contributes equally when SD on the nstat-dimensional space: where k is a constant scale factor dependent of the distribution of summary statistics [59] such that the median absolute deviation is a consistent estimator for the standard deviation of each summary statistic for j = 1, 2, …, nstat: For Gaussian distribution k = 1/Φ −1 3/4 ≈ 1.4826 [59] (estimated after converting to standard normal), where for uniform distribution k = 2/ 3 (for standard continuous uniform which spans on − 3, 3 , with k • ( 3/2) = 1).All of the parameters are simulated from uniform priors, therefore we used k = 2/ 3.
[43], and [45], or the Gaussian kernel: (15) [41,44].Next, we implemented lsfit R function [60,61] to calculate weighted least squares estimates of regression coefficients, with indeces of accepted ordered M values denoted by id 1: M : β ˆW LS = ( 1, SumStat obs scaled ⊤ W SumStat obs scaled ) −1 1, SumStat obs scaled ⊤ W θ id 1: M * * (16) ,where W is a diagonal matrix, and diag W = w 1 , w 2 , …, w M corresponding to kernel weights.The beta estimates vector is an approximate draw from the posterior with  17) [4,62,63], such that accepted data set with smallest Euclidean distance d i from Algorithm 1 output is adjusted the least and the accepted data set with largest d i out of the M accepted is adjusted the most.SNP loci vs. signF ST summary statistics for the 1st method, i.e. without the initial outlier scan.Plots for each of 10,000 simulator output data sets assumed as observed under four of migration m and mode of reproduction sex combinations selected randomly with equal probability: sex = 0 and m = 0.0 in yellow, sex = 1 and m = 0.0 in red, sex = 1 and m = 0.2 in turquoise, sex = 1 and m = 0.5 in black, and selected randomly with equal probability scenarios of loci under selection as seen in Fig. 2. A visible pattern is shown of an increase in summary statistics values away from 0 at selected loci with a decrease in migration rate, and an increase in genetic hitchhiking effect with an asexual mode of reproduction (sex = 0) (absence of recombination) given same m of m = 0.0.SNP loci vs. signF ST summary statistics for the 1st method, i.e. without the initial outlier scan.Plots for each of 10,000 simulator output data sets assumed as observed under four of migration m and mode of reproduction sex combinations selected randomly with equal probability scenarios of loci under selection: none in black, one locus (L/3, or L/2, or 2L/3) in red, two loci (L/3 with L/2, or L/3 with 2L/3, or L/2 with 2L/3) in blue, as seen in Fig. 2. A visible pattern is shown of summary statistics values closest to 0 for no loci under selection, largest magnitude in summary statistics values away from 0 at single locus under selection, and in-between the magnitude of summary statistics values away from 0 and increase in genetic hitchhiking effect for two loci under selection (lower observed frequency of recombination events between two loci under selection at a confined distance apart).True parameter value under which the observed parameter is generated (x-axis) vs. median, MSE, variance, bias-squared for L/3 th, L/2 th and 2L/3 th SNP respectively for the 1st method, i.e. without the initial outlier scan, from n ABC = 10,000 ABC tests from signF ST summary statistics from Algorithm 2 with Gaussian kernel.Colors based on three combinations of number of loci under selection of observed data sets: none in black, one locus in red, and two loci in blue.A pattern is shown of a positive relationship between the magnitude of true observed selection coefficient values for two loci, and variance as well bias (squared), with greater genetic hitchhiking effect on two loci under selection.SNP loci vs. observed summary statistics for the 2nd method, i.e. each of four scenarios with initial outlier scan, (Table 4).Simulator output data sets assumed as observed under four of migration m and mode of reproduction (sex) combinations selected randomly with equal probability: sex = 0 and m = 0.0 in yellow; sex = 1 and: m = 0.0 in red, m = 0.2 in turquoise, and m = 0.5 in black.Data selected randomly with equal probability scenarios of loci under selection.A visible pattern is shown of summary statistics values closest to 0 for the strongest migration rate of m = 0.5 across all scenarios, constraining the genetic divergence.Clear summary statistics peaks shown in scenario 1, with smaller effect of genetic drift on strong selection, and with m = 0 most clear pattern of LD break down when sex = 1 instead of sex = 0 seen with XP − EHH.Pattern of more stochastic effect between the peaks of summary statistics values between neighboring SNPs seen in scenario 4, resembling more of heterozygosity pattern seen on a S. cerevisiae genome.A larger range of summary statistic values seen for scenario 2 compared to scenario 3, with four migration-selection cycles instead of two.Colors of observed summary statistics labels correspond to colors of plots in Fig. 9. SNP loci vs. summary statistics for the 2nd method, i.e. each of four scenarios with initial outlier scan, (Table 2).Simulator output data sets assumed as observed under four of migration m and mode of reproduction sex combinations selected randomly with equal probability scenarios of number loci under selection occuring within the loci identified by the outlier scan: none in black, one in red, and two blue.Clear summary statistics peaks shown in scenario 1, with smaller effect of genetic drift on strong selection.Pattern of more stochastic effect between the peaks of summary statistics values between neighboring SNPs seen in scenario 4, resembling more of heterozygosity pattern seen on a S. cerevisiae genome.A larger range of summary statistic values seen for scenario 2 compared to scenario 3, with four migration-selection cycles instead of two.A visible pattern of signF ST and p X − p Y values closest to 0 for no loci under selection across all scenarios.Colors of observed summary statistics labels correspond to colors of plots in Fig. 9. True parameter value under which the observed parameter is generated (x-axis) vs. median, MSE, variance, bias-squared for loci corresponding to SNP for the 2nd method, i.e. determined by initial outlier scan, followed by simulation of data sets and ABC from Algorithm 2 with Gaussian kernel correction.Colors based on four combinations of summary statistics inputted to ABC are described in the figure legend.Plots for four scenarios, where in scenario 1 the range of selection coefficient is 10-fold larger ([−0.25,0.25]):thus the y-axis of the posterior median is 10-fold larger, the MSE, variance and bias-squared is (10-fold)-squared larger than for scenarios 2-4.A pattern is shown of posterior median parameter values closest to true observed values based on signF ST , p X − p Y , and signF ST , p X − p Y for scenario 1, and least information captured with XP − EHH.

11 :
allele a Discrete Unif 0, 1 and replicate L times for each 5: genome carrier n, n ∈ 1, 2, …, N e in X F 1 and Y F probability parent, p 1 Discrete Unif 1, N e from X F 1 Sample with equal probability parent, p 2 Discrete Unif 1, N e from Y
066Mb and strains differ by 73,015 SNPs SNP spacing = 12.066Mb/73, 015 ≈ 165 The recombination rate conversion inside the function: avg numrec = L × SNP spacing − 1 × r (This is the average number of recombination events of genome chunk on which the SNPs are spaced) Example: For L = 100, and r = 3.0 × 10 −4 avg numrec = 4.95 Convert r SNP = avg numrec / L − 1 = 4.95/99 = 0.05 (This is the recombination rate of SNPs alone) avg numrecSNP = L − 1 × r SNP = 99 × 0.05 = 4.95 = avg numrec (This is the average number of recombination events of SNPs alone) Parameters per SNP equivalent to line 13 and line 15 ofAlgorithm 3: Sample number of recombination events independently of each other (line 13): n r Bin L − 1, r SNP Sample loci of recombination events (line 15): L rSNP Discrete Unif n r , L − 1 What is written in line 13 and line 15 ofAlgorithm 3: Sample number of recombination events independently of each other (line 13): n r Bin L × SNP spacing − 1, r Sample loci of recombination events (line 15): L r Discrete Unif n r , L × SNP spacing − 1

[
41], such that ∑ m = 1 M w m = 1, to obtain a weighted sample (θ m adj * , w m ).The w m can be calculated based on either Epanechnikov kernel: SumStat obs scaled )β ˆW LS

Fig. 10 .
Fig. 10.Mean of summary statistic p X − p Y values per locus at SNP 1 to L = 20 for nsim = 100, 000 simulated data sets, with selection at single locus at SNP 10.The decrease in p X − p Y is shown to be proportional to distance away from locus under selection, and with drops of p X − p Y values at migration m generations (left).Superimposed expected values (deterministic model), and mean of 100,000 simulated data sets of p X − p Y at locus under selection (upper right).Visible genetic drift for a randomly chosen simulated data set.

Fig. 1 .
Fig. 1.Population genetic model for divergent selection: A. Population divergence for t* − 1 generations during which reproduction is asexual and the absolute fitness depends on the allele-specific copy under which selection acts upon.Each offspring is genetically identical to its only parent and an individual is chosen to be a parent with probability proportional to its fitness.B. Symmetric migration rates m XY and m Y X between t* − 1 and t* generation.Neutral evolution with recombination.C. Second population divergence.

Fig. 2 .
Fig. 2.Possible parameter space of selection coefficients for the 1st method, i.e. without the initial outlier scan on summarized observed data.Scheme of selection coefficients conditional on genome loci s i | I = i , rates of migration m , and possible modes of reproduction during migration generations sex .The positions under selection can only take on values on L/3, L/2, 2L/3 , being able to re-use the same nsim = 100, 000 data sets for the ABC iterations.

Fig. 5 .
Fig. 5. True parameter value under which the observed parameter is generated (x-axis) vs. median, MSE, variance, bias-squared for L/3 th, L/2 th and 2L/3 th SNP respectively for the 1st method, i.e. without the initial outlier scan, from n ABC = 10,000 ABC tests, based on signF ST summary statistics from Algorithm 2 with Gaussian kernel.Colors based on four migration m and mode of reproduction sex combinations of observed data sets: sex = 0 and m = 0.0 in yellow; sex = 1 and: m = 0.0 in red, m = 0.2 in turquoise, m = 0.5 in black.A pattern is shown of a positive relationship between the magnitude of true observed selection
L, loci corresponding to SNPs under selection L s , alleles under selection a, recombination rate r, selection coefficients at loci under selection s, migration rate between populations m, number of generations between migration-selection cycles t*, type of reproduction during migration generations sex, spacing of SNPs on the genome, assumed equal spacing SNP spacing , and number total number of generations to simulate t final .
. The number of recombination events n r between p 1 and p 2 are sampled from Bin L × SNP spacing − 1, r , with r corresponding to genomic recombination rate and SNP spacing corresponding to SNP spacing on the genome.Loci of recombination events, L r , are sampled n r times from L r Discrete Unif n r , L × SNP spacing − 1 .If a random variable rv sampled from Uniform distribution on [0, 1] is less than 0.5, i.e. if rv Unif 0, 1 < 0.5, p 1 is reassigned to p 2 and p 2 is reassigned to p 1 .The recombination of p 1 and p 2 then starts from the genome of p 1 and alternates between two parents to form an offspring.This concludes founding F 2 at time

in Part A and Part B respectively
are repeated for a total specified number of cycles n cycles , with ending on the final generation before migration, i.e. t final = n cycles t* − 1.The output of the data of the x matrices of dimension N e × L per population is then ready for model inference by mapping the data to summary statistics and the ABC.
w m to accepted data sets from ABC-rejection [41].The weight w m for each of accepted pairs from Algorithm 1 output are calculated using kernel κ d M d m [42], where d m is Table 2 is as follows: for L = 100 SNPs, with recombination rate of 3.0×10 −4 Morgans/bp.The expected number of crossovers was μ n r = 4.95 and average spacing of polymorphic sites on the genome every SNP spacing = 165