Blockwise Site Frequency Spectra for Inferring Complex Population Histories and Recombination

We introduce ABLE (Approximate Blockwise Likelihood Estimation), a novel composite likelihood framework based on a recently introduced summary of sequence variation: the blockwise site frequency spectrum (bSFS). This simulation-based framework uses the the frequencies of bSFS configurations to jointly model demographic history and recombination and is explicitly designed to make inference using multiple whole genomes or genome-wide multi-locus data (e.g. RADSeq) catering to the needs of researchers studying model or non-model organisms respectively. The flexible nature of our method further allows for arbitrarily complex population histories using unphased and unpolarized whole genome sequences. In silico experiments demonstrate accurate parameter estimates across a range of divergence models with increasing complexity, and as a proof of principle, we infer the demographic history of the two species of orangutan from multiple genome sequences (over 160 Mbp in length) from each species. Our results indicate that the two orangutan species split approximately 650-950 thousand years ago but experienced a pulse of secondary contact much more recently, most likely during a period of low sea-level South East Asia (∼300,000 years ago). Unlike previous analyses we can reject a history of continuous gene flow and co-estimate genome-wide recombination. ABLE is available for download at https://github.com/champost/ABLE.


Introduction
Demographic history has a major role in shaping genetic variation. However, using this information 18 in an efficient way to infer even very simple models of population history remains challenging: 19 a complete description of the history of genomic samples includes both the ancestral process of 20 coalescence and recombination, as captured by the ancestral recombination graph (ARG). While the 21 ARG is straightforward to simulate, in practice, the number of recombination and coalescent events 22 in any stretch of genome generally exceeds the information (i.e. number of mutations) available to 23 reconstruct them. Thus, it is currently not feasible to perform demographic inference by integrating 24 over all realizations of the ARG that are compatible with a genomic dataset [1].  Other methods seek to use linkage information by approximating recombination, i.e. the sequential 33 transitions between local genealogies along the genome, as a Markov process [8,9]. Methods based 34 on the Sequential Markov Coalescent (SMC, [10]) are computationally intensive, limited to relatively 35 simple models [11] or small samples [8,12,13] and require good genome assemblies which are presently 36 available only for a handful of species. 37 Multi-locus methods exploit information contained in short-range linkage by assuming that 38 recombination has is negligible effects within short blocks of sequence [14][15][16][17][18]. However, this approach 39 potentially biases demographic inference and still looses information contained in longer range linkage 40 disequilibrium (LD), which is expected to result from historical admixture or drastic changes in 41 population size. While recombination within blocks has been included in multi-locus inference, this 42 currently does not scale to whole genome data [19]. Interestingly, the few methods capable of jointly 43 inferring recombination (using the SMC) and demography using whole genomes [12,13] can only 44 analyze a couple of samples or are restricted to specific population histories [32,33]. 45 To overcome these limitations, we introduce a composite likelihood (CL) framework which is 46 highly flexible both in terms of the demographic histories and data that can be accommodated. We 47 can infer arbitrarily complex demographic histories along with the average recombination rate using 48 multiple whole genomes or genome-wide multi-locus data (e.g. RADSeq) catering to the needs of 49 researchers studying model or non-model organisms respectively. Our method builds upon an existing 50 analytic approach [16,18] that partitions the genome into blocks of equal (and arbitrary) size and 51 summarizes the genome-wide pattern of linked polymorphism as a frequency distribution of blockwise 52 site frequency spectra. We refer to this straightforward extension of the SFS, as the distribution 53 of blockwise SFS configurations, or simply the bSFS. The bSFS is a richer summary of sequence 54 variation than the SFS which retains information about the variation in genealogies contained in 55 linkage within blocks. We use Monte Carlo simulations from the coalescent with recombination to 56 approximate the bSFS. This overcomes the limitations of exact likelihood calculations [18,20] based 57 on the bSFS by accommodating larger samples of genomes and including recombination within 58 blocks as a free parameter. Our approach is implemented in the software Approximate Blockwise 59 Likelihood Estimation (ABLE) which is freely available (https://github.com/champost/ABLE). 60 The paper is structured as follows: we first describe how the bSFS can be approximated for 61 samples from single and multiple populations both with and without recombination. The accuracy 62 of our approximation is assessed by comparing it to analytic results for small samples in the absence 63 of intra-block recombination under three different demographic models. We then illustrate the 64 performance of ABLE on real data by analysing whole genomes of the two species of orangutan 65 (Pongo pygmaeus and P. abelii ) which inhabit the islands of Borneo and Sumatra respectively [21,22]. 66 These sister taxa represent an excellent test case as their demographic history has been the subject 67 of several previous analyses [12,13,19,[21][22][23] and the geological knowledge of the Sunda shelf 68 is extensive [24]. The best supported history we infer is a previously unexplored scenario of a 69 population divergence (about a million years ago) followed by a discrete pulse of bidirectional 70 admixture which coincides with a cyclical sea-level changes in South East Asia [24]. We also obtain 71 plausible estimates for the per generation genome-wide recombination rate. Finally, we make use of 72 extensive simulations to asses the inferential power of our approach. We explore the ability of ABLE 73 to distinguish between various two population models and investigate the effects of sample and 74 block size on parameter estimates. We also compare the performance of a small-sample inference 75 with ABLE to that based on the SFS (∂a∂i [3]) using larger samples. 76

77
The blockwise SFS (bSFS) 78 Consider a random sample of sequence blocks of fixed length. In practice, such sequence blocks 79 (coloured segments in Fig. 1  the bSFS is essentially a frequency spectrum of site frequency spectra types across blocks (i.e. a 86 histogram of histograms) and can be thought of as a straightforward extension of the SFS that 87 accounts for linkage over a fixed length of sequence block (Fig. 1).

88
The bSFS readily extends to samples from multiple populations where the entries of k are counts 89 of mutation types defined by the joint SFS [6]. One advantage of the bSFS is that we only require 90 unphased data as mutations are not distinguished based on unique branches but branch classes 91 (singletons, doubletons, etc., see Fig. S1). In the absence of outgroup information and/or to avoid 92 biases due to errors when polarizing with distant outgroups, the bSFS may be folded.

106
To study how the number of sampled ARGs summarized by the bSFS affects the convergence of the 107 approximate CL to the analytical expectations, we considered small samples under three simple 108 demographic models: a single population (b = 4, no outgroup) which doubled in effective size (N e ) 109 at time T = 0.2 (Fig. 2a), a history of isolation between two populations A and B (at time T = 1.  isolation between populations A and B followed by continuous unidirectional migration (from A to B) at rate M migrants per generation and (c) IUA: isolation between three populations A, B and C followed be unidirectional admixture of a fraction f from A to B. Analytical expectations for these models can be found in [18,20,25].
The Monte Carlo approximation to the distribution of bSFS configurations matches the analytic 121 prediction extremely well (Fig. 3) even when only small samples of genealogies are used, e.g. 1000 122 simulated replicates. This is perhaps surprising, given that this sample size is on the same order 123 as the number of unique bSFS configurations. For example, for a sample of b = 2 from the two 124 populations IM model (Fig. 2b) and counting up to k max = 4 mutations per SFS-type and block, 125 there are 396 unique bSFS configurations. Interestingly, the probability of bSFS configurations 126 involving fixed differences ( Fig. 3; green middle row) can be approximated accurately with fewer 127 sampled genealogies than the probability of rare configurations that include shared polymorphism 128 ( Fig. 3; blue middle row). The latter are due to either incomplete lineage sorting or admixture. 129 Likewise, for the IUA model, the probability of bSFS configurations involving mutations shared by 130 A and B was harder to approximate than that of (B, (A, C)) configurations (blue vs. magenta in 131 Fig. 3

133
To demonstrate the performance of our new bSFS approach using the ABLE framework on real data, 134 we re-analysed whole genome data [21,22] for the two species of orangutan (Pongo pygmaeus and P. 135 abelii ) which inhabit Borneo and Sumatra respectively. These sister taxa are an excellent test case 136 given that their demographic history has been the subject of several previous analyses [12,13,19,[21][22][23]. 137 We selected a subsample consisting of two diploid genomes per species (i.e. b = 4 per island) and 138 inference, all analyses were repeated using shorter blocks (500bp; 9,085 unique bSFS configurations) 142 which were obtained by dividing each 2kb block.  Table 1). The only exception was the IM model (M4) which did 159 not increase support compared to a strict divergence history (M2). Interestingly, we found greater 160 support for instantaneous admixture (M6) compared to a history of Isolation and Initial Migration 161 (IIM) up to a time T 2 (M5).   Fig. 4) in Table 2 (see also Fig. S2) indicate that we have relatively greater power to infer more recent 175 aspects of orangutan history (N S , N B and T 2 ) compared to the time of initial divergence (T ) and 176 the size of the common ancestral population (N A ). While the admixture fraction estimated from 177 Sumatra to Borneo (f S→B ) was not significantly different from 0, admixture estimates in the reverse 178 direction had much tighter CI which clearly excluded zero.
179 T 2 284,000 -306,000 Effect of block length and sample size 180 We assessed how block and sample size affects ABLE's ability to infer two-population histories and 181 recombination in two ways. First, we repeated the orangutan analyses using shorter blocks (500bp). 182 Second, we used simulations to investigate how sampling additional genomes per population affects 183 our inferential power.

184
Block length

185
Comparing estimates based on 2kb blocks (Table 1) to shorter 500bp blocks (Table S1) suggest that 186 most, but not all, aspects of the inference were fairly robust to block length. As expected, shorter 187 blocks led to a greater uncertainty in model and parameter estimates (Table S2). Importantly 188 however, even with 500bp blocks, M6 was identified as the best fitting model and we found broad 189 overlap in 95 % CIs of parameter estimates with the 2kb analysis.

190
Both the divergence time T and the genome-wide recombination rate r were poorly estimated 191 with 500bp blocks. While the 2kb analyses resulted in fairly stable inferences for r (≈ 2 × 10 −8 bp −1 192 per generation) that agree with recombination estimates for humans [31], the 500bp estimates were 193 2-4 times greater and had very wide 95 % CIs (Table S2). However, note that the 95 % CIs of T 194 under 2kb and 500bp do overlap for periods < 1 Mya.

195
To test whether our method has any inherent bias to overestimate recombination with shorter 196 blocks, we simulated block-wise data under model M6 using the r estimates obtained from the 197 2kb data (Table 1). Applying ABLE to these simulated datasets and after taking into account the 198 sampling strategy (i.e. M6 (2dp), Table S3), we noticed no significant overestimation of recombination 199 rates. This may point to assumptions made by our approach on the type of recombination which 200 may have been violated when considering 500bp blocks from the Orangutan data (see Discussion). 201

Sample size 202
As expected, point estimates and power generally improved (Table S3 and Fig. S3) with increasing 203 sample sizes. While some parameters, in particular r, appear non-identifiable with minimal sampling 204 (a single diploid genome per species), all eight parameters of M6 are well estimated with just two or 205 three diploid genomes. We observed a five fold improvement in accuracy for r and an up to two 206 fold improvement for demographic parameters when increasing sampling effort from a single to two 207 diploid genomes per population.

208
Perhaps surprisingly however, Fig. S3 suggests that for histories similar to that inferred for the two 209 orangutan species, we can expect at best slight improvements in power when adding a third diploid 210 genome per population. Given that analysing three diploid samples per population almost triples 211 the computation time ( Fig. S4), this suggests that (at least in the case of orangutans) analysing a 212 total of four diploid genomes is a good compromise between information and computational cost.  Table S4) and compared the overall fit 218 to the true and alternative models. As expected (given that models were nested) data generated 219 under simple models did not give a better fit to more complex histories (Fig. 6). In contrast, data 220 generated under more complex histories showed a worse fit to simpler scenarios than the truth.

221
However, given the increasing dimensionality of the likelihood surface for more complex models, 222 the similar LnL values for nested models did not imply that the actual ML estimates of demographic 223 parameters under the simpler models were a subset of the corresponding estimates under the more 224 complex models (see Fig. S5, Fig. S6 and Fig. S7). For instance, given M1 as the true model 225 (Fig. S5), the population split time was largely overestimated under M6 as this model also consists 226 of another confounding demographic feature, a pulsed admixture event subsequent to divergence. 227 Interestingly, the genome-wide recombination rate was fairly consistently estimated among the 228 various models while the ancestral population size was consistently underestimated. To further investigate the ability to correctly identify complex demographies involving post-divergence230 admixture, we generated 20 simulated datasets under the most complex model considered in the 231 orangutan analysis (M6). We considered 9 different divergence/admixture times which varied from 232 900-150Kya and 600-75Kya respectively with all other values being kept constant (Table S4) Table S4). Each dataset 244 consisted of 5 diploid genomes/population. The ∂a∂i analyses were based on the whole sample, 245 while ABLE was only used on subsamples of two diploid genomes per population.

246
Despite the fact that ABLE used less than half of the data, it performed as well and in some 247 cases better as ∂a∂i (Fig. S9, Fig. S10  calculation or approximation of the bSFS relies on the assumption that blocks are statistically 290 exchangeable, and so ignores heterogeneity in mutation and recombination rates. 291 We can visualize the absolute fit of our demographic model to the data by comparing the observed 292 distribution of bSFS configurations to that expected under M6 (obtained using 50 million simulated 293 blockwise ARGs). If the data were generated entirely by the inferred demographic history alone, 294 we would expect the most common bSFS configurations to fit this expectation most closely (see 295  If background selection at linked sites is indeed responsible and a fraction of selective targets is 301 weakly deleterious, we would expect increasingly biased N e estimates going backwards in time. This 302 could explain why we estimated a much smaller effective size for the ancestral population (Table 1 303 and consider a bSFS with reduced block size (Fig. S13). Going further, it will be interesting to explore 306 the possibility of jointly inferring demography and various forms of selection using the bSFS [35]. improve the power to infer recent demographic events.

318
Our finding of larger r estimates when using shorter blocks for the Orangtuan data was surprising. 319 Given that our method ignores heterogeneity in both r and µ, both of which increase auto-correlation 320 across short distances, we expected to find the opposite, i.e. a decrease in r estimates for shorter 321 blocks. However, our simulation analysis showed that ABLE gives relatively unbiased estimates 322 of r for short (500bp) blocks when inference was performed using 2 or more diploid samples per 323 population (Table S3). A plausible explanation for the larger estimates in Orangutan data could 324 be gene conversion as it could potentially inflate r at short blocks with a diminishing effect on the 325 bSFS for blocks that are longer than the typical conversion tract length of several hundred bases 326 (see Table 2 in [36]). It should be possible to use this dependence on block length to develop explicit 327 estimators for gene conversion and cross-over rates.

328
Even under a complex demography such as M6, our simulation-based power analyses indicate 329 that most demographic parameters can be reasonably recovered with only a single diploid genome 330 per population (Table S3). Increasing sample size to two diploid genomes more than halved the 331 standard deviation in estimates for some parameters, most notably the recombination rate (Fig. S3). 332 However, a further increase in sample size gave a negligible improvement, despite the considerable 333 computational cost (Fig. S4)  in similar inferences of the population history and with a slight improvement for admixture rates 348 when using ABLE (Fig. S9, Fig. S10 and Fig. S11). Of note however, we only make use of a random 349 pilot analyses varying some or all of the factors mentioned above (see Fig. S15 and Fig. S14) should 368 help to inform the inference strategy.

369
It is also clear that, independent of the inference approach, the information in the data is finite 370 so there must be a hard limit on how realistic a history one can hope to infer. Thus, the fact that 371 The branches of a given genealogy corresponding to our population sample can be partitioned 393 into a vector t whose entries t i ∈ [t 1 , t 2 , . . . , t b−1 ] denote the total length of all branches with i 394 descendants (Fig. S1). The probability of observing k i mutations on a branch class t i is given by a 395 Poisson distribution with rate parameter θt i > 0: Because mutations occur independently on different branch types, the joint probability of seeing 397 a specific configuration k j = {k 1,j , k 2,j , . . . , k b−1,j } in a sequence block j and for a given branch 398 length vector t is then a product of Poisson distributions The likelihood L(Θ) at a point in parameter space Θ ∈ R + is calculated as where G is the (unknown) genealogy and D the data [44]. Summarizing genealogies G by t and D by 401 k j and drawing M random samples of t from p(t | Θ), the Monte Carlo approximation of Eq. 3 can 402 be obtained In theory, each block in a dataset might have a unique bSFS configuration. In practice however, 404 for short blocks spanning a handful of SNPs (e.g. < 10), the number of observed bSFS configurations 405 will be much smaller than the number of blocks. Assuming that blocks are equivalent and independent, 406 that is, they have the same length, per base mutation and recombination rates and are unlinked, we 407 can summarize the entire genome into blockwise data (Fig. 1) by counting the number of each unique 408 bSFS type n k j . Thus, the approximate joint composite log likelihood for a sample of n genomes is 409 given as ). However note that Θ ∈ R 2+ can be too restrictive a criterion for some parameters of 431 complex demographies such as coefficients of exponential population expansion/contraction where 432 α S , α B ∈ R (see Fig. 4).
and  All sites that did not pass this criteria (0 < VAF < 0.3) were coded as missing (N). 471 We also masked any non reference allele that were within 3bp of an indel and checked for allele 472 balance at heterozygous calls (e.g. each allele was covered by > 20% of the reads at that position).  Table 1 and Table S1, we report the best MCLEs whose likelihoods have been evaluated using 1M 491 ARGs. For some models for which replicate local searches did not converge sufficiently, a second 492 round of local searches was used.  and under the best model (i.e. M6) and MCLE (Table 1 and Table S1). Blocks in each dataset 511 were assumed to be completely linked (given our estimate of per site r) across 0.5 Mb stretches of 512 sequence. These simulations represent an extreme case of linkage and are thus conservative. Indeed 513 our real data contain large gaps between blocks especially due to the highly repetitive nature of 514 the orangutan genome. As we wish to know the local variability of the bootstrap inferences around 515 the MCLE obtained from the orangutan data, we only carried out local searches for each bootstrap 516 replicate (using the boundaries and step sizes obtained in the analysis of real data, see above).

517
The simulation based power test exploring the effect of sample size (1 -3 diploid genomes per 518 population) was based on inferences using simulated data followed by a full parametric bootstrap. 519 Given the computational effort required (see Fig. S4), we restricted our study to 500bp blocks with 520 values for the demographic parameters chosen to represent the results inferred from the real data 521 under M6 (Table 1 and Table S1). Parametric bootstrap datasets were generated with linkage (under 522 the full ARG) exactly analogous to the bootstrap in the real data analysis. Mb in size). A total of 10 datasets were simulated for each of the three scenarios (see Table S4). 548 Each dataset was either summarized as the folded SFS (for a subsequent ∂a∂i analysis) and by the 549 folded bSFS by randomly sampling 2 diploid genomes from each population (for a subsequent ABLE 550 analysis   Table S1 Point estimates for the demographic history of orangutan species obtained 689 from 500bp blockwise data (cf. Fig. 4).

690
Model  Simulations were performed with the values specified below for studying model misspecification under 700 progressively nested models (cf. Fig. 4). r stands for the recombination rate/base pair/generation 701 and we assumed 20 years/generation. All event times (T and T 2 ) have been specified in years. where the divergence time between populations (T ) was set to be that of the true admixture time(i.e. 740 T 2 ). Above each plot, the true split and admixture times are given in a T : T 2 format and given in 741 the inferred MCLEs from the orangutan dataset (Table S1 and Table 1  The expected bSFS (x -axis) was generated with ABLE using 50 million ARG's at the MCLE for 766 each model (Table S1) and plotted against the observed bSFS (y-axis) from the orangutan data. 767 The diagonal black line indicates the perfect match between the expected and observed. The colours 768 represent the total number of SNPs contained in each configuration. under model M6 estimated from the 2kb MCLE data (see Table 1). For each block length the 772 number of unique bSFS configurations is plotted on the y-axis. For this history, the number of 773 configurations defined by the bSFS is maximized for blocks between 1kb -10kb.  (Table S1) and 2kb datasets (Table 1) for an 776 increasing number of sampled ARGs (x -axis). Violin plots show the entire spread around the 777 respective averages (red dots) over 100 log likelihood evaluations. Note that the y-axis represents 778 the per block CLs i.e. downscaled with respect to the number of blocks found in each dataset.

779
Fig. S16 Relative fit between models using 500bp blocks. Histograms of 100 evaluations 780 of the Composite Likelihood (CL) using 1 million ARG's and at the MCLE corresponding to each 781 model ( Table Table S1). Note that the x -axis represents the per block CLs i.e. downscaled with 782 respect to the number of blocks found in each dataset. Models further to the right are a relatively 783 better fit than those to the left. Model M1 is ignored (appears much further to the left) as it has 784 the worst fit among all models.
we are using 2Kb blocks from the Orangutan data consisting of 2 diploid samples each from the 803 Bornean and Sumatran populations. The data files and further documentation is available for 804 download from https://github.com/champost/ABLE/data. we are using 2Kb blocks from the Orangutan data consisting of 2 diploid samples each from the 808 Bornean and Sumatran populations. The data files and further documentation is available for 809 download from https://github.com/champost/ABLE/data. Text. S10 Python code (for a ∂a∂i analysis) defining models M6 (see Fig. 4).