ABLE: blockwise site frequency spectra for inferring complex population histories and recombination

We introduce ABLE (Approximate Blockwise Likelihood Estimation), a novel simulation-based composite likelihood method that uses the blockwise site frequency spectrum to jointly infer past demography and recombination. ABLE is explicitly designed for a wide variety of data from unphased diploid genomes to genome-wide multi-locus data (for example, RADSeq) and can also accommodate arbitrarily large samples. We use simulations to demonstrate the accuracy of this method to infer complex histories of divergence and gene flow and reanalyze whole genome data from two species of orangutan. ABLE is available for download at https://github.com/champost/ABLE. Electronic supplementary material The online version of this article (10.1186/s13059-018-1517-y) contains supplementary material, which is available to authorized users.

Correlations between the parametric bootstrap estimates obtained for the 2kb dataset ( Table 2). The scatter plots in the lower and upper half of the figure (are symmetrically identical) show the pairwise correlations among parameter estimates under M6. Density plots for each parameter are respectively given along the diagonal. Accuracy for each parameter is given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over-or under-estimates on the coalescent time scale. The black dots show the inferred point estimates for the 1, 2 or 3 diploid genomes per population schemes given in Table S3. The blue dots correspond to the inferred point estimates on data simulated with gene conversion (Table S3).  Table S3). Two types of block sizes were considered (500bp and 2kb) by keeping the whole genome size fixed at 200Mb. A single likelihood was evaluated at the true MCLE for each sampling configuration and block size by varying the number of sampled ARGs (x -axis). Times have been approximated to the nearest second. The y-axis represents inference accuracy as given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over/under-estimates respectively.

Fig. S5 Parameter inference under the true and alternative models.
The y-axis represents inference accuracy as given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over/under-estimates respectively.

Fig. S6 Parameter inference under the true and alternative models.
The y-axis represents inference accuracy as given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over/under-estimates respectively.   The y-axis represents inference accuracy as given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over/under-estimates respectively.
Fig. S10 Comparison of ∂a∂i and ABLE under model M6. The y-axis represents inference accuracy as given as the base-2 logarithm of the relative error (i.e. estimated value divided by the truth): 0 (red line) corresponds to the truth, increments of +1/-1 to two-fold over/under-estimates respectively.
Fig. S11 Effect of long range linkage on the absolute model fit using simulations and for the most common configurations. Two datasets were simulated under M6 using long (0.5 Mb) contiguous sequences totaling 200 Mb (cut into 500bp and 2kb blocks respectively) and with the inferred MCLEs from the orangutan dataset (Table S1 and Table 1 respectively). The expected bSFS (x -axis) was generated by sampling 50 million ARG's at the inferred MCLEs and using the same values as for the simulated bSFS (y-axis). The diagonal black line indicates the perfect match between the expected and simulated. The colours represent the total number of SNPs contained in each configuration.
Fig. S12 Absolute model fit to the observed 500bp bSFS for the most common configurations. Each point represents a unique mutational configuration making up the bSFS. The expected bSFS (x -axis) was generated with ABLE using 50 million ARG's at the MCLE for each model (Table S1) and plotted against the observed bSFS (y-axis) from the orangutan data. The diagonal black line indicates the perfect match between the expected and observed. The colours represent the total number of SNPs contained in each configuration. Fig. S13 The number of unique bSFS configurations for different block sizes. Data sets of a total length of 200Mb sized were simulated as independent blocks of a fixed length (x -axis) under model M6 estimated from the 2kb MCLE data (see Table 1). For each block length the number of unique bSFS configurations is plotted on the y-axis. For this history, the number of configurations defined by the bSFS is maximized for blocks between 1kb -10kb.
Fig. S14 Accuracy of the log likelihood (LnL) approximation for M6. The LnL was evaluated at the MCLEs obtained for the 500bp (Table S1) and 2kb datasets (Table 1) for an increasing number of sampled ARGs (x -axis). Violin plots show the entire spread around the respective averages (red dots) over 100 log likelihood evaluations. Note that the y-axis represents the per block CLs i.e. downscaled with respect to the number of blocks found in each dataset.
Fig. S15 Relative fit between models using 500bp blocks. Histograms of 100 evaluations of the Composite Likelihood (CL) using 1 million ARG's and at the MCLE corresponding to each model (Table S1). Note that the x -axis represents the per block CLs i.e. downscaled with respect to the number of blocks found in each dataset. Models further to the right are a relatively better fit than those to the left. Model M1 is ignored (appears much further to the left) as it has the worst fit among all models.