HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets

Abstract Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows–Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of “big data” genomics: a combinatorial core coupled with statistical inference in closed form.


Population genetic model and inference scheme-derivation of Equation 6
We start by expanding the logarithm of Equation 4

Filtering blocks-choice of thresholds
In order to determine filter thresholds that minimise false positive calls while achieving good accuracy, we considered all haploblocks that overlap the midpoint of the simulated chromosome segment detected in four arbitrarily chosen simulations per simulated selection coe cient generated as part of Figure S9, at time points where the selected allele has frequencies 0% and 60%. Note that the former e↵ectively corresponds to a neutral simulation.
For various combinations of thresholds, the subset of these blocks that do not have the selected allele are counted as false positives if they pass the filters, and as true negatives if they do not; while those MBE that do have the selected allele are counted as true positives if they pass the filters, and as false negatives if they do not. We note that unlike in the confusion matrices for example shown in Figures S5-S8, a low true positive rate is not a problem, as in principle a single block is enough to infer selection at the midpoint. The main criteria for the choice of thresholds is therefore to minimise false positives, followed by the accuracy of the point estimation of the selection coe cient.
In a first step, we set the threshold ont 1 to 1, e↵ectively disabling the filter defined by Equations 7-12, and try various thresholds ont 2 as defined by Equation 17. The results are summarised in Table S1, and show that on its own no threshold is enough to fully avoid false positives. This filtering step is intended primarily to catch long blocks with few haplotypes, expected under recent common ancestry, and we therefore opt to continue with a threshold of 1%. As shown in Figures S3 and S4, this is indeed enough to filter long blocks containing only few haplotypes.
In a second step, we look for the strongest threshold ont 1 that leads to no false yet some true positives when applied after the previous filtering step with threshold of 1%. As can be seen in Table S2, only thresholds of 1% and the more stringent adaptive threshold defined by Equations 10-12 satisfy these conditions, and we therefore chose the latter. The adaptive threshold ont 1 is parametrised by a minimal detectable selection coe cient s min . The underlying quantile is at least 1%, but not lower than 0.01% to avoid comparing very small numbers. The reasoning is that in relation to selection the e↵ect of drift, which is not modelled as part of our sweep model, is stronger for low frequency alleles, and such an adaptive threshold therefore represents an additional safeguard against false positives.
FIG. S1. Probability of a haplotype in a haplotype block. The red and blue lines indicate haplotypes A and B respectively and dots indicate SNPs. SNPs S 1 , S 2 , S 3 , S 4 define the regions that lie between or contain the breakpoints. The probability of observing the haplotype denoted by 'obs.' (observed) is the product of two independent recombination events R 1 and R 2 , i.e. the probability of no event in a region with recombination fraction r times the probability of one event in a region with recombination fraction r. Table S1. Haploblocks and inferred selection coe cients at di↵erent filter thresholds. We count all haploblocks that overlap the midpoint of the simulated chromosome segment detected in four arbitrarily chosen simulations per simulated selection coe cient generated as part of Figure S9, at time points where the selected allele has frequencies 0% and 60%.        We count each simulation once, and infer selection if at least one block that carries the selected allele covers the midpoint, i.e. the locus simulated with selection. We show true positive and false negative proportions both for time points where the frequency f(a) of the beneficial allele a is at 60% (middle row), and jointly for the 12 frequencies above 0% (third row; frequencies 2%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%      Figure S24 for the 88,906 blocks that passed both filtering steps in the neutral simulation with constant population size mimicking chromosome 2 of the UK Biobank (see Materials and Methods for details on the simulation FIG. S26. Proportion of artificial chromosomes covered by haplotypes inferred to be under selection in a neutral simulation of chromosome 2 with bottleneck and exponential population growth downsampled to ⇠48k SNPs. Results corresponding to the analysis presented in Figure S24 for the 1,252,333 blocks that passed both filtering steps in the neutral simulation with bottleneck and exponential population growth mimicking chromosome 2 of the UK Biobank (see Materials and Methods for details on the simulation).