Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps
Introduction
A selective sweep, the process whereby beneficial mutations at a locus that contribute to the fitness of an organism rise in frequency to become prevalent in a population, can occur through two main mechanisms that leave distinct genomic signatures (Pritchard et al., 2010, Cutter and Payseur, 2013, Messer and Petrov, 2013). A relatively new adaptive allele can proliferate so that the single haplotype on which it has occurred reaches a high frequency, resulting in a signature of a “hard” selective sweep (Maynard Smith and Haigh, 1974, Kaplan et al., 1989, Kim and Stephan, 2002). Alternatively, a mutation that arises de novo multiple times or exists as standing genetic variation on several haplotype backgrounds before the onset of positive selection can increase in frequency; in these cases, multiple favored haplotypes have relatively high frequencies, generating a signature of a “soft” selective sweep (Hermisson and Pennings, 2005, Przeworski et al., 2005, Pennings and Hermisson, 2006a). Soft sweeps can provide an effective mechanism for natural selection and might explain a sizeable fraction of selective events in many systems (Orr and Betancourt, 2001, Innan and Kim, 2004, Pritchard et al., 2010, Messer and Petrov, 2013).
Most statistical methods that have been designed to detect selective sweeps from patterns of genetic polymorphism search for patterns expected under a hard-sweep model, such as the presence of a single common haplotype (Hudson et al., 1994), high haplotype homozygosity (Depaulis and Veuille, 1998, Sabeti et al., 2002, Voight et al., 2006), high-frequency derived variants and related features of site-frequency spectra (Tajima, 1989, Braverman et al., 1995, Fay and Wu, 2000, Nielsen et al., 2005), or local loss of variation near a putative selected site (Maynard Smith and Haigh, 1974, Begun and Aquadro, 1992, Kim and Stephan, 2002). Many methods that search for patterns expected with hard sweeps, however, can be less well suited to the problem of identifying soft sweeps (Pennings and Hermisson, 2006b, Teshima et al., 2006, Cutter and Payseur, 2013). Therefore, current genomic scans for selective sweeps might be limited in their ability to uncover an important class of adaptive events.
Recently, it has been shown that statistics based on haplotype homozygosity can identify both hard and soft sweeps from population-genomic data (Ferrer-Admetlla et al., 2014, Garud et al., 2015). Garud et al. (2015) developed a haplotype homozygosity statistic, , relying on the principle that in a soft sweep, the most frequent haplotype might not predominate in frequency, and instead, multiple frequent haplotypes might be present. In terms of frequencies for with and , Garud et al. (2015) defined as This statistic calculates homozygosity by combining the two largest haplotype frequencies into a single value and then computing a haplotype homozygosity. Garud et al. (2015) determined that has reasonable power to detect both hard and soft sweeps, applying the statistic to Drosophila population-genomic data and identifying abundant signatures of natural selection.
To determine whether the genomic regions with the highest values of were compatible with either a hard-sweep or soft-sweep pattern, Garud et al. (2015) examined a second statistic, , a ratio of a haplotype homozygosity that excludes the most frequent haplotype and a haplotype homozygosity that includes this haplotype: For high values of , hard sweeps are expected to produce relatively low values of because they produce a single high-frequency haplotype (very high , low ). Soft sweeps, on the other hand, produce multiple high-frequency haplotypes (high , , and perhaps others), and are expected to produce higher values of .
Garud et al. (2015) found that this two-step process–identification of regions with high followed by examination of –could both detect selective sweeps in general and distinguish hard and soft sweeps. As we will show, however, a complication in the approach is that the permissible range of varies with the value of . Thus, the magnitude of that might be regarded as indicative of a soft or hard sweep can depend on the associated values of . This potential difference in interpretations for values of as a function of can present a particular challenge when comparing at multiple loci with a wide range of values.
In a line of work separate from the use by Garud et al. (2015) of homozygosity-based soft sweep statistics, Rosenberg and Jakobsson (2008) and Reddy and Rosenberg (2012) analyzed the properties of homozygosity statistics in relation to the frequency of the most frequent allele, identifying upper and lower bounds on homozygosity given the frequency of the most frequent allele. This work, along with related work on other statistics (Long and Kittles, 2003, Hedrick, 2005, Jost, 2008, VanLiere and Rosenberg, 2008, Maruki et al., 2012, Jakobsson et al., 2013), seeks to understand mathematical bounds on population-genetic statistics, so that their application and interpretation can be suitably informed by the mathematical constraints on their numerical values.
Here, to facilitate the interpretation of the statistics of Garud et al. (2015) and to enhance comparisons among values of these statistics at loci with different haplotype homozygosities, we use a result from Rosenberg and Jakobsson (2008) to determine the upper and lower bounds on as a function of . The upper bound provides a basis for normalization of to produce a statistic with the same range, from 0 to 1, irrespective of the value of . Using the upper bound and the new normalized statistic, we reexamine Drosophila data analyzed by Garud et al. (2015), demonstrating that the upper bound, , and the normalized statistic, , enable improved insights regarding soft selective sweeps on the basis of genetic polymorphism data.
Section snippets
Theory
Our goal is to determine the maximum of given the value of , for . For convenience, we denote . We denote the desired upper bound by .
For generality in our description, we consider “alleles” at a locus. These distinct “alleles” can be viewed as representing distinct haplotypes at a specific location in the genome; the assumption is that a set of distinct genetic types is considered, representing perhaps distinct haplotypes or distinct alleles in the traditional sense,
Application to data
We illustrate the bounds on as functions of by reexamining two Drosophila melanogaster data sets studied by Garud et al. (2015), each containing fully sequenced genomes of inbred lines generated from samples taken in North Carolina. First, we consider the Drosophila Genetic Reference Panel (DGRP) data set consisting of sequences of 145 inbred lines (Mackay et al., 2012). Next, we examine the Drosophila Population Genomic Panel (DPGP) consisting of 40 strains. We consider these two
Discussion
Statistical methods for detecting selective sweeps from genomic data have enabled the identification of cases of adaptation in multiple organisms. Many statistics have been developed to identify hard selective sweeps, and recent attention has now also focused on detecting soft sweeps (Messer and Neher, 2012, Peter et al., 2012, Fu and Akey, 2013, Messer and Petrov, 2013, Vitti et al., 2013, Ferrer-Admetlla et al., 2014, Jensen, 2014, Wilson et al., 2014). Garud et al. (2015) recently proposed
Acknowledgments
We thank Doc Edge, Arbel Harpak, Rajiv McCoy, Pleuni Pennings, Dmitri Petrov, and Ben Wilson for helpful comments. Support was provided by NIH grants R01 GM089926, R01 GM097415, R01 GM100366, and R01 HG005855, and by graduate fellowships from the National Science Foundation and the Stanford Center for Computational, Evolutionary, and Human Genomics. Part of this work was completed in the Petrov Lab at Stanford University.
References (49)
- et al.
Upper bounds on in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles
Theor. Popul. Biol.
(2014) - et al.
Population genomics of rapid adaptation by soft selective sweeps
Trends Ecol. Evolut.
(2013) - et al.
The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation
Curr. Biol.
(2010) - et al.
Informativeness of genetic markers for inference of ancestry
Am. J. Hum. Genet.
(2003) - et al.
Mathematical properties of the measure of linkage disequilibrium
Theor. Popul. Biol.
(2008) - et al.
Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila
Science
(2005) - et al.
Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster
Nature
(1992) - et al.
The hitchhiking effect on the site frequency spectrum of DNA polymorphisms
Genetics
(1995) - et al.
World-wide survey of an Accord insertion and its association with DDT resistance in Drosophila melanogaster
Mol. Ecol.
(2004) - et al.
The many landscapes of recombination in Drosophila melanogaster
PLoS Genet.
(2012)
Genomic signatures of selection at linked sites: unifying the disparity among species
Nature Rev. Genet.
DDT resistance in Drosophila correlates with Cyp6g1 over-expression and confers cross-resistance to the neonicotinoid imidacloprid
Mol. Genet. Genomics
Neutrality tests based on the distribution of haplotypes under an infinite-site model
Mol. Biol. Evol.
Hitchhiking under positive Darwinian selection
Genetics
On detecting incomplete soft or hard selective sweeps using haplotype structure
Mol. Biol. Evol.
Selection and adaptation in the human genome
Annu. Rev. Genomics Hum. Genet.
Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps
PLoS Genet.
A standardized genetic differentiation measure
Evolution
Soft sweeps: molecular population genetics of adaptation from standing genetic variation
Genetics
Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster
Genetics
Pattern of polymorphism after strong artificial selection in a domestication event
Proc. Natl. Acad. Sci. USA
The relationship between and the frequency of the most frequent allele
Genetics
On the unfounded enthusiasm for soft selective sweeps
Nature Commun.
and its relatives do not measure differentiation
Mol. Ecol.
Cited by (26)
Evolutionary insights into plant breeding
2020, Current Opinion in Plant BiologyCitation Excerpt :However, in most crop species it is still unknown if selection has been driven primarily by hard or soft sweeps. This is in part due to challenges in the identification of selective sweeps ([30], methods reviewed in Ref. [31]), which can be confounded by population structure (see discussion in Ref. [32••]) and gene flow from wild or feral populations (Box 2). Moving forward, it will be interesting to tease apart the relative influence of hard and soft sweeps on genetic architecture and continued crop evolution.
Natural Selection Associated With Infectious Diseases
2017, On Human Nature: Biology, Psychology, Ethics, Politics, and ReligionMathematical bounds on Shannon entropy given the abundance of the ith most abundant taxon
2023, Journal of Mathematical BiologyOn the origin and structure of haplotype blocks
2023, Molecular Ecology