Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps

https://doi.org/10.1016/j.tpb.2015.04.001Get rights and content

Abstract

Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1 haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization.

Introduction

A selective sweep, the process whereby beneficial mutations at a locus that contribute to the fitness of an organism rise in frequency to become prevalent in a population, can occur through two main mechanisms that leave distinct genomic signatures (Pritchard et al., 2010, Cutter and Payseur, 2013, Messer and Petrov, 2013). A relatively new adaptive allele can proliferate so that the single haplotype on which it has occurred reaches a high frequency, resulting in a signature of a “hard” selective sweep (Maynard Smith and Haigh, 1974, Kaplan et al., 1989, Kim and Stephan, 2002). Alternatively, a mutation that arises de novo multiple times or exists as standing genetic variation on several haplotype backgrounds before the onset of positive selection can increase in frequency; in these cases, multiple favored haplotypes have relatively high frequencies, generating a signature of a “soft” selective sweep (Hermisson and Pennings, 2005, Przeworski et al., 2005, Pennings and Hermisson, 2006a). Soft sweeps can provide an effective mechanism for natural selection and might explain a sizeable fraction of selective events in many systems (Orr and Betancourt, 2001, Innan and Kim, 2004, Pritchard et al., 2010, Messer and Petrov, 2013).

Most statistical methods that have been designed to detect selective sweeps from patterns of genetic polymorphism search for patterns expected under a hard-sweep model, such as the presence of a single common haplotype (Hudson et al., 1994), high haplotype homozygosity (Depaulis and Veuille, 1998, Sabeti et al., 2002, Voight et al., 2006), high-frequency derived variants and related features of site-frequency spectra (Tajima, 1989, Braverman et al., 1995, Fay and Wu, 2000, Nielsen et al., 2005), or local loss of variation near a putative selected site (Maynard Smith and Haigh, 1974, Begun and Aquadro, 1992, Kim and Stephan, 2002). Many methods that search for patterns expected with hard sweeps, however, can be less well suited to the problem of identifying soft sweeps (Pennings and Hermisson, 2006b, Teshima et al., 2006, Cutter and Payseur, 2013). Therefore, current genomic scans for selective sweeps might be limited in their ability to uncover an important class of adaptive events.

Recently, it has been shown that statistics based on haplotype homozygosity can identify both hard and soft sweeps from population-genomic data (Ferrer-Admetlla et al., 2014, Garud et al., 2015). Garud et al. (2015) developed a haplotype homozygosity statistic, H12, relying on the principle that in a soft sweep, the most frequent haplotype might not predominate in frequency, and instead, multiple frequent haplotypes might be present. In terms of frequencies pi0 for i=1,2,3, with i=1pi=1 and p1p2p3, Garud et al. (2015) defined H12 as H12=(p1+p2)2+i=3pi2. This statistic calculates homozygosity by combining the two largest haplotype frequencies into a single value and then computing a haplotype homozygosity. Garud et al. (2015) determined that H12 has reasonable power to detect both hard and soft sweeps, applying the statistic to Drosophila population-genomic data and identifying abundant signatures of natural selection.

To determine whether the genomic regions with the highest values of H12 were compatible with either a hard-sweep or soft-sweep pattern, Garud et al. (2015) examined a second statistic, H2/H1, a ratio of a haplotype homozygosity H2 that excludes the most frequent haplotype and a haplotype homozygosity H1 that includes this haplotype: H1=p12+p22+i=3pi2H2=p22+i=3pi2. For high values of H12, hard sweeps are expected to produce relatively low values of H2/H1 because they produce a single high-frequency haplotype (very high p1, low p2). Soft sweeps, on the other hand, produce multiple high-frequency haplotypes (high p1, p2, and perhaps others), and are expected to produce higher values of H2/H1.

Garud et al. (2015) found that this two-step process–identification of regions with high H12 followed by examination of H2/H1–could both detect selective sweeps in general and distinguish hard and soft sweeps. As we will show, however, a complication in the approach is that the permissible range of H2/H1 varies with the value of H12. Thus, the magnitude of H2/H1 that might be regarded as indicative of a soft or hard sweep can depend on the associated values of H12. This potential difference in interpretations for values of H2/H1 as a function of H12 can present a particular challenge when comparing H2/H1 at multiple loci with a wide range of H12 values.

In a line of work separate from the use by Garud et al. (2015) of homozygosity-based soft sweep statistics, Rosenberg and Jakobsson (2008) and Reddy and Rosenberg (2012) analyzed the properties of homozygosity statistics in relation to the frequency of the most frequent allele, identifying upper and lower bounds on homozygosity given the frequency of the most frequent allele. This work, along with related work on other statistics (Long and Kittles, 2003, Hedrick, 2005, Jost, 2008, VanLiere and Rosenberg, 2008, Maruki et al., 2012, Jakobsson et al., 2013), seeks to understand mathematical bounds on population-genetic statistics, so that their application and interpretation can be suitably informed by the mathematical constraints on their numerical values.

Here, to facilitate the interpretation of the statistics of Garud et al. (2015) and to enhance comparisons among values of these statistics at loci with different haplotype homozygosities, we use a result from Rosenberg and Jakobsson (2008) to determine the upper and lower bounds on H2/H1 as a function of H12. The upper bound provides a basis for normalization of H2/H1 to produce a statistic with the same range, from 0 to 1, irrespective of the value of H12. Using the upper bound and the new normalized statistic, we reexamine Drosophila data analyzed by Garud et al. (2015), demonstrating that the upper bound, (H2/H1)max, and the normalized statistic, (H2/H1), enable improved insights regarding soft selective sweeps on the basis of genetic polymorphism data.

Section snippets

Theory

Our goal is to determine the maximum of H2/H1 given the value of H12, for 0<H121. For convenience, we denote Z=H2/H1. We denote the desired upper bound by Zmax.

For generality in our description, we consider “alleles” at a locus. These distinct “alleles” can be viewed as representing distinct haplotypes at a specific location in the genome; the assumption is that a set of distinct genetic types is considered, representing perhaps distinct haplotypes or distinct alleles in the traditional sense,

Application to data

We illustrate the bounds on H2/H1 as functions of H12 by reexamining two Drosophila melanogaster data sets studied by Garud et al. (2015), each containing fully sequenced genomes of inbred lines generated from samples taken in North Carolina. First, we consider the Drosophila Genetic Reference Panel (DGRP) data set consisting of sequences of 145 inbred lines (Mackay et al., 2012). Next, we examine the Drosophila Population Genomic Panel (DPGP) consisting of 40 strains. We consider these two

Discussion

Statistical methods for detecting selective sweeps from genomic data have enabled the identification of cases of adaptation in multiple organisms. Many statistics have been developed to identify hard selective sweeps, and recent attention has now also focused on detecting soft sweeps (Messer and Neher, 2012, Peter et al., 2012, Fu and Akey, 2013, Messer and Petrov, 2013, Vitti et al., 2013, Ferrer-Admetlla et al., 2014, Jensen, 2014, Wilson et al., 2014). Garud et al. (2015) recently proposed

Acknowledgments

We thank Doc Edge, Arbel Harpak, Rajiv McCoy, Pleuni Pennings, Dmitri Petrov, and Ben Wilson for helpful comments. Support was provided by NIH grants R01 GM089926, R01 GM097415, R01 GM100366, and R01 HG005855, and by graduate fellowships from the National Science Foundation and the Stanford Center for Computational, Evolutionary, and Human Genomics. Part of this work was completed in the Petrov Lab at Stanford University.

References (49)

  • A.D. Cutter et al.

    Genomic signatures of selection at linked sites: unifying the disparity among species

    Nature Rev. Genet.

    (2013)
  • P. Daborn et al.

    DDT resistance in Drosophila correlates with Cyp6g1 over-expression and confers cross-resistance to the neonicotinoid imidacloprid

    Mol. Genet. Genomics

    (2001)
  • F. Depaulis et al.

    Neutrality tests based on the distribution of haplotypes under an infinite-site model

    Mol. Biol. Evol.

    (1998)
  • J.C. Fay et al.

    Hitchhiking under positive Darwinian selection

    Genetics

    (2000)
  • A. Ferrer-Admetlla et al.

    On detecting incomplete soft or hard selective sweeps using haplotype structure

    Mol. Biol. Evol.

    (2014)
  • W. Fu et al.

    Selection and adaptation in the human genome

    Annu. Rev. Genomics Hum. Genet.

    (2013)
  • N.R. Garud et al.

    Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps

    PLoS Genet.

    (2015)
  • P.W. Hedrick

    A standardized genetic differentiation measure

    Evolution

    (2005)
  • J. Hermisson et al.

    Soft sweeps: molecular population genetics of adaptation from standing genetic variation

    Genetics

    (2005)
  • R.R. Hudson et al.

    Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster

    Genetics

    (1994)
  • H. Innan et al.

    Pattern of polymorphism after strong artificial selection in a domestication event

    Proc. Natl. Acad. Sci. USA

    (2004)
  • M. Jakobsson et al.

    The relationship between FST and the frequency of the most frequent allele

    Genetics

    (2013)
  • J.D. Jensen

    On the unfounded enthusiasm for soft selective sweeps

    Nature Commun.

    (2014)
  • L. Jost

    GST and its relatives do not measure differentiation

    Mol. Ecol.

    (2008)
  • Cited by (26)

    • Evolutionary insights into plant breeding

      2020, Current Opinion in Plant Biology
      Citation Excerpt :

      However, in most crop species it is still unknown if selection has been driven primarily by hard or soft sweeps. This is in part due to challenges in the identification of selective sweeps ([30], methods reviewed in Ref. [31]), which can be confounded by population structure (see discussion in Ref. [32••]) and gene flow from wild or feral populations (Box 2). Moving forward, it will be interesting to tease apart the relative influence of hard and soft sweeps on genetic architecture and continued crop evolution.

    • Natural Selection Associated With Infectious Diseases

      2017, On Human Nature: Biology, Psychology, Ethics, Politics, and Religion
    View all citing articles on Scopus
    View full text