Introduction

A characterization of the adaptive history of human populations requires knowledge of the genes that have been affected by positive natural selection, which is also important for an analysis of the genetic causes behind human disease.1, 2 During the past decade, various statistical approaches have been developed and used for scanning the entire genome for traces of selective sweeps, and hundreds of loci with strong evidence of selection have been discovered with these methods.3, 4, 5, 6, 7, 8, 9, 10, 11

Most of these studies have used the HapMap12 or Perlegen13 data sets consisting of millions of single nucleotide polymorphisms (SNPs). Their good coverage of SNPs allows a relatively precise identification of genes and even the actual variants under selection,7 but their weakness is the limited sample set that currently allows the analysis of only a few populations from each continent. Additional large data sets have recently become available through genome-wide SNP scans, providing data of a large numbers of individuals from a variety of populations. These data have also provided a powerful tool for analyzing population differentiation and structure,14, 15, 16, 17, 18 but to date, only a few studies have used similar data sets for analyzing traces of positive natural selection.5, 10, 19, 20

In this study, we used a genome-wide data set from Eastern and Western Finland, Sweden, Northern Germany, and Great Britain, and a combination of three statistical methods to search for loci that have been affected by recent natural selection.

Materials and methods

Data sets

We analyzed Affymetrix (Santa Clara, CA, USA) 250K Sty array data from Eastern and Western Finland, Sweden,18 Northern Germany,21 and Great Britain22 (hereafter, 250K data), combined with 250K Nsp array data from the Germans and the British (hereafter, 500K data). We also used HapMap12, 23 250K and 500K data provided by Affymetrix for estimating genetic differentiation between continents – selection scans for the HapMap populations have been performed previously. The quality control of the data followed common standards and is briefly described in Supplementary Methods. The data sets and samples used in this study are outlined in Table 1.

Table 1 The data sets used in different analyses of the study, and the numbers of samples and markers in the analyses after quality control and filtering

Analysis of natural selection

We employed two statistics for scanning the genome-wide data for signs of positive natural selection: integrated haplotype score test (iHS),9 and single-SNP long-range haplotype test (LRH),7 both based on comparing the extended haplotype homozygosity (EHH) score24 of the ancestral and derived allele of each marker. The tests are designed to detect sites in which one allele is surrounded by a much longer haplotype than expected for alleles of corresponding frequency evolving neutrally. Such a situation may arise when natural selection is driving one haplotype to high frequency, leaving recombination little time to break the haplotype. The test statistics were calculated for each marker in Sweep 1.1,7 separately in each population.

Population differentiation across the genome was estimated by calculating the FST statistic4, 25 for each marker in several population combinations: between each European population and the other Europeans pooled together, and between different continents using the HapMap YRI, CHB+JPT, and all the European samples.

To find the genomic regions with multiple SNPs with high iHS, LRH, or FST scores, the single-SNP absolute values from each population were analyzed in 200-kb windows with a 100-kb overlap. Each window was classified as either an extreme or a suggestive outlier based on each of the three statistics: the most extreme outliers included the windows with iHS or LRH 3.2 for at least two or four SNPs in the 250K and 500K analysis, respectively, in any of the populations. The extreme outlier windows for FST included those with at least two or four SNPs among the highest 1/2000 of each population comparison. Similarly, a suggestive category contained windows with at least two or four SNPs with iHS or LRH 2.6, or FST among the 1/500 highest values.

To extract the windows most likely to be affected by natural selection, an overlap of at least two statistics was required, because it has been observed that overlapping false positives in the iHS and LRH tests are rare.7 Thus, a window had to fall in the category of extreme outliers based on iHS or LRH in at least one population, and have at least a suggestive signal in any of the populations in the other test or FST. The overlapping or adjacent windows fulfilling these criteria were combined to form regions. Regions with low SNP densities or known inversions were excluded (see Supplementary Methods).

To visualize haplotype variation in the selected loci, median-joining networks were constructed with Network 4.5.0.2 (fluxus-engineering.com),26, 27 using European as well as HapMap 250K data. To analyze the extent of overlap between selection signals and association with disease, we counted the bins with and without signs of selection that contained at least one positively associating gene, as listed in the NHGRI catalog of genome-wide association studies (http://www.genome.gov/gwastudies) and Genetic Association Database (http://geneticassociationdb.nih.gov/).

Empirical analysis of different sample sizes

We estimated the effect of sample size empirically by calculating iHS and LRH for each marker in chromosomes 1–3 from the British data sampled to seven different sizes (Table 1). We calculated the correlations of the standardized iHS and LRH values between the largest sample of 700 individuals and the smaller samples. As the analysis showed that sample size affected the reliability of the statistics (Figure 3), we wanted to assign more weight to populations with larger sample sizes. This was obtained by multiplying the iHS and LRH values of each population with the correlation coefficient of the British test sample of corresponding size before extracting outlying genomic regions described above.

Coalescent simulations

Previous studies have analyzed the performance of iHS and LRH tests for different allele frequencies, scenarios of natural selection, demographies,7 and sample size.20 We sought to characterize the applicability of the tests for genome-wide data sets by analyzing the joint effects of SNP density and sample size. We performed coalescent simulations using the SelSim software,28 simulating genomic segments with a neutral model and with natural selection (see Supplementary Methods for details of the simulations and subsequent analyses). We constructed data sets of four different SNP densities corresponding to the median densities of HapMap II and Affymetrix arrays 6.0, 500K and 250K, and sample sizes of 50, 100, 150, and 200 individuals. The SNP ascertainment bias was accounted for by matching the SNP frequency spectrum of the simulations to that observed in real data.9 We compared the power obtained with different SNP densities and sample sizes by adjusting the false discovery rate (FDR) to approximately 1% in each data set.

Allele frequency simulations

As the North European populations analyzed in this study are much more closely related than the populations in most previous scans of natural selection, we wanted to analyze the expected extent of population differences caused by positive selection in Northern Europe. For this purpose, we simulated allele frequencies with demographic models corresponding to two population pairs, one modelling an Asian and a European population, and the other modelling two North European populations. We adjusted the demographic parameters by matching the simulated allele frequency differences with empirical distributions (see Supplementary Methods). Several scenarios of selection were analyzed, including different time spans, allele frequencies at the start of selection, and selection coefficients. Selection was applied to only one of the populations of each pair. The distribution of FST was calculated from the resulting allele frequencies. The simulations were performed in R (http://www.r-project.org/), and the demographic parameters are shown in Supplementary Table 1.

Results

Signs of selection across the genome

The results of the 250K analysis for all the populations are visualized in Figure 1, showing that the signal of selection is shared among several populations for many but not all the regions. Usually a high FST does not overlap with high iHS or LRH. A total of 60 regions had strong signs of selection, with at least one clearly outlying iHS or LRH score as well as at least one suggestive signal in another statistic (Table 2). However, the employed SNP density does not result in a full coverage of the euchromatin regions of the genome: in the 250K analysis, 64–74% (median 72%) of the euchromatin was covered by at least five markers per 200-kb window, and in the 500K data it was 84%. The different coverage of the populations results from differing numbers of markers becoming excluded because of minor allele frequency below 5%.

Figure 1
figure 1

The iHS and LRH signals in the different populations in 200-kb windows across the genome, and FST signals over all population comparisons. The larger symbol denotes the most extreme outliers, whereas the smaller symbol denotes suggestive signals. The grey boxes and horizontal lines denote heterochromatin, centromere, and telomere regions, and the grey vertical lines correspond to windows with at least five SNPs per 200-kb window.

Table 2 The genomic regions showing the strongest signs of positive natural selection

In total, the 60 regions contain 121 genes. The windows with signs of selection showed a statistically significant enrichment of genes associated with disease (χ2 P<10−4) – however, given the relatively large windows (200 kb), the disease-associated genes and variants are not necessarily the targets of selection. Figure 2 shows examples of median-joining networks in genes RAB38 and PPP2R2B, both having a combination of characteristic signs of positive natural selection: population differentiation, enrichment of high-frequency derived alleles demonstrated by the long branches from the ancestral haplotype to the high-frequency clusters, and star-like haplotype patterns with one high-frequency haplotype surrounded by rare haplotypes.

Figure 2
figure 2

Median-joining networks of haplotypes in the regions Chr11:87 480 000–87 590 000 containing 15 SNPs in the RAB38 gene (a), and Chr5:145 970 000–146 030 000 containing 13 SNPs in the PPP2R2B gene (b). Nodes denote the haplotypes, with their size corresponding to the overall frequency. The legend showing the color codes of the population frequencies also shows the relative sizes of the study samples in the entire data to assist the interpretation of haplotype frequency differences. The branches connecting the haplotypes denote the SNPs differing between haplotypes. The ancestral (chimpanzee) haplotype is marked in yellow.

Power and performance of iHS and LRH

The analysis of iHS and LRH values from the British, who were sampled to different sizes, showed that the reliability of both statistics improved markedly as the sample size increased, and especially the LRH statistic appeared to lack robustness for smaller samples (Figure 3). The denser marker set of the 500K data improved the correlation, but not dramatically. The differences in the robustness between the sample sizes was accounted for by scaling down the genomic values of iHS and LRH of the smaller samples, which explains the much stronger signs of selection among the British and German samples than among the Finns or Swedes (Table 2). In the coalescent simulations, natural selection increased both iHS and LRH values, and the values calculated from real data fell between the neutral and selected simulations, as expected (Supplementary Figure 1). The power to detect a selection signal increased with both sample size and SNP density for the iHS and LRH statistics and their combination, ranging from about 10% to over 80% (Figure 4). In the simulated selection scenario, the highest power was reached by using iHS statistic alone – however, the simulations encompass only a single population, which can underestimate the power of the combined iHS+LRH analysis performed on a data set with several populations.

Figure 3
figure 3

Correlation of the iHS and LRH values of SNPs between a sample of 700 British individuals and samples of various sizes in 250K and 500K data sets. The arrows indicate the sizes of the population samples in this study.

Figure 4
figure 4

Power of the iHS test alone (a), LRH test alone (b), and iHS and LRH combined (c) in data sets of different SNP densities and sample sizes based on coalescent simulations.

Allele frequency simulations

Simulations of allele frequency differences between populations showed that for closely related populations – such as the North European population pair in the analysis – recent natural selection acting on one population has to be very strong to increase the allele frequency differences notably, because migration efficiently evens out allele frequency differences. Between continents – with a very low migration rate between the populations – even weak selection increases FST, and strong selection may lead to extreme differentiation (Figure 5, Supplementary Figure 2). Thus, as expected, the relatedness of populations has a major effect on the possibility of using population differentiation-based tests for detecting positive selection.

Figure 5
figure 5

Distributions of FST in simulations of allele frequencies after 480 generations (about 12 000 years), with a starting allele frequency of 0.1 and three different selection coefficients (s) in two populations with demographies corresponding to a European–Asian population pair and two North European populations.

Discussion

In this study, we used genome-wide data from 250 000 and 500 000 SNPs to search for loci affected by recent positive natural selection in North European populations. We found convincing evidence of selection in 60 loci, 21 of which have not been discovered in previous scans for selection.

Many of the regions with strong signs of selection contain several genes with particularly interesting functions, although further studies are needed to fully determine which are the actual genes and variants behind the selective advantage. Of the two examples visualized as median-joining networks, the RAB38 gene is expressed in melanocytes, and its disruption in mice causes oculocutaneous albinism, lung disease, and platelet deficiency.29, 30, 31 This makes RAB38 an interesting novel candidate locus for human pigmentation. The PPP2R2B gene regulates neuronal apoptosis and may affect adenovirus replication.32, 33 Yet another intriguing gene is RGS9 that affects vision adaptation in different light conditions.34, 35 However, lack of selection signal in individual loci in this study should not be interpreted as absence of selection, as there are several interesting regions lacking sufficient SNP coverage – including the loci of, for example, the LCT and OCA2 genes and the CYP3A region, all well-known candidates for recent natural selection.9, 36, 37 In addition, the iHS and LRH tests have good power to detect selected haplotypes only in a relatively narrow frequency range, which may be the reason why some other well-established candidate genes, such as SLC24A538 and MYOA5,9 show a lower signal below the chosen thresholds of this study.

Many of the selected regions contain genes associated with human disease, such as interferon gamma (IFNG), nitric oxide synthase 1 (neuronal) adaptor protein (NOS1AP), cadherin 13 (CDH13), and the APOE cluster (Table 2).39, 40, 41, 42, 43, 44, 45 Altogether, we observed a statistically significant enrichment of genes associated with complex disease, which is consistent with earlier studies showing a connection between natural selection and certain types of human disease.1, 2, 20, 46 However, the increased population differentiation because of positive selection makes these loci particularly vulnerable to false-positive associations.47, 48, 49 As an example, in our data set, the haplotype in the PDE11A gene showing very high population differentiation has been reported to associate with depression in Mexican Americans50 – possibly due to confounding population structure. Thus, additional caution is necessary when disease association is observed in selected genes, which poses a further challenge for discovering functions of genes under natural selection by genome-wide analysis.46

Both empirical analyses and simulation studies showed that sample size and SNP density affect the performance of the iHS and LRH statistics. Although very dense data sets clearly yield the best power, the available genome-wide data sets with large sample sizes also seem adequate for successful selection scans, which is consistent with earlier results.20 In reality, however, the power of the tests is likely to be lower than our simulations indicate: non-African human populations have not been of constant size, patterns of genomic variation are much more complex than in the simulations, SNP density varies between regions, and the simulated selection scenario was one in which the statistics have been shown to have good power.7 On the other hand, the power in the simulations may be lower than in the real data because of the absence of multipopulation comparisons.

Selection signals can be compared between populations to detect adaptive differences between populations, both worldwide and locally between closely related populations.20 The populations of Northern Europe have been distinct and subject to partly different environmental conditions for some 10 000 years, and the selective events detected by iHS scans have been estimated to be younger than that, on average 6600 years old in non-African populations.9 Thus, some of the observed differences in the selection signals between populations may represent local adaptations. However, many are likely to arise from false negatives caused by the low power to detect selection, and differences in power between populations because of different allele frequencies, demography, and sample size.7, 9, 18, 51

This study is not intended to be an exhaustive analysis of positive selection in Northern Europe. Statistical methods to scan for signs of positive natural selection are plagued by several limitations, and no method is suitable for covering the full spectrum of natural selection.46 Specifically, the iHS and LRH statistics are efficient only when the selected haplotype has not yet reached fixation. This may be at least a partial explanation to the non-overlapping patterns of iHS or LRH with FST, which may detect more ancient selection than the other tests. However, FST alone has been suggested to be a poor indicator of selection.52 Complementary methods that seek for fixation in one population and segregation in others,7, 10 such as XP-EHH, are unlikely to be effective for closely related populations in which the allele frequency differences, even in the presence of selection, are much smaller than between continents, as shown by our simulations. Furthermore, the SNP selection by array providers poses a limitation: although Affymetrix 500K arrays are not based on the tag-SNP approach that is problematic for LD-based selection testing due to low marker density in regions with high LD,20 the bias toward common alleles limits the choice of statistics, makes local adaptations unlikely to be represented in the arrays, and may lead to lack of coverage in interesting regions. At present, the field is lacking comprehensive studies comparing the performance of different methods in scenarios of different kinds of positive selection, data sets, and populations, making it difficult to choose the most effective combination of statistics.

Despite these limitations, we have used data from genome-wide arrays to detect 60 regions – many previously undiscovered – with strong evidence of recent natural selection in Northern Europe, and these loci will be interesting targets for follow-up studies. Our study demonstrates the usefulness of genome-wide data sets for analysis of natural selection; particularly, the possibility to analyze large samples from a wide spectrum of human populations is a significant advantage. Furthermore, these data sets may prove useful for studying differences between populations due to local adaptations. However, the precise identification of the genes and variants behind the selective advantage will require data from genomic sequencing, as well as functional studies.