Introduction

Telomeres consist of (TTAGGG)n nucleotide repeats and an associated protein complex located at chromosome ends. They are essential for maintaining chromosomal integrity. Telomere-associated proteins include the telomerase reverse transcriptase (TERT) and its RNA component (TERC), plus an ordered protein complex, or shelterin, consisting of six proteins: TERF1, TERF2, TINF2, TERF2IP, ACD and POT1 (Collins and Mitchell, 2002; de Lange, 2005). This telomere complex, and many other associated proteins, are responsible for preserving chromosome ends, and thus genomic stability, by protecting chromosomes from end-to-end fusion, atypical recombination and degradation (Moon and Jarstfer, 2007). Many of the components of the telomeric complex are highly conserved across species in comparative sequence and functional investigations (Nakamura and Cech, 1998; Li et al., 2000; Kanoh and Ishikawa, 2003; de Lange, 2004; Savage et al., 2005). It was also shown that seven of these genes (TERT, POT1, TNKS, TERF1, TINF2, TERF2 and TERF2IP) had lower nucleotide diversity compared with other gene families; they were also highly conserved and the most common allele was ancestral (Savage et al., 2005).

Telomere nucleotide repeats progressively shorten with each cell division due to incomplete replication of the 3′ end by DNA polymerases. When they become critically short, cellular senescence or cellular crisis is induced in normal cells but in malignant cells this pathway is bypassed through the activation of telomerase or the alternative pathways (Gilley et al., 2005; Rodier et al., 2005). Short telomeres induce genetic instability and thereby promote the initiation and development of cancer (Blasco et al., 1997; Rudolph et al., 1999, 2001; Wu et al., 2003; Plentz et al., 2003, 2004). Telomere attrition has also been associated with aging, many diseases (including diabetes mellitus and cardiovascular disease), inflammation, oxidative stress, an unhealthy lifestyle and smoking (von Zglinicki, 2002; Wong and Collins, 2003; Morlá et al., 2006; Aubert and Lansdorp, 2008; Mirabello et al., 2009). Several disorders are associated with mutations in telomere biology genes (Crabbe et al., 2004; Blasco, 2007; Vulliamy et al., 2008; Armanios, 2009; Savage and Alter, 2009). Patients with dyskeratosis congenita, a heterogeneous inherited bone marrow failure and cancer predisposition syndrome, have extremely short telomeres and germline mutations in genes important in the maintenance of telomeres (DKC1, TERC, TERT, NOLA3 (alias NOP10), TINF2 or NOLA2 (alias NHP2)) (Crabbe et al., 2004; Vulliamy et al., 2008; Armanios, 2009; Savage and Alter, 2009). In addition, recent genome-wide association studies found that genetic variation at 5p15.33 (TERT-CLPTM1L locus) was associated with risk of glioma (Shete et al., 2009), basal cell carcinoma (Stacey et al., 2008, 2009), testicular cancer (Turnbull et al., 2010), pancreatic cancer (Petersen et al., 2010) and lung cancer (McKay et al., 2008; Jin et al., 2009; Landi et al., 2009); an association study of multiple tumor types suggest that this region may contain important markers of overall cancer risk (Rafnar et al., 2009).

The extent to which disease-associated alleles differ in frequency between populations and the evolutionary forces responsible for the observed degree of population differentiation may provide keys to further understanding disease pathogenesis. Allele frequencies for many genetic variants differ by geographical regions (Guthery et al., 2007; Lan et al., 2007; Myles et al., 2008), possibly the result of several factors including natural selection and neutral genetic drift. There may be functional consequences of a particular variant that leads to a more favorable response and thus certain variants may be under selective pressure. Searching for a signature of selection has the potential to identify functional and disease related variants (Bamshad and Wooding, 2003; Hurst, 2009).

We examined patterns of differentiation, allele frequencies and the haplotype structure of 37 genes important in telomere biology among 53 worldwide populations from Africa, the Middle East, Europe, Central/South (C/S) Asia, East Asia, Oceania and the Americas. Data from 1168 unrelated individuals were obtained from the genome-wide scan of the Human Genome Diversity Panel (HGDP-CEPH) (Cann et al., 2002; Li et al., 2008) and from the International HapMap Project (The International HapMap Consortium, 2003) at 716 single-nucleotide polymorphism (SNP) loci. We additionally compared our telomere gene set to HGDP-CEPH data that we had on 174 innate immunity genes at 1396 SNPs, this allowed us to determine if two sets of genes grouped by function have similar genetics. We hypothesized that genetic variation in telomere biology genes may be constrained because of both the high degree of sequence similarity previously observed across species and the critical roles their protein products have in chromosomal stability.

Materials and methods

Data set

We obtained SNP data for each gene, including 20 kbp upstream and 10 kbp downstream, from the HGDP-CEPH (Cann et al., 2002) genome-wide scan of 650 000 common SNPs (Li et al., 2008) and the HapMap Phase 2 (The International HapMap Consortium, 2003) public database. Genotype data were retrieved for 37 gene regions: ACD, ATM, BLM, DDX1, DDX11, DKC1, MRE11A, NBN, NOLA1, NOLA2, NOLA3, PARP1, PARP2, PINX1, POT1, PRKDC, RAD50, RAD51AP1, RAD51C, RAD51L1, RAD51L3, RAD54L, RECQL, RECQL4, RECQL5, RTEL1, TEP1, TERC, TERF1, TERF2, TERF2IP, TERT, TINF2, TNKS, TNKS2, WRN and XRCC6, as well as the region between PARP2/TEP1. Genes were chosen based on their involvement in telomere biology or presumed interaction with telomeres as reported in the literature. Telomerase complex genes include TERC, TERT, DKC1, TEP1, NOLA1, NOLA2 and NOLA3; shelterin genes include TERF1, TERF2, TERF2IP, POT1, TINF2 and ACD; DNA repair genes include XRCC6, NBN, RAD50, ATM, RAD54L, RAD51L3, RAD51C, RAD51AP1, RAD51L1 and MRE11A; helicase genes include WRN, BLM, RECQL, RECQL4, RECQL5, DDX1 and DDX11; and, other telomere-associated genes include PRKDC, PARP1, PARP2, PINX1, TNKS, TNKS2 and RTEL1. All SNPs, regardless of minor allele frequency, were included in the analysis as for many of these genes there were only a few SNPs available. Data were retrieved for all individuals in the 52 populations (952 individuals) included in the HGDP-CEPH 952 panel and the four populations (270 individuals) of the HapMap project for the same 716 SNPs. Atypical and related individuals were removed (Rosenberg, 2006), which resulted in 898 individuals from the HGDP-CEPH panel and 270 from the HapMap project. The final data set included 1168 unrelated individuals from 53 unique populations. We did not limit the SNP data to SNPs only within exons, introns, promoters or 3′areas because the goal of this study was to understand the gene regions, including upstream and downstream regions.

We also obtained genotype data for 174 genes involved in innate immunity as a comparison set for our telomere biology genes. SNP data for each gene were acquired from the HGDP-CEPH (Cann et al., 2002) genome-wide scan for all individuals in the 52 populations, and cleaned as described above. The immune gene set was chosen as a comparison gene set as these genes are often highly variable. Additional comparisons were made with data reported in the literature. Supplementary Table 1 lists the 174 innate immune gene regions evaluated.

HapMap phase 3 (The International HapMap Consortium, 2003) SNP data for 11 populations (1115 individuals) were also retrieved for a subset of the telomere maintenance genes that were potential candidates for evolutionary selection (defined in results): ACD, NOLA2, RECQL4, POT1, TERF2IP and TNKS, as well as for TERT. Individuals in this phase do not overlap with HapMap phase 2 participants.

Data analysis

Haplotype and SNP frequencies were estimated using a Bayesian algorithm implemented in PHASE version 2.1 (Stephens et al., 2001; Stephens and Scheet, 2005). Haplotypes determined by PHASE were used as input for all other analyses. The package ARLEQUIN version 3.11 (Excoffier et al., 2005) was used to compute haplotype diversity, FST values, Mantel test, analysis of molecular variance (AMOVA) and heterozygosity. FST values based on allele frequencies were calculated as a measure of population differentiation and significance was estimated with 10 000 permutations. A Mantel test was used to test the significance of the regression of genetic distance on geographic distance between population pairs with 10 000 permutations. In order to apportion the fraction of the genetic variance due to differences between and within continental groups and infer the genetic structure of the populations, AMOVA was performed with 10 000 permutations. Mega version 4.0 (Tamura et al., 2007) was used to construct a neighbor-joining tree based on genetic differentiation. Population structure was inferred by a Bayesian clustering analysis performed with structure version 2.2 (Pritchard et al., 2000; Falush et al., 2007) using the following settings: admixture model, correlated markers, K=1–10, a length of 100 000 for the burn-in period, and 100 000 repetitions following the burn-in period. Haploview version 4.1 (Barrett et al., 2005) was used to determine the degree of linkage disequilibrium (LD) and minor allele frequencies (MAF). LD P values (with s.e.) were estimated by Monte Carlo approximation with 10 000 steps in the Markov Chain using ARLEQUIN. This LD calculation is an extension of Fisher exact probability test on contingency tables, and the results are given as a significance level of LD for each pair of loci with a small P value (<0.05) indicating high LD (Excoffier et al., 2005; Santos-Lopes et al., 2007). Differences between the telomere and immune gene set results were tested for significance with parametric (t-test) and non-parametric tests (Mann–Whitney U-test).

We retrieved ancestral (chimpanzee) alleles for 98.2% of the SNPs using the UCSC Genome Browser (March 2006 Assembly: http://genome.ucsc.edu/) and/or Ensembl (release 50, Jul 2008: http://www.ensembl.org/index.html). In cases where neither human allele corresponded to the chimpanzee allele or when the chimpanzee allele was unknown we excluded these SNPs from the analysis. Pairwise geographic distance between populations and distance from Addis Ababa, Ethiopia (the putative point of origin of modern humans (White et al., 2003)) was estimated in kilometers (km) following the likely colonization route (shortest path through landmasses) as in Prugnolle et al. (2005).

Selection was evaluated with the following analyses: (1) between population differentiation (FST), which can be inflated due to environmental pressures on populations causing local adaptation and allele frequency changes (positive directional selection), and negative or balancing selection can decrease the differentiation of selected loci (Akey et al., 2002; Nielsen, 2005; Sabeti et al., 2007); (2) genetic diversity, a significant decrease points to positive selection when a particular allele is favored, and increases could be balancing selection with more diversity being potentially adaptive; (3) LD across populations, selection can increase LD; and (4) MAF and derived allele frequency (DAF) tests (Walsh et al., 2006). The iHS for HapMap phase 2 data was retrieved with HAPLOTTER (The International HapMap Consortium, 2005; Voight et al., 2006).

Results

We analyzed SNP data from the HGDP-CEPH (Cann et al., 2002) genome-wide scan for 37 gene regions involved in telomere biology and from Phase 2 of the International HapMap project (The International HapMap Consortium, 2003). The telomere data set consisted of a total of 716 SNPs in 1168 individuals from 53 worldwide populations. Supplementary Tables 1 and 2 give summary statistics for the telomere biology and innate immune gene regions analyzed (for example, alleles, MAF, heterozygosity, variation components).

Population structure and differentiation

Bayesian cluster analysis and a distance-based neighbor-joining tree segregated individuals into five genetic clusters: Africa, Eurasia (Middle East, Europe, Utah, C/S Asia), East Asia, Oceania and America (Supplementary Figure 1). The Utah, USA population clustered within Eurasia, with high genetic similarity to the European populations. We tested for isolation by distance using a Mantel test on FST estimates and found a significant positive correlation between the degree of genetic divergence and the pairwise geographical distance (correlation coefficient (r)=0.64, P<0.001).

Table 1 shows the levels of differentiation (FST) by gene sorted in descending order by among regions differentiation. PRKDC and POT1 have the lowest levels of differentiation, and nearly half (0.44) of the POT1 SNPs have global FST estimates less than the 0.05 percentile of the overall FST distribution. ACD and TERF2IP had the highest levels of differentiation, and a large portion of their SNPs were above 0.95 (0.60 and 0.50, respectively) and 0.99 (0.40 and 0.38, respectively) percentiles. ACD had very high levels of differentiation observed between HapMap Yoruba and Utah populations and between Yoruba and Chinese/Japanese populations (0.52 and 0.75, respectively), and TERF2IP between Utah and Chinese/Japanese populations (0.56).

Table 1 Levels of differentiation (FST) by gene using HapMap 2 and HGDP data

Overall, TERT had average levels of differentiation among regions and HapMap populations (Table 1 and Supplementary Table 3). Limiting the SNPs to only those within the TERT gene (introns, exons and UTRs, n=4 SNPs), the levels of differentiation were lower among regions (Fst=0.072) and within populations (Fst=0.089). We further evaluated the recently identified TERT SNP, rs2736100 (localized to intron 2: at chromosome 5p15.33, position 1 339 516) as it appears to be associated with risk of lung cancer, testicular cancer and glioma (McKay et al., 2008; Jin et al., 2009; Landi et al., 2009; Shete et al., 2009; Turnbull et al., 2010). rs2736100 had variable levels of differentiation among geographical regions. Its genotype frequencies varied among regions and levels of pairwise differentiation were particularly high among Oceania and all other regions, as well as among America and Eurasia, and low among Africa, East Asia and Eurasia (Supplementary Table 3 and Supplementary Figure 2).

As a gene set, the AMOVA partitions variation among the seven geographical regions similar to that observed for the entire 650k autosomal SNP panel (Li et al., 2008) with within-population variation accounting for the majority of the genetic diversity (Figure 1). There was some disparity in how variation is partitioned among individual genes. There was substantially higher among-regions variation observed in ACD and TERF2IP and the least in POT1 (Figure 1). There was significantly lower differentiation observed among geographical regions for the telomere biology genes compared with the innate immune genes (P=0.0002) (Figure 1), and the distributions of FST values among all the HGDP populations showed a shift down towards lower FST for the telomere biology genes (Figure 2). AMOVA variation components and differentiation by locus are shown in Supplementary Tables 1 and 2 for the innate immune and telomere biology genes, respectively. Grouping the genes by function showed that telomerase complex genes had the lowest FST values (among regions=0.07), followed by helicase genes, and other telomere associated genes compared with the genome-wide average for autosomal SNPs (0.10–0.15 (Akey et al., 2002; Shriver et al., 2004, 2005; Weir et al., 2005)) and the innate immune gene set, as well as among HapMap 2 populations in comparison to other gene sets (Table 2).

Figure 1
figure 1

Analysis of molecular variance by gene using HapMap 2 and HGDP data. Partitioning variation into three components: within population (WP), among region (AR) and among-population-within-region (APWR). Populations are assigned to the seven main geographic regions from the HGDP-CEPH panel; *HGDP panel at 650k autosomal SNPs (Li et al., 2008).

Figure 2
figure 2

Differentiation among the HGDP populations for the telomere biology and innate immune gene sets.

Table 2 Genetic differentiation (FST) among major continental groups and HapMap 2 populations in comparison to other data sets

Haplotype diversity and LD

The number of haplotypes and diversity estimates by gene and region are shown in Table 3 and Supplementary Table 4. Overall, the haplotype diversity was highest in Africa (0.844) and lowest in Oceania (0.634; Table 3). The majority of genes had very low to moderate haplotype diversity (Supplementary Table 4). DKC1, TERC and XRCC6 had very low haplotype diversity of less than 50%, and TINF2, NOLA2 and RECQL4 had low diversity estimates between 60–70%. Surprisingly, TERT had high haplotype diversity (ranging from 81 to 96%). Using only the SNPs within the TERT gene, the haplotype diversity was lower in East Asia, Oceania and America (ranging from 37 to 74%), and higher in Eurasia and Africa (88% and 90%, respectively) (Supplementary Figure 2). The mean haplotype diversity for all of the 53 populations was negatively correlated with geographic distance from Ethiopia (r=−0.89, slope=−1.17e−5). Heterozygosity in the telomere gene set was significantly lower than in the immune gene set by geographical region (P=0.014; Table 3).

Table 3 Telomere biology genes average diversity of the major continental regions from the HapMap 2 project and HGDP-CEPH panel

LD P values were estimated for all marker pairs for each gene and a summary of the proportion of SNPs with significant LD (P<0.05) is presented in Table 4. The proportion of marker pairs with significant LD varied among geographic regions and genes. A low proportion of marker pairs with LD was often observed in Oceania. The lowest LD was observed in RAD51L1 (0.4 of marker pairs in all populations) and the highest in POT1 (>0.9). However, this analysis is limited by the small number of SNPs in many of these genes and the limited population sizes in Oceania and America.

Table 4 Proportion of marker pairs with significant linkage disequilibriuma using HapMap 2 and HGDP data

Ancestral alleles

Comparing the ancestral allele frequency (AAF) spectrums among the HapMap 2 and HGDP populations, we found that populations in Africa had more SNPs with high AAFs and populations in America had the lowest (Figure 3). A steeper slope of SNP counts in the midrange of the distribution reflects more SNPs with high AAFs (Li et al., 2008). The slopes of SNP counts in the range of 0.2–0.8 AAF for all of the populations progressively declined moving away from Ethiopia (3.8–0.1; Figure 3b). This AAF pattern did not change after limiting the SNPs to only those in exons, introns and UTR regions. The average AAF was highest in African populations (0.735) and lowest in American populations (0.655) (data not shown). Average AAFs for the majority of our genes was high (>60%). For TERT, the average AAF was 64%, and for the TERT SNP, rs2736100, it was 49%. The AAF for rs2736100 (ancestral allele: T) was variable by geographic region, the highest AAF was observed in America (0.88) and the lowest in Oceania (0.045).

Figure 3
figure 3

Ancestral allele frequency spectrum using HapMap 2 and HGDP data. (a) Histograms of AAFs for four populations: Yoruba, USA, Han and OC (the two populations in Oceania were combined due to small sample sizes). N is the number of individuals and the slope is for the SNP counts in the range of 0.2–0.8 AAF. (b) Slopes of AAFs between 0.2 and 0.8 for all of the 53 populations versus geographic distance from Ethiopia.

Test for selection

For evaluating FST, we concentrated on regions that show high or low values among multiple markers, as individual SNPs show considerable variation. According to the cut-points estimated by Akey et al. (Akey et al., 2002), ACD and TERF2IP had the highest proportion of SNPs with high FST (0.45) (0.6 and 0.5, respectively), and TNKS, RAD51L1 and RECQL all had very low FST (two SNPs with an FST=0 and one SNP with an FST 0.005). We also plotted the average FST versus the average heterozygosity by region. There were three outliers with high FST and low heterozygosity, ACD, TERF2IP and TERF2, and two outliers with low FST and high heterozygosity, POT1 and NOLA2 (data not shown). There were three outliers in the DAF test with a large amount of derived alleles (>80%): ACD (in CEU and Han populations), NOLA2 (in CEU and America populations) and RECQL4 (in Oceania) (data not shown). All of the loci in TERC (n=3) and XRCC6 (n=5) had DAFs of <20% in all populations and in CEU and Han populations, respectively. The MAF test suggests RECQL4 (Han), POT1 (Han and CEU) and RAD54L (America) with an excess of SNPs with high MAF (>40%) and ACD (Han), TERF2IP (CEU), and RAD51L3 (America) with an excess of SNPs with low MAF (<10%). The strong LD observed for POT1 supports the existence of balancing selection. Overall, TERT did not show evidence of selection.

HapMap 3 data for select genes

HapMap 3 (The International HapMap Consortium, 2003) SNP data for 11 populations were retrieved for genes identified as potential evolutionary selection candidates in this study (based on at least two tests: ACD, NOLA2, RECQL4, POT1 TERF2IP), and previous studies (Savage et al., 2005 and the HapMap (The International HapMap Consortium, 2007): TNKS), to confirm our findings in an additional data set with a more dense SNP coverage. We also retrieved SNP data for TERT. Assigning the populations to the main geographic regions (identified by a distance-based neighbor-joining tree, Supplementary Figure 3), the AMOVA partitions the majority of the genetic diversity to within-population variation (91%); there is less variance attributable to among regions in NOLA2, POT1 and TNKS (<5%), more variance among regions in TERF2IP and ACD (>20%) and average in RECQL4 (14%), as observed in the HapMap 2 and HGDP data set.

TERT had high haplotype diversity (96–99%) and heterozygosity (0.26–0.35) in these 11 populations. There was average differentiation among geographical regions and within populations based on allele frequencies (FST=0.118 and 0.138, respectively), similar to the HapMap 2 and HGDP data set. However, population comparisons based on haplotype frequencies were all low, with FST <0.025.

There was evidence of positive selection (high FST, low heterozygosity, high or low DAFs and low MAFs) in ACD and TERF2IP, and evidence of balancing selection (low FST, high heterozygosity and high MAFs) in POT1, NOLA2 and regions of TNKS (Supplementary Figure 4). There were mixed signals in RECQL4, with a proportion of SNPs with extreme high and low FST, heterozygosity and MAFs. The patterns of variance attributable to among regions are also consistent, with extremely high values in ACD and TERF2IP and low in POT1, NOLA2 and TNKS.

Discussion

In this study, we examined allele frequency distributions, diversity, differentiation, LD and population structure among 53 worldwide populations by combining HapMap 2 and HGDP-CEPH genome-wide scan data of 37 genes vital for telomere stability. This extensive data set allowed us to create a comprehensive catalog of worldwide genetic variation for these genes. Overall, most telomere biology genes had low to moderate diversity and less than average differentiation. There was significantly lower differentiation among HGDP populations and heterozygosity in the telomere biology genes compared with innate immunity genes. Differentiation among geographical regions in the telomere biology genes grouped by function showed the lowest values in the telomerase complex genes compared with other gene sets and the genome average. These genes are required for telomere elongation and maintaining chromosomal stability.

As a gene set, there is a specific population structure; cluster analyses segregated individuals into five genetic clusters, concordant with larger analyses with the HGDP-CEPH panel (Rosenberg et al., 2002; Jakobsson et al., 2008; Li et al., 2008). The significant positive correlation between the degree of genetic divergence and the pairwise geographical distance suggests that the observed genetic differentiation can be partially explained by isolation by geographic distance, which agrees with previous data (Ramachandran et al., 2005; Jakobsson et al., 2008). As expected, the mean haplotype diversity and AAFs were highest in Sub-Saharan Africa. For all populations, diversity and AAF slopes were negatively correlated with geographic distance from Addis Ababa, Ethiopia, consistent with a serial founder model during a spatial expansion from Africa (Ramachandran et al., 2005). The AAFs for the majority of telomere maintenance genes were high, with most having an average AAF>60%, and the AAF slopes were much higher (range of 0.1–3.8) than observed by Li et al. (2008) (range of 0.001–0.004).

The high AAF, low diversity and differentiation in many of these genes and gene sets suggest that they may be constrained, possibly because of their essential roles in chromosomal stability. Several telomere maintenance genes have been previously shown to be highly conserved across species (Nakamura and Cech, 1998; Li et al., 2000; Kanoh and Ishikawa, 2003; de Lange, 2004; Savage et al., 2005). This conservation can be explained by a low mutation rate and/or negative selection, however, distinguishing the two is a difficult task as both result in little sequence change (Hurst, 2009). Savage et al., 2005 also found that seven of these genes had more synonymous compared with non-synonymous mutations per site. A plausible explanation for the lower levels of diversity and differentiation observed in many telomere maintenance genes is that negative selection acts to maintain the status quo of these essential genes. Perhaps these genes were highly conserved during evolution because of their important function and the accumulation of new mutations was not tolerable.

Negative, positive and balancing selection can each leave a specific signature on allele frequency patterns and LD (Walsh et al., 2006; Hurst, 2009). Using these patterns, we found evidence suggestive of positive selection in two separate data sets for ACD and TERF2IP, and evidence of balancing selection in POT1, NOLA2 and TNKS. Regions of low recombination, and thus long-range LD, as observed in POT1 and regions of TNKS, could be the result of balancing selection; alleles under balancing selection can drag linked alleles with them and cause increased LD (Hurst, 2009). Two additional studies also found evidence of selection in POT1, TNKS and TERF2IP. POT1 and TNKS were found to have significantly positive Tajima's D (Tajima, 1989) using sequence data (Savage et al., 2005), POT1 in non-Hispanic Caucasians and TNKS in individuals of Pacific Rim ancestry, suggestive of balancing selection. POT1 (in Europeans), TERF2IP and TNKS (both in East Asian and African populations) were also identified as candidate regions for recent selection with the powerful long-range haplotype and iHS tests in the HapMap genome-wide study based on over 3.1 million SNPs (The International HapMap Consortium, 2007).

Allele-frequency-based tests are not considered the most powerful methods to detect a recent selective sweep (Hanchard et al., 2006) and there is no statistical significance associated with these results. However, they highlight regions that might justify further investigation. There is also the possibility of SNP ascertainment bias that may result in false positive signals. Some of the genome-wide platform-selected SNPs are chosen based on their location in and around specific genes as well as based on haplotype-tagging SNPs in the region. However, we did not limit our analyses to only SNPs within gene exons, introns and UTRs because the goal of the study was to understand the gene and its surrounding region. The HGDP data were generated based on common SNPs and the HapMap data are also skewed toward common alleles making it more difficult to detect an excess of rare or derived alleles near fixation. However, the identification of these genes as candidates for selection in other studies suggests that selection may indeed be present. It has been (Barreiro et al., 2008) suggested that positive selection has ensured the regional adaptation of human populations by increasing population differentiation in gene regions, and that these loci likely contribute to disease-related phenotypic diversity among these different human populations.

We further explored genetic variation in TERT because several studies have identified both SNPs and mutations in TERT as important in cancer and telomere biology disorders (Armanios, 2009; Rafnar et al., 2009; Savage and Alter, 2009). Others have observed a high degree of TERT sequence similarity across species, hence we hypothesized that there would be limited genetic variation in these populations. However, we observed high haplotype diversity and heterozygosity in TERT. The level of TERT differentiation among populations was average or lower than average (genome-wide average for autosomal SNPs), which may reflect a lack of LD and likely a high recombination rate in this region. The cancer-associated SNP, rs2736100, varied substantially in genotype frequency across major continental regions, which could correlate to varying disease risk.

In conclusion, this study suggests that, as a group, telomere biology genes have less diversity and differentiation than genes with different functions. Data suggest that TERT may be an exception to this hypothesis. The identification of telomere biology genes under selection (for example, ACD, TERF2IP, POT1 and TNKS) might provide clues to their roles in telomere and chromosomal stability. It is possible that higher levels of genetic variation may not be tolerated in these genes, possibly due to their critical role in telomere maintenance.