Tag SNP selection for candidate gene association studies using HapMap and gene resequencing data

Xu, Zongli; Kaplan, Norman L; Taylor, Jack A

doi:10.1038/sj.ejhg.5201875

Download PDF

Article
Published: 13 June 2007

Tag SNP selection for candidate gene association studies using HapMap and gene resequencing data

Zongli Xu¹,
Norman L Kaplan² &
Jack A Taylor^1,3

European Journal of Human Genetics volume 15, pages 1063–1070 (2007)Cite this article

1824 Accesses
31 Citations
Metrics details

Abstract

HapMap provides linkage disequilibrium (LD) information on a sample of 3.7 million SNPs that can be used for tag SNP selection in whole-genome association studies. HapMap can also be used for tag SNP selection in candidate genes, although its performance has yet to be evaluated against gene resequencing data, where there is near-complete SNP ascertainment. The Environmental Genome Project (EGP) is the largest gene resequencing effort to date with over 500 resequenced genes. We used HapMap data to select tag SNPs and calculated the proportions of common SNPs (MAF≥0.05) tagged (ρ²≥0.8) for each of 127 EGP Panel 2 genes where individual ethnic information was available. Median gene-tagging proportions are 50, 80 and 74% for African, Asian, and European groups, respectively. These low gene-tagging proportions may be problematic for some candidate gene studies. In addition, although HapMap targeted nonsynonymous SNPs (nsSNPs), we estimate only ∼30% of nonsynonymous SNPs in EGP are in high LD with any HapMap SNP. We show that gene-tagging proportions can be improved by adding a relatively small number of tag SNPs that were selected based on resequencing data. We also demonstrate that ethnic-mixed data can be used to improve HapMap gene-tagging proportions, but are not as efficient as ethnic-specific data. Finally, we generalized the greedy algorithm proposed by Carlson et al (2004) to select tag SNPs for multiple populations and implemented the algorithm into a freely available software package mPopTag.

Genotyping, sequencing and analysis of 140,000 adults from Mexico City

Article Open access 11 October 2023

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

A resource-efficient tool for mixed model association analysis of large-scale data

Article 25 November 2019

Introduction

The International HapMap Project has detailed information on genetic variation across the genome.¹ An important use of these data is to help identify genetic determinants of disease. HapMap Release 20 has genotype data for more than 3.7 million SNPs for several populations (http://www.hapmap.org/). Simulations with HapMap ENCODE (Encyclopedia of DNA Elements) Project data, (resequencing of 10 500-kb genomic regions in 48 individuals and subsequent genotyping of all discovered SNPs as well as all SNPs in dbSNP at the time in the 270 HapMap DNA samples), estimated that 94% of the common SNPs (minor allele frequency, MAF≥0.05) in non-African populations and 81% in Yoruba from Ibadan, Nigeria (YRI) populations are in high linkage disequilibrium (LD) with at least one of the SNPs in HapMap.¹ These simulations suggest that HapMap SNP density may be adequate for whole-genome association studies.

Investigators are also using HapMap data for SNP selection in candidate gene association studies.^{1, 2} Because HapMap collects samples from SNPs that have been deposited into dbSNP (http://www.ncbi.nlm.nih.gov/SNP), it only has partial information on gene polymorphisms, whereas near-complete ascertainment of common SNPs in genes can be obtained through gene resequencing. The largest gene resequencing effort to date is the Environmental Genome Project (EGP) sponsored by National Institute of Environmental Health Science³ (http://www.niehs.nih.gov/envgenom/home.htm), which at the time of this study has resequenced 518 genes in 90 to 95 people of different ethnic backgrounds and has identified more than 70 000 SNPs. In total, EGP has resequenced more than 12 Mb of the human genome, although individual ethnic information is available only for 127 genes resequenced in EGP Panel 2. We used HapMap data to identify tag SNPs for each of these 127 genes and then, using the catalog of common SNPs identified through EGP resequencing, we estimated gene-tagging proportions of HapMap tag SNPs in each of three ethnic groups. In addition, we considered strategies to improve gene-tagging proportions beyond those obtained using HapMap tag SNPs.

The 391 genes resequenced in EGP Panel 1 used 90 individuals drawn from the ethnically diverse Polymorphism Discovery Resource.⁴ Because of ethical, legal, and social implications (ELSI), ethnic identifiers were removed, resulting in an ethnic-mixed sample, but one with known ethnic proportions. The utility of tag SNPs chosen from an ethnic-mixed sample is unclear, because allele frequencies and/or underlying LD patterns may differ between populations,⁵ We investigated this problem by using ethnic-pooled data for EGP Panel 2 genes for which we have individual ethnic data.

Candidate gene studies often include individuals from multiple ethnic groups, which may require the use of different ethnic-specific panels of tag SNPs. It would be reasonable and certainly more convenient to have one set of tag SNPs that can be used in multiple populations. Similar in purpose to the TagIT⁶ and MultiPop-TagSelect⁷ methods, we generalized the greedy algorithm proposed by Carlson et al (2004)²¹ to select tag SNPs for multiple populations. We used this algorithm to choose HapMap multipopulation tag SNPs and evaluated gene-tagging proportions for EGP Panel 2 genes.

Because nonsynonymous coding SNPs (nsSNPs) are a high priority for candidate gene-association studies,^{8, 9} HapMap made a special effort to include as many nsSNPs as possible.¹ Despite HapMap's effort, its information on nsSNPs may be limited, because most nsSNPs have low MAF.^{8, 9, 10, 11} To quantify HapMap's success in capturing nsSNPs, we used EGP resequence data to estimate the fraction of nsSNPs that are either in HapMap or in high LD with a SNP in HapMap.

Materials and methods

Data

The EGP selected for resequencing those genes thought to be involved in susceptibility to environmentally associated disease. The major focus of this effort was on genes associated with DNA repair, cell cycle regulation, apoptosis, and metabolism. These genes are widely distributed across all chromosomes, except for the Y chromosome. At the time of this study, genotypes based on resequencing data were available from the EGP website for 52 387 SNPs in 391 genes from EGP Panel 1 and for 18 850 SNPs in 127 genes from EGP Panel 2. By examining the date of deposit, we found 52 352 (73%) of the SNPs in Panel 1 and 2 were novel at the time of their deposit into dbSNP.¹² Approximately 17% of the novel SNPs were common (MAF≥0.05) in EGP data. We considered only biallelic SNPs with less than 20% missing genotype data, resulting in 48 697 SNPs in EGP Panel 1 and 17 495 SNPs in EGP Panel 2. The EGP resequencing effort applied a number of measures to assure data quality and had an average base call Phred score >45 (99.998% accuracy of the base call).¹³

EGP Panel 1 has DNA from 90 individuals, that includes 24 African-Americans, 24 Asian-Americans, 24 European-Americans, 12 Hispanic-Americans, and 6 Native-Americans, with equal numbers of males and females drawn from the Polymorphism Discovery Resource.⁴ EGP Panel 2 has DNA from an independent set of 95 individuals (http://egp.gs.washington.edu/), that includes 15 African-Americans (AA), 12 YRI, 12 Japanese in Tokyo, Japan (JPT), 12 Han Chinese in Beijing, China (CHB), 22 CEPH (Utah residents with ancestry from northern and western Europe) (CEPH) and 22 Hispanics (HISP). Fifty-eight of the individuals (12 YRI, 12 JPT, 12 CHB, and 22 CEPH) in EGP Panel 2 were also included in HapMap. Although African-Americans have an admixed ancestry,⁴ a recent study has shown that the LD pattern of African-Americans was similar to YRI,¹⁴ and therefore, we combined the two groups as ‘African’. Similarly, Chinese and Japanese data were combined as ‘Asian’. To mimic the EGP Panel 1 ethnic-mixed sample, we also formed an EGP Panel 2 ‘Pool’ group composed of all Panel 2 subjects.

SNP genotype data were coded 1, 0, and −1 corresponding to major allele homozygote, heterozygote, and minor allele homozygote. For consistency of the genotype code across populations, major and minor alleles were always classified by the allele frequency in the Pool data. For population-specific data, we calculated MAF within each population. For Panel 1 and pooled Panel 2 data, we calculated MAF using ethnically mixed data. We divided SNPs into two groups: common SNPs where the MAF was ≥0.05, and rare SNPs with MAF <0.05.

HapMap SNPs were genotyped in four population samples, including 30 CEPH trios, 45 unrelated JPT, 45 unrelated CHB, and 30 YRI trios. HapMap Public Release 20 has genotype information for about 3.7 million SNPs (∼1.2 SNP/kb across the human genome). SNP genotype data were downloaded from the HapMap website. Only the 210 unrelated individuals were included in our analysis. As with the EGP data, we combined CHB and JPT data as ‘Asian’. We matched HapMap and EGP SNPs according to reference chromosome positions in dbSNP build 124. If the genotyping orientation was different between EGP and HapMap, HapMap SNP nucleotide data were converted into the complementary nucleotide code.

Composite linkage disequilibrium

Standard measures of LD, including r² and D′, require assumptions of random mating and Hardy–Weinberg equilibrium (HWE) for phase-unknown data.¹⁵ These assumptions may not be met for EGP Panel 1 data, where ethnic identifiers have been removed from individual samples or for Panel 2 Pool data where ethnic identifiers were ignored. Therefore, instead of r², we used in our analysis a measure of composite LD proposed by Weir and Cockerham.^{16, 17} Composite LD (Δ_AB=D_AB+D_A/B) measures the association of alleles from different loci A and B on the same gamete (gametic LD, D_AB), as well as on different gametes (nongametic LD, D_AB). D_A/B is the usual measure of LD, D_AB=p_AB−p_Ap_B, whereas nongametic LD is D_A/B=p_A/B−p_Ap_B. Where p_AB is the frequency of gamete AB, p_A/B is the frequency of alleles A and B on two different gametes, p_A and p_A are the frequencies of alleles A and B at two loci. An advantage of the composite LD measure (Δ_AB) is that it can be calculated from genotype data directly without requiring an assumption of random mating. In addition, it provides a robust method to test for LD, maintains the correct type I error rate whether or not there is departure from HWE at either locus.^{18, 19} In the case of random mating, D_A/B=0, and the composite LD reduces to the usual gametic LD D_AB. A test statistic for composite LD proposed by Weir¹⁵

is based on a normalization of Δ_AB. In this expression, D_A=p_AA−p_A and D_B=p_BB−p_B are the deviations from HWE at each locus, p_AA and p_BB are the frequencies of genotypes AA and BB. For n individuals, nρ² has an approximate χ₁² distribution when Δ_AB=0.¹⁵ In most cases, ρ² and the gametic LD measure r² are very similar.¹⁹ Finally, with our genotype coding, ρ is equivalent to the simple linear correlation coefficient of genotype data at two loci.²⁰

Tag SNPs

We employed a greedy algorithm proposed by Carlson et al²¹ to select tag SNPs from the set of common SNPs for each gene. First, we calculated ρ² for all possible pairs of common SNPs within a gene. For each gene, the greedy algorithm selects a SNP where ρ² is greater or equal to 0.8 with the largest number of other SNPs, and places these correlated SNPs into one bin. The binning process is iterated for all remaining unbinned SNPs, and continues until ρ² is less than 0.8 for all remaining pairs of SNPs. These SNPs are each placed into singleton bins containing only themselves.

We generalized the greedy algorithm to construct a parsimonious set of tag SNPs for multiple populations. As before, we first calculate ρ² for all pairs of common SNPs within a genome region separately for each ethnic group. We then execute the following three steps.

1
For each SNP, we count the number of SNPs that have ρ² greater or equal to a specified threshold with the SNP. This is done independently for each ethnic group.
2
We sum up the counts for each SNP across ethnic groups. The SNP with the largest sum is selected as a tag SNP.
3
For each ethnic group, we bin SNPs for which ρ² exceeds the threshold with the tag SNP.

Steps 1–3 are iterated for all remaining unbinned SNPs within each ethnic group until the only remaining SNPs are those whose sum equals 1. These SNPs are placed into singleton bins containing only themselves.

We note that this algorithm does not require that the different ethnic groups start with the same set of common SNPs. Furthermore, LD patterns may vary between populations so that the set of SNPs binned at each step may differ by ethnic group. We implemented this algorithm into a freely available software mPopTag (http://dir.niehs.nih.gov/direb/mpoptag).

For each of the gene regions resequenced by EGP, we used HapMap data to select tag SNPs. We evaluated these tag SNPs against EGP genotype data by calculating the ‘gene-tagging proportion’, that is, the percent of common EGP SNPs in a gene that are in high LD (ρ²≥0.8) with at least one tag SNP. We investigated a simple strategy to increase gene-tagging proportions by supplementing HapMap tag SNPs. For EGP common SNPs that were not in high LD (ρ² <0.8) with any HapMap tag, we used the greedy algorithm to construct LD bins. The supplemental tag SNPs were chosen either to tag all bins or only multi-SNP bins.

Simulations

EGP gene resequencing often excluded portions of large introns.¹³ HapMap may have SNPs within such unresequenced ‘holes’ and inclusion of these SNPs might improve HapMap gene-tagging proportions.²² To estimate the effect of HapMap SNPs in holes on our estimation of gene-tagging proportion, we simulated genes with and without holes using ENCODE data. First, we simulated HapMap SNPs by randomly sampling common SNPs in ENCODE regions at a density comparable to HapMap. To better approximate HapMap SNPs, we restricted sampling to ‘RS SNP’ (ie SNPs that were in dbSNP before ENCODE resequencing, http://www.hapmap.org/downloads/encode1.html.en). Second, the contiguous region of resequenced and unresequenced segments of each EGP Panel 2 gene was simulated by randomly placing it within ENCODE regions. For simulated genes both with and without holes, we applied the greedy algorithm to select tag SNPs. We then calculated tagging proportions for common SNPs in the resequenced regions. A similar simulation strategy was used to investigate the effect of HapMap SNPs in flanking regions on gene-tagging proportions.

The small sample size of EGP ethnic groups could lead to biased estimates of gene-tagging proportions. We used the coalescent method implemented in COSI software²³ to simulate genotype data of a 50-kb gene for 10 000 individuals in each of four ethnic groups (European, African-American, African, and Asian). Analogous to HapMap, we randomly sampled 90 individuals for each ethnic group and selected tag SNPs for each ethnic group. Finally, for each ethnic group, we compared tagging proportion estimates for a large sample of 1 000 individuals to tagging proportion estimates for a small sample of 24 individuals (EGP sample size).

We also performed simulations using EGP Panel 2 data to evaluate the effect of a small number of HapMap SNPs that were missing from EGP. We randomly sampled a small subset of EGP common SNPs and added them to the set of HapMap SNPs found in EGP. For both these sets of SNPs, we used EGP genotype data to select tag SNPs. We then calculated gene-tagging proportions in each of the ethnic populations for the two tag SNP sets.

Results

For common SNPs (MAF≥0.05), EGP Panel 1, EGP Panel 2, and ENCODE only have small differences in SNP density (Table 1). On a genome-wide basis, HapMap Release 20 has approximately 45% of the common SNP density found in EGP and ENCODE. We also examined HapMap SNP densities in the specific regions resequenced by EGP, and found that HapMap has a slightly higher SNP density in Panel 1 regions than Panel 2 regions (Table 1). There were 8852 SNPs in EGP Panel 2 that had MAF≥0.05 in one or more of the three ethnic groups within EGP. Of these SNPs, HapMap had genotyped 2710 (31%). Conversely, there were 3073 HapMap SNPs found in EGP Panel 2 resequenced gene regions for the 127 genes that had MAF≥0.05 in one or more of the three ethnic groups within HapMap. Of these SNPs, EGP had genotype information for 2916 (95%).

Table 1 SNP density

Full size table

We selected tag SNPs using HapMap genotype data for 127 genes in EGP Panel 2, and used EGP resequencing data to evaluate their performance in each of the three ethnic groups. Based on HapMap tag SNPs for each gene in each ethnic group, we found that gene-tagging proportions differed by ethnic group. The median gene-tagging proportions were 48, 78, and 72% for African, Asian, and European groups, respectively (Figure 1). We also investigated our decision to pool the ethnically admixed African-American individuals with the YRI individuals into a single ‘African’ Group. We find that median gene-tagging proportions for African-Americans, YRI, and the pooled ‘African’ groups only have minimal difference (data not shown).

In general, EGP resequenced the entire genomic sequence for genes whose size was <30 kb, whereas resequencing of genes >30 kb excluded portions of large introns.¹³ Because HapMap has genotype data on SNPs that were in unresequenced regions and thus were not included in our analysis, HapMap-tagging proportions for genes with unresequenced ‘holes’ may be biased downward.²² Using ENCODE data, we simulated the effect of unresequenced holes on tagging proportions and found little evidence of bias (Figure 2). We also examined whether inclusion of additional SNPs available within HapMap beyond the 3′ and 5′ flanking regions resequenced by EGP would substantially improve gene-tagging proportions. Simulation results suggested inclusion of an additional 5 kb to both flanking regions provides only modest improvement in gene-tagging proportions (Figure 2). Increasing flanking regions to as much as 20 kb provided very little additional improvement and required many more tag SNPs (data not shown).

HapMap SNPs that are not included in EGP could lead to underestimation of the gene-tagging proportions. In total, there were 157 HapMap SNPs that were common in at least one HapMap ethnic group (117, 87, and 105 in African, Asian, and European, respectively), but were not found in EGP. However, the majority of the missed SNPs (72, 67, and 83 in the three ethnic groups, respectively) were in high LD ρ²≥0.8 with another HapMap SNP that did have a match in EGP. The results of 100 simulations suggest that the 157 HapMap SNPs missing from EGP have minimal effect on gene-tagging proportions and, on average, result in a 2% increase in median tagging proportion in the three ethnic groups.

The small sample size of EGP might bias gene-tagging proportion estimates. We used simulations to compare tagging proportions from a sample size of 24 or 1000. The results of 100 simulations suggest that gene-tagging proportion estimates at EGP sample sizes of 24 individuals have minimal bias (data not shown).

We applied the strategy described in Materials and methods for supplementing the set of HapMap tag SNPs. If supplemental tag SNPs for all untagged LD bins are included, then all gene-tagging proportions are increased to 1.0, but this requires a large number of additional tag SNPs, because there are many LD bins with a single SNP. We therefore considered the more efficient strategy of only adding tag SNPs for untagged multi-SNP LD bins. The results in Figure 3 show that this strategy improves the tagging proportions with a modest increase in the number of tag SNPs. For example, in Europeans, an increase from 792 to 962 tag SNPs (1.3 additional tag SNPs/gene) resulted in an increase in the median gene-tagging proportion from 0.72 to 0.87. In contrast, a total of 1537 SNPs would be required to tag all LD bins.

For the 391 genes in EGP Panel 1, ethnic-specific data are not available. To investigate the utility of using ethnic-mixed Panel 1 genotype data to pick tag SNPs, we pooled Panel 2 genotype data and compared the results of the Pool against the ethnic-specific standards. Only 47 (0.3%) of the 16 195 correlated SNP pairs (ρ²≥0.8) in the Pool were not correlated in any ethnic group, suggesting there are minimal false-positive correlations. In addition, 90% of correlated SNP pairs in the Pool were correlated in three or more ethnic groups. Thus, ρ² calculated from Pool data appears to reflect LD structure in component populations. The results in Figure 3 show that adding tag SNPs for multi-SNP bins identified from Panel 2 Pool genotype data can improve the gene-tagging proportions of HapMap. For example, in the European sample, an increase from 792 to 1255 tag SNPs (3.6 additional SNPs/gene) resulted in an increase in the median gene-tagging proportion from 0.72 to 0.83. A total of 2228 tag SNPs would be required to include all singleton bins from the Pool and increases median gene-tagging proportion from 0.72 to 0.93.

We applied the generalized greedy algorithm described in Materials and methods to select multipopulation tag SNPs for the three HapMap populations and identified 1674 tag SNPs, of which 959 tagged multi-SNP bins. We evaluated gene-tagging proportions of these 959 tag SNPs in EGP Panel 2 data (Figure 4). The results show that the median gene-tagging proportions were 0.42, 0.74, and 0.74 for African, Asian, and European populations respectively. Median gene-tagging proportions could be increased to 0.55, 0.8, and 0.78, respectively by using all 1674 tag SNPs. Using the supplemental tag SNP strategy described in Materials and methods and applying the multipopulation tag SNP algorithm to EGP Panel 2 ethnic-specific data, we added 1219 multi-SNP bin tag SNPs (for a total of 2178 tag SNPs), with resulting median gene-tagging proportions of 0.81, 0.94, and 0.92 for the three populations. We also augmented the multi-population HapMap tag SNPs with 369 SNPs from Panel 2 Pool multi-SNP bins and obtained median gene-tagging proportions of 0.52, 0.84, and 0.82 for the three populations. Adding to the multipopulation HapMap tags, all tag SNPs from the Pool (for a total of 2284 tag SNPs), increased tagging proportions to 0.64, 0.93, and 0.93 in the three populations (Figure 4).

For EGP Panel 2 genes, there were on average approximately 3 nsSNPs per gene. The majority of these nsSNPs (∼82% in non-African groups and ∼72% in the African group) were rare (MAF <0.05). HapMap did not have genotype data on roughly 40% of common and 87% of rare nsSNPs (Table 2). About 30% of the missed common nsSNPs are in high LD with a common SNP in HapMap, but only a very small proportion of rare nsSNPs are in high LD with a common HapMap SNPs. Therefore, approximately 70% of common nsSNPs and 15% of all (rare plus common) nsSNPs in EGP can be tagged by a common HapMap SNPs. Even if we augment all common HapMap SNPs with all rare nsSNPs in HapMap, only 26% of all nsSNPs in EGP are tagged at ρ²≥0.8. Using this same augmented set of tag SNPs, we found that the multi-marker evaluation method incorporated in Haploview software²⁴ increased the tagging proportion to 30%.

Table 2 Number of nonsynonymous SNPs per gene among EGP and EGP-matched HapMap SNPs

Full size table

Discussion

Using ENCODE data, it has been argued that HapMap has adequate SNP density for whole-genome scans. However, HapMap SNP density may pose a problem for some individual candidate genes. ENCODE regions include less than 20 genes and this is an inadequate sample to assess gene-tagging proportions. Using tag SNPs selected from HapMap and applying them to EGP genotype data of 127 genes, we found that tagging proportions were low for nearly half of genes, particularly, when evaluated in African samples.

Our estimation of HapMap-tagging proportions could be biased downward for several reasons. First, EGP did not resequence portions of large introns (holes) and had limited data on flanking regions. We evaluated the possibilities that the inclusion of HapMap SNPs in these regions might improve gene-tagging proportions. Simulations based on ENCODE data suggest that accounting for HapMap SNPs in holes, or in an additional 5 to 20 kb of both 5′ and 3′ flanking sequence, would provide only modest improvements in HapMap gene-tagging proportions for EGP resequenced gene regions. Although inclusion of larger flanking regions might improve gene-tagging proportions, such inclusion might not be cost effective for candidate gene studies. Second, using simulations in EGP Panel 2 data, we evaluated whether the small number of common HapMap SNPs that are missing from EGP affect tagging proportion. Our results suggest that their inclusion would provide minimal improvement in tagging proportions. Finally, we used simulation to investigate the effect of small EGP sample size, but found minimal bias in gene-tagging proportion estimates.

Tagging proportion is a commonly used threshold metric of how well a set of genotyped SNPs captures ungenotyped variants.^{1, 25, 26} However, one must be cautious when using the specified threshold to estimate sample size required for an association study. Because sample size and power to detect a causal variant are not linearly related, merely adjusting sample size requirements by the reciprocal of the threshold is not sufficient to achieve a specified power.²⁶ The summary metric average maximum r² suffers the same problem.²⁶ For a more complete discussion of this issue and a strategy for obtaining more accurate estimates of sample size, the reader should consult the papers of Jorgenson and Witte.^{26, 27}

HapMap-tagging proportions can be improved by adding supplemental tag SNPs based on ethnic-specific resequencing data. We noted that gene-tagging proportions in Asians and Europeans can be substantially improved by adding a small number of tag SNPs for multi-SNP bins not yet tagged by HapMap. Gene-tagging proportions can also be improved for Africans but, because of the fine-grained LD structure, require many more tag SNPs.

Despite its lack of individual ethnicity information, EGP Panel 1 represents a rich resource of SNP information that might be useful for tag SNP selection. To examine this possibility, we pooled the EGP Panel 2 genotype data and used these data as a surrogate for EGP Panel 1. This is an appropriate surrogate given that EGP Panel 1 and 2 are similar in the number of people from different ethnic groups, gene function, gene size, and SNP density (http://www.genome.utah.edu/genesnps). EGP Panel 2 Pool data showed that the vast majority of SNP pairs that were correlated in the Pool were also correlated in each of several ethnic groups. We show that tag SNPs from EGP Panel 2 Pool data can augment HapMap tag SNPs to increase gene-tagging proportions, although these tags are not as efficient as tag SNPs from ethnic-specific data. We conclude from these results that the detailed resequencing information on 391 EGP Panel 1 genes may be used to select tag SNPs for multiple populations.

An advantage of multipopulation tag SNPs is that a single set of SNPs can be genotyped in multiple populations, rather than developing different panels of tag SNPs for each population. A disadvantage is that the number of tag SNPs will be larger than the number of tag SNPs in any one ethnic-specific group. Furthermore, the SNPs in an LD bin defined by a multipopulation tag SNP can differ by population, and thus multiple LD or haplotype maps are still needed to analyze the genotype data of multipopulation tag SNPs.

HapMap contained a much higher percentage of rare nsSNPs in EGP Panel 1 gene regions than in EGP Panel 2 gene regions (Table 2). We believe the difference is because Panel 1 data were deposited into dbSNP before HapMap, whereas Panel 2 data were deposited after the creation of HapMap. Thus, Panel 2 data are likely to be representative of the vast majority of genes that have not been extensively resequenced. Although HapMap was not intended to provide coverage for rare SNPs, efforts were made to genotype all known nsSNPs.¹ Similar to results of Barrett et al,¹¹ our results based on EGP Panel 2 data suggest that HapMap provided information for the majority of common nsSNPs, but is of marginal value for the 80% of nsSNPs that are rare. Using a multimarker tag SNP evaluation method provided some improvement in nsSNP-tagging proportion, but the majority of nsSNPs remained untagged.

HapMap is a resource for whole-genome association studies,¹ and is also a powerful resource for other uses, including the selection of tag SNPs for candidate gene studies. But because HapMap is an incomplete catalog of SNPs, its successful use in candidate gene studies depends on whether this incomplete catalog provides adequate information on the untyped SNPs in genes. Our results suggest HapMap-tagging proportions are low for many genes and that investigators may wish to augment HapMap SNPs with additional SNPs from gene resequencing data. Both EGP Panel 1 and Panel 2 provide a rich SNP resource for a large selection of genes that investigators can use to supplement HapMap tag SNPs. As the cost of resequencing continues to decline, such resources will be available on a larger selection of genes.

References

The International HapMap Consortium: a haplotype map of the human genome. Nature 2005; 437: 1299–1320.
The International HapMap Consortium: The International HapMap Project. Nature 2003; 426: 789–796.
Olden K, Wilson S : Environmental health and genomics: visions and implications. Nat Rev Genet 2000; 1: 149–153.
Article CAS Google Scholar
Collins FS, Brooks LD, Chakravarti A : A DNA polymorphism discovery resource for research on human genetic variation. Genome Res 1998; 8: 1229–1231.
Article CAS Google Scholar
Evans DM, Cardon LR : A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations. Am J Hum Genet 2005; 76: 681–687.
Article CAS Google Scholar
Ahmadi KR, Weale ME, Xue ZY et al: A single-nucleotide polymorphism tagging set for human drug metabolism and transport. Nat Genet 2005; 37: 84–89.
Article CAS Google Scholar
Howie BN, Carlson CS, Rieder MJ, Nickerson DA : Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum Genet 2006; 120: 58–68.
Article Google Scholar
Ireland J, Carlton VE, Falkowski M et al: Large-scale characterization of public database SNPs causing non-synonymous changes in three ethnic groups. Hum Genet 2006; 119: 75–83.
Article Google Scholar
Cargill M, Altshuler D, Ireland J et al: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 1999; 22: 231–238.
Article CAS Google Scholar
Glatt CE, DeYoung JA, Delgado S et al: Screening a large reference sample to identify very low frequency sequence variants: comparisons between two genes. Nat Genet 2001; 27: 435–438.
Article CAS Google Scholar
Barrett JC, Cardon LR : Evaluating coverage of genome-wide association studies. Nat Genet 2006; 38: 659–662.
Article CAS Google Scholar
Taylor JA, Xu Z, Kaplan NL, Morris RW : How well do HapMap haplotypes identify Common haplotypes of genes? A comparison with haplotypes of 334 genes resequenced in the Environmental Genome Project. Cancer Epidemiol Biomarkers Prev 2006; 15: 133–137.
Article CAS Google Scholar
Livingston RJ, von Niederhausern A, Jegga AG et al: Pattern of sequence variation across 213 environmental response genes. Genome Res 2004; 14: 1821–1831.
Article CAS Google Scholar
Gabriel SB, Schaffner SF, Nguyen H et al: The structure of haplotype blocks in the human genome. Science 2002; 296: 2225–2229.
Article CAS Google Scholar
Weir B : Genetic Data Analysis II. Sunderland, MA: Sinauer Associates, 1996.
Google Scholar
Weir BS, Cockerham CC : Complete characterization of disequilibrium at two loci. Mathematical Evolutionary Theory 1989.
Weir BS : Inferences about linkage disequilibrium. Biometrics 1979; 35: 235–254.
Article CAS Google Scholar
Schaid DJ : Linkage disequilibrium testing when linkage phase is unknown. Genetics 2004; 166: 505–512.
Article Google Scholar
Weir BS, Hill WG, Cardon LR : Allelic association patterns for a dense SNP map. Genet Epidemiol 2004; 27: 442–450.
Article CAS Google Scholar
Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG : Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 2003; 73: 115–130.
Article CAS Google Scholar
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA : Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004; 74: 106–120.
Article CAS Google Scholar
Pe’er I, Chretien YR, de Bakker PI, Barrett JC, Daly MJ, Altshuler DM : Biases and reconciliation in estimates of linkage disequilibrium in the human genome. Am J Hum Genet 2006; 78: 588–603.
Article Google Scholar
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D : Calibrating a coalescent simulation of human genome sequence variation. Genome Res 2005; 15: 1576–1583.
Article CAS Google Scholar
Barrett JC, Fry B, Maller J, Daly MJ : Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005; 21: 263–265.
Article CAS Google Scholar
Zeggini E, Rayner W, Morris AP et al: An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nat Genet 2005; 37: 1320–1322.
Article CAS Google Scholar
Jorgenson E, Witte JS : Coverage and power in genomewide association studies. Am J Hum Genet 2006; 78: 884–888.
Article CAS Google Scholar
Jorgenson E, Witte JS : A gene-centric approach to genome-wide association studies. Nat Rev Genet 2006; 7: 885–891.
Article CAS Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers, whose comments and suggestions greatly improved the manuscript. This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences.

Author information

Authors and Affiliations

Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
Zongli Xu & Jack A Taylor
Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
Norman L Kaplan
Laboratory of Molecular Carcinogenesis, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
Jack A Taylor

Authors

Zongli Xu
View author publications
You can also search for this author in PubMed Google Scholar
Norman L Kaplan
View author publications
You can also search for this author in PubMed Google Scholar
Jack A Taylor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jack A Taylor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Z., Kaplan, N. & Taylor, J. Tag SNP selection for candidate gene association studies using HapMap and gene resequencing data. Eur J Hum Genet 15, 1063–1070 (2007). https://doi.org/10.1038/sj.ejhg.5201875

Download citation

Received: 12 July 2006
Revised: 30 March 2007
Accepted: 11 May 2007
Published: 13 June 2007
Issue Date: October 2007
DOI: https://doi.org/10.1038/sj.ejhg.5201875

Keywords

This article is cited by

Association of MMP-2 gene haplotypes with thoracic aortic dissection in chinese han population
- Ou Liu
- Jiachen Li
- Hongjia Zhang
BMC Cardiovascular Disorders (2016)
Intragenic Variations in BTLA Gene Influence mRNA Expression of BTLA Gene in Chronic Lymphocytic Leukemia Patients and Confer Susceptibility to Chronic Lymphocytic Leukemia
- Lidia Karabon
- Anna Partyka
- Irena Frydecka
Archivum Immunologiae et Therapiae Experimentalis (2016)
Genetics of the human placenta: implications for toxicokinetics
- Claudia Gundacker
- Jürgen Neesen
- Markus Hengstschläger
Archives of Toxicology (2016)
Multi-marker-LD based genetic algorithm for tag SNP selection
- Amer E. Mouawad
- Nashat Mansour
Interdisciplinary Sciences: Computational Life Sciences (2014)
SNPPicker: High quality tag SNP selection across multiple populations
- Hugues Sicotte
- David N Rider
- Jean-Pierre A Kocher
BMC Bioinformatics (2011)

Tag SNP selection for candidate gene association studies using HapMap and gene resequencing data

Abstract

Similar content being viewed by others

Genotyping, sequencing and analysis of 140,000 adults from Mexico City

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

A resource-efficient tool for mixed model association analysis of large-scale data

Introduction

Materials and methods