Skip to main content
Log in

Methods to impute missing genotypes for population data

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automatic Control 19:716–723

    Article  Google Scholar 

  • Alter O, Brown P, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106

    Article  PubMed  CAS  Google Scholar 

  • Becker T, Knapp M (2005) Impact of missing genotype data on Monte–Carlo simulation based haplotype analysis. Hum Hered 59:185–189

    Article  PubMed  Google Scholar 

  • Chiano MN, Clayton DG (1998) Fine genetic mapping using haplotype analysis and the missing data problem. Ann Hum Genet 62:55–60

    Article  PubMed  CAS  Google Scholar 

  • Dai JY, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to improve inference in SNP association studies. Genet Epidemiol 30:690–702

    Article  PubMed  Google Scholar 

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38

    Google Scholar 

  • Enfron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–451

    Article  Google Scholar 

  • Excoffier L, Slakin M (1995) Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927

    PubMed  CAS  Google Scholar 

  • Fallin D, Schork N (2000) Accuracy of haplotype frequency estimation for biallelic loci, via the expectation–maximization algorithm for unphased diploid genotype data. Am J Hum Genet 67:947–959

    Article  PubMed  CAS  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, NY

    Google Scholar 

  • Hawley M, Kidd K (1995) HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 86:409–411

    PubMed  CAS  Google Scholar 

  • Hoti F, Sillanpaa MJ (2006) Bayesian mapping of genotype expression interactions in quantitative and qualitative traits. Heredity 97:4–18

    Article  PubMed  CAS  Google Scholar 

  • Lake S, Lyon H, Tantisira K, Silverman E, Weiss S, Laird N, Schaid D (2003) Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 55:56–65

    Article  PubMed  CAS  Google Scholar 

  • Lewontin R (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 120:849–852

    Google Scholar 

  • Lichten M, Goldman A (1995) Meiotic recombination hotspots. Annu Rev Genet 29:423–444

    Article  PubMed  CAS  Google Scholar 

  • Lin S, Chakravarti A, Cutler D (2004) Haplotype and missing data inference in nuclear families. Genome Res 14:1624–1632

    Article  PubMed  CAS  Google Scholar 

  • Little R, Rubin D (1987) Statistical analysis with missing data. Wiley, New York

    Google Scholar 

  • Liu N, Beerman I, Lifton R, Zhao H (2006) Haplotype analysis in the presence of informatively missing genotype data. Genet Epidemiol 30:290–300

    Article  PubMed  Google Scholar 

  • Long J, Williams R, Urbanek M (1995) An E–M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 56:799–810

    PubMed  CAS  Google Scholar 

  • Mallows C (1973) Some comments on Cp. Technometrics 15:661–675

    Article  Google Scholar 

  • Marchini J, Culter D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, Donnelly P, the International HapMap Consortium (2006) A comparison of phasing algorithm for trios and unrelated individuals. Am J Hum Genet 78:437–450

    Article  PubMed  CAS  Google Scholar 

  • Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913

    Article  PubMed  CAS  Google Scholar 

  • Nicolae DL (2006) Testing untyped alleles (TUNA)—applications to genome-wide association studies. Genet Epidemiol 30:718–727

    Article  PubMed  Google Scholar 

  • Niu T, Qin ZS, Xu X, Liu J (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 70:157–169

    Article  PubMed  CAS  Google Scholar 

  • Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909

    Article  PubMed  CAS  Google Scholar 

  • Qin ZS, Niu T, Liu J (2002) Partition–ligation–expectation–maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 71:1242–1247

    Article  PubMed  CAS  Google Scholar 

  • Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  • Schaid D, Rowland C, Tines D, Jacobson RM, Poland G (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70:425–434

    Article  PubMed  Google Scholar 

  • Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotype phase. Am J Hum Genet 78:629–644

    Article  PubMed  CAS  Google Scholar 

  • Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3(7):e114

    Article  PubMed  Google Scholar 

  • Souverein OW, Zwinderman AH, Tanck MWT (2006) Multiple imputation of missing genotype data for unrelated individuals. Anna Hum Genet 70:372–381

    Article  CAS  Google Scholar 

  • Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162–1169

    Article  PubMed  CAS  Google Scholar 

  • Stephens M, Scheet P (2005) Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet 76:449–462

    Article  PubMed  CAS  Google Scholar 

  • Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68:978–989

    Article  PubMed  CAS  Google Scholar 

  • The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437:1299–1320

    Article  Google Scholar 

  • Therneau T, Atkinson E (1997) An introduction to recursive partitioning using the RPART routines. Tech Rep 61:52

    Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the three anonymous reviewers for their constructive suggestions. This work was supported by the U.S. Public Health Service, National Institutes of Health, contract grant number GM065450.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel J. Schaid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Schaid, D.J. Methods to impute missing genotypes for population data. Hum Genet 122, 495–504 (2007). https://doi.org/10.1007/s00439-007-0427-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-007-0427-y

Keywords

Navigation