Methods to impute missing genotypes for population data

Yu, Zhaoxia; Schaid, Daniel J.

doi:10.1007/s00439-007-0427-y

Methods to impute missing genotypes for population data

Original Investigation
Published: 13 September 2007

Volume 122, pages 495–504, (2007)
Cite this article

Human Genetics Aims and scope Submit manuscript

Zhaoxia Yu¹ &
Daniel J. Schaid²

617 Accesses
35 Citations
Explore all metrics

Abstract

For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Haplotype estimation for biobank-scale data sets

Article 06 June 2016

Jared O'Connell, Kevin Sharp, … Jonathan Marchini

Population-specific genotype imputations using minimac or IMPUTE2

Article 30 July 2015

Elisabeth M van Leeuwen, Alexandros Kanterakis, … Jouke Jan Hottenga

Genotype Imputation to Increase Sample Size in Pedigreed Populations

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automatic Control 19:716–723
Article Google Scholar
Alter O, Brown P, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106
Article PubMed CAS Google Scholar
Becker T, Knapp M (2005) Impact of missing genotype data on Monte–Carlo simulation based haplotype analysis. Hum Hered 59:185–189
Article PubMed Google Scholar
Chiano MN, Clayton DG (1998) Fine genetic mapping using haplotype analysis and the missing data problem. Ann Hum Genet 62:55–60
Article PubMed CAS Google Scholar
Dai JY, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to improve inference in SNP association studies. Genet Epidemiol 30:690–702
Article PubMed Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
Google Scholar
Enfron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–451
Article Google Scholar
Excoffier L, Slakin M (1995) Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927
PubMed CAS Google Scholar
Fallin D, Schork N (2000) Accuracy of haplotype frequency estimation for biallelic loci, via the expectation–maximization algorithm for unphased diploid genotype data. Am J Hum Genet 67:947–959
Article PubMed CAS Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, NY
Google Scholar
Hawley M, Kidd K (1995) HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 86:409–411
PubMed CAS Google Scholar
Hoti F, Sillanpaa MJ (2006) Bayesian mapping of genotype expression interactions in quantitative and qualitative traits. Heredity 97:4–18
Article PubMed CAS Google Scholar
Lake S, Lyon H, Tantisira K, Silverman E, Weiss S, Laird N, Schaid D (2003) Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 55:56–65
Article PubMed CAS Google Scholar
Lewontin R (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 120:849–852
Google Scholar
Lichten M, Goldman A (1995) Meiotic recombination hotspots. Annu Rev Genet 29:423–444
Article PubMed CAS Google Scholar
Lin S, Chakravarti A, Cutler D (2004) Haplotype and missing data inference in nuclear families. Genome Res 14:1624–1632
Article PubMed CAS Google Scholar
Little R, Rubin D (1987) Statistical analysis with missing data. Wiley, New York
Google Scholar
Liu N, Beerman I, Lifton R, Zhao H (2006) Haplotype analysis in the presence of informatively missing genotype data. Genet Epidemiol 30:290–300
Article PubMed Google Scholar
Long J, Williams R, Urbanek M (1995) An E–M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 56:799–810
PubMed CAS Google Scholar
Mallows C (1973) Some comments on Cp. Technometrics 15:661–675
Article Google Scholar
Marchini J, Culter D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, Donnelly P, the International HapMap Consortium (2006) A comparison of phasing algorithm for trios and unrelated individuals. Am J Hum Genet 78:437–450
Article PubMed CAS Google Scholar
Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913
Article PubMed CAS Google Scholar
Nicolae DL (2006) Testing untyped alleles (TUNA)—applications to genome-wide association studies. Genet Epidemiol 30:718–727
Article PubMed Google Scholar
Niu T, Qin ZS, Xu X, Liu J (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 70:157–169
Article PubMed CAS Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909
Article PubMed CAS Google Scholar
Qin ZS, Niu T, Liu J (2002) Partition–ligation–expectation–maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 71:1242–1247
Article PubMed CAS Google Scholar
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Article Google Scholar
Schaid D, Rowland C, Tines D, Jacobson RM, Poland G (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70:425–434
Article PubMed Google Scholar
Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotype phase. Am J Hum Genet 78:629–644
Article PubMed CAS Google Scholar
Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3(7):e114
Article PubMed Google Scholar
Souverein OW, Zwinderman AH, Tanck MWT (2006) Multiple imputation of missing genotype data for unrelated individuals. Anna Hum Genet 70:372–381
Article CAS Google Scholar
Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162–1169
Article PubMed CAS Google Scholar
Stephens M, Scheet P (2005) Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet 76:449–462
Article PubMed CAS Google Scholar
Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68:978–989
Article PubMed CAS Google Scholar
The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437:1299–1320
Article Google Scholar
Therneau T, Atkinson E (1997) An introduction to recursive partitioning using the RPART routines. Tech Rep 61:52
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Article PubMed CAS Google Scholar

Download references

Acknowledgments

The authors are grateful to the three anonymous reviewers for their constructive suggestions. This work was supported by the U.S. Public Health Service, National Institutes of Health, contract grant number GM065450.

Author information

Authors and Affiliations

Department of Statistics, University of California, Irvine, CA, 92697, USA
Zhaoxia Yu
Harwick 775, Division of Biostatistics, Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First Street, SW, Rochester, MN, 55905, USA
Daniel J. Schaid

Authors

Zhaoxia Yu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Schaid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel J. Schaid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Schaid, D.J. Methods to impute missing genotypes for population data. Hum Genet 122, 495–504 (2007). https://doi.org/10.1007/s00439-007-0427-y

Download citation

Received: 19 June 2007
Accepted: 30 August 2007
Published: 13 September 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s00439-007-0427-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Methods to impute missing genotypes for population data

Abstract

Access this article

Similar content being viewed by others

Haplotype estimation for biobank-scale data sets

Population-specific genotype imputations using minimac or IMPUTE2

Genotype Imputation to Increase Sample Size in Pedigreed Populations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Methods to impute missing genotypes for population data

Abstract

Access this article

Similar content being viewed by others

Haplotype estimation for biobank-scale data sets

Population-specific genotype imputations using minimac or IMPUTE2

Genotype Imputation to Increase Sample Size in Pedigreed Populations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation