From association to prediction: statistical methods for the dissection and selection of complex traits in plants

https://doi.org/10.1016/j.pbi.2015.02.010Get rights and content

Highlights

  • Controlling for spurious associations in statistical models is essential.

  • Computationally efficient approaches are critical for large data sets.

  • Statistical genetic models that predict phenotypes help accelerate breeding cycles.

  • Co-evolution between statistical models and sequencing and phenotyping advances is ongoing.

Quantification of genotype-to-phenotype associations is central to many scientific investigations, yet the ability to obtain consistent results may be thwarted without appropriate statistical analyses. Models for association can consider confounding effects in the materials and complex genetic interactions. Selecting optimal models enables accurate evaluation of associations between marker loci and numerous phenotypes including gene expression. Significant improvements in QTL discovery via association mapping and acceleration of breeding cycles through genomic selection are two successful applications of models using genome-wide markers. Given recent advances in genotyping and phenotyping technologies, further refinement of these approaches is needed to model genetic architecture more accurately and run analyses in a computationally efficient manner, all while accounting for false positives and maximizing statistical power.

Introduction

The ability to understand the genetic basis of biological phenomena requires the development and refinement of statistical approaches that identify associations between genetic markers and phenotypes. As genotypic data become easier to obtain, a more complete and complex landscape of genetic diversity is revealed, requiring more comprehensive statistical models that have the capacity to distinguish true biological associations from false positives arising from population structure and linkage disequilibrium (LD). Applications for the association of genetic markers with phenotypes have also expanded. In particular, the development of models predicting phenotypic values with genome-wide data sets can significantly accelerate plant breeding cycles. Although computationally efficient approaches have been developed to fit these models, the full potential of such analytical approaches has yet to be realized.

Genotyping technologies built on next-generation sequencing (NGS) have made it possible to obtain an unprecedented number of genetic markers and have enabled accurate quantification of gene expression at low cost. However, NGS has introduced new statistical challenges that need to be addressed, including consideration of rare alleles, the increase in computational time, and appropriate treatment of the multiple testing problem. In this review, we provide guidelines and suggestions for conducting an optimal statistical analysis that addresses these issues. Additionally, we highlight the most promising statistical approaches that should become more widespread in the plant genetics research community.

Section snippets

Genome-wide association study (GWAS): current practices and future perspectives

The two most widely used data sets for studying genetic variability are those derived from biparental crosses (e.g. F2 populations or recombinant inbred lines [RILs], Figure 1a) and those that consist of individuals assembled with complex relatedness or geographical origin (e.g. diversity populations, Figure 1b). These two data sets differ with respect to the number of recombination events they capture (reviewed in [1]). While biparental crosses only exploit recent recombination events that

Genomic selection to improve plant breeding practices

One of the most exciting new approaches is genomic selection (GS) [56•, 57•], which uses statistical models to predict which individuals will have optimal phenotypes based on marker data. This approach holds great promise for plant breeding efforts because it can theoretically achieve multiple cycles of selection in the amount of time required to complete one cycle using phenotype-based selection approaches [58•, 59•]. In contrast to the statistical models typically used in a GWAS, GS models

Maximizing genomic information from latest sequencing technologies

The advent of NGS has made it possible to affordably obtain markers with genome-wide coverage using techniques as appropriate including genotyping-by-sequencing (GBS) [67], restriction site associated DNA sequencing (RAD-seq) [68], target region resequencing or whole-genome resequencing from any species. Previously, markers for a GWAS were obtained through high-density SNP arrays such as diversity array technology (DArT) [69], Illumina Infinium or Affymetrix, which still dominate human studies

Concluding remarks

The current statistical approaches that associate genetic markers to phenotypes are sufficient to identify genomic signals of moderate to large effect and predict phenotypic values accurately enough for GS to make significant genetic gains in plant breeding programs. However, current studies usually lack the statistical power and mapping resolution to detect causative variants controlling a trait and could be prone to false positives. It is therefore important to understand the shortcomings of

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

This research was supported by National Science Foundation awards #0922493 and #1238142, University of Illinois starting funds (A.E.L.), and Cornell University startup funds (M.A.G.). We acknowledge the assistance of Patrick J. Brown in providing insight into NGS technologies and Christine H. Diepenbrock for invaluable feedback on the GS section.

References (72)

  • S. Myles et al.

    Association mapping: critical considerations shift from genotyping to experimental design

    Plant Cell

    (2009)
  • S. Atwell et al.

    Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines

    Nature

    (2010)
  • S.A. Flint-Garcia et al.

    Maize association population: a high-resolution platform for quantitative trait locus dissection

    Plant J

    (2005)
  • M.C. Romay et al.

    Comprehensive genotyping of the USA national maize inbred seed bank

    Genome Biol

    (2013)
  • M.D. McMullen et al.

    Genetic properties of the maize nested association mapping population

    Science

    (2009)
  • J.M. Yu et al.

    Genetic design and statistical power of nested association mapping in maize

    Genetics

    (2008)
  • E.S. Buckler et al.

    The genetic architecture of maize flowering time

    Science

    (2009)
  • C.S. Zhu et al.

    Status and prospects of association mapping in plants

    Plant Genome

    (2008)
  • J.K. Pritchard et al.

    Association mapping in structured populations

    Am J Hum Genet

    (2000)
  • A.L. Price et al.

    Principal components analysis corrects for stratification in genome-wide association studies

    Nat Genet

    (2006)
  • T. Jombart et al.

    Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

    BMC Genet

    (2010)
  • A.E. Lipka et al.

    GAPIT: genome association and prediction integrated tool

    Bioinformatics

    (2012)
  • B.A. Loiselle et al.

    Spatial genetic-structure of a tropical understory shrub, Psychotria officinalis (rubiaceae)

    Am J Bot

    (1995)
  • A.J. Garris et al.

    Genetic structure and diversity in Oryza sativa L

    Genetics

    (2005)
  • J.M. Yu et al.

    A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

    Nat Genet

    (2006)
  • C. Lippert et al.

    FaST linear mixed models for genome-wide association studies

    Nat Methods

    (2011)
  • Q. Wang et al.

    A SUPER powerful method for genome wide association study

    PLOS ONE

    (2014)
  • H. Kang et al.

    Efficient control of population structure in model organism association mapping

    Genetics

    (2008)
  • H.M. Kang et al.

    Variance component model to account for sample structure in genome-wide association studies

    Nat Genet

    (2010)
  • Z.W. Zhang et al.

    Mixed linear model approach adapted for genome-wide association studies

    Nat Genet

    (2010)
  • M. Li et al.

    Enrichment of statistical power for genome-wide association studies

    BMC Biol

    (2014)
  • P.J. Bradbury et al.

    TASSEL: software for association mapping of complex traits in diverse samples

    Bioinformatics

    (2007)
  • X. Zhou et al.

    Genome-wide efficient mixed-model analysis for association studies

    Nat Genet

    (2012)
  • A. Platt et al.

    Conditions under which genome-wide association studies will be positively misleading

    Genetics

    (2010)
  • S.P. Dickson et al.

    Rare variants create synthetic genome-wide associations

    PLoS Biol

    (2010)
  • G. Orozco et al.

    Synthetic associations in the context of genome-wide association scan signals

    Hum Mol Genet

    (2010)
  • V. Segura et al.

    An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations

    Nat Genet

    (2012)
  • P. Pérez et al.

    Genome-wide regression & prediction with the BGLR statistical package

    Genetics

    (2014)
  • T.A. Manolio et al.

    Finding the missing heritability of complex diseases

    Nature

    (2009)
  • A. Gyenesei et al.

    High-throughput analysis of epistasis in genome-wide association studies with BiForce

    Bioinformatics

    (2012)
  • B. Goudey et al.

    GWIS – model-free, fast and exhaustive search for epistatic interactions in case-control GWAS

    BMC Genomics

    (2013)
  • T. Schupbach et al.

    FastEpistasis: a high performance computing solution for quantitative trait epistasis

    Bioinformatics

    (2010)
  • R.J. Wisser et al.

    Multivariate analysis of maize disease resistances suggests a pleiotropic genetic basis and implicates a GST gene

    Proc Natl Acad Sci USA

    (2011)
  • X. Zhou et al.

    Efficient multivariate linear mixed model algorithms for genome-wide association studies

    Nat Methods

    (2014)
  • A. Korte et al.

    A mixed-model approach for genome-wide association studies of correlated traits in structured populations

    Nat Genet

    (2012)
  • C.R. Moore et al.

    High-throughput computer vision introduces the time axis to a quantitative trait map of a plant growth response

    Genetics

    (2013)
  • Cited by (108)

    View all citing articles on Scopus
    View full text