From association to prediction: statistical methods for the dissection and selection of complex traits in plants
Introduction
The ability to understand the genetic basis of biological phenomena requires the development and refinement of statistical approaches that identify associations between genetic markers and phenotypes. As genotypic data become easier to obtain, a more complete and complex landscape of genetic diversity is revealed, requiring more comprehensive statistical models that have the capacity to distinguish true biological associations from false positives arising from population structure and linkage disequilibrium (LD). Applications for the association of genetic markers with phenotypes have also expanded. In particular, the development of models predicting phenotypic values with genome-wide data sets can significantly accelerate plant breeding cycles. Although computationally efficient approaches have been developed to fit these models, the full potential of such analytical approaches has yet to be realized.
Genotyping technologies built on next-generation sequencing (NGS) have made it possible to obtain an unprecedented number of genetic markers and have enabled accurate quantification of gene expression at low cost. However, NGS has introduced new statistical challenges that need to be addressed, including consideration of rare alleles, the increase in computational time, and appropriate treatment of the multiple testing problem. In this review, we provide guidelines and suggestions for conducting an optimal statistical analysis that addresses these issues. Additionally, we highlight the most promising statistical approaches that should become more widespread in the plant genetics research community.
Section snippets
Genome-wide association study (GWAS): current practices and future perspectives
The two most widely used data sets for studying genetic variability are those derived from biparental crosses (e.g. F2 populations or recombinant inbred lines [RILs], Figure 1a) and those that consist of individuals assembled with complex relatedness or geographical origin (e.g. diversity populations, Figure 1b). These two data sets differ with respect to the number of recombination events they capture (reviewed in [1••]). While biparental crosses only exploit recent recombination events that
Genomic selection to improve plant breeding practices
One of the most exciting new approaches is genomic selection (GS) [56•, 57•], which uses statistical models to predict which individuals will have optimal phenotypes based on marker data. This approach holds great promise for plant breeding efforts because it can theoretically achieve multiple cycles of selection in the amount of time required to complete one cycle using phenotype-based selection approaches [58•, 59•]. In contrast to the statistical models typically used in a GWAS, GS models
Maximizing genomic information from latest sequencing technologies
The advent of NGS has made it possible to affordably obtain markers with genome-wide coverage using techniques as appropriate including genotyping-by-sequencing (GBS) [67•], restriction site associated DNA sequencing (RAD-seq) [68], target region resequencing or whole-genome resequencing from any species. Previously, markers for a GWAS were obtained through high-density SNP arrays such as diversity array technology (DArT) [69], Illumina Infinium or Affymetrix, which still dominate human studies
Concluding remarks
The current statistical approaches that associate genetic markers to phenotypes are sufficient to identify genomic signals of moderate to large effect and predict phenotypic values accurately enough for GS to make significant genetic gains in plant breeding programs. However, current studies usually lack the statistical power and mapping resolution to detect causative variants controlling a trait and could be prone to false positives. It is therefore important to understand the shortcomings of
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgements
This research was supported by National Science Foundation awards #0922493 and #1238142, University of Illinois starting funds (A.E.L.), and Cornell University startup funds (M.A.G.). We acknowledge the assistance of Patrick J. Brown in providing insight into NGS technologies and Christine H. Diepenbrock for invaluable feedback on the GS section.
References (72)
- et al.
Association mapping: critical considerations shift from genotyping to experimental design
Plant Cell
(2009) - et al.
Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines
Nature
(2010) - et al.
Maize association population: a high-resolution platform for quantitative trait locus dissection
Plant J
(2005) - et al.
Comprehensive genotyping of the USA national maize inbred seed bank
Genome Biol
(2013) - et al.
Genetic properties of the maize nested association mapping population
Science
(2009) - et al.
Genetic design and statistical power of nested association mapping in maize
Genetics
(2008) - et al.
The genetic architecture of maize flowering time
Science
(2009) - et al.
Status and prospects of association mapping in plants
Plant Genome
(2008) - et al.
Association mapping in structured populations
Am J Hum Genet
(2000) - et al.
Principal components analysis corrects for stratification in genome-wide association studies
Nat Genet
(2006)
Discriminant analysis of principal components: a new method for the analysis of genetically structured populations
BMC Genet
GAPIT: genome association and prediction integrated tool
Bioinformatics
Spatial genetic-structure of a tropical understory shrub, Psychotria officinalis (rubiaceae)
Am J Bot
Genetic structure and diversity in Oryza sativa L
Genetics
A unified mixed-model method for association mapping that accounts for multiple levels of relatedness
Nat Genet
FaST linear mixed models for genome-wide association studies
Nat Methods
A SUPER powerful method for genome wide association study
PLOS ONE
Efficient control of population structure in model organism association mapping
Genetics
Variance component model to account for sample structure in genome-wide association studies
Nat Genet
Mixed linear model approach adapted for genome-wide association studies
Nat Genet
Enrichment of statistical power for genome-wide association studies
BMC Biol
TASSEL: software for association mapping of complex traits in diverse samples
Bioinformatics
Genome-wide efficient mixed-model analysis for association studies
Nat Genet
Conditions under which genome-wide association studies will be positively misleading
Genetics
Rare variants create synthetic genome-wide associations
PLoS Biol
Synthetic associations in the context of genome-wide association scan signals
Hum Mol Genet
An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations
Nat Genet
Genome-wide regression & prediction with the BGLR statistical package
Genetics
Finding the missing heritability of complex diseases
Nature
High-throughput analysis of epistasis in genome-wide association studies with BiForce
Bioinformatics
GWIS – model-free, fast and exhaustive search for epistatic interactions in case-control GWAS
BMC Genomics
FastEpistasis: a high performance computing solution for quantitative trait epistasis
Bioinformatics
Multivariate analysis of maize disease resistances suggests a pleiotropic genetic basis and implicates a GST gene
Proc Natl Acad Sci USA
Efficient multivariate linear mixed model algorithms for genome-wide association studies
Nat Methods
A mixed-model approach for genome-wide association studies of correlated traits in structured populations
Nat Genet
High-throughput computer vision introduces the time axis to a quantitative trait map of a plant growth response
Genetics
Cited by (108)
Major impacts of widespread structural variation on sorghum
2024, Genome ResearchExperimental methods for phenotypic and molecular analyses of seed shattering in cultivated and weedy rice
2024, Advances in Weed Science