From association to prediction: statistical methods for the dissection and selection of complex traits in plants

doi:10.1016/j.pbi.2015.02.010

Current Opinion in Plant Biology

Volume 24, April 2015, Pages 110-118

https://doi.org/10.1016/j.pbi.2015.02.010 Get rights and content

Highlights

•
Controlling for spurious associations in statistical models is essential.
•
Computationally efficient approaches are critical for large data sets.
•
Statistical genetic models that predict phenotypes help accelerate breeding cycles.
•
Co-evolution between statistical models and sequencing and phenotyping advances is ongoing.

Quantification of genotype-to-phenotype associations is central to many scientific investigations, yet the ability to obtain consistent results may be thwarted without appropriate statistical analyses. Models for association can consider confounding effects in the materials and complex genetic interactions. Selecting optimal models enables accurate evaluation of associations between marker loci and numerous phenotypes including gene expression. Significant improvements in QTL discovery via association mapping and acceleration of breeding cycles through genomic selection are two successful applications of models using genome-wide markers. Given recent advances in genotyping and phenotyping technologies, further refinement of these approaches is needed to model genetic architecture more accurately and run analyses in a computationally efficient manner, all while accounting for false positives and maximizing statistical power.

Introduction

The ability to understand the genetic basis of biological phenomena requires the development and refinement of statistical approaches that identify associations between genetic markers and phenotypes. As genotypic data become easier to obtain, a more complete and complex landscape of genetic diversity is revealed, requiring more comprehensive statistical models that have the capacity to distinguish true biological associations from false positives arising from population structure and linkage disequilibrium (LD). Applications for the association of genetic markers with phenotypes have also expanded. In particular, the development of models predicting phenotypic values with genome-wide data sets can significantly accelerate plant breeding cycles. Although computationally efficient approaches have been developed to fit these models, the full potential of such analytical approaches has yet to be realized.

Genotyping technologies built on next-generation sequencing (NGS) have made it possible to obtain an unprecedented number of genetic markers and have enabled accurate quantification of gene expression at low cost. However, NGS has introduced new statistical challenges that need to be addressed, including consideration of rare alleles, the increase in computational time, and appropriate treatment of the multiple testing problem. In this review, we provide guidelines and suggestions for conducting an optimal statistical analysis that addresses these issues. Additionally, we highlight the most promising statistical approaches that should become more widespread in the plant genetics research community.

Section snippets

Genome-wide association study (GWAS): current practices and future perspectives

The two most widely used data sets for studying genetic variability are those derived from biparental crosses (e.g. F₂ populations or recombinant inbred lines [RILs], Figure 1a) and those that consist of individuals assembled with complex relatedness or geographical origin (e.g. diversity populations, Figure 1b). These two data sets differ with respect to the number of recombination events they capture (reviewed in [1^•^•]). While biparental crosses only exploit recent recombination events that

Genomic selection to improve plant breeding practices

One of the most exciting new approaches is genomic selection (GS) [56•, 57•], which uses statistical models to predict which individuals will have optimal phenotypes based on marker data. This approach holds great promise for plant breeding efforts because it can theoretically achieve multiple cycles of selection in the amount of time required to complete one cycle using phenotype-based selection approaches [58•, 59•]. In contrast to the statistical models typically used in a GWAS, GS models

Maximizing genomic information from latest sequencing technologies

The advent of NGS has made it possible to affordably obtain markers with genome-wide coverage using techniques as appropriate including genotyping-by-sequencing (GBS) [67^•], restriction site associated DNA sequencing (RAD-seq) [68], target region resequencing or whole-genome resequencing from any species. Previously, markers for a GWAS were obtained through high-density SNP arrays such as diversity array technology (DArT) [69], Illumina Infinium or Affymetrix, which still dominate human studies

Concluding remarks

The current statistical approaches that associate genetic markers to phenotypes are sufficient to identify genomic signals of moderate to large effect and predict phenotypic values accurately enough for GS to make significant genetic gains in plant breeding programs. However, current studies usually lack the statistical power and mapping resolution to detect causative variants controlling a trait and could be prone to false positives. It is therefore important to understand the shortcomings of

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

• of special interest
•• of outstanding interest

Acknowledgements

This research was supported by National Science Foundation awards #0922493 and #1238142, University of Illinois starting funds (A.E.L.), and Cornell University startup funds (M.A.G.). We acknowledge the assistance of Patrick J. Brown in providing insight into NGS technologies and Christine H. Diepenbrock for invaluable feedback on the GS section.

References (72)

S. Myles et al.
Association mapping: critical considerations shift from genotyping to experimental design
Plant Cell
(2009)
S. Atwell et al.
Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines
Nature
(2010)
S.A. Flint-Garcia et al.
Maize association population: a high-resolution platform for quantitative trait locus dissection
Plant J
(2005)
M.C. Romay et al.
Comprehensive genotyping of the USA national maize inbred seed bank
Genome Biol
(2013)
M.D. McMullen et al.
Genetic properties of the maize nested association mapping population
Science
(2009)
J.M. Yu et al.
Genetic design and statistical power of nested association mapping in maize
Genetics
(2008)
E.S. Buckler et al.
The genetic architecture of maize flowering time
Science
(2009)
C.S. Zhu et al.
Status and prospects of association mapping in plants
Plant Genome
(2008)
J.K. Pritchard et al.
Association mapping in structured populations
Am J Hum Genet
(2000)
A.L. Price et al.
Principal components analysis corrects for stratification in genome-wide association studies
Nat Genet
(2006)

T. Jombart et al.

Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

BMC Genet

(2010)

A.E. Lipka et al.

GAPIT: genome association and prediction integrated tool

Bioinformatics

(2012)

B.A. Loiselle et al.

Spatial genetic-structure of a tropical understory shrub, Psychotria officinalis (rubiaceae)

Am J Bot

(1995)

A.J. Garris et al.

Genetic structure and diversity in Oryza sativa L

Genetics

(2005)

J.M. Yu et al.

A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

Nat Genet

(2006)

C. Lippert et al.

FaST linear mixed models for genome-wide association studies

Nat Methods

(2011)

Q. Wang et al.

A SUPER powerful method for genome wide association study

PLOS ONE

(2014)

H. Kang et al.

Efficient control of population structure in model organism association mapping

Genetics

(2008)

H.M. Kang et al.

Variance component model to account for sample structure in genome-wide association studies

Nat Genet

(2010)

Z.W. Zhang et al.

Mixed linear model approach adapted for genome-wide association studies

Nat Genet

(2010)

M. Li et al.

Enrichment of statistical power for genome-wide association studies

BMC Biol

(2014)

P.J. Bradbury et al.

TASSEL: software for association mapping of complex traits in diverse samples

Bioinformatics

(2007)

X. Zhou et al.

Genome-wide efficient mixed-model analysis for association studies

Nat Genet

(2012)

A. Platt et al.

Conditions under which genome-wide association studies will be positively misleading

Genetics

(2010)

S.P. Dickson et al.

Rare variants create synthetic genome-wide associations

PLoS Biol

(2010)

G. Orozco et al.

Synthetic associations in the context of genome-wide association scan signals

Hum Mol Genet

(2010)

V. Segura et al.

An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations

Nat Genet

(2012)

P. Pérez et al.

Genome-wide regression & prediction with the BGLR statistical package

Genetics

(2014)

T.A. Manolio et al.

Finding the missing heritability of complex diseases

Nature

(2009)

A. Gyenesei et al.

High-throughput analysis of epistasis in genome-wide association studies with BiForce

Bioinformatics

(2012)

B. Goudey et al.

GWIS – model-free, fast and exhaustive search for epistatic interactions in case-control GWAS

BMC Genomics

(2013)

T. Schupbach et al.

FastEpistasis: a high performance computing solution for quantitative trait epistasis

Bioinformatics

(2010)

R.J. Wisser et al.

Multivariate analysis of maize disease resistances suggests a pleiotropic genetic basis and implicates a GST gene

Proc Natl Acad Sci USA

(2011)

X. Zhou et al.

Efficient multivariate linear mixed model algorithms for genome-wide association studies

Nat Methods

(2014)

A. Korte et al.

A mixed-model approach for genome-wide association studies of correlated traits in structured populations

Nat Genet

(2012)

C.R. Moore et al.

High-throughput computer vision introduces the time axis to a quantitative trait map of a plant growth response

Genetics

(2013)

Cited by (108)

Accurate genomic selection using low-density SNP panels preselected by maximum likelihood estimation
2024, Aquaculture
Genomic selection (GS) poses a challenge for the prediction of the genomic estimated breeding value (GEBV) using a low-density SNP panel. Several methods have been proposed for SNP preselection. However, these methods often suffer from either significant computational complexity or erratic accuracy in GS. In this study, we developed an approach called MLE-rank based on maximum likelihood estimation to preselect a set of SNPs for GS. First, we generated 90 simulated datasets and compared the performance of MLE-rank with uniform distribution and preselection based on a genome-wide association study (GWAS). For simulated datasets, compared to uniform distribution, both MLE-rank and GWAS preselection reduced the SNP density by a factor of 10 while maintaining prediction accuracy. Additionally, compared to the other two methods, MLE-rank's prediction accuracy was significantly improved with the medium- and high-heritability datasets. Then, we further evaluated these three preselection approaches using real disease-resistant phenotypes of leopard coral grouper (Plectropomus leopardus) and Japanese flounder (Paralichthys olivaceus). We found that the 3 k SNPs preselected by MLE-rank had a stable and effective prediction effect. The uniform distribution requires 70 k, while the GWAS preselection method requires 3 k (P. leopardus) and 50 k (P. olivaceus) to achieve similar prediction accuracy. Finally, we evaluated the prediction accuracy of MLE-rank using candidate populations of flounders and their progeny survival rates, with uniform distribution and GWAS preselection as benchmarks. In the results for this dataset, MLE-rank was found to have the same predictive effect for low-density SNP panels as it did for high-density SNPs, regardless of whether GWAS preselection or uniform distribution was used. Taken together, the results we have observed indicate that we have ensured that MLE-rank does not reduce prediction accuracy for any of the datasets. MLE-rank showed superior performance in reducing the number of SNPs. Moreover, we observed a relative standard deviation in prediction accuracy when using a low density of SNPs selected by MLE-rank compared to a high density determined through a uniform distribution strategy. In conclusion, MLE-rank not only reduces the number of SNPs used for GS but also exhibits high predictive accuracy. This could potentially lead to a decrease in genotyping costs and promote the wider application of GS in fish breeding.
Transcription factor encoding gene OsC1 regulates leaf sheath color through anthocyanidin metabolism in Oryza rufipogon and Oryza sativa
2024, BMC Plant Biology
Major impacts of widespread structural variation on sorghum
2024, Genome Research
Experimental methods for phenotypic and molecular analyses of seed shattering in cultivated and weedy rice
2024, Advances in Weed Science
Genome-wide association study on resistance of cultivated soybean to Fusarium oxysporum root rot in Northeast China
2023, BMC Plant Biology
Regularized multi-trait multi-locus linear mixed models for genome-wide association studies and genomic selection in crops
2023, BMC Bioinformatics

View all citing articles on Scopus

View full text

From association to prediction: statistical methods for the dissection and selection of complex traits in plants

Highlights

Introduction

Section snippets

Genome-wide association study (GWAS): current practices and future perspectives

Genomic selection to improve plant breeding practices

Maximizing genomic information from latest sequencing technologies

Concluding remarks

References and recommended reading

Acknowledgements

Association mapping: critical considerations shift from genotyping to experimental design

Plant Cell

Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines

Nature

Maize association population: a high-resolution platform for quantitative trait locus dissection

Plant J

Comprehensive genotyping of the USA national maize inbred seed bank

Genome Biol

Genetic properties of the maize nested association mapping population

Science

Genetic design and statistical power of nested association mapping in maize

Genetics

The genetic architecture of maize flowering time

Science

Status and prospects of association mapping in plants

Plant Genome

Association mapping in structured populations

Am J Hum Genet

Principal components analysis corrects for stratification in genome-wide association studies

Nat Genet

Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

BMC Genet

GAPIT: genome association and prediction integrated tool

Bioinformatics

Spatial genetic-structure of a tropical understory shrub, Psychotria officinalis (rubiaceae)

Am J Bot

Genetic structure and diversity in Oryza sativa L

Genetics

A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

Nat Genet

FaST linear mixed models for genome-wide association studies

Nat Methods

A SUPER powerful method for genome wide association study

PLOS ONE

Efficient control of population structure in model organism association mapping

Genetics

Variance component model to account for sample structure in genome-wide association studies

Nat Genet

Mixed linear model approach adapted for genome-wide association studies

Nat Genet

Enrichment of statistical power for genome-wide association studies

BMC Biol

TASSEL: software for association mapping of complex traits in diverse samples

Bioinformatics

Genome-wide efficient mixed-model analysis for association studies

Nat Genet

Conditions under which genome-wide association studies will be positively misleading

Genetics

Rare variants create synthetic genome-wide associations

PLoS Biol

Synthetic associations in the context of genome-wide association scan signals

Hum Mol Genet

An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations

Nat Genet

Genome-wide regression & prediction with the BGLR statistical package

Genetics

Finding the missing heritability of complex diseases

Nature

High-throughput analysis of epistasis in genome-wide association studies with BiForce

Bioinformatics

GWIS – model-free, fast and exhaustive search for epistatic interactions in case-control GWAS

BMC Genomics

FastEpistasis: a high performance computing solution for quantitative trait epistasis

Bioinformatics

Multivariate analysis of maize disease resistances suggests a pleiotropic genetic basis and implicates a GST gene

Proc Natl Acad Sci USA

Efficient multivariate linear mixed model algorithms for genome-wide association studies

Nat Methods

A mixed-model approach for genome-wide association studies of correlated traits in structured populations

Nat Genet