Skip to main content

Advertisement

Log in

Fast and efficient correction for population stratification in multi-locus genome-wide association studies

  • Original Paper
  • Published:
Genetica Aims and scope Submit manuscript

Abstract

Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F1 scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F1 score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer’s disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer’s disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bhatnagar SR et al (2020) Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLOS Genet 16(5):e1008766

    Article  CAS  Google Scholar 

  • Chen H et al (2016) Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Human Genet 98(4):653–666

    Article  CAS  Google Scholar 

  • Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004

    Article  CAS  Google Scholar 

  • Freedman ML et al (2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36(4):388–393

    Article  CAS  Google Scholar 

  • Genin E et al (2011) APOE and Alzheimer disease: a major gene with semi-dominant inheritance. Mol Psychiatry 16(9):903–907

    Article  CAS  Google Scholar 

  • He H et al (2011) Effect of population stratification analysis on false-positive rates for common and rare variants. BMC Proc 5(9):S116

    Article  Google Scholar 

  • Hellwege JN et al (2017) Population stratification in genetic association studies. Curr Protoc Human Genet. https://doi.org/10.1002/cphg.48

    Article  Google Scholar 

  • Kamboh MI et al (2012) Genome-wide association study of Alzheimer’s disease. Transl Psychiatry 2(5):e117–e117

    Article  CAS  Google Scholar 

  • Price AL et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909

    Article  CAS  Google Scholar 

  • Rakitsch B et al (2013) A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics (oxford, England) 29(2):206–214

    Article  CAS  Google Scholar 

  • St George-Hyslop PH (2000) Genetic factors in the genesis of Alzheimer’s disease. Ann N Y Acad Sci 924(1):1–7

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Natural Science Foundation of Anhui Province (No. 2008085MA09) and the Doctoral research funding of Anhui Medical University (No. XJ201710).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaning Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Genomic control

Let Xn × m denote the genotype matrix with n samples and m candidate markers. The Cochran-Armitage Trend test association analysis between the phenotype (denoted as Y) and the j-th SNP (\({X}_{j.col}\) represents the j-th column of genotype matrix X) is

$$T_{j} = n \times R_{j}^{2} ,$$

where \(R_{j} = corr\left( {Y,X_{j \cdot col} } \right)\) is the correlation. Under the null hypothesis of no association, Tj is supposed to have a \(\chi_{1}^{2}\) distribution which fails to hold true in the presence of PS.

In order to correct for the biased distribution of Tj caused by latent population structure, GC was proposed with two basic presumptions. The first one is that under the null hypothesis the distribution of Tj with hidden population structure is an inflated λ\(\chi_{1}^{2}\) distribution with an inflation factor λ. For the sake of estimating λ, the second assumption follows as: the vast majority of the analyzed marker is under the null hypothesis where no association exists.

Let \(T_{1} ,T_{2} , \cdots ,T_{m}\) denote all the association test statistics for total m markers and most of them are under the null hypothesis following an inflated λ\(\chi_{1}^{2}\) distribution. Then we have

$$median\,\left( {T_{1} ,T_{2} , \cdots ,T_{m} } \right) = median\,\left( {\lambda \chi_{1}^{2} } \right) = 0.456\lambda.$$

The inflation factor λGC can be estimated by

$$\lambda_{GC} = \,\frac{{median\left( {T_{1} ,T_{2} , \cdots ,T_{m} } \right)}}{0.456}.$$

And the adjusted test statistic by GC is defined as

$$T_{j}^{adj} = \frac{{T_{j} }}{{\lambda_{GC} }}\left( {j = 1,2, \cdots ,m} \right).$$

EIGENSTRAT

PCA has been proven to be effective capturing the underlying PS factors. Subtracting the first few principal components from the original genotype data is based on the hypothesis that they carry the majority of the latent PS information which contributes mostly to the inflation of the correlation test. The EIGENSTRAT method was proposed by Alkes L Price,et al.in 2006 using the PCAs to correct for PS based on this perception. It also showed that since the principal components are constructed to be orthogonal, the results were insensitive to the number of components inferred.

As we defined in the previous section, Xn × m represents the genotype matrix and Yn × 1 denotes the phenotype vector. The first step of EIGENSTRAT is to perform the SVD process on the genotype matrix X.

$$X_{n \times m} = U_{n \times n} D_{n \times m} V_{m \times m}^{T} .$$

Let Uk denote the first k columns of U supposed to carry the latent information of PS. The second step of EIGENSTRAT is to subtract the ancestry information from the original genotype and phenotype data. For the j-th candidate marker, let \({X}_{j \cdot col}\) (the j-th column of the genotype matrix X) be the genotype and we have

$$X_{j \cdot col}^{adj} = X_{j \cdot col} - U_{k} U_{k}^{T} X_{j \cdot col} ,$$
$$Y^{adj} = Y - U_{k} U_{k}^{T} Y,$$

where \(X_{j \cdot col}^{adj}\) and \(Y^{adj}\) is the adjusted genotype and phenotype vectors. The last step of EIGENSTRAT constructing the association test statistic is to perform the Cochran–Armitage Trend test on the adjusted data. The adjusted correlation is defined as \(R_{j}^{adj} = corr\left( {Y^{adj} ,X_{j \cdot col}^{adj} } \right)\) and the adjusted chi-squared test for testing association between genotype of the j-th marker and the phenotype is

$$T_{j}^{adj} = \left( {n - k - 1} \right) \times \left( {R_{j}^{adj} } \right)^{2} ,$$

where \(T_{j}^{adj}\) is asymptotically distributed as \(\chi_{1}^{2}\) under the null hypothesis of no association.

GMMAT

Using mixed effect models to control the PS is becoming prevalent with more and more computationally effective algorithms being proposed. The GMMAT utilizes the logistic mixed model to fit the original data with latent population structure. Compared to linear mixed models (LMMs), the generalized linear mixed models (GLMMs) are proven to be more accurate for binary traits in GWAS especially in the presence of non-constant residual trait variance (Chen et al. 2016). Inheriting the notations from the previous sections, we use the index \(i\left( {i = 1,2, \cdots ,n} \right)\) to represent the i-th sample and the index \(j\left( {j = 1,2, \cdots ,m} \right)\) for the j-th SNP. For some specific candidate marker, the logistic mixed model for single-SNP association test is

$$\log it\left( {\pi_{i} } \right) = X_{i} \alpha + C_{i}^{T} \beta + b_{i} ,$$

where Xi denotes the i-th sample’s genotype of the candidate marker and α is the fixed genetic effect parameter. Ci is a p × 1 column vector of covariates and β is the p × 1 covariate effect vector including an intercept. \(b = \left( {b_{1} ,b_{2} , \cdots ,b_{n} } \right)^{{\text{T}}}\) is the random effect vector assumed to be distributed as \(N\left( {0,\sum\nolimits_{k = 1}^{K} {\tau_{k} V_{k} } } \right)\) where \(\tau_{k}^{\prime } s\left( {k = 1,2, \cdots ,K} \right)\) are the variance component parameters and \({V}_{k}\) ′s are known \(n\times n\) relatedness matrices. \(\pi_{i} = P\left( {y_{i} = 1\left| {X_{i} ,C_{i} ,b_i} \right.} \right)\) is the probability of the binary phenotype (in this case yi is the disease status of the i-th sample), conditional on the genotype, covariates and random effects.

In order to construct the score test statistic for the null hypothesis \(H_{0} :\alpha = 0\), we need to fit the logistic mixed model under H0, which is the same for all the investigated candidate markers, as

$$\log it\left( {\pi_{i0} } \right) = C_{i}^{\rm T} \beta + b_{i} ,$$

where \(\pi_{i0} = P\left( {y_{i} = 1\left| {C_{i} ,b_{i} } \right.} \right)\). We can use the effective AI-REML algorithm to estimate the parameters though iteration process and obtain the fitted value of \(\widehat{{\uppi }_{i0}}\). For the detailed information of the fitting process, one can refer to the original GMMAT research paper (Chen et al. 2016). The score test statistic for \({H}_{0}\) is

$$T = X^{{\rm T}} \left( {Y - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\pi } _{0} } \right)$$

where \(X = \left( {X_{1} ,X_{2} , \cdots ,X_{n} } \right)^{\rm T}\) is the n × 1 column vector of genotypes and similarly Y is the phenotype vector. \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\pi }_{0} = \left( {\widehat{\pi}_{10} ,\widehat{\pi}_{20} , \cdots ,\widehat{\pi}_{n0} } \right)\) is the fitted value of the conditional phenotype probability under the null logistic model, which is the same for all SNPs.

Simulation results under independent design

We did the same simulation under the independent SNP scenarios with all six methods and the results resemble to the correlation design presented in the previous section. All the aforementioned three methods (GC, EIGENSTRAT, GMMAT) have two features in common: they are all single-marker methods adjusting for the marginal association test for each marker separately, and efficiency of all the three methods relies on one or several threshold values. As a result, dependent information among different markers is not fully utilized and as it is shown, the precision and recall level of the testing procedures are significantly affected by the choice of the threshold. It can be anticipated that a strict threshold will result in a high level of precision and a low level of recall, which means that the majority of the tested-positive markers are indeed true positive markers, however a great number of true positive markers with relatively moderate effects miss out in the meantime (see Fig. 4). 

Fig. 4
figure 4

a Independent SNP when no PS is present (λGC = 1). b Independent SNP when weak PS is present (λGC = 1.5). c Independent SNP when modest PS is present (λGC = 2). d Independent SNP when strong PS is present (λGC = 2.5)

Additional simulation of PCA-LASSO with various parameters

We first explored the performance of our method PCA-LASSO changing with the signal/noise ratio. In the simulation, relative risk (R) between the case group and control group of the causal SNPs controls the signal/noise ratio. We simulated five different scenarios with 100 random causal SNPs using 5 different relative risk, R = 1.2, 1.4, 1.6, 1.8 and 2.0, respectively (the rest parameters are the same with the simulation in the original manuscript). We repeated each scenario for 100 times.

Figure 5a is the performance of our proposed method PCA-LASSO under various disease risks. It is apparent that the precision level has minimal changes at different disease risks while recall and F1 score markedly increases with increasing disease risk. We also did the simulations with different sample sizes. We chose five different value of sample sizes (m = 500, 700, 1000, 1200 and 1500) and repeated each scenario for 100 times.

Fig. 5
figure 5

a Performance under different disease risks. b Performance under different sample size. c Performance under different gene number

Figure 5b shows the performance of PCA-LASSO under various scenarios with different sample sizes from 500 to 1500. The precision level is relatively stable when the sample size changes, while recall and F1 score gradually increase with the sample size. For the simulation on the number of genes, we tested on scenarios with 10,000, 20,000, 40,000, 60,000, 80,000 and 100,000 genes respectively and repeated each scenario for 100 times.

Figure 5c presents the results of PCA-LASSO’s performance under situations with different gene numbers from 10,000 to 100,000. Once again, the precision level remains stable regardless of number of genes. However, recall and F1 score drops as the gene number increases.

Sociodemographic characteristics of study population from ADNI

(See Table 4).

Table 4 Sociodemographic characteristics of study population from ADNI

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, R., Yuan, M., Xu, X.S. et al. Fast and efficient correction for population stratification in multi-locus genome-wide association studies. Genetica 149, 313–325 (2021). https://doi.org/10.1007/s10709-021-00129-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10709-021-00129-3

Keywords

Navigation