Abstract
Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F1 scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F1 score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer’s disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer’s disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.
Similar content being viewed by others
References
Bhatnagar SR et al (2020) Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLOS Genet 16(5):e1008766
Chen H et al (2016) Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Human Genet 98(4):653–666
Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004
Freedman ML et al (2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36(4):388–393
Genin E et al (2011) APOE and Alzheimer disease: a major gene with semi-dominant inheritance. Mol Psychiatry 16(9):903–907
He H et al (2011) Effect of population stratification analysis on false-positive rates for common and rare variants. BMC Proc 5(9):S116
Hellwege JN et al (2017) Population stratification in genetic association studies. Curr Protoc Human Genet. https://doi.org/10.1002/cphg.48
Kamboh MI et al (2012) Genome-wide association study of Alzheimer’s disease. Transl Psychiatry 2(5):e117–e117
Price AL et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909
Rakitsch B et al (2013) A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics (oxford, England) 29(2):206–214
St George-Hyslop PH (2000) Genetic factors in the genesis of Alzheimer’s disease. Ann N Y Acad Sci 924(1):1–7
Acknowledgements
This work was partially supported by the Natural Science Foundation of Anhui Province (No. 2008085MA09) and the Doctoral research funding of Anhui Medical University (No. XJ201710).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Genomic control
Let Xn × m denote the genotype matrix with n samples and m candidate markers. The Cochran-Armitage Trend test association analysis between the phenotype (denoted as Y) and the j-th SNP (\({X}_{j.col}\) represents the j-th column of genotype matrix X) is
where \(R_{j} = corr\left( {Y,X_{j \cdot col} } \right)\) is the correlation. Under the null hypothesis of no association, Tj is supposed to have a \(\chi_{1}^{2}\) distribution which fails to hold true in the presence of PS.
In order to correct for the biased distribution of Tj caused by latent population structure, GC was proposed with two basic presumptions. The first one is that under the null hypothesis the distribution of Tj with hidden population structure is an inflated λ\(\chi_{1}^{2}\) distribution with an inflation factor λ. For the sake of estimating λ, the second assumption follows as: the vast majority of the analyzed marker is under the null hypothesis where no association exists.
Let \(T_{1} ,T_{2} , \cdots ,T_{m}\) denote all the association test statistics for total m markers and most of them are under the null hypothesis following an inflated λ\(\chi_{1}^{2}\) distribution. Then we have
The inflation factor λGC can be estimated by
And the adjusted test statistic by GC is defined as
EIGENSTRAT
PCA has been proven to be effective capturing the underlying PS factors. Subtracting the first few principal components from the original genotype data is based on the hypothesis that they carry the majority of the latent PS information which contributes mostly to the inflation of the correlation test. The EIGENSTRAT method was proposed by Alkes L Price,et al.in 2006 using the PCAs to correct for PS based on this perception. It also showed that since the principal components are constructed to be orthogonal, the results were insensitive to the number of components inferred.
As we defined in the previous section, Xn × m represents the genotype matrix and Yn × 1 denotes the phenotype vector. The first step of EIGENSTRAT is to perform the SVD process on the genotype matrix X.
Let Uk denote the first k columns of U supposed to carry the latent information of PS. The second step of EIGENSTRAT is to subtract the ancestry information from the original genotype and phenotype data. For the j-th candidate marker, let \({X}_{j \cdot col}\) (the j-th column of the genotype matrix X) be the genotype and we have
where \(X_{j \cdot col}^{adj}\) and \(Y^{adj}\) is the adjusted genotype and phenotype vectors. The last step of EIGENSTRAT constructing the association test statistic is to perform the Cochran–Armitage Trend test on the adjusted data. The adjusted correlation is defined as \(R_{j}^{adj} = corr\left( {Y^{adj} ,X_{j \cdot col}^{adj} } \right)\) and the adjusted chi-squared test for testing association between genotype of the j-th marker and the phenotype is
where \(T_{j}^{adj}\) is asymptotically distributed as \(\chi_{1}^{2}\) under the null hypothesis of no association.
GMMAT
Using mixed effect models to control the PS is becoming prevalent with more and more computationally effective algorithms being proposed. The GMMAT utilizes the logistic mixed model to fit the original data with latent population structure. Compared to linear mixed models (LMMs), the generalized linear mixed models (GLMMs) are proven to be more accurate for binary traits in GWAS especially in the presence of non-constant residual trait variance (Chen et al. 2016). Inheriting the notations from the previous sections, we use the index \(i\left( {i = 1,2, \cdots ,n} \right)\) to represent the i-th sample and the index \(j\left( {j = 1,2, \cdots ,m} \right)\) for the j-th SNP. For some specific candidate marker, the logistic mixed model for single-SNP association test is
where Xi denotes the i-th sample’s genotype of the candidate marker and α is the fixed genetic effect parameter. Ci is a p × 1 column vector of covariates and β is the p × 1 covariate effect vector including an intercept. \(b = \left( {b_{1} ,b_{2} , \cdots ,b_{n} } \right)^{{\text{T}}}\) is the random effect vector assumed to be distributed as \(N\left( {0,\sum\nolimits_{k = 1}^{K} {\tau_{k} V_{k} } } \right)\) where \(\tau_{k}^{\prime } s\left( {k = 1,2, \cdots ,K} \right)\) are the variance component parameters and \({V}_{k}\) ′s are known \(n\times n\) relatedness matrices. \(\pi_{i} = P\left( {y_{i} = 1\left| {X_{i} ,C_{i} ,b_i} \right.} \right)\) is the probability of the binary phenotype (in this case yi is the disease status of the i-th sample), conditional on the genotype, covariates and random effects.
In order to construct the score test statistic for the null hypothesis \(H_{0} :\alpha = 0\), we need to fit the logistic mixed model under H0, which is the same for all the investigated candidate markers, as
where \(\pi_{i0} = P\left( {y_{i} = 1\left| {C_{i} ,b_{i} } \right.} \right)\). We can use the effective AI-REML algorithm to estimate the parameters though iteration process and obtain the fitted value of \(\widehat{{\uppi }_{i0}}\). For the detailed information of the fitting process, one can refer to the original GMMAT research paper (Chen et al. 2016). The score test statistic for \({H}_{0}\) is
where \(X = \left( {X_{1} ,X_{2} , \cdots ,X_{n} } \right)^{\rm T}\) is the n × 1 column vector of genotypes and similarly Y is the phenotype vector. \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\pi }_{0} = \left( {\widehat{\pi}_{10} ,\widehat{\pi}_{20} , \cdots ,\widehat{\pi}_{n0} } \right)\) is the fitted value of the conditional phenotype probability under the null logistic model, which is the same for all SNPs.
Simulation results under independent design
We did the same simulation under the independent SNP scenarios with all six methods and the results resemble to the correlation design presented in the previous section. All the aforementioned three methods (GC, EIGENSTRAT, GMMAT) have two features in common: they are all single-marker methods adjusting for the marginal association test for each marker separately, and efficiency of all the three methods relies on one or several threshold values. As a result, dependent information among different markers is not fully utilized and as it is shown, the precision and recall level of the testing procedures are significantly affected by the choice of the threshold. It can be anticipated that a strict threshold will result in a high level of precision and a low level of recall, which means that the majority of the tested-positive markers are indeed true positive markers, however a great number of true positive markers with relatively moderate effects miss out in the meantime (see Fig. 4).
Additional simulation of PCA-LASSO with various parameters
We first explored the performance of our method PCA-LASSO changing with the signal/noise ratio. In the simulation, relative risk (R) between the case group and control group of the causal SNPs controls the signal/noise ratio. We simulated five different scenarios with 100 random causal SNPs using 5 different relative risk, R = 1.2, 1.4, 1.6, 1.8 and 2.0, respectively (the rest parameters are the same with the simulation in the original manuscript). We repeated each scenario for 100 times.
Figure 5a is the performance of our proposed method PCA-LASSO under various disease risks. It is apparent that the precision level has minimal changes at different disease risks while recall and F1 score markedly increases with increasing disease risk. We also did the simulations with different sample sizes. We chose five different value of sample sizes (m = 500, 700, 1000, 1200 and 1500) and repeated each scenario for 100 times.
Figure 5b shows the performance of PCA-LASSO under various scenarios with different sample sizes from 500 to 1500. The precision level is relatively stable when the sample size changes, while recall and F1 score gradually increase with the sample size. For the simulation on the number of genes, we tested on scenarios with 10,000, 20,000, 40,000, 60,000, 80,000 and 100,000 genes respectively and repeated each scenario for 100 times.
Figure 5c presents the results of PCA-LASSO’s performance under situations with different gene numbers from 10,000 to 100,000. Once again, the precision level remains stable regardless of number of genes. However, recall and F1 score drops as the gene number increases.
Sociodemographic characteristics of study population from ADNI
(See Table 4).
Rights and permissions
About this article
Cite this article
Liu, R., Yuan, M., Xu, X.S. et al. Fast and efficient correction for population stratification in multi-locus genome-wide association studies. Genetica 149, 313–325 (2021). https://doi.org/10.1007/s10709-021-00129-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10709-021-00129-3