Fast and efficient correction for population stratification in multi-locus genome-wide association studies

Liu, Rui; Yuan, Min; Xu, Xu Steven; Yang, Yaning

doi:10.1007/s10709-021-00129-3

Fast and efficient correction for population stratification in multi-locus genome-wide association studies

Original Paper
Published: 04 September 2021

Volume 149, pages 313–325, (2021)
Cite this article

Genetica Aims and scope Submit manuscript

Rui Liu¹^na1,
Min Yuan²^na1,
Xu Steven Xu³ &
…
Yaning Yang¹

640 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F₁ scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F₁ score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer’s disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer’s disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies

Article Open access 23 November 2022

Discovering weaker genetic associations guided by known associations

Article Open access 24 February 2020

Novel genetic matching methods for handling population stratification in genome-wide association studies

Article Open access 14 March 2015

References

Bhatnagar SR et al (2020) Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLOS Genet 16(5):e1008766
Article CAS Google Scholar
Chen H et al (2016) Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Human Genet 98(4):653–666
Article CAS Google Scholar
Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004
Article CAS Google Scholar
Freedman ML et al (2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36(4):388–393
Article CAS Google Scholar
Genin E et al (2011) APOE and Alzheimer disease: a major gene with semi-dominant inheritance. Mol Psychiatry 16(9):903–907
Article CAS Google Scholar
He H et al (2011) Effect of population stratification analysis on false-positive rates for common and rare variants. BMC Proc 5(9):S116
Article Google Scholar
Hellwege JN et al (2017) Population stratification in genetic association studies. Curr Protoc Human Genet. https://doi.org/10.1002/cphg.48
Article Google Scholar
Kamboh MI et al (2012) Genome-wide association study of Alzheimer’s disease. Transl Psychiatry 2(5):e117–e117
Article CAS Google Scholar
Price AL et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909
Article CAS Google Scholar
Rakitsch B et al (2013) A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics (oxford, England) 29(2):206–214
Article CAS Google Scholar
St George-Hyslop PH (2000) Genetic factors in the genesis of Alzheimer’s disease. Ann N Y Acad Sci 924(1):1–7
Article CAS Google Scholar

Download references

Acknowledgements

This work was partially supported by the Natural Science Foundation of Anhui Province (No. 2008085MA09) and the Doctoral research funding of Anhui Medical University (No. XJ201710).

Author information

Rui Liu and Min Yuan have contributed equally to this work.

Authors and Affiliations

Department of Statistics and Finance, University of Science and Technology of China, Hefei, 230026, Anhui, China
Rui Liu & Yaning Yang
Center for Data Science in Health, School of Public Health Administration, Anhui Medical University, Hefei, 230032, Anhui, China
Min Yuan
Genmab US, Inc., Princeton, NJ, 08540, USA
Xu Steven Xu

Authors

Rui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Min Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xu Steven Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yaning Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaning Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Genomic control

Let X_n × m denote the genotype matrix with n samples and m candidate markers. The Cochran-Armitage Trend test association analysis between the phenotype (denoted as Y) and the j-th SNP (${X}_{j.col}$ represents the j-th column of genotype matrix X) is

$$T_{j} = n \times R_{j}^{2} ,$$

where $R_{j} = corr\left( {Y,X_{j \cdot col} } \right)$ is the correlation. Under the null hypothesis of no association, T_j is supposed to have a $\chi_{1}^{2}$ distribution which fails to hold true in the presence of PS.

In order to correct for the biased distribution of T_j caused by latent population structure, GC was proposed with two basic presumptions. The first one is that under the null hypothesis the distribution of T_j with hidden population structure is an inflated λ$\chi_{1}^{2}$ distribution with an inflation factor λ. For the sake of estimating λ, the second assumption follows as: the vast majority of the analyzed marker is under the null hypothesis where no association exists.

Let $T_{1} ,T_{2} , \cdots ,T_{m}$ denote all the association test statistics for total m markers and most of them are under the null hypothesis following an inflated λ$\chi_{1}^{2}$ distribution. Then we have

$$median\,\left( {T_{1} ,T_{2} , \cdots ,T_{m} } \right) = median\,\left( {\lambda \chi_{1}^{2} } \right) = 0.456\lambda.$$

The inflation factor λ_GC can be estimated by

$$\lambda_{GC} = \,\frac{{median\left( {T_{1} ,T_{2} , \cdots ,T_{m} } \right)}}{0.456}.$$

And the adjusted test statistic by GC is defined as

$$T_{j}^{adj} = \frac{{T_{j} }}{{\lambda_{GC} }}\left( {j = 1,2, \cdots ,m} \right).$$

EIGENSTRAT

PCA has been proven to be effective capturing the underlying PS factors. Subtracting the first few principal components from the original genotype data is based on the hypothesis that they carry the majority of the latent PS information which contributes mostly to the inflation of the correlation test. The EIGENSTRAT method was proposed by Alkes L Price,et al.in 2006 using the PCAs to correct for PS based on this perception. It also showed that since the principal components are constructed to be orthogonal, the results were insensitive to the number of components inferred.

As we defined in the previous section, X_n × m represents the genotype matrix and Y_n × 1 denotes the phenotype vector. The first step of EIGENSTRAT is to perform the SVD process on the genotype matrix X.

$$X_{n \times m} = U_{n \times n} D_{n \times m} V_{m \times m}^{T} .$$

Let U_k denote the first k columns of U supposed to carry the latent information of PS. The second step of EIGENSTRAT is to subtract the ancestry information from the original genotype and phenotype data. For the j-th candidate marker, let ${X}_{j \cdot col}$ (the j-th column of the genotype matrix X) be the genotype and we have

$$X_{j \cdot col}^{adj} = X_{j \cdot col} - U_{k} U_{k}^{T} X_{j \cdot col} ,$$

$$Y^{adj} = Y - U_{k} U_{k}^{T} Y,$$

where $X_{j \cdot col}^{adj}$ and $Y^{adj}$ is the adjusted genotype and phenotype vectors. The last step of EIGENSTRAT constructing the association test statistic is to perform the Cochran–Armitage Trend test on the adjusted data. The adjusted correlation is defined as $R_{j}^{adj} = corr\left( {Y^{adj} ,X_{j \cdot col}^{adj} } \right)$ and the adjusted chi-squared test for testing association between genotype of the j-th marker and the phenotype is

$$T_{j}^{adj} = \left( {n - k - 1} \right) \times \left( {R_{j}^{adj} } \right)^{2} ,$$

where $T_{j}^{adj}$ is asymptotically distributed as $\chi_{1}^{2}$ under the null hypothesis of no association.

GMMAT

Using mixed effect models to control the PS is becoming prevalent with more and more computationally effective algorithms being proposed. The GMMAT utilizes the logistic mixed model to fit the original data with latent population structure. Compared to linear mixed models (LMMs), the generalized linear mixed models (GLMMs) are proven to be more accurate for binary traits in GWAS especially in the presence of non-constant residual trait variance (Chen et al. 2016). Inheriting the notations from the previous sections, we use the index $i\left( {i = 1,2, \cdots ,n} \right)$ to represent the i-th sample and the index $j\left( {j = 1,2, \cdots ,m} \right)$ for the j-th SNP. For some specific candidate marker, the logistic mixed model for single-SNP association test is

$$\log it\left( {\pi_{i} } \right) = X_{i} \alpha + C_{i}^{T} \beta + b_{i} ,$$

where X_i denotes the i-th sample’s genotype of the candidate marker and α is the fixed genetic effect parameter. C_i is a p × 1 column vector of covariates and β is the p × 1 covariate effect vector including an intercept. $b = \left( {b_{1} ,b_{2} , \cdots ,b_{n} } \right)^{{\text{T}}}$ is the random effect vector assumed to be distributed as $N\left( {0,\sum\nolimits_{k = 1}^{K} {\tau_{k} V_{k} } } \right)$ where $\tau_{k}^{\prime } s\left( {k = 1,2, \cdots ,K} \right)$ are the variance component parameters and ${V}_{k}$ ′s are known $n\times n$ relatedness matrices. $\pi_{i} = P\left( {y_{i} = 1\left| {X_{i} ,C_{i} ,b_i} \right.} \right)$ is the probability of the binary phenotype (in this case y_i is the disease status of the i-th sample), conditional on the genotype, covariates and random effects.

In order to construct the score test statistic for the null hypothesis $H_{0} :\alpha = 0$, we need to fit the logistic mixed model under H₀, which is the same for all the investigated candidate markers, as

$$\log it\left( {\pi_{i0} } \right) = C_{i}^{\rm T} \beta + b_{i} ,$$

where $\pi_{i0} = P\left( {y_{i} = 1\left| {C_{i} ,b_{i} } \right.} \right)$. We can use the effective AI-REML algorithm to estimate the parameters though iteration process and obtain the fitted value of $\widehat{{\uppi }_{i0}}$. For the detailed information of the fitting process, one can refer to the original GMMAT research paper (Chen et al. 2016). The score test statistic for ${H}_{0}$ is

$$T = X^{{\rm T}} \left( {Y - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\pi } _{0} } \right)$$

where $X = \left( {X_{1} ,X_{2} , \cdots ,X_{n} } \right)^{\rm T}$ is the n × 1 column vector of genotypes and similarly Y is the phenotype vector. $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\pi }_{0} = \left( {\widehat{\pi}_{10} ,\widehat{\pi}_{20} , \cdots ,\widehat{\pi}_{n0} } \right)$ is the fitted value of the conditional phenotype probability under the null logistic model, which is the same for all SNPs.

Simulation results under independent design

We did the same simulation under the independent SNP scenarios with all six methods and the results resemble to the correlation design presented in the previous section. All the aforementioned three methods (GC, EIGENSTRAT, GMMAT) have two features in common: they are all single-marker methods adjusting for the marginal association test for each marker separately, and efficiency of all the three methods relies on one or several threshold values. As a result, dependent information among different markers is not fully utilized and as it is shown, the precision and recall level of the testing procedures are significantly affected by the choice of the threshold. It can be anticipated that a strict threshold will result in a high level of precision and a low level of recall, which means that the majority of the tested-positive markers are indeed true positive markers, however a great number of true positive markers with relatively moderate effects miss out in the meantime (see Fig. 4).

Additional simulation of PCA-LASSO with various parameters

We first explored the performance of our method PCA-LASSO changing with the signal/noise ratio. In the simulation, relative risk (R) between the case group and control group of the causal SNPs controls the signal/noise ratio. We simulated five different scenarios with 100 random causal SNPs using 5 different relative risk, R = 1.2, 1.4, 1.6, 1.8 and 2.0, respectively (the rest parameters are the same with the simulation in the original manuscript). We repeated each scenario for 100 times.

Figure 5a is the performance of our proposed method PCA-LASSO under various disease risks. It is apparent that the precision level has minimal changes at different disease risks while recall and F₁ score markedly increases with increasing disease risk. We also did the simulations with different sample sizes. We chose five different value of sample sizes (m = 500, 700, 1000, 1200 and 1500) and repeated each scenario for 100 times.

Figure 5b shows the performance of PCA-LASSO under various scenarios with different sample sizes from 500 to 1500. The precision level is relatively stable when the sample size changes, while recall and F₁ score gradually increase with the sample size. For the simulation on the number of genes, we tested on scenarios with 10,000, 20,000, 40,000, 60,000, 80,000 and 100,000 genes respectively and repeated each scenario for 100 times.

Figure 5c presents the results of PCA-LASSO’s performance under situations with different gene numbers from 10,000 to 100,000. Once again, the precision level remains stable regardless of number of genes. However, recall and F₁ score drops as the gene number increases.

Sociodemographic characteristics of study population from ADNI

(See Table 4).

Table 4 Sociodemographic characteristics of study population from ADNI

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, R., Yuan, M., Xu, X.S. et al. Fast and efficient correction for population stratification in multi-locus genome-wide association studies. Genetica 149, 313–325 (2021). https://doi.org/10.1007/s10709-021-00129-3

Download citation

Received: 15 March 2021
Accepted: 26 July 2021
Published: 04 September 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10709-021-00129-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and efficient correction for population stratification in multi-locus genome-wide association studies

Abstract

Access this article

Similar content being viewed by others

GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies

Discovering weaker genetic associations guided by known associations

Novel genetic matching methods for handling population stratification in genome-wide association studies

References

Acknowledgements