ABSTRACT
Genome-wide association studies (GWAS) have been widely used to unravel connections between genetic variants and diseases. Larger sample sizes in GWAS can lead to discovering more associations and more accurate genetic predictors. However, sharing and combining distributed genomic data to increase the sample size is often challenging or even impossible due to privacy concerns and privacy protection laws such as the GDPR. While meta-analysis has been established as an effective approach to combine summary statistics of several GWAS, its accuracy can be attenuated in the presence of cross-study heterogeneity. Here, we present sPLINK (safe PLINK), a user-friendly tool, which performs federated GWAS on distributed datasets while preserving the privacy of data and the accuracy of the results. sPLINK neither exchanges raw data nor does it rely on summary statistics. Instead, it performs model training in a federated manner, communicating only model parameters between cohorts and a central server. We verify that the federated results from sPLINK are the same as those from aggregated analyses conducted with PLINK. We demonstrate that sPLINK is robust against heterogeneous data (phenotype and confounding factors) distributions across cohorts while existing meta-analysis tools considerably lose accuracy in such scenarios. We also show that sPLINK achieves practical runtime, in order of minutes or hours, and acceptable network bandwidth consumption for chi-square and linear/logistic regression tests. Federated analysis with sPLINK, thus, has the potential to replace meta-analysis as the gold standard for collaborative GWAS. The user-friendly, readily usable sPLINK tool is available at https://exbio.wzw.tum.de/splink.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Joint last authors
Detailed comparison between sPLINK and meta-analysis for heterogeneous confounding factor scenario added; Runtime and network bandwidth usage results for sPLINK added; Figure 7 and Figure 8 revised; Author list updated; Concise comparison between sPLINK and state-of-the-art approaches including PLINK, meta-analysis, homomorphic encryption based methods and secure multiparty computing based frameworks added.
↵3 This value was computed based on the authors’ claim that their runtime linearly depends on the sample size and it takes 80 days to compute the results for a dataset with 1M individuals and 500k SNPs18.