High performance solutions for big-data GWAS
Introduction
Genome-wide association (GWA) analyses are a powerful statistical tool to identify certain locations of significance in the genome: typically, they aim at determining which single-nucleotide polymorphisms (SNPs) influences specific traits of interest. Thanks to these studies, hundreds of SNPs for dozens of complex human diseases and quantitative traits have been discovered [1]. In GWA studies (GWAS), one of the most used methods to account for the genetic substructure due to relatedness and population stratification is the variance component approach based on mixed-models [2], [3]. While effective, mixed-models based methods are computationally demanding both in terms of data management and computation. The objective of this research is to make large-scale GWA analyses affordable.
Computationally, a mixed-model based GWAS on n individuals, m genetic markers (SNPs), and t traits boils down to the solution of the generalized least-squares (GLS) problemswhere is the design matrix, is the covariance matrix, contains the vector of observations, and quantifies the relation between a variation in an SNP () and a variation in a trait (). Furthermore, is a symmetric positive definite (SPD) matrix, and the full rank matrix can be viewed as composed of two parts: , with and , where , which contains covariates, is fixed, and only varies with SNPi. Moreover, the relationship among the individuals is taken into account by the covariance matrix :Here, I is the identity matrix, the kinship matrix contains the relationship among all studied individuals, and and are trait-dependent scalar estimates. Finally, common problem sizes are: , , and t is either 1 (single-trait analysis) or in the range of thousands (multi-trait analysis).
The first reported GWA study dates back to 2005: 146 individuals were genotyped, and about SNPs were analyzed with respect to one trait [4]. Since then, as the catalog of published GWA analyses shows [5], [6], the number of publications has increased steadily, up to 2404 in 2011 and 3307 in 2012. A similar growth can be observed in both the population size and the number of SNPs: across all the GWAS published in 2012 the studies comprised on average individuals, with a maximum of , and on average genetic markers, with a maximum of . More recently, advances in technology make it affordable to assess “omics” phenotypes in large populations, resulting in the challenge of analyzing (potentially hundreds) of thousands of traits. From the perspective of Eqs. (1), (2), these trends present concrete challenges, especially in terms of memory requirements. As and the m ’s and t ’s compete for the main memory, two distinct scenarios arise: (1) if n is small enough for to fit in main memory, the ’s and ’s are to be streamed from disk; (2) if does not fit in main memory, both data and computation have to be distributed over multiple compute nodes. In this paper, we present efficient strategies for utilizing distributed architectures —such as clusters, Cloud-based systems, and supercomputers— to execute single-trait and multi-trait GWA analyses with arbitrarily large population size, number of SNPs, and traits.
Related work. To perform GWA studies, there exist several freely available libraries. Among them, we highlight GenABEL, a widely spread framework for statistical genomics [7], and FaST-LMM, a high-performance software targeting single-trait analyses [8]. More recently, Fabregat et al. developed OmicABEL —a package for the GenABEL suite— which implements optimized solutions for shared memory architectures [9], [10]. However, those algorithms do not support distributed-memory computations, and are only applicable when the kinship matrix fits in the local memory of a single node.
Organization of the paper. The rest of this paper is structured as follows. Section 2 is devoted to single-trait GWAS analyses (i.e., ): In Section 2.1, the mathematical algorithm is first introduced; in Section 2.2, we discuss the shared memory implementation GWAS-1D-SMP, an extension to accommodate analyses with an arbitrarily large number of SNPs (large m); then, in Section 2.3, we present GWAS-1D-MPI, a distributed memory extension for analyses with large population size (large n). Section 3 addresses multi-trait studies (i.e., ): A discussion of a core algorithm that exploits invariants across multiple traits is given in Section 3.1; to allow the solutions of problems of arbitrary size (in terms of , and t), we apply out-of-core (Section 3.2) and distributed-memory techniques (Section 3.3), thus yielding GWAS-2D-MPI. Conclusions are drawn in Section 4.
Section snippets
Single-trait GWAS
We consider Eq. (1) restricted to the study of a single trait y:
Multi-trait GWAS
In an important class of GWAS (analysis of “omics” phenotypes), the studies involve many traits [12], [13], [14], [15]. In this case, the set of generalized least squares problems in Eq. (1) extends into the second dimension j:
This extra dimension is not only reflected in the traits , but it also introduces varying matrices . Such symmetric positive definite ’s share the common structurewhere the so called kinship
Conclusion
We presented parallel algorithms for the computation of linear mixed-models based genome-wide association studies (GWAS). They address the issue of growing dataset sizes due to the number of studied polymorphisms m, the population size n, and/or the number of traits t.
The first algorithm uses a double-buffering technique in order to process datasets with arbitrarily large numbers of genetic polymorphisms. Compared to other wide-spread GWAS-codes, our shared memory implementation, GWAS-1D-SMP,
Acknowledgments
Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through Grant GSC 111 and Deutsche Telekom Stiftung is gratefully acknowledged. The authors thank Yurii Aulchenko for fruitful discussions on the biological background of GWAS.
References (16)
- et al.
Solving sequences of generalized least-squares problems on multi-threaded architectures
Appl. Math. Comput. (AMC)
(2014) - et al.
Potential etiologic and functional implications of genome-wide association loci for human diseases and traits
Proc. Natl. Acad. Sci. U.S.A.
(2009) - et al.
The use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods
Ann. Hum. Genet.
(1986) - et al.
A unified mixed-model method for association mapping that accounts for multiple levels of relatedness
Nat. Genet.
(2006) - et al.
Complement factor h polymorphism in age-related macular degeneration
Science
(2005) - L.A. Hindorff, J. MacArthur, J. Morales, H.A. Junkins, P.N. Hall, A.K. Klemm, T.A. Manolio, A catalog of published...
- T.A. Manolio, Published gwas reports, 2005–6/2012....
- et al.
GenABEL: an R library for genome-wide association analysis
Bioinformatics
(2007)
Cited by (4)
Studying the effects of haplotype partitioning methods on the RA-associated genomic results from the North American Rheumatoid Arthritis Consortium (NARAC) dataset
2019, Journal of Advanced ResearchCitation Excerpt :GWAS results represent a domain of big data with millions of SNPs tested against many phenotypes. These results have become a burden for bioinformaticians in terms of processing time and real-time visualization [10,11]. The applied haplotype block methods were CIT, FGT, and SSLD.
Bayesian large-scale multiple regression with summary statistics from genome-wide association studies
2017, Annals of Applied StatisticsBig data applications in engineering and science
2016, Big Data Concepts, Theories, and Applications