Elsevier

Parallel Computing

Volume 42, February 2015, Pages 75-87
Parallel Computing

High performance solutions for big-data GWAS

https://doi.org/10.1016/j.parco.2014.09.005Get rights and content

Highlights

  • We consider mixed-models based genome-wide association studies of large scale.

  • We address GWAS with only a single trait and with many traits.

  • GWAS with arbitrarily large populations, and of arbitrarily many SNPs and traits are enabled.

  • Distributed memory architectures, such as Cloud, clusters, and supercomputers are used.

  • Scalability with respect to problem size and resources is demonstrated.

Abstract

In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: (1) millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; (2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: the genotype and phenotype data is streamed from secondary storage using the double-buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.

Introduction

Genome-wide association (GWA) analyses are a powerful statistical tool to identify certain locations of significance in the genome: typically, they aim at determining which single-nucleotide polymorphisms (SNPs) influences specific traits of interest. Thanks to these studies, hundreds of SNPs for dozens of complex human diseases and quantitative traits have been discovered [1]. In GWA studies (GWAS), one of the most used methods to account for the genetic substructure due to relatedness and population stratification is the variance component approach based on mixed-models [2], [3]. While effective, mixed-models based methods are computationally demanding both in terms of data management and computation. The objective of this research is to make large-scale GWA analyses affordable.

Computationally, a mixed-model based GWAS on n individuals, m genetic markers (SNPs), and t traits boils down to the solution of the m×t generalized least-squares (GLS) problemsbij:=XiTMj-1Xi-1XiTMj-1yj,withi=1,,mandj=1,,t,where XiRn×p is the design matrix, MjRn×n is the covariance matrix, yjRn contains the vector of observations, and bijRp quantifies the relation between a variation in an SNP (Xi) and a variation in a trait (yj). Furthermore, Mj is a symmetric positive definite (SPD) matrix, and the full rank matrix Xi can be viewed as composed of two parts: Xi=XL|XRi, with XLRn×(p-1) and XRiRn×1, where XL, which contains p-1 covariates, is fixed, and only XRi varies with SNPi. Moreover, the relationship among the individuals is taken into account by the covariance matrix Mj:Mj=σj2hj2Φ+(1-hj2)I.Here, I is the identity matrix, the kinship matrix ΦRn×n contains the relationship among all studied individuals, and σj2 and hj2 are trait-dependent scalar estimates. Finally, common problem sizes are: 103n105,2p20, 105m108, and t is either 1 (single-trait analysis) or in the range of thousands (multi-trait analysis).

The first reported GWA study dates back to 2005: 146 individuals were genotyped, and about 103,000 SNPs were analyzed with respect to one trait [4]. Since then, as the catalog of published GWA analyses shows [5], [6], the number of publications has increased steadily, up to 2404 in 2011 and 3307 in 2012. A similar growth can be observed in both the population size and the number of SNPs: across all the GWAS published in 2012 the studies comprised on average 15,471 individuals, with a maximum of 133,154, and on average 1,252,222 genetic markers, with a maximum of 7,422,970. More recently, advances in technology make it affordable to assess “omics” phenotypes in large populations, resulting in the challenge of analyzing (potentially hundreds) of thousands of traits. From the perspective of Eqs. (1), (2), these trends present concrete challenges, especially in terms of memory requirements. As MjRn×n and the m Xi’s and t yj’s compete for the main memory, two distinct scenarios arise: (1) if n is small enough for Mj to fit in main memory, the Xi’s and yj’s are to be streamed from disk; (2) if Mj does not fit in main memory, both data and computation have to be distributed over multiple compute nodes. In this paper, we present efficient strategies for utilizing distributed architectures —such as clusters, Cloud-based systems, and supercomputers— to execute single-trait and multi-trait GWA analyses with arbitrarily large population size, number of SNPs, and traits.

Related work. To perform GWA studies, there exist several freely available libraries. Among them, we highlight GenABEL, a widely spread framework for statistical genomics [7], and FaST-LMM, a high-performance software targeting single-trait analyses [8]. More recently, Fabregat et al. developed OmicABEL —a package for the GenABEL suite— which implements optimized solutions for shared memory architectures [9], [10]. However, those algorithms do not support distributed-memory computations, and are only applicable when the kinship matrix fits in the local memory of a single node.

Organization of the paper. The rest of this paper is structured as follows. Section 2 is devoted to single-trait GWAS analyses (i.e., t=1): In Section 2.1, the mathematical algorithm is first introduced; in Section 2.2, we discuss the shared memory implementation GWAS-1D-SMP, an extension to accommodate analyses with an arbitrarily large number of SNPs (large m); then, in Section 2.3, we present GWAS-1D-MPI, a distributed memory extension for analyses with large population size (large n). Section 3 addresses multi-trait studies (i.e., t>1): A discussion of a core algorithm that exploits invariants across multiple traits is given in Section 3.1; to allow the solutions of problems of arbitrary size (in terms of m,n, and t), we apply out-of-core (Section 3.2) and distributed-memory techniques (Section 3.3), thus yielding GWAS-2D-MPI. Conclusions are drawn in Section 4.

Section snippets

Single-trait GWAS

We consider Eq. (1) restricted to the study of a single trait y:bi:=XiTM-1Xi-1XiTM-1y,withi=1,,m.

Multi-trait GWAS

In an important class of GWAS (analysis of “omics” phenotypes), the studies involve many traits yj [12], [13], [14], [15]. In this case, the set of generalized least squares problems in Eq. (1) extends into the second dimension j:bij:=XiTMj-1Xi-1XiTMj-1yj,withi=1,,mandj=1,,t.

This extra dimension is not only reflected in the traits yj, but it also introduces varying matrices Mj. Such symmetric positive definite Mj’s share the common structureMj=σj2hj2Φ+(1-hj2)I,where the so called kinship

Conclusion

We presented parallel algorithms for the computation of linear mixed-models based genome-wide association studies (GWAS). They address the issue of growing dataset sizes due to the number of studied polymorphisms m, the population size n, and/or the number of traits t.

The first algorithm uses a double-buffering technique in order to process datasets with arbitrarily large numbers of genetic polymorphisms. Compared to other wide-spread GWAS-codes, our shared memory implementation, GWAS-1D-SMP,

Acknowledgments

Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through Grant GSC 111 and Deutsche Telekom Stiftung is gratefully acknowledged. The authors thank Yurii Aulchenko for fruitful discussions on the biological background of GWAS.

References (16)

  • D. Fabregat-Traver et al.

    Solving sequences of generalized least-squares problems on multi-threaded architectures

    Appl. Math. Comput. (AMC)

    (2014)
  • L.A. Hindorff et al.

    Potential etiologic and functional implications of genome-wide association loci for human diseases and traits

    Proc. Natl. Acad. Sci. U.S.A.

    (2009)
  • E. Boerwinkle et al.

    The use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods

    Ann. Hum. Genet.

    (1986)
  • J. Yu et al.

    A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

    Nat. Genet.

    (2006)
  • R.J. Klein et al.

    Complement factor h polymorphism in age-related macular degeneration

    Science

    (2005)
  • L.A. Hindorff, J. MacArthur, J. Morales, H.A. Junkins, P.N. Hall, A.K. Klemm, T.A. Manolio, A catalog of published...
  • T.A. Manolio, Published gwas reports, 2005–6/2012....
  • Y.S. Aulchenko et al.

    GenABEL: an R library for genome-wide association analysis

    Bioinformatics

    (2007)
There are more references available in the full text version of this article.

Cited by (4)

  • Studying the effects of haplotype partitioning methods on the RA-associated genomic results from the North American Rheumatoid Arthritis Consortium (NARAC) dataset

    2019, Journal of Advanced Research
    Citation Excerpt :

    GWAS results represent a domain of big data with millions of SNPs tested against many phenotypes. These results have become a burden for bioinformaticians in terms of processing time and real-time visualization [10,11]. The applied haplotype block methods were CIT, FGT, and SSLD.

  • Big data applications in engineering and science

    2016, Big Data Concepts, Theories, and Applications
View full text