rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study

Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called “rMVP” to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.


30
The computation burden of GWAS is partially caused by the increasing sample size and 31 marker density applied for these studies. As a result, how to efficiently analyse the big data is 32 Introducing the population structure concept into GWAS has dramatically improved 39 accuracy of detection. For example, incorporating the fractions of individuals belonging to 40 subpopulations, namely Q matrix, reduces both false positive and false negative signals [4]. 41 Principal components (PCs) are widely used to represent subpopulations and to enable the 42 incorporation of population structure into GWAS [5]. Implementing the General Linear 43 Model (GLM) to incorporate either the Q matrix or PCs as covariates, PLINK has become the 44 most popular software package for GWAS [6]. are implemented in rMVP for association tests. When there is more than one covariate (e.g. 119 PCs) added to association test models, the inverse of the design matrix corresponding to the 120 covariates will be calculated n times, where n is marker size. Block matrix multiplication 121 strategy can be used to speed up the processes including inverse of the design matrix 122 corresponding to the covariates and the testing markers. This strategy is used in all available 123 association test methods in rMVP. Take GLM as an example, the fixed effect model function 124 can be written as: where y is a vector of phenotype, X is a matrix of fixed effects and test SNP, b is 127 an incidence matrix for X , and e is a vector of residual that followed a normal distribution 128 with mean of zero and 2 e Iσ covariance, where I is the identity matrix and 2 e σ is the 129 unknown residual variance. Equation (1) can be reformulated by following steps: 130
If we use w and x represent 1 2 3 , , ,..., k C C C C and SNP , respectively, the inverse of 143 M matrix can be written as: where 146 is always a key part in commonly used GWAS procedure, which has been precomputed 171 already. Not only that, as shown in Figure

Memory-efficient: Efficient memory usage in data loading and parallel computation 194
Genotype matrices are the biggest datasets for GWAS. In rMVP, genotype data in multiple 195 formats are converted to 'big.matrix', which can minimize RAM usage through generating a 196 bridge that facilitates RAM accessing the data on the hard disk instead of loading it to RAM 197 directly as the most software tools do. rMVP achieves this goal by using the 'bigmemory' 198 package to build data mirrors that are accessible to RAM, while the actual data remain on the 199 hard drive. In this way, very little RAM capacity is needed for the temporary storage of the 200 data. Once the data mirrors are built, users will never need to re-build them again and the 201 time of loading input data is negligible. When multiple threads are used to accelerate the 202 association tests, no additional data mirrors will be copied for each thread as all threads will 203 share the same data mirrors. 204 Here, we made a rough illustration of 'big.matrix' based memory storage of one and 205 multiple threads for rMVP. The complete GWAS procedure of three methods was recorded 206 for RAM usage test in a Linux server ('RES' -'SHR'). In this test, the product of genotype 207 data size was measured in standard R matrix format, and 'theoretical RAM cost' for multiple 208 threads in 'fork' mode is defined as r × c × t × 8 bytes, where r and c are the number of rows 209 and columns of a matrix respectively, t is the number of threads. From the Figure 1 Microsoft R Open platform. However, the big genotype data is loaded into RAM and 239 resulting in a big memory cost as most of the GWAS software tools do, e.g., GEMMA, 240 GCTA, and GAPIT. For the "memory" mode, all the matrices that required for constructing 241 the GRM are stored in the 'big.matrix' format and the matrix multiplication of 'big.matrix' is 242 implemented by our newly developed C++ function, which could be parallel accelerated by 243 using the OpenMP (Open Multi-Processing) technology. Although it can significantly 244 decrease the cost of memory, a little bit more computing time is required (Table 1). Users can 245 easily adjust the "priority" parameter to get rid of the memory limit or the fastest speed 246 depending on the data size and computing resources. 247 Table 1. Comparison of memory and time cost under modes of "speed" and "memory" 9 single tested marker. This process involves the inverse of the design matrix for covariates and 257 the tested markers. Since the covariates were the same for every tested marker, we partitioned 258 the design matrix into sub-matrices according to the covariates and the testing markers. The 259 inverse of the entire design matrix was calculated from the one-time calculation of the inverse 260 of the sub-matrix of covariates. As the number of covariates and markers increased, sub-261 matrices partitioning significantly saved computing time (Table 2). Block matrix 262 multiplication strategy has been used in all association tests including GLM, MLM, and 263 FarmCPU. 264  FarmCPU_pkg (R package written in pure R, http://zzlab.net/FarmCPU/), respectively. The 284 rMVP (written in R and C++) was compared with these software packages for speed 285 performance and the computing time was recorded for each software from loading data to generating results files (Figure 2, Table S1). Detailed software version and scripts for 287 computing speed test are provided in Supplementary Table S2 Table 3. At the moment, rMVP does not provide functions of imputation and quality 327 control, which need to be done before association tests. Instead, rMVP provides functions for 328 flexible data conversion that can easily accept the data from other software, such as 329 Beagle [28], which also accepts data in VCF format and provides imputation and quality 330 control functions. 331    Table S1. Computation speed performances of PLINK, GEMMA, and rMVP for five 449 simulated datasets. 450 Table S2. Software versions and codes used in performance tests. 451 Table S3. Parameter details for flexible visualization of GWAS related information. 452

Figure S1
453 Title: Comparisons of association results between rMVP and related software. 454 Legend: x-axis is the computed p-value in -log10 format for different GWAS models, y-axis is the computed 455 p-value in -log10 format of related software for corresponding GWAS model, the experiment was performed on 456 the simulated 16 data units (16,000 samples and 1,600,000 SNPs).

457
Figure S2 458 Title: Comparisons of power and false positive discovery for different GWAS models 459 between rMVP and related software. 460 Legend: The experiment was performed using an Arabidopsis dataset, which includes 1178 individuals and 461 208794 SNPs, the phenotype was simulated by randomly selected 10 QTNs following a normal distribution with 462 mean 0 and variance 0.1, the heritability was 0.5. The final results were the average of 100 replicates.

463
Figure S3 464 Title: The road mapping of whole GWAS procedures in rMVP. 465 Legend: K is the Kinship matrix, also known as Genomic relationship matrix (GRM