mrMLM v4.0.2: An R Platform for Multi-locus Genome-wide Association Studies

Previous studies have reported that some important loci are missed in single-locus genome-wide association studies (GWAS), especially because of the large phenotypic error in field experiments. To solve this issue, multi-locus GWAS methods have been recommended. However, only a few software packages for multi-locus GWAS are available. Therefore, we developed an R software named mrMLM v4.0.2. This software integrates mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, pKWmEB, and ISIS EM-BLASSO methods developed by our lab. There are four components in mrMLM v4.0.2, including dataset input, parameter setting, software running, and result output. The fread function in data.table is used to quickly read datasets, especially big datasets, and the doParallel package is used to conduct parallel computation using multiple CPUs. In addition, the graphical user interface software mrMLM.GUI v4.0.2, built upon Shiny, is also available. To confirm the correctness of the aforementioned programs, all the methods in mrMLM v4.0.2 and three widely-used methods were used to analyze real and simulated datasets. The results confirm the superior performance of mrMLM v4.0.2 to other methods currently available. False positive rates are effectively controlled, albeit with a less stringent significance threshold. mrMLM v4.0.2 is publicly available at BioCode (https://bigd.big.ac.cn/biocode/tools/BT007077) or R (https://cran.r-project.org/web/packages/mrMLM.GUI/index.html) as an open-source software.


Introduction
Since the establishment of the mixed linear model (MLM) framework of genome-wide association studies (GWAS) [1,2], the MLM-based GWAS methodologies have been widely used to identify many important loci for complex traits in animals, plants, and humans. With the technological advances in molecular biology, a huge number of markers are easily obtained. However, this brings new computational and analytic challenges. The MLM-based single-marker association in genome-wide scans proves its feasibility. To increase statistical power and decrease running time in quantitative trait nucleotide (QTN) detection, a series of additional MLMbased methods have been proposed. For example, Kang et al. [3] proposed an efficient mixed model association (EMMA), which was then extended to generate EMMAX [4] and GEMMA [5]. Meanwhile, Zhang et al. [6] reported a compressed MLM (CMLM) method, which was then extended to develop ECMLM [7] and SUPER [8]. In addition, other methods have also been developed, e.g., GRAMMAR-Gamma [9], FaST-LMM [10], FaST-LMM-Select [11], and BOLT-LMM [12]. All the aforementioned methods have been subjected to multiple testing. To control the false positive rate in such tests, the Bonferroni correction is frequently adopted. However, this correction is often too conservative to detect many important loci.
To detect more QTNs with a low false positive rate, multilocus methods have been recommended. This recommendation was implemented for the first time by Segura and his colleagues [13]. Thereafter, Liu et al. [14] developed FarmCPU. Based on the advantages of the random model of QTN effect over the fixed model [15], we have recently developed six multi-locus methods: mrMLM [16], FASTmrMLM [17] (File S1), FAS-TmrEMMA [18], ISIS EM-BLASSO [19], pLARmEB [20], and pKWmEB [21] (File S2). These methods include two stages. First, various algorithms are used to select all the potentially associated markers. Second, the selected markers are put in one model, then all the effects in this model are estimated by empirical Bayes, and all the non-zero effects are further identified by likelihood ratio test for true QTNs. Although a less stringent significance threshold is adopted, these methods have high power and accuracy, and a low false positive rate.
Many packages are available in the GWAS software, e.g., PLINK [22], TASSEL [23], EMMA [3], EMMAX [4], GEMMA [5], and GAPIT [24,25] (File S2). However, these packages are almost all based on single-marker association in genome scans. To popularize our multi-locus GWAS methods, we integrated all the six multi-locus approaches into one R package named mrMLM v4.0.2 ( Figure S1). Implementation mrMLM v4.0.2 includes four parts ( Figure 1): dataset input, parameter setting, software running, and result output. In the dataset input module, users need to input trait phenotypes and marker genotypes. The two types of datasets are input by the filePhe and fileGen files, respectively, and the available file formats are *.csv and *.txt. Marker genotypes may be indicated by mrMLM numeric (or character) and Hapmap formats, and are used to calculate both kinship (using mrMLM or EMMA [3]) and population structure (using Structure [26] or fastSTRUCTURE [27]) matrices. This software also has an option to input kinship matrix, population structure matrix, and covariate table. The three types of datasets are input by the fileKin, filePS, and fileCov files, respectively. In the parameter setting module, users need to set 17
Influence of various factors on QTN detection using mrMLM v4.0.2 To investigate the effect of the number of markers on running time, four samples with various numbers of markers (0.2, 0.5, 0.8, and 1.01 million) and a fixed sample size (500 accessions) were sampled from the real dataset from rice [28]. As a result, it took 0.23, 0.66, 1.18, and 1.61 hours, respectively (Figure 2A). This indicates the increase of running time with the increase of the number of markers. To investigate the effect of sample size on running time, 300, 600, 900, 1200, and 2262 accessions were sampled from 2262 accessions each with 1.01 million markers from the rice dataset [28]. As a result, it took 0.37, 0.78, 1.30, 2.04, and 9.56 hours, respectively (Figure 2B). This indicates that larger sample size requires much more running time than smaller ones.
To investigate the effect of the number of CPUs on speedup, one sample with 500 accessions and 1.01 million  Table S1). This indicates the effectiveness of parallel computing. The relatively small speedups with 5-7 CPUs for pLAR-mEB and ISIS EM-BLASSO may be due to the fact that their potentially associated markers were determined at the chromosome and genome levels, respectively (Table S1). To compare the running time of various methods, one sample with 500 accessions and 1.01 million markers was analyzed by seven methods (mrMLM, FASTmrMLM, FASTmrEMMA, pLAR-mEB, pKWmEB, ISIS EM-BLASSO, and FarmCPU). As a result, it took 1.43, 1.08, 1.45, 1.53, 3.07, 1.06, and 1.09 hours, respectively ( Figure 2D). This indicates that ISIS EM-BLASSO is the fastest one, and FASTmrMLM is equivalent with FarmCPU and faster than mrMLM.
The first to fourth experiments were conducted on the first to fourth servers, respectively (File S5).

Real data analyses in rice, maize, and Simmental beef cattle
We re-analyzed the aforementioned three datasets in rice [28], maize [29], and Simmental beef cattle [30]. The details can be found in File S5.
tively (Table S2). Around these QTNs, some genes have been reported to be associated with grain width. Among these reported genes, two were identified both by mrMLM and by Wang and his colleagues [28], and eleven were detected only by mrMLM ( Figure 3A; Table S3). In addition, two genes were predicted to be associated with grain width in this study (Figure 3A; Table S4).
The total numbers of QTNs detected by the aforementioned six methods for oil concentration in maize are 42, 43, 31, 29, 17, and 6, respectively (Table S5). Around these QTNs, some genes have been reported to be associated with maize oil concentration. Among these reported genes, ten were identified both by mrMLM and by Li and his colleagues [29], thirteen were detected only by mrMLM, and four were identified only by Li and his colleagues [29] (Figure 3B; Table S6).
The total numbers of QTNs identified by the aforementioned six methods for kidney weight in Simmental beef cattle are 4, 55, 167, 117, 8, and 48, respectively (Table S7). Around these QTNs, some genes have been reported to be associated with kidney weight. Among these reported genes, MECOM was identified both by mrMLM and by An and his colleagues [31]. LCORL and NCAPG, which are very important genes for kidney weight in cattle, were detected only by mrMLM (Figure 3C; Table S8).

Discussion
To confirm the correctness of our software mrMLM v4.0.2, the same simulation datasets (https://doi.org/10.5061/dryad. sk652) from Zhang et al. [20] (File S6) were re-analyzed by the aforementioned six methods and three current methods (GEMMA [5], FarmCPU [14], and EMMAX [4]). As a result, our six methods are better than the three current methods (Figures S2-S4; Tables S9-S11). The conclusion was also confirmed by the studies of Zhang and his colleagues [32]. As compared with the original packages of our multi-locus GWAS methods, there have been some improvements in the new version. First, the FASTmrMLM algorithm is described for the first time in this study (File S1). Then, the new package is faster in reading datasets and efficient in parallel computing ( Figure 2C). Even if the sample size is larger than 2000, FAS-TmrEMMA is fast as well. This is because it is unnecessary to solve eigenvector at genome scan. Finally, the option for continuous covariates has been set up in order to analyze animal and human GWAS datasets. The new package works well for continuous variables in plant, animal, and human GWAS, although the current version doesn't work for the case-control datasets in human genetics. In addition, we correct one mistake in the determination of the potentially associated SNPs in the Monte Carlo simulation studies of Zhang and his colleagues [20].
In the work of Zhang's group [32], several major concerns in GWAS have been discussed, i.e., methodological selection, the critical probability value or log of odds (LOD) score, reliable candidate genes, and heritability missing.
Using mrMLM v4.0.2, individual parameters may be changed in order to obtain the best results (Files S3 and S4). For example, the number of potentially associated SNPs for each chromosome in pLARmEB [20] is set at 50, and the search radius in mrMLM [16] and FASTmrMLM [17] is set at 20 kb in real data analysis. In addition, users should under- Figure 3 Manhattan and QQ plots for grain width, oil concentration, and kidney weight in GWAS using mrMLM v4.0.2 Left is Manhattan plot, while right is QQ plot. A. Grain width in rice [28]. B. Oil concentration in maize [29]. C. Kidney weight in Simmental beef cattle [30]. The dots were used to indicate the known genes detected both by mrMLM and in original studies (black), only by the software mrMLM (red), and only in original studies (grey), as well as candidate genes around QTNs from the software mrMLM (blue). QQ, quantile-quantile. stand some parameter settings. For example, the maximum number of CPUs in parallel computation is set at 10. If users want to use more CPU cores, this parameter needs to be modified in the codes. Of course, the accuracy, size, and color of the GWAS figures and the critical LOD score line of significant QTNs may be changed as well.

Conclusion
To popularize our multi-locus GWAS methods, six multi-locus methods have been integrated into the software mrMLM v4.0.2. In this package, three genotypic data formats are available, big dataset can be analyzed at server, parallel computation with multiple CPUs can be performed, and parameters in the GWAS figures may be set. In addition, the graphical user interface software, mrMLM.GUI v4.0.2, built upon Shiny, is available as well. Real data analyses and Monte Carlo simulation studies confirmed the advantages of our multi-locus GWAS methods.