RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method

Difficulty in detecting rare variants is one of the problems in conventional genome-wide association studies (GWAS). The problem is closely related to the complex gene compositions comprising multiple alleles, such as haplotypes. Several single nucleotide polymorphism (SNP) set approaches have been proposed to solve this problem. These methods, however, have been rarely discussed in connection with haplotypes. In this study, we developed a novel SNP-set method named “RAINBOW” and applied the method to haplotype-based GWAS by regarding a haplotype block as a SNP-set. Combining haplotype block estimation and SNP-set GWAS, haplotype-based GWAS can be conducted without prior information of haplotypes. We prepared 100 datasets of simulated phenotypic data and real marker genotype data of Oryza sativa subsp. indica, and performed GWAS of the datasets. We compared the power of our method, the conventional single-SNP GWAS, the conventional haplotype-based GWAS, and the conventional SNP-set GWAS. Our proposed method was shown to be superior to these in three aspects: (1) controlling false positives; (2) in detecting causal variants without relying on the linkage disequilibrium if causal variants were genotyped in the dataset; and (3) it showed greater power than the other methods, i.e., it was able to detect causal variants that were not detected by the others, primarily when the causal variants were located very close to each other, and the directions of their effects were opposite. By using the SNP-set approach as in this study, we expect that detecting not only rare variants but also genes with complex mechanisms, such as genes with multiple causal variants, can be realized. RAINBOW was implemented as an R package named “RAINBOWR” and is available from CRAN (https://cran.r-project.org/web/packages/RAINBOWR/index.html) and GitHub (https://github.com/KosukeHamazaki/RAINBOWR).


numerator relationship matrix A.
To fit this model in this practice, RAINBOW converts the model to a single random-effect model by using a weighted average of the variance-covariance matrices of the Z c u c (all markers) and Z r i u r i (markers tested) to be proportional to the variance-covariance of the random effect. This single random-effect model is fitted in EMMA/GEMMA. This process is repeated to find optimal weights using L-BFGS optimization of full/restricted log likelihood.
Their method to fit their model, which is rather ad-hoc, I suspect arise from the software restriction of EMMA/GEMMA that can only fit a single random effect model. The aspect of this fitting process that I question is how good are the resulting estimates compared to say if the model parameters were estimated directly via REML? Some existing software such as SAS and asreml are able to estimate variance components directly via REML even when there are multiple random effects. These are of course commercial software and likely authors would like to have RAINBOW without those restrictions but I think a comparison of variance component estimates that authors obtain to REML estimates are warranted.

General comment:
• ρ 12 for Eqns (16) and (17) does not seem to be defined. I assumed to be Pearson's correlation coefficient between X 1 and X 2 but could the authors clarify? • P4. L82 & L89 & L95. σ 2 c and σ 2 r i are not estimated by solving the mixed effect model Eq (1). It is important to distinguish between the model and the fitting process of the model. The fitting process estimates the variance parameters 1 usually using REML, but not always the case as is the authors' case described in "Estimation of variance components" section. The authors could just refer to "Estimation of variance components" section rather than referring that it is solved by model Eq (1).
• Any reference to back up the statement "the score test is not necessarily the best method for testing the random effects in the mixed effects model" in L41?

Citations
• Please cite software used. E.g. cite R; and cite ggplot2 for creating the figures. Citations for R-packages are easily obtained in R by using say citation("ggplot2") for gglot2 R package.
• Please cite relevant statistical papers. E.g. REML attributed to Patterson & Thompson (1971); testing of variance components at boundary -> the asymptotic distribution discussed in Self and Liang (1987) with further discussion in Stram and Lee (1994) (latter may be more relevant as it pertains directly to LR tests. Score test, ML and so on also missing citations.

Minor comments:
Abstract: • "The results indicated that the proposed method was able to control false positives than the others." -> control false positives better than other methods.
• "The proposed method was also excellent at detecting causal variants without relying on the linkage disequilibrium if causal variants were genotyped in the dataset. Moreover, the proposed method showed greater power than the other methods, i.e., it was able to detect causal variants that were not detected by the others, primarily when the causal variants were located very close to each other, and the directions of their effects were opposite. The proposed method, RAINBOW, is exceptionally superior in controlling false positives, detecting causal variants, and detecting nearby causal variants with opposite effects. By using the SNP-set approach as the proposed method, we expect that detecting not only rare variants but also genes with complex mechanisms, such as genes with multiple causal variants, can be realized." -> Rather than repeating "the proposed method" so many times, I suggest you make it a list instead, e.g. "Our proposed method was superior in three aspects: (1) superior in detecting causal variants without. . . ; (2) . . . ". 2