Further investigations of the W-test for pairwise epistasis testing

Background: In a recent paper, a novel W-test for pairwise epistasis testing was proposed that appeared, in computer simulations, to have higher power than competing alternatives. Application to genome-wide bipolar data detected significant epistasis between SNPs in genes of relevant biological function. Network analysis indicated that the implicated genes formed two separate interaction networks, each containing genes highly related to autism and neurodegenerative disorders. Methods: Here we investigate further the properties and performance of the W-test via theoretical evaluation, computer simulations and application to real data. Results: We demonstrate that, for common variants, the W-test is closely related to several existing tests of association allowing for interaction, including logistic regression on 8 degrees of freedom, although logistic regression can show inflated type I error for low minor allele frequencies, whereas the W-test shows good/conservative type I error control. Although in some situations the W-test can show higher power, logistic regression is not limited to tests on 8 degrees of freedom but can instead be tailored to impose greater structure on the assumed alternative hypothesis, offering a power advantage when the imposed structure matches the true structure. Conclusions: The W-test is a potentially useful method for testing for association - without necessarily implying interaction - between genetic variants disease, particularly when one or more of the genetic variants are rare. For common variants, the advantages of the W-test are less clear, and, indeed, there are situations where existing methods perform better. In our investigations, we further uncover a number of problems with the practical implementation and application of the W-test (to bipolar disorder) previously described, apparently due to inadequate use of standard data quality-control procedures. This observation leads us to urge caution in interpretation of the previously-presented results, most of which we consider are highly likely to be artefacts.


Introduction
In a paper recently published in the journal Nucleic Acids Research, Wang and colleagues 1 proposed a novel W-test for pairwise epistasis testing. The thrust of the proposed method was to compare the distributions of the k observed genotype combinations at L = 2 diallelic genetic loci such as single nucleotide polymorphisms (SNPs), between cases and controls (see Table 1). In general, for L = 2 loci, the number of possible categories k = 9, although this number can be lower if any cells are empty in cases and/or controls. The comparison of genotype distributions between cases and controls is achieved by constructing a cell-specific measure for each of the observed genotype categories, corresponding to the normalized log odds ratio for that category: (i = 1, . . . k). These cell-specific quantities are then combined to construct a scaled χ 2 test statistic whose value and (possibly noninteger) degrees of freedom (df) can be calculated as a function of two parameters (h and f) that are estimated using a bootstrapping approach. Although in principle applicable to other orders of SNP combinations, including single SNPs (where k = 3) and combinations of more than two SNPs (where k = 3 L ), the main focus of the published paper 1 and accompanying software was on the pairwise test.
Analysis of 2 × k contingency tables, such as Table 1, in order to detect association between the row and column variables, is a classical problem in statistics that has had a long history of investigation in the statistical and epidemiological literatures. The usual treatment depends on whether one wishes to condition on the row margins (here, the number of cases and controls -as would be natural in a case/control study), on the column margins (the 'exposure' variables -as would be natural in a clinical trial or a cohort study), or both 2, 3 . Conditioning on the number of cases and controls leads to considering the cell counts as coming from two independent multinomial distributions. Conditioning on the column margins leads to considering the cell counts as coming from k independent binomial distributions. Conditioning on both row and column margins leads to a hypergeometric distribution for the cell counts.
Tests of association within each of the above formulations depend on whether one wishes to use an exact or an asymptotic test 2, 3 . Interestingly, all formulations result in a χ 2 test statistic on k − 1 = 8 df (provided all cells are represented in the data), reflecting the fact that there are k − 1 additional independent parameters to be estimated under the alternative hypothesis (where the row and column variables are allowed to be associated) compared to the null hypothesis (where there is no association between the row and column variables). This contrasts with the W-test proposed by Wang et al. 1 , in which k = 9 non-independent (log odds ratio) quantities are combined, resulting in the necessity for a scaled χ 2 test statistic (with parameters h and f estimated using bootstrapping) in order to account for the non-independence between the k = 9 normalized log odds ratios.
Arguably the most natural way to analyse data from a 2 × 9 contingency table is to perform a standard Pearson's χ 2 test 4 on 8 df, testing the independence of the column variable (here, genotype) and the row variable (here, case/control status). We note that both Pearson's χ 2 test and the W-test actually test for association (between genotype category and phenotype) rather than testing specifically for epistasis or statistical interaction [5][6][7][8][9] between the genotypes at the two loci in relation to phenotype. Unfortunately, depending on the software implementation used, Pearson's χ 2 test can fail to produce a test statistic for sparse data (i.e. cells with low or zero genotype frequencies), and continuity corrections 10 have only been developed for 2×2 and 2×1 contingency tables. Wang et al. 1 point out that one important advantage of their proposed W-test is its adaptive ability to cope with sparse data, through the data-dependent bootstrap estimation of the scaling factor h and the degrees of freedom parameter f.
The precise details of Wang et al.'s bootstrap procedure are not fully delineated in their manuscript, but perusal of both the manuscript and the R code provided indicates that the bootstrap involves using a default of B = 200 bootstrap replicates, each of which uses genotype data from N B = min (1000, N) randomly chosen individuals and P B = min (1000, P) randomly chosen pairs of SNPs (where N is the total number of individuals and P the total number of pairs of SNPs in the data set under study), along with phenotype data (case/control status) that are resampled under the null hypothesis (i.e. independent of genotype). Thus, in order to implement the proposed bootstrap procedure, one needs real data from a reasonable number of 'other' pairs of SNPs, which are used as 'surrogates' (to estimate the distributional properties of the test) for every 'test' pair of SNPs. Thus the bootstrap procedure is data-adaptive in the sense that real GWAS data (at a number of pairs of SNPs -possibly but not necessarily including the test pair) from the current data set are where j = 1 refers to cases and j = 0 to controls, and considering the cell counts in cases and controls as coming from two independent multinomial distributions) can be calculated 3 as [Diag(π) -ππ T ]/N j , where π is the vector of true underlying multinomial probability parameters π T = (p j1 , p j2 , . . . , p j9 ), whose maximum likelihood estimates are This suggests that the covariances of the normalized log odds ratios, which are functions of the estimates , could perhaps be better estimated on the basis of the observed data at the test pair of SNPs alone, possibly through use of a bootstrap. This would presumably result in a data-adaptive approach that adapts to the properties of the specific SNP pair under test, rather than to the properties of all (or a sample of) other SNP pairs in the data set, many of which may have quite different properties (e.g. different minor allele frequencies (MAFs)) from the SNP pair under test.
In the GWAS literature, use of the full set of (or a sample of) observed test statistics as, in some sense, 'surrogates' for the test statistic at any specific SNP (in order to estimate the distributional properties of the test) is not unusual. Devlin and Roeder 11 showed that, in the presence of population stratification, the genome-wide distribution of χ 2 test statistics on 1 df (when testing for allelic association between genotype and phenotype) is inflated by a constant multiplicative factor λ. Devlin and Roeder therefore proposed estimating this 'genomic control' factor λ on the basis of a sample of observed test statistics, and producing a set of adjusted test statistics by dividing each of the observed test statistics by λ. This suggestion is closely related to the now popular approach 12 of using quantile-quantile (Q-Q) plots to investigate whether there is any evidence for population stratification (or indeed some other phenomenon that causes departure from the expected genome-wide distribution of test statistics) in a GWAS. If the resulting plot of the observed test statistics versus their expected values shows sufficient departure from the line of equality, then each observed test statistic can be adjusted by dividing it by an estimate of λ. The bootstrap procedure proposed by Wang et al. 1 for estimating h and f would therefore appear to fall into this general framework. Thus, while theoretical arguments would suggest that a data-adaptive estimation of h and f might most naturally depend only on the data at the SNP pair currently under test, the use of data from a large number of other pairs of SNPs to estimate the relevant distribution can be perhaps motivated by comparison to Devlin and Roeder's genomic control procedure. Moreover, given that the implementation of the proposed bootstrap procedure actually generates 9 different values of h and f (dependent on the number of observed genotype categories, k), one could argue that the only 'other' SNPs that contribute to determining the distribution for any test pair of SNPs are those that have similar properties to the test SNP, at least in terms of the sparsity of the observed genotype table.
In their paper, Wang et al. compared their proposed W-test to Pearson's χ 2 test, as well as to Multifactor Dimensionality Reduction 13 and to an unspecified (possibly a linear allelic) logistic regression test. Another method, not -as far as we are aware -considered by Wang et al., for comparing the distributions of k = 9 (or possibly less) genotype categories between two groups (cases and controls) would be to carry out a saturated logistic regression test on 8 (or possibly less) df, i.e. comparing (via a likelihood ratio test) a model: logit(p) = α + β 1 I(x 1 = 1) + β 2 I(x 1 = 2) +γ 1 I(x 2 = 1) + γ 2 I(x 2 = 2) +i 11 I(x 1 = 1)I(x 2 = 1) + i 12 I(x 1 = 1)I(x 2 = 2) +i 21 I(x 1 = 2)I(x 2 = 1) + i 22 I(x 1 = 2)I(x 2 = 2) with a null model logit(p) = α, where p denotes the probability of an observation being a case (rather than a control), x 1 and x 2 denote the genotypes (coded 0, 1, 2) at locus 1 and 2 respectively, and I represents an indicator function. The parameters (β 1 , β 2 , γ 1 , γ 2 ) correspond to the main effects of locus 1 and 2 respectively, and the four i st parameters correspond to statistical interaction effects (on the logit scale). The 9 parameters (α, β s , γ t , i st ) (where s and t each take values 1 or 2) are essentially reparameterisations of the 9 independent parameters δ uv obtained when modelling the log odds of disease given genotype (where u and v each take values 0, 1 or 2 according to genotype at locus 1 and 2) as: -in other words, allowing the log odds (and thus the probability) of disease to take 9 different values according to the genotype category to which an individual belongs. Any completely missing genotype categories in cases and/or controls result in one or more parameters being dropped from the model, thus the method automatically adapts to the sparsity of the observed data (albeit in a different way from Wang et al.'s data-adaptive procedure). This logistic regression formulation emphasizes the fact that the 8 df test actually tests for association (which could correspond to main effects, interaction effects, or both), rather than testing for statistical interaction per se.
This 'prospective' logistic regression model (modelling the log odds of outcome or phenotype, given exposure or genotype) is in contrast to Wang et al.'s 'retrospective' model, which models (the ratio of) the log odds for genotype given phenotype. The prospective model is most natural in the context of cohort studies or clinical trials, but is arguably less natural in the context of case control studies, where subjects are ascertained based on their phenotype (case or control status). However, it has been shown 14 that valid estimates of the parameters of interest (β s , γ t , i st ) -which correspond to ratios of the log odds for phenotype (i.e. the log odds of being a case rather than a control) at different levels of the exposure variables -are achieved when this prospective model is applied to retrospectively ascertained case/control data. This convenient property has resulted in the enduring popularity of logistic regression as the standard method of choice in the epidemiological literature for analysing case/control data.

Methods and Results
Application of W-test to real and simulated example data sets To compare the newly-proposed W-test with other, more standard, analysis options, we applied the W-test and three alternative methods (Pearson's χ 2 test using two alternative implementations, and logistic regression on 8 df (denoted LR8)) to two different data sets. The first data set was distributed with the W-test software (R version) developed by Wang et al. 1 . This example data set consists of 50 SNPs (resulting in 1225 SNP pairs) genotyped in 1000 individuals, and presumably corresponds to a single replicate of the simulated data (simulated under an interaction model, using real genotype data) described by Wang et al. 1 . The second data set was a simulated data set that we constructed ourselves using real genotype data from Wellcome Trust Case Control Consortium 2 (WTCCC2) controls 15 , with phenotype (case/control status) simulated under the null hypothesis of no difference in genotype distribution between cases and controls. Specifically, we selected 1000 female founder individuals and randomly assigned 500 as cases and 500 as controls. Real genotype data were selected at 50 SNPs for these individuals by first LD pruning using the PLINK 16 command ″-indep-pairwise 50 5 0.5″ and then the first and last five SNPs from chromosomes 1 to 5 were extracted, giving 50 SNPs in total.  We can see from Figure 1 and Figure 2 that the -log 10 P-values achieved by the different methods are highly correlated and largely comparable in both data sets, except when Pearson's χ 2 test fails to give a result (indicated in the plots by a cross and a -log 10 P-value that we set arbitrarily to 4.5 or 28); this can occur with the the χ 2 full (CHI-f) implementation when one or more of the nine genotype categories does not appear. However, in these situations, the W-test, logistic regression and the reduced χ 2 (CHI-r) implementation all produce a result, and their results are seen to be largely comparable. The W-test does show slightly lower P-values at the most stringent significance thresholds when applied to the W-test demo data, suggesting a possible power advantage for the W-test over logistic regression and CHI-r for data generated under the simulation model assumed by Wang et al.
The W-test also shows slightly lower P-values at the most stringent significance thresholds when applied to the WTCCC2 data. However, since these data were simulated under the null hypothesis of no association (or interaction), we cannot interpret this behaviour as implying higher power for the W-test.

Application of W-test to simulated data sets
To further compare the performance of the W-test with previously proposed tests of association and/or interaction, we simulated 1000 replicates of genotype data for 500 cases and 500 controls at two diallelic loci with alleles denoted A and a (at locus 1) and B and b (at locus 2) respectively, under various generating models. For simulating under the null hypothesis, a higher number of replicates (5000) was used. Initially we assumed allele frequencies of (0.4, 0.6, 0.4, 0.6) for alleles (a, A, b, B) and assumed no linkage disequilibrium (LD) between the loci; these settings ensured that none of the tests were affected by issues of sparse data, so all tests could be computed in all simulation replicates.
For the W-test, the bootstrap procedure proposed by Wang et al. 1 is not possible when only two SNPs are being evaluated. We therefore considered two alternative approaches for specifying h and f. In the first (denoted as W), we assumed the test pair of SNPs had come from the real Wellcome Trust Case Control Consortium (WTCCC) 12 data set analysed by Wang et al. 1      values of h and f specified within the W-test software package (R version) i.e. h = (k − 1)/k and df f = k − 1 (where k is the number of genotype categories observed).
In addition to performing the W-test, Pearson's χ 2 test (full and reduced versions) and logistic regression on 8 df (LR8), we also considered three additional logistic regression-based tests: • LR3: logistic regression on 3 df, comparing (via a likelihood ratio test) the models logit(p) = α + β x 1 + γ x 2 + ix 1 x 2 and logit(p) = α • LR1: logistic regression on 1 df, comparing (via a likelihood ratio test) the models logit(p) = α + β x 1 + γx 2 + ix 1 x 2 and logit(p) = α + β x 1 + γ x 2 • LRI: logistic regression on 1 df with only an interaction term, comparing (via a likelihood ratio test) the models logit(p) = α + ix 1 x 2 and logit(p) = α These logistic regression-based tests all use an allelic (rather than a genotypic) coding of the genotype variables in order to reduce the df 6,8,17 . This assumption of allelic effects has proved extremely effective in GWAS analysis of single SNPs 12 and is generally the default option used in most GWAS; even if the true effects do not precisely follow an allelic model, the reduction in df achieved can lead to higher power for tests that make this assumption 18 . The overall power of the W-test is seen to be similar to that of LR8 and Pearson's χ 2 test; in some cases (e.g. Figure 3, bottom panels) the power of the W-test is higher than that of LR8 and Pearson's χ 2 test, whereas in other cases (e.g. Figure 3, middle and top right panels) the power of the W-test is lower than that of LR8 and Pearson's χ 2 test. The W-test with values of h and f estimated from genome-wide (WTCCC) data (denoted W) gave consistently slightly higher power than using the default values (denoted W′). For simulations under complex effects, where the generating model is not well-approximated by an allelic model (Figure 3, bottom panels), the three genotypic tests (W-test, χ 2 test, LR8) of association, allowing for interaction, all have higher power than the corresponding allelic tests (LR3, LRI), with the W-test (with values of h and f estimated from genome-wide data) showing the overall highest power. However, for simulations where the structure of the generating model matches an allelic model, (Figure 3, middle and top right panels), the allelic tests show higher power than the genotypic tests.
Similar results to those described above were found when we repeated our simulations under further scenarios where the two loci were assumed to be in LD (data not visualised). We note that neither our simulations nor those presented by Wang et al. can be considered a fully comprehensive evaluation; given the effectively infinite number of possible association models that could exist between two SNPs and a disease phenotype, such an evaluation would be quite hard to achieve. It is thus very difficult to know in practice, in any given situation, which method is likely to have the highest power, making it difficult to specify in advance which test to use. The simulation results presented here are not intended to address the question of how often the underlying association model will be more or less amenable to testing using the W-test compared to alternative methods, but rather to point out that there are situations where existing alternative approaches show higher power. Given that this power difference was most profound at low MAFs, we wondered if it could be explained by the fact that, at low MAFs, we expect many χ 2 tests to be undefined (as demonstrated in Figure 1 and Figure 2) if implemented based on the full genotype table (CHI-f implementation). If one uses the CHI-f implementation and counts an undefined result (an 'NA') as a non-detection, while using the total number of simulation replicates (detections and non-detections) as the denominator, then this will result in a decrease in both power and type I error. We illustrate this phenomenon in Figure 4. Here we repeated our simulations using lower MAFs of 0.1 for the two SNPs and considered a variety of different scenarios both with and without LD. We denote by CHI-f′ the power obtained when you count an undefined result as a nondetection, while using the total number of simulation replicates as the denominator. We denote by CHI-f the power obtained when you instead ignore undefined results, and use the number of detections as the numerator with the number of simulation replicates in which a result was obtained as the denominator. Figure 4 demonstrates that the CHI-f′ method of counting does indeed result in an apparent lower power for Pearson's χ 2 test, although it is unclear whether this effect is sufficient to explain the differences in power presented by Wang et al.  Figure 4), reflecting the fact that the generating model in these simulations follows an allelic structure.
As a final comparison of methods under a situation that the W-test is specifically designed to address, we performed an additional set of simulations for pairs of SNPs with very low MAFs (MAF≈0.01 in controls, MAF≈0.03 in cases), in strong LD (R 2 = 0.64 in controls, R 2 = 0.83 in combined cases and controls) and operating via complex effects (see generating model shown in the bottom left panel of Figure 3). The results are shown in Figure 5. In this instance, we can see that Pearson's χ 2 test based on the full genotype  Table 2 (and comparison of these P-values to those given in the original WTCCC publication 12 ) sounds an immediate warning note: these P-values seem suspiciously small for a modestlysized GWAS of a complex neuropsychiatric disorder, and, more importantly, do not seem compatible with the results presented in Figure 4, Table 3 and Supplementary  Figure S5 (owing to the choice of an upper limit of 10 for the y axis), the large number of remaining isolated significant SNPs, not supported by other SNPs in the same genetic region in LD with the significant SNP, suggests that these significant single-SNP results are highly likely to be artefacts, most likely due to genotyping errors.
Together with raw genotype data, the WTCCC distributes a list of SNPs (exclusion-list-snps-26_04_2007.txt) that have failed various quality control checks and should therefore be considered unreliable. We compared the SNPs listed in Table 2 with those appearing in exclusion-list-snps-26_04_2007. txt and found that all but three of the rows of Table 2 contain at least one SNP that appears on the WTCCC recommended exclusion list. An additional list of 561 SNPs that failed manual visual checking of the cluster (intensity) plots was obtained from the WTCCC (Jeffrey Barrett, personal communication; Supplementary File 1). This list includes rs1048194, which appears in each of the remaining three rows of Table 2. Thus, every SNP pair identified by Wang et al. as significant in a pairwise W-test (of association allowing for interaction) in the WTCCC data contains at least one SNP that has been flagged as unreliable by the WTCCC. We therefore consider the WTCCC results presented by Wang et al. as highly suspect and likely to be explained by poor quality genotyping in cases, controls or both. We note that it is not necessary to obtain such lists of SNPs failing quality control from the WTCCC in order to spot this problem; visual inspection of the Manhattan plots (and comparison to the original WTCCC publication) is enough to at least flag up the problem, and implementation of standard GWAS quality control procedures would, in any case, eliminate the vast majority of these suspect SNPs.      Figure S5 (lower panel).
The remaining row of GAIN results (see Table 3) contains SNP_ A-2050329 and SNP_A-8715766, corresponding to rs3787282 and rs17070836 on chromosomes 20 and 8, respectively. The pairwise W-test P-value is given as 5.70E-11, with single-SNP P-values of 0.00125821 and 0.00749938. Although we have no particular reason to distrust this result, the fact that it is being interpreted as a replication of an interaction between PTPRT and CSMD1 (genes which appear in our list of untrustworthy 'significant' WTCCC results shown in Table 2), means that, at best, we would consider this as an isolated finding requiring further replication.   Figure 3 states that the W-test was computed on real genome-wide data with 'permuted phenotype'. This seems an odd procedure; Q-Q plots are generally plotted using the real observed results (based on the real phenotypes) in order to determine whether the genome-wide distribution of test statistics is as expected 12 . By definition, if one permutes the phenotype (to mimic data generated under the null hypothesis), one would indeed expect the Q-Q plot to follow the line of equality, but it does not provide any information about whether the observed results follow their expected distribution (and may thus be considered reliable). On closer investigation of the W-test software (R version) developed by Wang et al. 1 , we found that the Q-Q plot function takes as input genotype data, but not phenotype data, and does indeed calculate W-test results using randomly permuted phenotypes with an equal number of cases and controls. Therefore, the function does not actually plot a Q-Q plot of the real observed results. As an example, Figure 6 shows the W-test demo data plotted as a standard Q-Q plot and then plotted three times using the Q-Q plot function from the W-test (R package) software. In contrast to the true W-test demo data results (which are highly significant, presumably because of the simulation model chosen by Wang et al.), the Q-Q plots generated within the W-test software suggest (somewhat misleadingly) that the observed test statistics are largely consistent with the null hypothesis of no association.  In a real-data application, Wang et al. apply their W-test to genomewide association data for bipolar disorder and highlight a number of significant detections of pairwise epistasis. We have not ourselves re-analysed these data using alternative methods, but, given the high level of similarity between the W-test and alternative methods found in analysis of real ( Figure 1 and Figure 2) and simulated ( Figure 3- Figure 5) genotype data, we anticipate that similarly significant results would be obtained when applying alternative methods such as logistic regression to the bipolar data. Unfortunately, all but one of the results presented by Wang et al. can most likely be attributed to SNP genotyping error (resulting in highly significant single-SNP P-values for one or both SNPs of a pair) and so are probably artefactual. We therefore consider the subsequent network analysis performed by Wang et al. of the identified interactions as being, at best, uninterpretable and, at worst, potentially highly misleading, and we urge researchers to exercise caution in their interpretation of these results. This warning illustrates the importance of using standard quality control checks (such as Q-Q and Manhattan plots of real observed test statistics) when analysing genome-wide association data, even when the primary focus of a study is on presenting novel methodology.
In summary, our investigations of the W-test suggest that this test remains an attractive option for testing for association, while allowing for interaction (although it does not test for pure interaction -the test can be sensitive to main effects of any or all SNPs included in the model), and it does offer some advantages over alternative tests (such as Pearson's χ 2 test and logistic regression on 8 df) at lower minor allele frequencies. The fact that in some scenarios the W-test shows higher power, whereas in other scenarios alternative approaches show higher power, makes it difficult to specify in advance which test should be preferred. One attraction of logistic regression is its flexibility, allowing one to tailor the test to impose greater structure on the assumed alternative hypothesis (such as assuming multiplicative allelic effects), which can offer a power advantage when the imposed structure matches the true underlying data structure. Extensions of the W-test that would allow similar flexibility might be an interesting topic for further investigation.

Data availability
The Howey and Cordell have done an elegant job in assessing the W-test method recently published as a powerful approach for detecting pairwise epistatic interactions. Their comprehensive assessment covered the properties and caveats of the W-test, the key results of simulation and real data analysis in the original paper, as well as additional comparisons of W-test against a set of existing methods. The assessment led them to conclude that W-test had virtually no advantages over the existing methods in explicitly testing pairwise epistatic interactions, but could be a useful alternative in GWAS for detection of low frequency or rare variants. I like to thank the authors for their efforts because detection of epistasis or statistical gene-gene interaction has been challenging for a long time, and is longing for not only innovative but also robust methods and examples (Zietz H., 100: 379-84, 2017). The study is well designed, well Am J Hum Genet executed and technically sound. The paper is well written and easy to follow. I hope the comments below are helpful to improve the paper: Slightly surprisingly, neither the current nor the original W-test paper cited the recent review of detecting epistasis in human complex traits (Wei , 15: 722-33, 2014), where et al. Nat Rev Genet methods for testing interactions, rather than the whole pair effects, have been discussed. Issues of LD, marginal/main effects and capturing rare or low frequency variants have also been discussed in the review. Addition of the review in citation would make the discussion more interesting.
It seems important to make clear that testing epistasis is to test interaction terms, not the overall effects of the whole pair. From that perspective, it might help to slightly reshape the presentation of the power results, e.g. high power in W-test when no interactions simulated to be interpreted as a disadvantage/problem instead? Nonetheless, it is good to point out that W-test can be a useful filter to select candidate pairs for explicit interaction tests, at a price of missing true interactions without important main effects.
It seems also important to make clear that two markers in high LD carry little epistasis, although the pair could be statistically significant by haplotype effects. The property that a pair of closely located markers may capture rare haplotypes in the form of significant statistical epistasis has been explored for identifying rare variants and/or functional regulatory mechanisms (Wei et al. PLoS 8: e71203, 2013;23: 5061-8, 2014). W-test seems to have advantages in this One Hum Mol Genet aspect, particularly in situations of multiple markers.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
Referee Expertise: Statistical Genetics I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.