Imperfect Linkage Disequilibrium Generates Phantom Epistasis (& Perils of Big Data)

The genetic architecture of complex human traits and diseases is affected by large number of possibly interacting genes, but detecting epistatic interactions can be challenging. In the last decade, several studies have alluded to problems that linkage disequilibrium can create when testing for epistatic interactions between DNA markers. However, these problems have not been formalized nor have their consequences been quantified in a precise manner. Here we use a conceptually simple three locus model involving a causal locus and two markers to show that imperfect LD can generate the illusion of epistasis, even when the underlying genetic architecture is purely additive. We describe necessary conditions for such “phantom epistasis” to emerge and quantify its relevance using simulations. Our empirical results demonstrate that phantom epistasis can be a very serious problem in GWAS studies (with rejection rates against the additive model greater than 0.28 for nominal p-values of 0.05, even when the model is purely additive). Some studies have sought to avoid this problem by only testing interactions between SNPs with R-sq. <0.1. We show that this threshold is not appropriate and demonstrate that the magnitude of the problem is even greater with large sample size, intermediate allele frequencies, and when the causal locus explains a large amount of phenotypic variance. We conclude that caution must be exercised when interpreting GWAS results derived from very large data sets showing strong evidence in support of epistatic interactions between markers.

definite evidence of epistasis at causal loci. Specifically, when the SNPs used in association analyses are in imperfect LD with the alleles at causal loci, linear regression on SNPs may lead to unaccounted variance, or missing heritability (e.g., Manolio et al. 2009;de Los Campos et al. 2015). When this unaccounted additive signal is correlated with interaction contrasts, the "illusion" of epistasis is created even for traits that are purely additive.
Several authors have expressed concerns about the role that LD can have on the detection of epistasis (Wood et al. 2014;W.-H. Wei, Hemani, and Haley 2014). However, these problems have not been quantified nor have they been given a precise mathematical treatment. In this study, we present a simple three locus model involving a causal (unobserved) locus and two markers that makes explicit how phantom epistasis may emerge even in systems that are strictly additive. We use this model to derive a set of conditions that are necessary for the occurrence of phantom epistasis, and quantify the magnitude of the problem using simulations based on real human genotypes from the UK-Biobank. The existence of phantom epistasis is also studied using extensions of the model that include dominance and multiple loci. Our results indicate that imperfect LD can lead to seriously inflated type-I error rates. We also show that the rate of detection of phantom epistatic interactions increases with sample size; this should be considered when testing for epistatic interactions using big data sets such as the ones that are becoming available.

MATERIALS AND METHODS
We begin by considering a simple model with three biallelic loci. One of them, denoted as z i , represents a causal locus (also referred as to the 'quantitative trait locus', QTL) on observation i and has a direct effect on the expression of the phenotype y i . The other two loci, denoted as x 1i and x 2i , are markers that are possibly in LD with the QTL but have no causal effect on the phenotype. For SNPs, a standard practice is to code genotypes by counting at each of the loci the number of copies of a reference allele carried by the i th individual. Here, to facilitate the presentation we assume that genotypic codes and phenotypes are expressed as deviations from their corresponding means; therefore Eðz i Þ ¼ Eðx 1i Þ ¼ Eðx 2i Þ ¼ Eðy i Þ ¼ 0. In this setting, a single locus strictly additive model takes the form where b is the additive effect of an allele substitution at locus z, and d i is an error term. Evidently, with only one causal locus there is no epistasis. Next, suppose that an instrumental regression of the form is used to investigate the presence of epistasis. Here, the b ' s are regression coefficients that are functions of the QTL effect (b) and of the (multilocus) LD involving the two markers and the QTL genotypes.
In the population, given the centered genotype codes, the regression coefficients entering in the right-hand-side of [2] are 2 6 4 Required conditions for phantom epistasis If the QTL is in LE with the two markers, then pðz i ; Therefore, all elements of the right-hand-side of [3] are equal to zero and, thus b 1 ¼ b 2 ¼ b 12 ¼ 0. Therefore, for phantom epistasis to emerge a first necessary condition is that the QTL must be in LD with at least one of the SNPs. (This condition is a special case of a more general condition that we discuss below.) On the other extreme, if there is perfect LD between the QTL and the marker pair ðx 1i x 2i Þ, then the QTL genotype can be expressed as a linear function of the two marker genotypes z i ¼ x 1i b 1 þ x 2i b 2 . In this case, a linear regression on the two markers captures fully the QTL variance and therefore the interaction term will be equal to zero. (A derivation of this intuitive result is presented section 2 of the Supplementary Methods.) Therefore, perfect LD is a sufficient condition for b 12 ¼ 0. Consequently, a second necessary condition for phantom epistasis to emerge is imperfect LD between the QTL and the marker pair. This guarantees that some fraction of the QTL variance is not captured by linear regression on the two marker genotypes. Furthermore, if the left-out QTL signal is not orthogonal to the interaction contrast x 1i x 2i , then b 12 6 ¼ 0: Consider now an intermediate case where one of the markers (say x 2i ) is independent of the pair formed by the QTL and the other marker ðz i ; x 1i Þ. This implies that pðz i ; x 1i ; x 2i Þ ¼ pðz i ; x 1i Þpðx 2i Þ. Under this condition, because the two markers are in LE, the coefficient matrix and its inverse (T 21 ) is diagonal; therefore, b 12 ¼ Therefore, a third necessary condition for phantom epistasis to emerge is that the three loci must be jointly in LD.
In conclusion, in the system discussed above, phantom epistasis can emerge if the three loci are in mutual but imperfect LD. Unfortunately, this condition cannot be assessed when the QTL genotype is unknown. Empirically only LD between the two markers can be assessed.

Phantom epistasis in multi-locus models
In an additive multi-locus model expression [1] becomes If one of the markers is in LE with the other marker and with all the QTL then Eðx 1i x 2i z ij Þ ¼ 0 " j, thus there will not be phantom epistasis.
In the presence of dominance, the causal (single locus) model becomes y i ¼ az i þ dz 2 i þ d i where a and d are additive and dominance values, respectively. If the empirical model of expression [2] is used to test for epistatic interactions then the left-hand-side of expression [3] remains unchanged, but the right-hand-side becomes 2 indicating that both dominance and additive effects can indeed contribute to phantom epistasis. In this system, phantom epistasis will emerge whenever Eðx 1i x 2i z i Þa þ Eðx 1i x 2i z 2 i Þd 6 ¼ 0: The conditions needed for phantom epistasis to emerge in a system involving dominance are the same as the ones described for the additive model. These include, first, imperfect LD between z i and z 2 i with the marker pair ðx 1i ; x 2i Þ such that either z i or z 2 i or both cannot be fully explained by a linear combination of the two markers. Second, phantom epistasis requires mutual LD at the three loci. If one of the markers (say x 2i ) is independent of the other-marker-QTL pair, then,

Simulation studies
The analytical results presented in the previous section indicate that multi-locus LD plays an important role in determining whether phantom epistasis may emerge. To shed light on the nature and the magnitude of the problem we conducted Monte Carlo simulations using real human genotypes of distantly related white Caucasian individuals from the UK-Biobank. In a first set of simulation scenarios, we generated data according to an additive model with a single causal locus (as in [1]) that explained either 0.5 or 1% of the phenotypic variance. The position of the QTL genotype z i was determined by randomly choosing a marker position on human chromosome 1; marker x 1i was always adjacent ("to the left") to the QTL whereas the other marker (x 2i ) was placed at increasing base pair distance ("to the right") of the QTL.
We analyzed the simulated phenotypes using an instrumental model such as the one in [2] extended with inclusion of an intercept and the top 5 SNP-derived PCs to avoid confounding due to any substructure that may be present (see section 3 of Supplementary Methods for details). The null hypothesis ðH 0 : b 12 ¼ 0Þ was rejected at a 0.05 significance level; therefore, rejection rates over 0.05 were interpreted as indicative of phantom epistasis. All analyses were done with R (Rcore development team (2012) using the BGData R-package (Grueneberg and de los Campos, 2019).
Since the power to detect a non-null interaction effect depends on sample size we analyzed data using a sample sizes of n = 10K, 50K, 100K and 250K (K = 1,000).

Alternative simulation scenarios
SNPs in different chromosomes: To assess the potential impact of long-range LD, in a new analysis setting we maintained the QTLproximal-marker pair (x 1i ; z i ) in chromosome 1 as in our main simulation scenario but now positioned the distal marker (x 2i ) in a randomly chosen position on chromosome 2. In this data set, we do not expect high levels of LD between SNPs in different chromosomes; however, at least in theory, this could happen if selection favors combinations of alleles, thus inducing LD at physically unlinked loci (Bulmer 1971).

Multi-locus models:
To consider the problem of phantom epistasis in multi-locus models we first assumed that the genetic architecture of the trait was determined by a major QTL plus an infinitesimal effect that was strictly additive. The trait heritability was 0.5, the main QTL explained 1% of the variance and the infinitesimal component the remaining 49% (see Section 3 of the Supplementary Methods for further details).
Second, we simulated data using a model that contains three marker-QTL pairs: ðx 1i ; z 1i Þ, ðx 2i ; z 2i Þ and ðx 3i ; z 3i Þ. As before, the trait was strictly additive with the genetic effect, P 3 j¼1 z ij b j , explaining 1% of the phenotypic variance (each QTL explained 1/3 of the total genetic variance). The three pairs were in chromosome 1. Within a pair, the marker was the SNP immediately adjacent to the QTL in the array. To study the effects of LD we moved pairs 1 and 3 further apart and always maintained the 2 nd pair in the middle point (based on base-pair distance). The empirical model used to test for interaction was as that in expression [2] with x 3i in place of x 2i ; that is we tested for phantom epistasis between SNPs in the first and third pair. This scenario aimed at describing problems that may emerge due to an unaccounted QTL (z 2i in this case) which may be in simultaneous LD between the two SNPs involved in the interaction.

Data availability
The genotypes used in the simulation were from the UK Biobank.

RESULTS
We begin presenting results from our main simulation scenario which involves a single-locus with two SNPs, one proximal (x 1i Þ and one distal (x 2i Þ to the QTL. Figure 1 shows measures of linkage disequilibrium between the three loci ðz i ; x 1i ; x 2i Þ involved in the system. The average (across Monte Carlo replicates) proportion of variance of the QTL (z i ) explained by the most adjacent marker (x 1i ) was about 0.085; however, the distribution of this statistic is highly skewed. When x 1i and x 2i were the two flanking markers of the QTL, on average they jointly explained 15% of the QTL variance. Therefore, on average there was a sizable rate of "missing" heritability. The R-sq. between x 2i and either the other marker or the QTL, falls very quickly for lags between 0-0.5Mb and reached near zero values at approximately 1 Mb (Figure 1). Figure 2 displays empirical rates of rejection by BP-distance between the QTL and the distal marker (x 2i ), by sample size and by proportion of variance explained by the QTL. In absence of phantom epistasis, the empirical rejection rate should be very close to the significance level (0.05). For the largest sample size, the curve relating empirical rejection rates with BP distance was clearly above 0.05 for distances of up to 2MB. In all the scenarios, the highest rejection rates were observed when x 2i and the QTL were at a distance of about 0.15 MB; here the empirical rejection rate was 0.13 when the QTL explained 0.5% of the phenotypic variance and as high as 0.17 when the QTL explained 1% of the variance. The latter (0.17) is more than three times the nominal rate of rejection expected under the absence of phantom epistasis (0.05). The curves relating empirical rejection rates with physical distance reach the nominal rejection rate of 0.05 at 1Mb for n = 10,000; however, for larger sample size the curves stayed above 0.05 even for distances longer than 1Mb. Figure 3 displays another way of viewing the simulation results of Figure 2 where the average rejection rate is calculated within bins of R-sq. between the two markers. When the two markers were uncorrelated, rejection rates were very close to 0.05 indicating absence of phantom epistasis. However very small LD between the two markers generates considerably higher rejection rates: an R 2 ðx 1i ; x 2i Þ 0:1 leads to rejection rates as high as 0.28 with the largest sample size when the proportion of variance explained by the QTL was 1%. The maximum rejection rates occur when the R-sq. between markers is between 0.1 to 0.2. Beyond this value in the range (0.2-0.9) rejection rates follow a linear decline.
Finally, Figure 4 shows the empirical rejection (for the scenario where the QTL explained 1% of the phenotypic variance) rates by minor-allele frequency and the R-squared between the two markers involved in the interaction. For any given bin of minor-allele frequency, the rejection rate was close to 0.05 when R-squared was close to zero, it increases reaching a maximum for R-squared values of 0.1-0.3, and then shows a decline reaching again values close to 0.05 when R-squared is larger than 0.9. These patterns are the same as the ones displayed in Figure 3. Interestingly, within a bin of R-squared, when phantom epistasis existed (i.e., when R-squared was neither null nor perfect) rejection rates were maximum for intermediate allele frequencies and decreased with extreme allele frequencies. This is likely a consequence of the fact that intermediate allele frequencies confer higher power to the test.
The results from the model with SNPs in different chromosomes are provided in Table S1. In all cases, the rejection rates were very close to 0.05 suggesting that phantom epistasis between SNPs on different chromosomes, if it exists, it occurs at a rate that does not induce an inflation of rejection rates above the significance level. Figure S1 summarizes the results from the simulation scenarios that included an infinitesimal effect. The empirical rejection rates were almost identical to those obtained, for the same sample size, with the single-QTL model (compare Figure S1 with the curves for N = 50K and 250K in the left panel of Figure 2). This suggests that the interaction contrasts were quasi orthogonal to the infinitesimal effect, which is probably a consequence of the short span of LD in this population.

Multi-locus models
The results from a scenario involving three marker-QTL pairs are summarized in Figure S2. The rates of rejection in this simulation scenario were very similar to the ones observed in the single QTL model (compare Figure S2 with curves corresponding to N-50K and N = 250K in the left panel of Figure 2) with mildly elevated rejection rates when the distance between pairs 1 and 3 was between 0.1-1Mb, due to the non-zero term b 2 Eðx 1i x 3i z 2i Þ of the right-hand-side of [3], that causes b 12 6 ¼ 0. However, as the distance between the markers involved in the interaction (i.e., the 1 st and 3 rd marker) increased, the estimated rejection rates approached the nominal rejection rate. Specifically, when the distance between the 1 st and the 2 nd pair is of 2Mb or larger the three pairs become independent and phantom epistasis vanishes.

DISCUSSION
There is a substantial amount of literature reporting the presence of epistasis affecting complex traits but results, when scrutinized, have been controversial. Sometimes the controversy spawns from the suspicion that epistatic interactions may be capturing additive signals that were missed by the baseline additive model used to test interactions. For instance, Hemani et al. (2014) identified 30 pairs of SNPs that interact influencing gene expression and that were replicated across two independent studies. In a subsequent study (Wood et al. 2014) replicated many of the interactions reported by Hemani et al.; however, in each case, using sequence data, a single third variant could explain all the apparent epistasis. This happened even after removal of all pairs of SNPs with r 2 , 0:1 which was suggested by Wei, Hemani, and Haley (2014) as a means to minimize confounding due to haplotype effects.
However, the problem of why and under what conditions additive effects may generate "epistatic signals" has not be formalized. In this work, we considered a simple three locus model to reveal conditions that lead to phantom epistasis. We show that phantom epistasis emerges in the presence of simultaneous but imperfect mutual LD between the three loci (the QTL and the two markers involved in the Figure 1 Average R-squared between pairs of loci and proportion of variance of the QTL genotype explained by the two markers, R 2 ðz i x 1i þ x 2i Þ, vs. distance between the QTL (z i ) and the distal marker (x 2i ). Marker x 1i was always adjacent to the QTL. interaction). This conceptually simple three loci model can be extended to more complex settings (e.g., multiple QTL-marker pairs) without affecting the underlying source of the principle: if additive QTL variance is imperfectly captured by linear regression on markers and the unexplained variation is not orthogonal to interaction contrasts, then phantom epistasis emerges.
We stress that simultaneous LD between the triplet ðx 1i ; x 2i ; z i Þ is required for phantom epistasis to emerge. At least in theory, it is possible to have cases where the two SNPs involved in an interaction are in LE yet, they are jointly in LD with an unobserved QTL. An example featuring such patterns is discussed by Wei et al. (2013) who present a model with dominance at a causal locus that generates phantom epistasis between two flanking markers that are marginally independent. In Wei's example the two SNPs are jointly in LD with the QTL and therefore, Eðx 1i x 2i z i Þ and Eðx 1i x 2i z 2 i Þ are nonzero, explaining why the model generates phantom epistasis. On the other hand, if one of the markers, e.g., x 1 , is independent of the remaining pair ðx 2i ; z i Þ; then Eðx 1i x 2i z i Þ ¼ Eðx 1i ÞEðx 2i z i Þ ¼ 0 and phantom epistasis does not not emerge.

Inferences Under imperfect LD
Several authors (M E Goddard 2009, de los Campos et al. 2013de Los Campos, Sorensen, and Gianola 2015;Gianola et al. 2015) have studied the role of imperfect LD on related inferential problems, including missing heritability and whether imperfect LD can lead to estimates of genomic correlations between traits that are different than the underlying genetic correlations . In all these cases, imperfect LD generates inferential difficulties; phantom epistasis is another inferential problem arising when the markers used for inferences are in imperfect LD with causal variants.
Testing interactions among weakly correlated SNPs only (e.g., considering only SNP-pairs with r 2 , 0:1Þ is not a solution: indeed, our simulation results show that weak LD between markers (e.g., R-sq. between 0.05 and 0.1) can lead to large numbers of false discoveries especially when sample size is large. However, our simulations also show that near independence between the two SNPs (e.g., R-sq. ,0.01), a condition that in the data set used in this study was achieved at distances of 1.5Mb-2Mb or longer, was enough in the scenarios tested to guarantee rejection rates very close to the significance level.

Perils of Big Data
If phantom epistasis exists (i.e., if the population coefficient b 12 6 ¼ 0) whether it is detected or not depends on the power of the study which increases with sample size. Our simulation results demonstrate this clearly: pairwise R-sq. of 0.1 between markers and large sample size (e.g., n . 100K) generates clear signs of phantom epistasis. However, rejection rates are not highly elevated over the significance level when sample size was smaller (n = 10k) because at that R-sq. the size of the interaction effect is small and therefore the power to detect such small interaction effect with small sample size is low. Big Data are a blessing for genomic analysis of complex traits; however, in some cases, large sample size can make an inferential problem even more problematic.
The proportion of variance explained by the QTL plays a role similar to that of sample size The larger the amount of variance explained by the QTL, the higher the power to detect phantom epistasis due to imperfect LD. To assess this, we repeated our simulation with a QTL that explained 50% of the phenotypic variance. The results (see Figure S3) showed, as one would expect, higher rejection rates than the one observed when the QTL explained a small fraction of the phenotypic variance (compare Figure 2 with Figure  S3). Importantly, and in agreement with what our model predicts, the rejection rate reaches the significance level when the two SNPs become independent. Figure 2 Empirical rejection rates vs. distance between the QTL and the distal marker, by proportion of variance explained by the QTL (left and right panels) and sample size (curves). In the simulations, a single QTL (z i ) had an additive effect that explained either 1% (left) or 0.5% (right) of the phenotypic variance. The empirical model considered two SNPs with no causal effect. One of them (x 1i ) was adjacent to the QTL and the other one (x 2i ) was placed at increasing distance from the pair (x 1i ; z i ). Rejection of the null hypothesis (no interaction between x 1i and x 2i Þ was conducted at a 0.05 significance level. Empirical rejection rates above 0.05 are indicative of phantom epistasis. Phantom Epistasis in multi-locus models The presence of a polygenic effect in our simulations did not lead to notoriously inflated rejection rates compared with the ones detected with the single locus model suggesting that the interaction contrasts were quasi orthogonal to the infinitesimal effect. This is probably because LD in this population spans over very short distances and therefore it is highly unlikely to find a pair of markers that are strongly correlated with infinitesimal effects emerging from large numbers of loci distributed over the genome. On the other hand, local polygenicity (i.e., the accumulation of many small-effect loci in a region with moderate or high LD) may lead to more serious apparent epistasis.
Local epistasis? Several studies have reported results highlighting the importance of 'local' epistatic interactions (e.g., Wei, Hemani, and Haley 2014;He et al. 2017). From a biological perspective, it is plausible that multiple mutations in a gene may have collectively a larger impact than the simple sum of the effects of each mutation individually. And this could manifest itself as "haplotype effects" (e.g., Haig 2011). However, phantom epistasis provides an alternative explanation for why epistatic interactions detected in GWAS occur between loci that are physically close. Indeed, we show analytically and empirically that LD between SNPs is required for phantom epistasis to appear, thus, phantom epistasis is expected to be predominantly a 'local' phenomena.
Inflation of rejection rates due to incorrect statistical assumptions? Based on an earlier version of this manuscript Dr. P. Visscher (personal communication) commented that under imperfect LD the distribution of the error terms of the empirical model is not normal. Indeed, because the QTL genotype has three levels, under imperfect LD the distribution of the error terms would be a mixture of three normal distributions with different means. This could certainly explain the inflation of rejection rates in small samples because the conditions for the test statistic to follow a t distribution are not met. However, in large samples, the test statistic follows a normal distribution even if errors are not Gaussian (this is simply an application of the Central Limit Theorem). Therefore, with the size of samples used in standard GWA studies, lack of normality of the error term should not be a cause for inflated rejection rates.
A second possible cause for inflated rejection rate of the null hypothesis can be underestimation of the error variance of the empirical model. However, it can be shown that the estimator of the error variance of the empirical model used in the simulations is unbiased with respect to the marginal distribution of the data. Therefore, phantom epistasis cannot be attributed to underestimation of the error variance.
The additive-non-additive conundrum Quantitative genetics studies properties of complex traits using regression analysis. In the field a careful distinction is made between observable and causal features of complex traits. For instance, it is well established that the linear regression of a phenotype on allele content yields estimates of the average effect of allele substitution and that both truly additive as well as dominance and epistatic effects can contribute to allele substitution effects. Furthermore, theoretical and empirical research has demonstrated that highly non-linear systems can generate signals that can often be explained almost completely with a linear model (Hill, Goddard, and Visscher 2008). For this reason, in general, one cannot make causal statements about gene action from observational variance component analyses (e.g., W. Huang and T. F. C. Mackay 2016). Complicating matters even further we show in this study that the opposite can happen: under a purely additive model, imperfect LD can generate non-additive signals! Figure 3 Empirical rejection rates vs. R-squared between the proximal and distal marker, by proportion of variance explained by the QTL (left and right panels) and sample size (curves). The simulation setting here was the same as that in Figure 2: a single QTL (z i ) had an additive effect that explained either 1% (left) or 0.5% (right) of the phenotypic variance. The empirical model considered two SNPs with no causal effect. One of them (x 1i ) was adjacent to the QTL and the other one (x 2i ) was placed at increasing distance from the pair (x 1i ; z i ). Rejection of the null hypothesis (no interaction between x 1i and x 2i Þ was conducted at a 0.05 significance level. Empirical rejection rates above 0.05 are indicative of phantom epistasis. The recognition that phantom epistasis may be an important phenomenon does not negate the relevance of gene-gene interactions at the causal level. It simply stresses the difficulties that one faces when trying to learn about causal features of a system using observational data and inputs (markers) which are proxies for the underlying variants that may have causal effects on traits.
Phantom epistasis: an opportunity to improve predictive performance? In this work, we have stressed that imperfect LD can limit the possibility to learn about causal effects. However, linear and non-linear genomic regressions can be very powerful predictive machines, and it is well established that the model that is best for inferences is not necessarily the best predictive tool. Phantom epistasis creates inferential problems but also opens opportunities for improving prediction models. Indeed, by capturing signals that are missed by an additive model, non-linear models using interactions between markers may increase the amount of genetic variance captured and improve prediction accuracy. This may explain, for instance why some non-linear models such as kernel regressions have shown better predictive performance than additive models, especially in breeding populations with longspan LD and low marker density (de los Campos et al. 2010).