A perspective on interaction effects in genetic association studies

ABSTRACT The identification of gene–gene and gene–environment interaction in human traits and diseases is an active area of research that generates high expectation, and most often lead to high disappointment. This is partly explained by a misunderstanding of the inherent characteristics of standard regression‐based interaction analyses. Here, I revisit and untangle major theoretical aspects of interaction tests in the special case of linear regression; in particular, I discuss variables coding scheme, interpretation of effect estimate, statistical power, and estimation of variance explained in regard of various hypothetical interaction patterns. Linking this components it appears first that the simplest biological interaction models—in which the magnitude of a genetic effect depends on a common exposure—are among the most difficult to identify. Second, I highlight the demerit of the current strategy to evaluate the contribution of interaction effects to the variance of quantitative outcomes and argue for the use of new approaches to overcome this issue. Finally, I explore the advantages and limitations of multivariate interaction models, when testing for interaction between multiple SNPs and/or multiple exposures, over univariate approaches. Together, these new insights can be leveraged for future method development and to improve our understanding of the genetic architecture of multifactorial traits.

Appendix A: Effect estimates from standardized and unstandardized predictors Following the notation defined in the main text, the (standardized) outcome can be expressed as: where ′ is a constant that depends neither on nor . This leads to the following relationship between the standardized and unstandardized estimates: can be compared with , the non-centrality parameter from the test of in a marginal model. The marginal effect of , is by definition the sum of the main effect of plus the marginal contribution from interaction terms involving which equals: The marginal estimated effect of can be derived similarly and equals: = + × so that and (the ncp of the marginal test of ) can be expressed as follows:

Appendix C: Variance-covariance for the GxE term and its estimated effect
To derive the covariance and the correlation parameter between and × we first calculate 2 , the variance of the interaction term × under the assumption of independence between and . Assuming the standard coding for [0,1,2], and the frequency of the coded allele is , and is normally distributed so that 2 follows a non-central chi-square distribution with one degree of freedom, it can be express as: = ( (2 × × (1 − )) + 2 × 4) × 2 × (1 + 2 2 ) − (2 × ) 2 × 2 = 2 × ( + 2 ) × ( 2 + 2 ) − 4 × 2 × 2 = 2 × 2 + 2 × 2 + 2 × 2 Under the same assumption, ( , × ), the covariance between and × can be expressed as: Similarly, one can derive the covariance between the exposure and the interaction and show that ( , × ) = × 2 . From this it appears that The correlation between and × equals then: We derive then the covariance and correlation between the estimated effect of and × . In its general form the variance-covariance matrix of estimates from the interaction model can be expressed as: where , the matrix of predictor variables, equals [1, , , × ] and 2 is the variance of , the residual of . This is a relatively complex form when and are not standardized. However when the predictors are standardized [ ] = [ ] = 0, and assuming and are independents, the formulation of greatly simplify, as all the off-diagonal elements of the matrix are null, so that: which implies that all covariance term, including ( ′ , ′ ) are null. Building on this, and using the equations from Appendix A, we can derive , the covariance between and for the unstandardized case: The correlation follows:

Appendix D: Derivation of the Pratt index
To estimate the variance explained by predictors or other related measures, we first derive the expected variance of the outcome for a given generative model. For a single interaction term and assuming − independence, it equals: When more interaction terms are included, the outcome variance becomes a little more complex as additional covariance terms are added. For example assuming interactions between and , = 1 … , the variance of becomes: For simplicity let us assume 2 is set so that ( ) = 1 in all further derivation. When testing a single interaction term and using the equivalences from Appendix A-B , it follows that the Pratt index can be expressed as a function of the estimates from the interaction model and the mean and variance of the genetic variant and the exposures considered: When summing the above Pratt index we obtain: The cumulative contribution of multiple interactions involving independent SNPS can also be derived from summary statistics, although the derivation is a little less friendly because of additional covariance terms. For example assuming interactions between and , = 1 … , we obtain ( Figure S4) : On should note that estimating the Pratt index for the exposure can be difficult in practice when the number of interaction is large, as it would require the estimated main exposure effect from a joint model including all SNPs main effect and all interactions term with the exposure. Also, because of the correlation between main and interaction terms, 2 * , as the standard 2 , only approximate the amount the variance will change if was held constant. For the latter measure, one can refer to 21 .
The special case of negative Pratt index: As opposed to standard derivation of the variance explained, the Pratt index formula allows 2 * being negative. This characteristic has been a source of concerns [Thomas, et al. 1998]. While in-depth discussion of this characteristic is out of the scope of this study, we consider a hypothetical example illustrating this phenomenon.
As negative values for the Pratt index can exist in the presence of correlated predictors (whether or not there interaction effect are modelled), we considered the simplest example of two highly correlated factors and (e.g. being coffee drinking and being smoking) that define a quantitative outcome through the linear model: = − . We assumed and are normal with have variance 1, ( , ) = 0.8, and = 2 = 2. It follows that the variance of equals 2 = 2 + 2 − 2 ( , ) = 1.8. Using standard approach we would conclude that and explain 80% and 20% of the variance of . Thus if was removed from the population (e.g. if everyone in the population stop smoking), a naïve interpretation would be that the variance of would decrease according to the amount of variance explain by . However, if is removed from the population, then = and its variance, − 2 = 2 = 4 and is therefore larger than before. The Pratt index captures such effects by assigning negative value to some predictors. In this specific case, we would have 2 * = 2.4 and 2 * = −0.6, with the later parameter highlighting the potential increase in variance of if is removed from the population.

Appendix F: GRS-based test, joint test and univariate test of multiple interaction effects
We denote = ( 1 , 2 , … ) a vector of effects from independent SNP, and 2 and are the variance of each estimate and weight of each SNP in the genetic risk score (GRS), respectively. The effect of the weighted GRS on the outcome, , equals: Consecutively, the variance of can be derived as follows: Hence for standardized and the ncp of the by interaction test equals where N is the sample size and × ′ is the interaction effect from the standardized model, and follows a chi-square with one degree of freedom. In comparison, the ncp for the test of the strongest pairwise interaction, i.e. the interaction that explained the largest amount of variance, equals pairwise = max( × × ′ ).

Appendix E: Joint test of main and interaction effects
The multiple regression least square provides the estimated effect of the genetic main effect and interaction effects = ( , ) and their variance-covariance matrix . The multivariate Wald test of the two parameters, which follow a 2 df chi-square can be expressed as: where is the covariance between and . It can be further developed as: For clarity we derived the nominator and the denominator separately, so that The denominator B equals: So that the joint test of and × effects equals: which is the sum of the individuals Wald test for the main effect and the interaction effect when and are standardized. Moreover, leveraging previous equivalences, we can express the joint test as a function of ′′ and ′′2 , the estimated main and interaction effects from the model where has been centered, it follows that: −1 = × ′′2 × 2 + × ′′2 × 2 × 2

Figure S1. Linear interaction effect across different coding schemes
Pattern of contribution of an interaction term to an outcome when using the standard coding ([0,1,2], upper panel) or using centered values ([-1,0,1], lower panel) for the genotype, and using a normal distribution for the exposure with variance 1 and mean of 0 (a,e), 1 (b,f), 5 (c,g) and 10 (d,h) in the generative model. The interaction effects were set so that they explain 1% of the outcome across all models.

Figure S2. Power comparison for linear regression
A normally distributed outcome Y is generated as a function of the main effect of a genetic variants G, the main effect of an exposure E and an interaction effect between G and E, using four different distributions of the exposure (lower panel). In scenario a) E is normally distributed with mean 0 and variance 1. In scenario b) E follows an exponential distribution with lambda parameter 1, so that the mean and the variance equal 1. In scenario c) E follows a uniform distribution with minimum 0 and maximum 10. In scenario d) E is normally distributed with mean 4 and variance 1. Power of the test of marginal genetic effect (marg.G), the main genetic effect (main.G), the interaction effect (int.GxE) and the joint test of main and interaction effect (Joint.G.GxE) are derived for each scenario at the significance level of 5x10 -8 across 1,000 simulations each including 20,000 samples. The contribution of E to the variance of Y, , equals 1%, while the contribution of G, , is either null (upper panels) or equals to , the contribution of the interaction term (middle panel). The are set so that the joint test achieve 60% power on average. In all scenarios where the mean of the exposure is large as compared to its standard deviation, the joint test and the marginal test of have dramatically larger power as compared to the interaction test, even when the underlying model includes an interaction effect only.

Figure S3. Power comparison for logistic regression
A binary outcome Y with a prevalence of 30% is generated as a function of the main effect of a genetic variants G, the main effect of an exposure E and an interaction effect between G and E, using four different distributions of the exposure (lower panel). In scenario a) E is normally distributed with mean 0 and variance 1. In scenario b) E follows an exponential distribution with lambda parameter 1, so that the mean and the variance are equal 1. In scenario c) E is normally distributed with mean 4 and variance 1. In scenario d) E follows a uniform distribution with minimum 0 and maximum 10. Power of the test of marginal genetic effect (marg.G), the main genetic effect (main.G), the interaction effect (int.GxE) and the joint test of main and interaction effect (Joint.G.GxE) is derived for each scenario at the significance level of 5x10 -8 across 1,000 simulations each including 20,000 samples. The exposure has an odds ratio of 1.1, while the odds ratio of G and GxE are set so that the joint test achieve 60% power on average. In all scenarios where the mean of the exposure is large as compared to its standard deviation, the joint test and the marginal test of G have dramatically larger power as compared to the interaction test, even when the underlying model includes an interaction effect only.

Figure S4. The Pratt index across multiple interactions.
Examples of the cumulated contribution of the genetic, exposure and interaction effects to the variance of an outcome Y when using the standard approach (variance explained by the interaction term on top of the marginal effect, ∑ 2 ) and the Pratt index (∑ * 2 ). The outcome was generated as a function of 10,000 independent causal SNPs and a single exposure. In scenario a) the exposure is binary and rare (5% prevalence) and modifies the effect of 10% of the causal SNPS. In scenario b) the exposure is binary and very common (90% prevalence) and modifies the effect of 30% of the causal SNPs. In scenario c) the exposure is normally distributed with mean 4 and variance 1 and modifies the effect of 60% of the causal SNPs. The frequencies of the causal SNPs are drawn from a uniform distribution with a minimum of 0.01 and a maximum of 0.99. For simplicity all main genetic effect and interaction effects have equal effect size and the direction of the interaction effects are set so the average contribution of interaction effects to the marginal effect of the exposure is null. The genetic main effects and interaction effects combined explain 60% of the variance of Y, and the exposure main effect explains 10%. The standard approach shows a similar pattern across the 3 scenarios, while the Pratt index correctly recovers the cumulated contribution of the GxE terms (i.e. as defined in the generative model).

Figure S5. Joint test of main genetic effect and multiple interaction effects in a linear regression.
A normally distributed outcome , a genetic variants and 10 non-centered normally distributed exposures = ( 1 , 2 , … , 10 ) are generated for 2,000 individuals across 100,000 replicates. The distribution of the -log10(pvalue) of two tests for the joint analysis of the main genetic effect and the 10 interaction effects between and are compared under a null model of no main genetic effect and no interaction: the multivariate Wald test of estimates obtained from the interaction model when using the raw exposures (upper panels, blue), and a test based on the sum of chi-squares from individuals main and interaction estimates obtain from the same model but after centering the exposures (middle panels, red). The correlation between these two tests is then compared under an alternative hypothesis where the variance explained by the main genetic effect and the interaction terms are drawn from a uniform [0, 0.002] (bottom panels). Four scenarios are considered: in a) is multivariate normal with no correlation between the exposures; in b) is multivariate normal with an average absolute pairwise correlation of 0.07 between the exposures (95% of the correlations are in [-0.19, 0.19]); in c) is multivariate normal with an average absolute pairwise correlation of 0.22 between the exposure (95% of the correlations are in [-0.44, 0.44]); and in d) is multivariate log-normal with no correlation between the exposures. Three patterns are considered for the genetic variant, i) G is not associated with (G 0 ); ii) G is causal for 3 out of the 10 exposures with effect size (i.e. the exposure's variance explained) drawn from a univariate [0, 0.01] (G 1 ); and iii) G is causal for 3 out of the 10 exposures with effect size drawn from a univariate [0, 0.05] (G 2 ).

Figure S6. Joint test of main genetic effect and multiple interaction effects in a logistic regression.
A binary outcome with 40% prevalence, a genetic variant and 10 non-centered normally distributed exposures = ( 1 , 2 , … , 10 ) are generated for 2,000 individuals across 100,000 replicates. The distribution of the -log10(pvalue) of two tests for the joint analysis of the main genetic effect and the 10 interaction effects between and are compared under a null model of no main genetic effect and no interaction: the multivariate Wald test of estimates obtained from the interaction model when using the raw exposures (upper panels, blue), and a test based on the sum of chi-squares from individuals main and interaction estimates obtain from the same model but after centering the exposures (middle panels, red). The correlation between these two tests is then compared under an alternative hypothesis where the main genetic effect and the interaction effect have an odds ratio of 1.1 (bottom panels). Four scenarios are considered: in a) is multivariate normal with no correlation between the exposures; in b) is multivariate normal with an average absolute pairwise correlation of 0.07 between the exposures (95% of the correlations are in [-0.19, 0.19]); in c) is multivariate normal with an average absolute pairwise correlation of 0.22 between the exposure (95% of the correlations are in [-0.44, 0.44]); and in d) is multivariate log-normal with no correlation between the exposures. Three patterns are considered for the genetic variant, i) G is not associated with (G 0 ); ii) G is causal for 3 out of the 10 exposures with effect size (i.e. the exposure's variance explained) drawn from a univariate [0, 0.01] (G 1 ); and iii) G is causal for 3 out of the 10 exposures with effect size drawn from a univariate [0, 0.05] (G 2 ).

Figure S7. GRS-based statistic and meta-analysis of single SNP estimates in linear regression.
A normally distributed outcome Y is generated as a function of the main effect of 20 genetic variants, the main effect of an exposure E and 20 interaction effects between the exposure and each of the 20 genetic variants across 1,000 replicates including each 2,000 samples. Three distributions of the exposure are considered. In scenario a) E is normally distributed with mean 3 and variance 1; in scenario b) E follows a uniform distribution with minimum 0 and maximum 10; and in scenario c) E follows an exponential distribution with parameter lambda of 1. The GRSbased test of marginal genetic effect (~, left panels), and main (center panels) and interaction (right panels) effects from the interaction model (~+ + × ) are compared against the corresponding inversevariance weighted meta-analysis of univariate statistics. Figure S8. GRS-based statistic and meta-analysis of single SNP estimates in logistic regression.
A binary outcome Y with prevalence of 40% is generated as a function of the main effect of 20 genetic variants, the main effect of an exposure E and 20 interaction effects between the exposure and each of the 20 genetic variants across 1,000 simulations including each 2,000 samples. Three distributions of the exposure are considered. In scenario a) E is normally distributed with mean 3 and variance 1; in scenario b) E follows a uniform distribution with minimum 0 and maximum 10; and in scenario c) E follows an exponential distribution with parameter lambda of 1. The GRS-based test of marginal genetic effect ( ( =1 )~, left panels), and main (center panels) and interaction (right panels) effect from the interaction model ( ( =1 )~+ + × ) are compared against the corresponding inverse-variance weighted meta-analysis of univariate statistics.