A systematic comparison of statistical methods to detect interactions in exposome-health associations

Background There is growing interest in examining the simultaneous effects of multiple exposures and, more generally, the effects of mixtures of exposures, as part of the exposome concept (being defined as the totality of human environmental exposures from conception onwards). Uncovering such combined effects is challenging owing to the large number of exposures, several of them being highly correlated. We performed a simulation study in an exposome context to compare the performance of several statistical methods that have been proposed to detect statistical interactions. Methods Simulations were based on an exposome including 237 exposures with a realistic correlation structure. We considered several statistical regression-based methods, including two-step Environment-Wide Association Study (EWAS2), the Deletion/Substitution/Addition (DSA) algorithm, the Least Absolute Shrinkage and Selection Operator (LASSO), Group-Lasso INTERaction-NET (GLINTERNET), a three-step method based on regression trees and finally Boosted Regression Trees (BRT). We assessed the performance of each method in terms of model size, predictive ability, sensitivity and false discovery rate. Results GLINTERNET and DSA had better overall performance than the other methods, with GLINTERNET having better properties in terms of selecting the true predictors (sensitivity) and of predictive ability, while DSA had a lower number of false positives. In terms of ability to capture interaction terms, GLINTERNET and DSA had again the best performances, with the same trade-off between sensitivity and false discovery proportion. When GLINTERNET and DSA failed to select an exposure truly associated with the outcome, they tended to select a highly correlated one. When interactions were not present in the data, using variable selection methods that allowed for interactions had only slight costs in performance compared to methods that only searched for main effects. Conclusions GLINTERNET and DSA provided better performance in detecting two-way interactions, compared to other existing methods. Electronic supplementary material The online version of this article (doi:10.1186/s12940-017-0277-6) contains supplementary material, which is available to authorized users.


Background
Many environmental exposures have been linked to health effects [1]. The fact that human biomonitoring and epidemiological studies are now able to measure a large number of environmental exposures in the same participants has led to the development of the exposome paradigm. The exposome is defined as the totality of human environmental exposures from conception onwards [2,3]. As in genome studies, most exposome studies rely on holistic data-driven approaches to discover associations between the exposome and a health outcome. Environmental exposures can have independent effects on health outcomes, but a promising feature of exposome research lies in the promise to examine potentially interacting exposures or, more generally, the effects of mixtures of exposures [4][5][6][7]. Two-or three-way interactions between environmental exposures have been described in the literature and statistical methods to uncover interactions among a large set of exposures have been suggested [8][9][10][11][12].
In a recent paper, Agier et al. [13] studied the performance of several variable selection algorithms in an exposome context. In particular, they considered the Environment-Wide Association Study (EWAS) and a two-step version of EWAS based on multiple linear regression (EWAS-MLR), Elastic net (ENET), sparse partial least squares regression (sPLS), Graphical Unit Evolutionary Stochastic Search (GUESS) and the Deletion/Substitution/Addition (DSA) algorithm. Their results showed the limitations of all methods to select the right exposures when those exposures are correlated, although they showed that GUESS and DSA provided a marginally better balance between sensitivity and specificity than the other methods. However, their simulations did not consider the presence of interactions and most of the methods tested could not accommodate a search for interaction terms.
In this paper, we want to extend their work by considering scenarios with statistical interactions and by providing a systematic comparison of methods that have been recommended to search for interactions. We are also interested in the performance of those methods in the absence of real interactions. This is of interest, as when analysing real data one never knows whether such interactions exist or not. We will restrict our analyses to linear models with main effects of exposures and two-way interactions, as they are the most commonly reported in the literature and because they are easily parameterized in regression equations, hence facilitating the comparison between methods. Specifically, we considered two-step Environment-Wide Association Study (EWAS 2 ), the Deletion/Substitution/Addition (DSA) algorithm, the Least Absolute Shrinkage and Selection Operator (LASSO), Group-Lasso INTERaction-NET (GLINTERNET), a three-step method based on regression trees and finally Boosted Regression Trees (BRT). Besides, we do not consider confounding by other covariates. The main focus of our analysis will be on variable selection, as we want to focus on the ability of the methods to correctly detect true associations. Other metrics such as bias or coverage of effect estimates will not be addressed, although we note that they will depend critically on the performance of variable selection.
The existing literature provides only a limited number of comparisons between the methods examined in this study and other alternatives. For instance, GLINTERNET was shown to perform comparably to the R package hierNet, with some advantages in computing time [8]. GLINTER-NET performed better than boosting in terms of false discovery probability [8]. Under simulations, the DSA algorithm seemed to be competitive with Logic Regression, which only handles binary variables [14]. Sun et al. [10] compared Bayesian Model Averaging, DSA, LASSO, Partial Least-Squares Regression and Supervised Principal Component Analysis under simulations with up to 20 variables. They also considered a two-step modelling strategy in which variables were screened for inclusion in the second step using Classification and Regression Trees (CART). There was no uniform dominance of one method across all examined simulation scenarios. However, to the best of our knowledge, the methods considered in our study have never been systematically compared under the characteristics of the exposome context, i.e. variable selection and interactions detection with a high number of potentially correlated exposures.

Simulating data
The exposome data were simulated based on an existing dataset with the observation of 237 exposures on 655 individuals from the INMA (INfancia y Medio Ambiente) mother-child cohort [15]. In particular, we computed the correlation matrix of the exposures, . In such matrix, 81% of absolute pairwise correlations were lower than 0.2 while 64% were lower than 0.1. The median absolute value was 0.06 and the absolute percentiles 2.5 th , 25 th , 75 th and 97.5 th were 0.003, 0.03, 0.15 and 0.61, respectively. 78% of exposures were correlated at absolute level higher than 0.6 with at least one other exposure (see Additional file 1: Section A). Then, this matrix was used to simulate the exposures E using a multivariate normal distribution with mean 0, The number of participants was set to N = 1200. Subsequently, the outcome variable Y was simulated as where F(E) is a function of the exposome E, hereafter called the true model, and σ is the residual standard deviation. We considered three scenarios, displayed in Table 1, all of them involving five exposures, hereafter called true predictors as they are the ones generating the outcome. Scenarios are characterized by the relationship between the true predictors and the outcome, F(E). Thus, scenario 1 corresponds to a case with no interactions, scenario 2 corresponds to a case with one 2-way interaction, and scenario 3 corresponds to a case with two 2-way interactions. For each of the three scenarios, we built subscenarios according to the coefficient of determination (R 2 ) of the model (set at either 0.1 or 0.3); the pairwise correlation between the true predictors (either mixed: any exposure can be selected as a true predictor regardless of correlation; or high: exposures are chosen so that all their pairwise correlations are above 0.6); the size of interaction effects (either strong: same size as the main effects; or moderate: half the size of the main effects) (Additional file 1: Section B); and the direction of the interaction effect (either +: same direction than the main effects; or −: opposite direction to the main effects). Subscenarios are summarized also in Table 1. In each scenario, the size of both the main and the interaction effects, and the value of σ were tuned so that they yielded the desired R 2 and a sensitivity of around 0.9 or as close as possible when a model including only the true terms was fitted and their significance was  Scenario 2. True model: F(E) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + γ 12 X 1 X 2 (Model size = 6; Only one 2-way interaction) Scenario 3. True model: a In each of the three scenarios, the outcome Y was generated as Y = F(E) + , where F(E) is a function of the predictors X 1 , . . . , X 5 , and ∼ N(0, σ ). In each scenario, subscenarios were considered according to the pairwise correlation of the predictors ("Mixed", when selecting the predictors among the whole exposome, in which case the absolute pairwise correlation ranged from 0.0000 to 1.0000; or "High", when selecting the predictors among the subset of the 13 variables in the exposome for which all absolute pairwise correlations were 0.62 or higher); the size of the interaction terms ("Strong", corresponding to equal size than the main effects size; or "Moderate", corresponding to size 1/2 of the "Strong"), and the sign of the interaction terms (+ or −). Values for the adjusted R 2 correspond to the mean and percentiles 2.5 th and 97.5 th as a result of fitting the model to 100 simulated datasets. b The median of the mean pairwise correlation between the true predictors was 0.12 (percentiles 2.5 th and 97.5 th : (0.05, 0.25)) for "Mixed", and 0.78 (percentiles 2.5 th and 97.5 th : (0.72, 0.87)) for "High". The median of the mean pairwise correlation between the true predictors and the other exposures was 0.13 (percentiles 2.5 th and 97.5 th : (0.09, 0.16)) for "Mixed", and 0.18 (percentiles 2.5 th and 97.5 th : (0.17, 0.19)) for "High". c In all scenarios, β 0 = β 1 = · · · = β 5 = 1 assessed (sensitivity was the proportion of statistically significant terms over repeated simulations, see details on sensitivity tuning in an enlarged version of Table 1 in Additional file 1: Section C).
For each of the subscenarios in Table 1, we simulated 100 training datasets (N = 1200 individuals) in which the statistical methods described below were applied. In each of the simulated datasets, true predictors were randomly selected from the available set of exposures. Likewise, we generated the same number of validation datasets (N = 10000 individuals each) for the assessment of the out-ofsample prediction.

Statistical methods
We considered several statistical methods previously recommended in the literature to detect interactions. Specifically, we focused on methods that implemented a variable selection approach to detect statistical interactions in the form of a product of two variables in a linear model. An exception was the inclusion of Boosted Regression Trees (BRT), which does not have an explicit regression equation. Besides, we restricted the simulation to methods that were or could easily be implemented in the R [16] software. The R code to reproduce this study is shown in Additional file 2.

Two-steps EWAS (EWAS 2 )
The Environment-Wide Association Study (EWAS) is a method analogous to genome-wide association studies but considering environmental factors instead of loci [17]. Thus, a univariate regression model is fitted for each exposure, and p-values are corrected for multiple comparisons. As an extension to search for two-way interactions, we used a two-step EWAS (EWAS 2 ) as suggested by Kooperberg [11]. First, we fitted a simple linear regression model to test each exposure marginally at significance level α = 0.05 and applying the Benjamini and Yekutieli correction [18] for multiple comparisons. All the significant exposures entered the second step of the process, in which we fitted a linear regression model with a pair of exposures and the corresponding two-way interaction term, and repeated the process for all possible pairs. p-values were corrected again for multiple comparisons using the Benjamini and Yekutieli correction and α = 0.05. In the end, the selected terms in EWAS 2 were all the main effects retained in the first step and all the two-way interaction terms that were significant in the second step. Note that this procedure does not provide a single model but just a set of exposures and a set of twoway interaction terms marginally associated with the outcome. However, such sets were used to assess the performance of EWAS 2 through some of the measures defined later.

DSA algorithm
The Deletion/Substitution/Addition (DSA) algorithm [9] is an iterative process that starts with an empty model and uses deletion (removing a variable from the model), substitution (replacing a variable in the model by another not in the model) or addition (adding a variable in the model) moves to find the final model (further details are shown in Additional file 1: Section D). The final model is selected by minimizing the residual mean squared error (RMSE) using 5-fold cross-validation. We fitted two versions of DSA, one that only searches for main effects (DSA 1 ) and another that also searches for 2-way interactions (DSA 2 ). The way the DSA software is implemented, the version that searches for 2-way interactions also searches for quadratic terms. In all cases, we set the maximum model size to 10, which was never reached in the simulations. We used the R package DSA. It is noteworthy that this package, and the required package modelUtils, although working properly, are not included in the CRAN repository [19].

Sun 3-step method (Sun3step)
A 3-step method similar to that suggested by Sun et al. [10] was implemented (Sun3step). In the first step, we performed a correlation analysis to assess the collinearity within each group of exposures. In our data, there were 15 groups of exposures containing from 1 to 51 exposures each (see Additional file 1: Section A). When several exposures in the same group were highly correlated (Pearson correlation coefficient above 0.60), only the one with the smallest p-value in the single-exposure regression model was retained. In the second step, the selected exposures entered a Classification And Regression Tree (CART), which was subsequently pruned, with the criteria of minimizing the cross-validated error. The R package rpart was used. The variables selected in the construction of the regression tree entered the third step, which consisted of applying the DSA algorithm (allowing for 2-way interactions and quadratic terms), hence providing the final model.

Least absolute shrinkage and selection operator (LASSO)
LASSO is a method of estimation in linear models which penalizes large model sizes. Specifically, the method minimizes the residual sum of squares penalized by the sum of the absolute value of the regression coefficients, which tends to produce some coefficients being exactly zero and hence providing a variable selection procedure [20]. The LASSO method is not specifically designed to find interactions, but it was used to compare its performance with GLINTERNET, an extension of the LASSO method designed to look for interactions described below. Thus, no interaction terms were allowed in the LASSO method. We used 3-fold cross-validation for the sake of comparability with GLINTERNET. The R package glmnet was used.

Group-Lasso INTERaction-NET (GLINTERNET)
GLINTERNET is a variable selection algorithm that fits linear pairwise-interaction models that satisfy strong hierarchy: if an interaction coefficient is estimated to be nonzero, then its two associated main effects also have nonzero estimated coefficients [8]. It is based on the overlapped group-lasso [21], which considers the linear predictor as a linear combination of groups of terms (including main effects and two-way interactions). The particular case in which each group consists of only one variable corresponds to LASSO. Groups of variables are allowed to overlap, in the sense that one variable can be present in more than one group (e.g. the same variable can be present in two or more groups corresponding to different two-way interaction terms). In such cases, the final coefficient for a given variable is the sum of the coefficients of the groups in which the variable is present. A cross-validation using 3-folds (for computational reasons) was performed using the R package glinternet.

Boosted Regression Trees (BRT)
Lampa et al. [12] recommended the Boosted Regression Trees (BRT) as a tool to identify complex interactions. BRT combines the strengths of two algorithms: regression trees (models that relate a response to its predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion [22]. Hence, unlike the previously described techniques, BRT does not provide a simple regression equation. In regression trees, data are partitioned into a set of disjoint regions, and each region is assigned a constant value of the outcome variable. The splits of the trees can capture nonlinear effects and complex interactions. We set the maximum number of trees to 5000 and the depth parameter, which can be thought of as the maximum order of interactions, to 4. We modified the BRT method by incorporating a variable selection procedure described in Díaz-Uriarte [23]. With this addition, besides measures of prediction, the technique can be compared to the other methods in terms of its performance in variable selection. Briefly, the variable selection algorithm works as follows. We proceed iteratively by fitting BRT and eliminating at each iteration a fraction f (we set f = 50%) of variables with the smallest importance. The importance of a given variable is based on the number of times it is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged over all trees. Then, the final set of variables is chosen as the smallest set of variables which minimizes the out-ofsample error rate. Computations were performed using the R packages gbm and dismo.

Measures of performance
To assess the performance of each method in each scenario, we estimated, among the simulated datasets, the mean value of the measures defined below. Such measures, analogous to those used by Agier et al. [13], are based on the comparison of the fitted models when using the assessed method and when using the model that already generated the data. Note that some measures are defined according to the terms in the model, while others are based on the variables involved in the model. For example, a model including X 1 , X 2 , X 2 1 and X 1 X 2 contains two variables and four terms. • Relative out-of-sample R 2 (R 2 rel ): ratio between the out-of-sample R 2 of the fitted model (numerator) and the out-of-sample R 2 of the model that includes only the terms used to generate the data (denominator), using a simulated test dataset (N = 10000).
• Sensitivity (Sens): Proportion of terms in the true model correctly detected by the fitted model. • Alternative sensitivity (AltSens) [13]: The average highest correlation between a true predictor and any variable involved in the fitted model, where A is the subset of true predictors, n A is the number of true predictors, and B is the subset of variables in the fitted model. If all the true predictors were detected, both Sens and AltSens would take the value 1. If none of the true predictors were detected, but instead a set of variables having each a correlation of 0.9 with one of the true predictors were selected by the fitted model, AltSens would take the value 0.9 while Sens would take the value 0.
• Sensitivity for variables (Sensvar): proportion of true predictors involved in any term of the fitted model. One minus the average highest correlation between a variable selected by the fitted model and any true predictor, where B is the subset of variables involved in the fitted model and n B is the size of B. If no false predictors were selected, both AltFDP and FDP would take the value 0. If none of the selected variables by the fitted model were a true predictor but each of them had a correlation of 0.9 with a true predictor, Alt FDP would take the value 0.1 but FDP would take the value 0.
• False discovery proportion for variables (FDPvar): Proportion of variables selected by the fitted model that are not true predictors.
Regarding the detection of interaction terms, we considered the following measures: • Sensitivity for interaction terms (Sens 2 ): Proportion of true interaction terms correctly detected by the fitted model. • Alternative sensitivity for interaction terms (AltSens 2 ), analogous to AltSens: the average of the highest correlation between a true predictor involved in an interaction term and a variable involved in an interaction term in the fitted model, where A 2 is the subset of true predictors involved in interaction terms, n A 2 is the size of A 2 , and B 2 the subset of the variables involved in interaction terms in the fitted model. • False discovery proportion of interaction terms (FDP 2 ): Proportion of interactions terms in the fitted model that are not in the true model. • Alternative false discovery proportion of interaction terms (AltFDP 2 ), analogous to AltFDP: The average of the highest correlation between a variable involved in an interaction term in the fitted model and any variable involved in an interaction term in the true model.
where n B 2 is the size of B 2 .
Note that some of the measures of performance cannot be computed for some models. Table 2 shows the features of each model according to both the availability of the measures of performance and the model structure.

Model size
Figure 1(a) shows the number of variables in the model. Despite the application of a false discovery correction, EWAS 2 selected between 10 and 20 times the number of true predictors. LASSO and GLINTERNET also selected more variables than in the true model, but to a much lower extent. Specifically, LASSO selected models with between 2 and 4 times the number of variables in the true model, while for GLINTERNET, this ratio ranged from 1 to 2, and was close to one in scenarios with high correlation between the true predictors. In contrast, DSA 1 , DSA 2 and Sun3step selected fewer variables than the true model. DSA 2 was the method closest to the right number of variables, with ratios of around 0.5 and 1, in scenarios with "Mixed" and "High" correlation, respectively. Sun3step was the most restrictive model, with a number of variables of around one third of that in the true model. Similar results were found when assessing the number of terms (as opposed to variables) included in the model (Additional file 1: Section E).

Predictive ability
In terms of predictive ability, all methods achieved R 2 rel between 0.3 and 1, i.e. R 2 lower than the model that includes only the terms used to generate the data ( Fig. 1(b)). GLINTERNET was the method with the highest R 2 rel , being higher than 0.7 in all scenarios and very close to 1 in scenarios with high correlation between true predictors. Despite not considering interaction terms, LASSO achieved good values of R 2 rel , close to those of GLINTERNET, and similar or better than those of other methods except in scenarios with two strong interaction terms (scenarios 3e and 3f ). DSA 2 provided good values of R 2 rel , but lower than those of GLINTERNET, especially in cases with strong interactions and low correlations between predictors (scenarios 2a, 2b, 3a, and 3b). Sun3step and especially BRT provided the lowest values of R 2 rel , which in some scenarios were lower than 0.5.

Sensitivity
EWAS 2 was the method with the highest sensitivity for variables (Sensvar, Fig. 2(a)). In many cases, EWAS 2 showed higher sensitivity than when fitting the model that included only the terms used to generate the data. This was the case because some of those terms were selected as significant in EWAS 2 (i.e. in models including a single exposure at a time), but they were not significant when all terms were included in the same model. LASSO and GLINTERNET had values of sensitivity similar to each other, which were in turn very close to those of the model that included only the terms used to generate the data in all scenarios, especially in those with true interactions and high pairwise correlation among the predictors. DSA 1 and DSA 2 had similar sensitivity for variables, and they were about half of those of LASSO and GLINTERNET. BRT performed similarly to the DSA but with slightly smaller values. Sun3step had the worst sensitivity for variables among all methods. Results when  assessing sensitivity for terms (as opposed to variables) showed the same patterns (Additional file 1: Section F). EWAS 2 , LASSO and GLINTERNET had values of Alt-Sens ( Fig. 2(b)) between 0.9 and 1, indicating that when they did not select a true predictor they selected a highly correlated exposure. Those values ranged from 0.7 to 0.9 for the DSA algorithms, and were lower for BRT and Sun3step. In terms of the sensitivity for interaction terms (Sens 2 , Fig. 2(c)), GLINTERNET achieved substantially higher values than DSA 2 , Sun3step and EWAS 2 , except in scenarios with two interaction terms and high pairwise correlation (i.e. 3e, 3f, 3g and 3h), where EWAS 2 was the best method. Results on AltSens 2 are shown in the Additional file 1 (sections H and I). The values of this alternative measure of sensitivity were much higher than Sens 2 for GLINTERNET, DSA 2 and Sun3step, indicating that when a true interaction term was not selected an interaction term involving a highly correlated exposure was selected. For EWAS 2 , AltSens 2 was much lower than for the other methods (see Additional file 1: Section I).

False discovery proportion
Regarding the proportion of wrongly selected exposures (FDPvar, Fig. 3(a)), all methods had values greater than 0.4. EWAS 2 had values of around 0.9 for all scenarios. LASSO also had high values, greater than 0.7. GLIN-TERNET had values between 0.5 and 0.6. DSA 1 , DSA 2 , BRT and Sun3step tended to produce the lowest values. Similar results were obtained for the proportion of wrongly selected terms (as opposed to variables) associated with the outcome (FDP, Additional file 1: Section G). When looking at the alternative measure of false discovery (AltFDP, Fig. 3(a)), DSA 1 , DSA 2 , BRT and Sun3step tended to produce values lower than 0.15, indicating that when wrongly selecting an exposure, they tended to select one that was highly correlated to a true predictor. GLIN-TERNET produced slightly higher values, while EWAS 2 and LASSO had values between 0.4 and 0.5.
In terms of false discovery for interaction terms (FDP 2 , Fig. 3(d)), EWAS 2 performed worst, with values close to 1 in scenarios with high pairwise correlation among the true predictors and of around 0.9 in the other scenarios. a b c Fig. 3 Performance of the compared methods in terms of specificity. a False discovery proportion for variables (FDPvar), b Alternative false discovery proportion (AltFDP), and c False discovery proportion for interaction terms (FDP 2 ). Mean values based on 100 simulations. The vertical line separates scenarios according to the pairwise correlation between the true predictors as "Mixed" (any exposure can be selected as a true predictor regardless of correlation), or "High" (exposures are chosen so that all their pairwise correlations are above 0.6). Scenarios 1, 2 and 3 involve no interactions, one two-way interaction, and two two-way interactions, respectively DSA 2 provided the lowest values, except in cases with two interaction terms and high correlation between true predictors (scenarios 3e to 3h), in which cases GLINTER-NET provided better results. GLINTERNET tended to perform better than Sun3step. The alternative measure for false discovery proportion for interactions (AltFDP 2 , Additional file 1: Sections H and J) showed much lower values than FDP 2 . In particular, DSA 2 had values below 0.1, indicating that when DSA 2 wrongly selected an interaction term, the exposures involved in the selected interaction were highly correlated to the true ones. The other methods provided higher values. The usual trade-off between sensitivity and false discoveries for both main effects and interaction terms was systematically observed under the different methods, i.e. no method maximized both (see Additional file 1: Section K).

Impact of correlation between exposures
The pairwise correlation among the true predictors showed an important impact on method performance. In general, model size was reduced in the high correlation scenarios for all studied methods except EWAS 2 . R 2 rel was mostly above 0.8 for high correlation while it ranged from 0.3 to 0.9 for mixed correlation. Regarding sensitivity, higher correlation was associated with a reduction in both Sens and Sens 2 (while there was no clear pattern for Sensvar) but with an increase in AltSens (always above 0.8 for high correlation scenarios) to the point to achieve higher R 2 rel (mostly above 0.8 for high correlation while it ranged from 0.3 to 0.9 for mixed correlation). In terms of false discoveries, almost no changes were observed, except an increase in FDP 2 in high correlation scenarios.
In addition, we performed a sensitivity analysis (Additional file 1: Section L) for the impact of a low pairwise correlation among the true predictors on the performance of the analysed methods. Specifically, we created the new scenario 2i, which was tuned to be similar to scenarios 2a and 2e, but differing in the pairwise correlation among the true predictor. In scenario 2i, the true predictors are selected among the subset of 13 exposures for which all pairwise correlations are 0.1 or lower, while in scenarios 2a and 2e such correlations were "Mixed" and "High", respectively. Results showed almost no changes regarding model size and R 2 rel . Sensitivity decreased around 40% and FDP increased around 30% for almost all methods when changing from "low" to "High" pairwise correlation, although the alternative measures (i.e. AltSens and AltFDP) remained in general invariant.

Scenarios with no interaction
Both DSA 2 and GLINTERNET are able to look for interaction terms. DSA 1 and LASSO can be seen, respectively, as particular cases of those methods, restricted to look for main effects only. Table 3 shows the relative performance of these two pairs of methods regarding sensitivity and FDP in scenarios with no real interaction (1a to 1d). For DSA, looking for interactions when they do not exist had almost no cost in terms of sensitivity (variation between -4 and 2%). The difference in FDP ranged from -6 to 7%. When comparing GLINTERNET with LASSO, looking for interactions reduced the sensitivity by 3 to 12%, but led to a reduction in FDP between 19 and 25%. That is, GLINTERNET detected fewer true predictors but it also detected fewer false predictors than LASSO.

Discussion
We conducted a simulation study in an exposome context comparing the performance of several statistical methods that have been recommended to detect interactions. In addition, two methods that are not able to detect interactions (LASSO and DSA 1 ) were also considered for comparison purposes. Of the tested methods, GLINTER-NET and DSA 2 showed the best overall performance, with GLINTERNET having better properties in terms of sensitivity and predictive ability, and DSA 2 giving lower values of false discovery measures. GLINTERNET and DSA 2 also performed best when capturing interaction terms, with the same trade-off between sensitivity and false discovery proportion. When interactions were not present in the data, using variable selection methods that allow for interactions had almost no cost in sensitivity and only a slight reduction in false discovery rate, compared to methods that only search for main effects.
Both GLINTERNET and DSA 2 have some specific features. GLINTERNET forces the main effects in the model when an interaction term is detected, as it is commonly done in practice, although this is not the case for DSA 2 . The DSA algorithm allows for including interactions of higher orders and, when the order of interactions is set to 2, the model also looks for quadratic effects. Interestingly, both GLINTERNET and DSA 2 can be considered generalizations of variable selection methods that only search for main effects. In our simulations, when using the the DSA method, looking for interactions when they do not exist had a small effect on sensitivity and produced also a small reduction of FDP, of up to 7%. Given these numbers, researchers may decide if it is worth the cost including the search for interactions in their analyses. The comparison between LASSO and its generalization GLIN-TERNET was less clear. Looking for interactions implied a reduction in sensitivity, but FDP was actually improved up to 25%. This may be explained by the fact that the two algorithms are not exactly comparable, as the penalty in GLINTERNET affects groups of coefficients. EWAS 2 , i.e. a two-step method that searches for interactions without including all variables in the same model, offered a poor performance, with a very high percentage of false positives despite the multiple comparison correction. This is in agreement with the poor performance of the EWAS method in a similar simulation study that did not consider interactions [13]. EWAS 2 had the highest sensitivity because many exposures, including the true predictors, were selected. This result did not extend to the detection of interactions terms (e.g. GLINTERNET had better sensitivity than EWAS 2 ). This may be due to the high number of interaction terms that are tested in the second stage as a result of the high number of exposures selected in the first stage, and the multiple comparison correction.
Sun et al. [10] recommended a three-step method, in which in the first step only one exposure per family is retained. Thus, by definition, this method will miss some true predictors if there are more than one true predictor in the same family. We repeated the analysis excluding this first step, and the performance of the Sun3step method only changed minimally (data not shown). This approach achieved the lowest sensitivity, an R 2 rel substantially lower and a FDR higher than for other methods. Similar performances were observed when looking at the interaction terms only. BRT is a method that differs from the rest in that it has no regression equation for the final model and that it does not formally perform variable selection. In this study, we embedded a variable selection procedure to BRT. It is possible that such variable selection may have reduced the performance of the method, although it was implemented to minimize the out-of-sample error. In fact, BRT had one of the lowest sensitivities, although FDP was low and comparable to DSA 2 . Despite BRT is mainly seen as a predictive method, it produced the lowest R 2 rel . This can be partly explained by the way the data was simulated. The true model has a linear equation form, hence regression methods may be more suited to capture those effects. Thus, it is possible that BRT had better performances in more complex scenarios.
The pairwise correlation among the true predictors revealed as one of the main drivers of method performance, in some cases being even more important than the presence of interaction terms. Specifically, when such correlations were high, selected models tended to be smaller and it was more difficult to select true terms, as correlated exposures were selected instead. In terms of prediction, such models where a correlated exposure was selected instead of a true one could result in higher R 2 rel . However, epidemiological studies are usually not focused on prediction but on identifying causal associations. The latter task becomes more difficult in settings with highly correlated exposures.
We have performed sensitivity analyses to assess the impact of the tuning of the main parameters for DSA 2 , LASSO, GLINTERNET, BRT and Sun3step. Results showed only some slight, almost always non significant changes in the performance of the methods, which did not change the conclusions of the study (Additional file 1: Section M).
Statistical interactions are scale dependent, so our results depend on the assumed underlying model. Researchers interested in the causal interpretation of interactions should refer to the methods described in VanderWeele [24], although most of them are developed for binary exposures and outcomes. In this paper we only considered two-way interactions. It is likely that more complex interactions between environmental exposures exist. Yet, higher order interactions are complex to interpret [25] and are usually not investigated. Although some papers, usually with predefined hypotheses, have reported 3rd and higher order interactions, the fact that studies often have low power to detect them precludes their examination [26,27]. Nevertheless, future comparison of methods in a context of higher order interactions would be of interest. The problem of the effects of mixtures of pollutants is of high interest, and alternative methods have been suggested to address that problem. For example, Bobb et al. [28] suggest a method based on Bayesian kernel machine regression that incorporates variable selection and even a hierarchical variable selection procedure that accounts for structure in the exposures (e.g. families of highly correlated exposures). This method, implemented in the bkmr R package, can capture complex exposure-response functions of mixtures of exposures. Another example is the Bayesian Profile regression method [29], implemented in the PReMiuM R package, which aims at finding clusters of subjects sharing similar exposure profiles that at the same time show differences in the outcome. The inclusion of these two techniques in our simulation setting was not computationally feasible. However, the use of either of those techniques is not computationally problematic for the analysis of a single dataset of a similar size as the ones used here, so they remain as two attractive techniques to be considered in practice. These two methods can also capture complex non-linear associations. Our simulations did not consider non-linear effects, but some of the techniques used, such as DSA, Sun3step and BRT would be able to capture them ( Table 2).
The present study considered a limited number of scenarios. In particular, we only considered linear regression models and we did not consider issues such as non-linear main effects or the effect of confounders that are not in the set of exposures of interest. Even in that restricted setting, the number of scenarios considered was small, as many other combinations of parameters could be used. This is an issue in all simulation studies, but the presence of interaction terms adds another layer to the number of potential scenarios to be tested. We based our simulation on realistic scenarios using existing data, and included scenarios with different degrees of correlation between true predictors, different number of interaction terms with different strengths and directions, and different levels of R 2 of the models. Although many more scenarios could have been investigated, we believe we covered a large range of realistic scenarios and included extreme situations acting as stress test simulations for the methods assessment.
In practice, epidemiological studies have a set of confounders that need to be included in the model to obtain unbiased estimates of the effects of exposures (e.g. sociodemographic variables or seasonal trends). We did not consider that situation on our simulations, but we expect that considering confounders would only change the initial conditions of the scenarios (e.g. some exposures would not have true effects after confounder adjustment, and the residual variation of the model may be reduced). However, in practice it is important that models allow for the possibility to force confounders into the model. All of the methods analysed except GLINTERNET and BRT allow for this possibility (Table 2). For the other methods, one would need to use other approaches to deal with confounding, such as fitting an initial regression model with just the confounders, and performing the variable selection of exposures in a second stage using the residuals of that model.