Provable Boolean interaction recovery from tree ensemble obtained via random forests

Significance Random Forests (RFs) are among the most successful machine-learning algorithms in terms of prediction accuracy. In many domain problems, however, the primary goal is not prediction, but to understand the data-generation process—in particular, finding important features and feature interactions. There exists strong empirical evidence that RF-based methods—in particular, iterative RF (iRF)—are very successful in terms of detecting feature interactions. In this work, we propose a biologically motivated, Boolean interaction model. Using this model, we complement the existing empirical evidence with theoretical evidence for the ability of iRF-type methods to select desirable interactions. Our theoretical analysis also yields deeper insights into the general interaction selection mechanism of decision-tree algorithms and the importance of feature subsampling.


Introduction
Supervised machine learning algorithms have been proven to be extremely powerful in a wide range of predictive tasks from genomics, to cosmology, to pharmacology.Understanding how a model makes predictions is of paramount value in science and business alike [44].For example, when a geneticist wants to understand a particular disease, e.g.breast cancer, a black-box algorithm predicting the risk of breast cancer from genotype features is useful, but it does not offer biological insight.
That is, discovery of genes and gene interactions driving a particular disease provides not only understanding as a basic goal in science, but also opens doors for therapeutic treatments.It is a pressing task, in genomics and beyond, to interpret supervised machine learning (ML) models or algorithms and extract mechanistic information in addition to prediction.
Among many supervised ML algorithms, tree ensembles such as those from Random Forests (RF) [5] and gradient boosted decision trees [13] stand out as they enjoy both state-of-the-art prediction performance in a variety of practical problems and lead to relatively simple interpretations [35,27,46,26,23].To interpret a tree ensemble model, two questions are central: • Feature importance: What features are important for the model's prediction?
• Interaction importance: What interactions among features are important for the model's prediction?While many studies (see [35,46,23,26] and the references therein) focus on the RF feature importance, there are relatively few results on the second question.In genetics, Wan et al. and Yoshida and Koike [41,43] seek (higher-order) gene-interactions (or epistasis) by extracting genetic variant interactions from paths of ensembles of fitted decision trees.Wan et al. [41] use MegaSNPHunter based on boosting trees and interpret all groups of features that jointly appear on one of the decision paths as a candidate interaction.Yoshida and Koike [43] propose to rank interactions of genetic variants based on how often they appear together on decision paths in an RF tree ensemble.Recently, iterative Random Forests (iRF) [1] is proposed to seek predictive, stable, and high-order non-linear or Boolean feature interactions.Even though iRF uses the idea that the set of interacting features often appear together on individual decision paths of a tree in an RF ensemble as in Yoshida and Koike [43], it uses several other ideas.That is, iRF incorporates a soft dimension reduction step via iterative re-weighting of features in terms of their Gini importance, in order to stabilize individual decision paths in the trees.Using the random intersection trees (RIT) [34] algorithm, iRF extracts stable interactions of arbitrary order in a computationally efficient way, even when the number of features is large.There is very positive evidence that iRF extracts predictive, stable, and high-order Boolean interaction information from RF in genomics and other fields [1,21,8].While all the works mentioned above provide strong empirical evidence that interactions extracted from the ensemble of decision trees via RF or iRF are informative about underlying biological functional relationships, there are no theoretical results regarding interaction discovery using RF, iRF, or other tree-based methods.In this paper, as a first step towards understanding the interaction discovery property of tree-based methods, we investigate a key idea in the previous works [41,43,1], namely, that frequent joint appearance of features on decision paths in the RF tree ensemble suggests an interaction.
One of the most common assumptions made in previous theoretical analyses of RF is a family of smoothness conditions on the underlying mean regression function, such as the Lipschitz smoothness condition, see e.g., [4,33,40].However, many biological processes show thresholding or discontinuous interacting behavior among biomolecules [42,16], which strongly violates the Lipschitz assumption.It is therefore necessary to introduce a model that can capture the thresholding behavior through discontinuous mean regression function.
The Locally Spiky Sparse (LSS) model.Motivated by this thresholding behavior of biomolecules and inspired by RF's predictive performance successes in genomics data problems [19,7,38], we consider the locally spiky sparse (LSS) model1 : an additive regression model where the mean regression function is assumed to be a linear combination of Boolean interaction functions.The linear coefficients, as well as the threshold coefficients of the Boolean functions, are called model coefficients.Via Boolean functions, the LSS model is able to capture discontinuous thresholding behavior in biology, hence it can be more relevant for biologists than models with smoothness constraints.We believe the LSS model is suitable and useful as a new benchmark model under which to evaluate theoretically (and computationally) interaction discovery performance of tree-based ML algorithms including RF.
Our contributions.Assume that i.i.d.data samples from the LSS model are given and an RF is fit to this data.
1) For an RF tree ensemble, we first define signed features.For a decision path of a set of signed features S ± in the ensemble, we then define a new quantity called depth-weighted prevalence (DWP).Intuitively speaking, DWP of S ± measures how frequently the features in S ± appears together in an RF tree ensemble.We show that DWP has a universal upper bound that depends only on the size of the set of signed features.Moreover, the upper bound is attained with high probability as the sample size increases if and only if the signed features represent a union of interactions in the LSS model.Based on DWP, we show that a simple algorithm, i.e., LSSFind defined in Algorithm 1, can consistently recover interaction components in the LSS model regardless of the model coefficients.
2) Our theoretical results imply that feature subsampling of RF is essential to recover interactions by the RF tree ensemble.When too few features are sampled at each node, the tree ensemble is close to extremely randomized trees and DWP of any set of signed features is independent of the response, which means it does not contain information on the LLS model; When too many features are sampled, all the trees in the ensemble will be very similar to one another and that turns out to make it difficult to use tree structures to distinguish between interactions and non-interactions.More specifically, the ratio between the number of subsampled features m try and the total number of features p should be a non-zero constant in order for our algorithm to learn higher-order interactions from tree paths.
Existing theoretical works on RF.Existing theoretical studies of RF and its variants belong to two categories.The first focuses on estimating the regression function under Lipschitz or related conditions on the underlying regression function via averaging the decision trees in the RF tree ensemble.The second category studies feature importance measures as an RF output.In contrast, we provide the first study on feature interaction selection consistency under a new LSS model using DWP extracted from the RF tree ensemble.
In particular, in the first category, Biau [4] considers "median forests" [11], originally considered as a theoretical surrogate by Breiman [6], and obtains the L 2 convergence rate under the Lipschitz continuous models.Scornet et al. [33] give the first consistency result for Breiman's original RF with sub-sampling instead of bootstrapping in the low-dimensional setting when data is generated via an additive regression model with continuous components.Wager and Athey [40] consider a variant of RF, called honest RF, in the causal inference setup and prove its point-wise consistency and asymptotic normality when the conditional mean function is Lipschitz continuous.Similar, Mentch and Hooker [28] showed that, under some Lipschitz-type conditions, moderately large number of trees approximate well the infinite number of trees.Based on these asymptotic normality results, [29] derived hypothesis tests for the null hypothesis that the regression function is additive.Thus, if one defines features interaction as the deviation from a continuous additive regression function, then their results enable testing on a particular candidate.In contrast, in this work we define feature interaction via the non-continuous Boolean functions in the LSS model and we derive consistent interaction selection via the RF tree ensemble, as opposed to a test for an individual interaction as in [29].
The second category focuses on theory regarding individual feature importance measures.Results in this line of work do not rely on Lipschitz conditions.However, to the best of our knowledge, these works study statistical properties of only noisy features, but do not provide results for signal features in finite samples.Louppe et al. [27] show that Mean Decrease Impurity (MDI) feature importance for randomized trees has a closed-form formula with infinite number of samples.Zhou and Hooker [46] use out-of-sample data to improve the MDI feature importance with unbiased theoretical guarantees.Li et al. [23] show that the MDI feature importance of noisy features is inversely proportional to the minimum leaf node size, and suggest a way to improve the MDI using out-of-bag samples.Löcher [26] gives a family of MDI feature importance via outof-bag samples that are unbiased for the noisy features.Moreover, many studies focus on permutation-based feature importance measures, in particular, Shapley effects [17,36,18,31,30,9,2].Among these works, [2] shows some conceptual similarities to the DWP approach considered in this paper, as they also consider the concept of joint appearance of features on decision paths in the RF tree ensemble.However, instead of using this concept to extract feature interactions, as done in this work, they use it to define an importance sampling scheme to estimate the Shapley effects.
Also related to our work is the recent work [3], which analyzes the extraction of rule sets from a RF tree ensemble.This is very similar to interaction selection as considered in this work, except that the extracted rules in [3] also include specific estimated thresholds for the individual features.The theoretical analysis in [3] focuses on the stability of the selected rules without specifying a particular data generating model.In contrast, this paper obtains model selection consistency results for LSSFind to estimate signed interactions of signal features under the LSS model.
The rest of the paper is organized as follows: Section 2 introduces the LSS model and Boolean interactions in more detail.Section 3 reviews the RF algorithm and formally defines DWP for a given set of signed features relative to an RF tree ensemble.Section 4 presents our main theoretical results for DWP and introduces LSSFind, a new theoretically inspired algorithm to detect interactions from RF tree ensembles via DWP.Section 5 contains simulation results.We conclude with a discussion in Section 6.

Locally Spiky Sparse (LSS) Model to describe Boolean interactions
In this section, we introduce necessary notations and a precise mathematical definition of the LSS model.To this end, for an integer N ∈ N, let [N ] := {1, 2, . . ., N }.For a set S of finite elements of [N ], let |S| denote its cardinality or the number of elements in S. For any event A, let 1(A) denote the indicator function of A. We assume a given data set D = {(x 1 , y 1 ), . . ., (x n , y n )} of n samples, with x i = (x i1 , . . ., x in ) ∈ R p and y i ∈ R.
We say that the data D is generated from a Locally Spiky Sparse (LSS) model when the following assumptions hold true.
LSS model.Assume D = {(x 1 , y 1 ), . . ., (x n , y n )} are i.i.d.samples from a distribution P (X, Y ) such that for some fixed constants C β > 0, C γ ∈ (0, 0.5), the regression function takes the following form: where in (1) means either ≤ or ≥, potentially different for every k.Coefficients β j are bounded from below, i.e., J min j=1 and thresholds γ j are bounded away from 0 and 1, i.e., for j = 1, . . ., J. S 1 , . . ., S J ⊂ [p] are sets of features called basic interactions.We associate ≤ in (1) with a negative sign (−1) and ≥ with a positive sign (+1), such that a signed feature can be written as a tuple Note that for interactions with only one feature k, due to the sign ambiguity in the LSS model, i.e., 1(X k ≤ a) = 1 − 1(X k > a), both {(k, −1)} and {(k, +1)}, are counted as an interaction.
The LSS model aims to capture interactive thresholding behavior which has been observed for various biological processes [42,12,25,20,24,22].For example, in gene regulatory networks often a few different expression patterns are possible.Switching between those patterns can be associated with individual components that interact via a threshold effect [25,20,24].Such a threshold behavior is also observed for other signal transduction mechanisms in cells, e.g, protein kinase [12] and cell differentiation [42].Another example of a well studied threshold effect is gene expression regulation via small RNA (sRNA) [22].Although for most biological processes the precise functional mechanisms between different features and a response variable of interest are much more complicated than what the LSS model can capture, theoretical investigations of a particular learning algorithm, such as RF, are only feasible within a well defined and relatively simple mathematical model, and useful for practice when such a model is empirically relevant.Given the empirically observed interactive threshold effects in many real biological systems, the LSS model clearly provides a useful enrichment to the current state of theoretical studies of RF and related methods, since current theoretical models do not capture the often observed interactive threshold behavior.
In order to prove our main Theorem 2, we further impose the following constraints on the LSS model.
This uniformity assumption implies that each feature is independent of each other.Because any decision tree remains invariant under any strictly monotone transform of an individual feature, the uniform distribution assumption of X can be relaxed to the assumption that individual features X j , j ∈ [p], are independent with a distribution that has Lebesgue density.We note that such an independence assumption might be violated in real world problems.For example, for genetic data with SNPs or gene expression as features X j there will typically be a strong correlation between features which are located close-by on the chromosome.However, in many cases, it is feasible to restrict to a subset of features (e.g., those which are located sufficiently far apart on the genome) in order to obtain approximate independence.In Section 5 we also demonstrate in simulations that for sufficiently weak feature correlation one can still obtain accurate interaction selection with LSSFind.
Note that although we assume |Y | < 1, the constant 1 can be changed to any constant as we can scale Y by any positive number and the conclusions in our main results will remain intact.This boundedness condition can be further relaxed to that the residue Z := Y − E(Y |X) is independent of X and 1-subgaussian if we assume a slightly stronger assumption on p and n than the conditions in C4.See Proposition 5 for more detail.
The non-overlapping assumption that different interactions S j1 , S j2 with j 1 = j 2 are disjoint might not always be justified in real world problems.However, it is a crucial assumption for our theorem to hold.The general problem with overlapping interactions in the LSS model is that such models can be non-identifiable, meaning that different forms of (1) can imply the same regression function E(Y |X).For example, for the response 1(X 1 < 0.5, X 2 < 0.5) + 1(X 1 > 0.5, X 2 > 0.5), by the definition of signed interactions in the LSS model, it has two basic signed interactions {(1, −1), (2, −1)} and {(1, +1), (2, +1)}.However, we can also write it as This means, a set of signed features which is an interaction in one of the representations is not an interaction in the other.Due to this identifiability problem, overlapping features can lead to both false positives and false negatives in term of interaction recovery with RF.One may try to define interaction more broadly to avoid this identifiability problem.For the previous example 1(X 1 < 0.5, X 2 < 0.5) + 1(X 1 > 0.5, X 2 > 0.5), although the basic signed interactions are not unique, they always constitute of both X 1 and X 2 .Whether the coefficients {β j } J j=0 are allowed to have different signs also affects the identifiability.The previous example is identifiable if we only allow positive coefficients.For domain problems where interactions are believed to be overlapping, one should investigate different identifiability conditions, but as this depends on the precise application, we leave this for future work.Our work in this paper provides a pathway to investigate this in detail later.We demonstrate how overlapping features affect our results with a simulation study in Section 5.
In Section 4 we show that a simple algorithm, LSSFind, that takes an RF tree ensemble as input, can consistently recover basic interactions S 1 , . . ., S J in the LSS model.Besides recovering S j ⊂ [p], LSSFind can also recover the signs of each feature k ∈ ∪ J j=1 S j in the LSS model, which indicates whether the corresponding threshold behavior in (1) is given by a ≤-or a ≥-inequality.Without loss of generality, in the rest of the paper we assume that all inequalities are ≤ in (1), that is, We stress, however, that all our results also hold for the general case (1).Because we assume that all the features in basic interactions have minus signs, we denote S − 1 , . . ., S − J ⊂ [p] × {−1, +1} with S − j = {(k, −1) : k ∈ S j } as basic signed interactions of the LSS model.As our theoretical results will show, the RF tree ensemble can recover not only the basic interactions S j ⊂ [p], but also basic signed interactions S − j ⊂ [p] × {−1, +1}.In other words, through DWP and under the LSS model, the RF tree ensemble can recover not only which features interact with each other in the LSS model, but also whether a particular feature in an interaction has to be larger or smaller than some threshold for this interaction to be active.Besides basic signed interactions, we also define a union signed interaction as a union of individual basic signed interactions, as made more precise in the following definition.
Definition 1 (Union signed interactions).In the LSS model with basic signed interactions for some (possibly empty) set of indices In other words, a union signed interaction is a union of one or more basic signed interactions.For a single-feature signed interaction, its sign-flipped counterpart can also be added to the union.For example, for an LSS model with The theoretical results that we present in Section 4 are asymptotic, in the sense that they assume the sample size n to go to infinity.Denote the number of signal features ∪ J j=1 S j in the LSS model to be s, i.e., J j=1 |S j | = s.We assume s is uniformly bounded regardless of n and p.However, the overall number of features p or the number of noisy features p − s can grow to infinity as n increases.Our theoretical results also assume C4 (Sparsity) s = O(1) and log(p) n → 0. This means that, in contrast to many theoretical works [10,33,40] 2 , our results hold in a high-dimensional setting as long as the overall number of signal features s is bounded.The limit log(p) n → 0 is a common assumption for high dimensional settings when analyzing consistency properties of Lasso (see, for instance, [37,45,15]).

Depth-Weighted Prevalence (DWP) for an RF Tree Ensemble
In this section, we first review the RF algorithm and then define DWP for a given RF tree ensemble.

Review of RF
RF is an ensemble of classification or regression trees, where each tree T defines a mapping from the feature space to the response.Trees are constructed on a bootstrapped or subsampled data set D (T ) of the original data D. Note that each tree is conditionally independent of one another given the data.Any node t in a tree T represents a hyper-rectangle R t in the feature space.A split of the node t is a pair (k t , γ t ) which divides the hyper-rectangle R t into two hyper-rectangles R t,l (k t , γ t ) = R t ∩1(X kt ≤ γ t ) and R t,r (k t , γ t ) = R t ∩1(X kt > γ t ), corresponding to the left child t l and right child t r of node t, respectively.For a node t in a tree T , N n (t) = |{i ∈ D (T ) : x i ∈ R t }| denotes the number of samples falling into R t .
Each tree T is grown using a recursive procedure (denoted as CART algorithm [5]), which proceeds in two steps for each node t.First, a subset M try ⊂ [p] of features is chosen uniformly at random.The size of M try is m try .Then the optimal split k t ∈ M try , γ t ∈ R is determined by maximizing impurity decrease defined in (6): where t l (t r ) is the left(right) child of t and for sample size n, I n (t) is the impurity measure defined in this paper as which is the variance of the response y i 's for all the samples in the region R t .Note that the analysis of this paper holds only for the variance impurity measure but it is possible to extend to other impurities measures, which is left as future work.The procedure terminates at a node t if two children contain too few samples, e.g., min{N n (t l ), N n (t r )} ≤ 1, or if all responses are identical, e.g., I n (t) = 0.For any tree T and any leaf node t leaf ∈ T , denote p(t leaf ) to be a path to that leaf node.
Definition 2 (Depth of a path).Given a path p(t leaf ) that connects root node t 1 and leaf node t leaf in a tree T , we define the depth of the path p(t leaf ) to be the number of non-root nodes contained in the path.
For any hyper-rectangle R t , µ(R t ) denotes its volume.We make the following assumptions on an RF tree ensemble: A1 (increasing depth of a tree in the RF ensemble).The minimum depth of any path in any tree goes to infinity, i.e., min as n → ∞.
A2 (balanced split in a tree of the RF ensemble).Each split (k t , γ t ) is balanced: for any node t, Note that, without loss of generality, we use the same C γ here as in the LSS model.Otherwise, we can always let C γ to be the minimum of the two.
A4 (no bootstrap or subsampling of samples).All the trees in RF are grown on the whole data set without bootstrapping or subsampling, i.e.D (T ) = D for any T .
A4 is a technical assumption that simplifies our notation and analysis.We assume that each tree is grown using all of the samples, which is quite different from the assumptions on subsampling in recent theoretical works on RF (e.g.[4] and [40]).The subsampling rate plays a crucial role in the analysis of the asymptotic distribution of the RF predictor [4,40], where it is assumed that the subsampling rate converges to zero at a desirable rate.However, since we focus on the features selected at each node and not on the asymptotic distribution of the predictor, we do not require such assumptions on the subsampling rate.
A1 ensures that the length of any decision path in any tree tends to infinity.This assumption is reasonable as tree depths in RF is usually of order O(log n) which tends to infinity as n → ∞.A2 ensures that each node split is balanced.Similar conditions are used commonly in other papers [40].A3 shows the important role of the parameter m try .Roughly speaking, m try cannot be too small or too big.When m try is too small, there will be too many splits on irrelevant features which makes the tree noisy.When m try is too big, there will be too little variability in the tree ensemble.This motivation will be made rigorous in the proof of Theorem 2.

Depth weighted prevalence (DWP)
In this section, for a tree ensemble from RF, we formally introduce Depth Weighted Prevalence (DWP).Given a decision tree T in an RF tree ensemble, we can randomly select a path P of T as follows: we start at the root node of T and then, at every node, randomly go left or right until we reach a leaf node.This is equivalent to selecting a path in T of depth D with probability 2 −D from all the paths in a decision tree.Denote the nodes in P to be t 1 , . . ., t D , t leaf .As such, any path P in a decision tree T can be associated with a sequence of signed features (k t1 , b t1 ), . . ., , where D is the depth of the path and for any inner node t ∈ [D] on the path the sign b t indicates whether the path at node t followed the ≤ direction (b t = −1) or the > direction (b t = +1) for the split on feature k t ∈ [p].For a given RF tree ensemble depending on data D, the randomly selected path P of tree T and any fixed constant > 0, we now define F (P, T, D) to be the set of signed features on P where the corresponding node in the RF had an impurity decrease of at least , that is, We use F as a shorthand for F (P, T, D) when the path P from tree T and the data D of interest are clear.Note that if a feature appears more than once on the path P, its sign in F is the sign when the feature appears the first time with the impurity decrease above the threshold.Our main theorem will be stated in terms of the DWP of a signed feature set S ± ⊂ [p] × {−1, +1} on the random path P within F .To formally define the DWP of S ± , we first need to identify the sources of randomness underlying F .There are three layers of randomness involved: 1. (D: Data randomness) the randomness involved in the data generation; 2. (T : Tree randomness) the randomness involved in growing an individual tree with parameter m try , given data D; 3. (P: Path randomness) the randomness involved in selecting a random path P of depth d with probability 2 −d , given a tree T from an RF tree ensemble with parameter m try based on data D.
In the following definition of the DWP of signed feature sets, the probability is conditioned on data D, and taken only over the randomness of the tree T and the randomness of selecting one of its paths as in P.
Definition 3. (Depth-Weighted Prevalence (DWP)) Conditioning on data, for any signed feature set S ± ⊂ [p] × {−1, +1}, we define the Depth-Weighted Prevalence (DWP) of S ± as the probability that S ± appears on the random path P within the set F , that is, We emphasize that the probability of selecting a path in a tree T is P (P|T ) = 2 −d where d is the depth of the path P.
While we only have a fixed sample size which means the data randomness is inevitable, the tree randomness and path randomness are generated by the algorithm and thus can be eliminated by sampling as many trees and paths as we like.Because the DWP in ( 8) is only conditioned on data, for any given > 0 and set of signed features S ± , it can be computed with arbitrary precision from an RF tree ensemble with sufficiently many trees (recall that, conditioned on data D, the different trees in an RF tree ensemble are generated independently).

Main results
In this section we present our main theoretical results which are concerned with DWP as introduced in the previous section.Our results show that LSSFind (Algorithm 1), which is based on DWP at an appropriate level described in Theorem 3, consistently recovers signed interactions under an LSS model.Before we state our main results in full detail, we want to illustrate it with a simple example.
Input: Dataset D, RF hyperparameter m try , impurity threshold > 0, prevalence threshold η > 0, and maximum interaction size s max ∈ N. Output: A collection of sets of signed features.Train an RF using dataset D with parameter m try .; return Algorithm 1: LSSFind(m try , , η, s max ) Illustrative example: Assume that p = 2 and there are just two features X 1 and X 2 .Assume there is a single interaction J = 1 and the regression function is given by The response surface of ( 9) is shown in Figure 1 in the top middle plot.We consider the population case, where we have full access to the joint distribution P (X, Y ), that is, we have access to an unlimited amount of data (n = ∞).When we apply the RF algorithm as in Section 3, then for each individual tree in the forest the root node either splits on feature X 1 or on feature X 2 .Since X 1 and X 2 are completely symmetric in the distribution P (X, Y ), thus, if the RF algorithm grows more and more trees, in the limit, half of them will split on X 1 at the root node and half of them split on X 2 at the root node.For infinite data, this 50/50 split is introduced by the CART algorithm since the two splits have identical decreases of impurities.Furthermore, the split at any node will be at 0.5 for any of the two features, since the two splits corresponding to X 1 ≤ 0.5 and X 2 ≤ 0.5 maximize the impurity decrease given infinite data.This is illustrated in Figure 1, where the left bottom figure shows a tree which splits on feature X 1 at the root node and the right bottom figure shows a tree which splits on feature X 2 at the root node.As each tree in RF grows to purity, when the root node Figure 1: Exemplary RF decision trees trained on data as in ( 9) to illustrate the results that will appear in Theorem 2. Top center: response surface of E (Y | X 1 , X 2 ) as in ( 2) with X 1 ∈ [0, 1] on the x-axis and X 2 ∈ [0, 1] on the y-axis.Bottom left: a decision tree that splits on feature X 1 at the root node with the respective regions and conditional response surfaces for left and right child of the root node.Bottom right: a decision tree that splits on feature X 2 at the root node.The red-marked decision paths contain all signed features from the basic signed interaction S − = {(1, −), (2, −)} from an LSS model as in (9).For both of the trees, if one starts at the root node and randomly goes left or right at every node, then the probability of the basic signed interaction to appear on the path is DWP (S In contrast, for any other set of signed features . This provides a simple example for the more general result in Theorem 2. splits at feature X 1 , then for the path of the tree which follows the (1, +1) direction, that is, the X 1 > 0.5 direction, the tree will stop growing, as the respective response surface is already constant.However, for the path of the tree which follows the (1, −1) direction, that is, the X 1 ≤ 0.5 direction, the tree will further split on the remaining feature X 2 .Then the tree will stop because the node reaches purity.Thus, we conclude that the forest consists of exactly the two different trees shown in Figure 1 and in the limit, where the number of trees grows to infinity, each of the two trees appears equally often.
For each node t in these trees, the impurity decrease satisfies ∆ n I (t) ≥ 1/16.Thus, for any < 1/16, we can show that the DWP of the basic signed interaction In the above example with infinite data, the tree depth is not going to infinity, which means it does not satisfy assumption A1.A1 is needed only for the finite sample case because, for finite samples, internal nodes in a tree can never reach purity due to noise.
In Figure 1 the paths which contain the basic signed interaction S − = {(1, −1), (2, −1)} are marked red.For all the other sets of signed features = 0.5 As we will formally state in the two theorems below, the same reasoning holds true asymptotically for any RF trained on the data from the LSS model, namely, the DWP of a set of signed features S ± ⊂ [p]×{−1, +1} is always upper bounded by 2 −|S ± | and this upper bound is attained if and only if S ± is a union signed interaction.Recall that the DWP depends on the data D. It turns out, that the general upper bound follows directly from the construction of DWP and holds for any data D, i.e., independent of the LSS model, as the following theorem shows.
Theorem 1.For any impurity threshold > 0 and any set of signed features S ± ⊂ [p] × {−1, +1} for the RF algorithm from Section 3 it holds true that In addition, when the data D is generated from an LSS model, asymptotically (as the sample size increases) the general upper bound is attained if and only if S ± is a union signed interaction, as the following theorem shows.
Theorem 2. Assume that the data D is generated from an LSS model with uniformity, bounded-response, non-overlap basic interactions, and sparsity constraints (see C1 -C4).For any impurity threshold > 0, let b( with constants C β as in (2), C γ as in (3), s as in C4, and C m as in A3.Given a set of signed features S ± ⊂ [p] × {−1, +1}, for the RF algorithm from Section 3 it holds true that, • (Interaction lower bound) when S ± is a union signed interaction as in Definition 1, we have • (Non-interaction upper bound) when S ± is not a union signed interaction, then, where p → denotes convergence in probability.
Proof Sketch: The detailed proof of Theorem 2 is deferred to Section 2. It has two major parts: first, showing the assertion for the idealized population case and second, extending the population case to the finite sample case.
In the first part, we define a population version of the set F , which we denote as F. The set F only contains desirable features, which are features of a path P that correspond to a positive decrease in impurity if the RF gets to see the full distribution P (X, Y ) (not just a finite sample D).Note that desirable/non-desirable features are different from signal/noisy features.The definition of desirable/non-desirable features depends on the concerned path in a tree.A noisy feature is always a non-desirable feature but a signal feature can become a non-desirable feature when it has been split in the path.See Definition 4. The set F is an oracle, in the sense that its construction depends on the true underlying LSS model.This is in contrast to the set F , which can be computed for any given path from a tree of RF.Given this definition of F, a sketch of the proof of the major assertions of Theorem 1 and 2 is as follows: 1.When a set of signed features S ± appears in F, this implies that every time a signed feature (k, b) ∈ S ± appears on the way from the root node to the leaf, the splitting direction implied by b was selected for P, which gives rise to the general upper bound of DWP (S ± ) ≤ 2 −|S ± | (see Theorem 1).
2. If S ± is a union interaction, then (assuming all leaf nodes of the tree are pure) a correct splitting direction for each of its features already implies that S ± appears on P and thus, DWP (S ± ) ≈ 2 −|S ± | (see first part of Theorem 3 below).
3. If S ± is not a union interaction, then there will always be the possibility that, although every split for an encountered feature which is an element of S ± was done in the correct direction, some of the features in S ± were just never encountered and therefore, a correct splitting direction does not imply that S ± appears on P, hence DWP (S ± ) < 2 −|S ± | (see second part of Theorem 3).
In the second part of the proof, we show that the observed set F and the oracle set F are the same with probability going to 1 as goes to zero and n goes to infinity.That would be nice and easy if a tree grown using finite samples will converge to a tree grown using the population in terms of the splitting features and thresholds when sample size tends to infinity.However, that is not true.The obstacle is that, when a node splits on a non-desirable feature, since all the thresholds yield the same impurity decrease in the population case, the threshold selected via finite samples can deviate from the threshold via the population no matter how many samples are used.Thus, we need to carefully analyze desirable features and non-desirable features separately based on uniform convergence results.
Remark 1: Theorems 1 and 2 demonstrate that recovery of interactions becomes exponentially more difficult as the size of an interaction increases.An interaction S ± corresponds to a region of size O(2 −|S ± | ), which means the sample size must be much larger than 2 |S ± | to have enough samples in that region.Also, the DWP at an appropriate level of a basic interaction S ± is 2 −|S ± | .To have a consistent estimate, the number of independent paths should be much larger than 2 |S ± | .Thus, when one wants to recover an interaction of size s, the number of samples and the number of trees must be much larger than 2 s .That shows the intrinsic difficulty of estimating high order interactions.
Using the conclusions in Theorem 2, one can show that LSSFind (Algorithm 1) can consistently recover all the basic interactions from the LSS model, as stated in Theorem 3.
Theorem 3. Let the output of LSSFind (Algorithm 1) be S (m try , , η, s max ).Under the same settings as in Theorem 2, as long as m try , , η satisfies the assumptions in Theorem 2 and the following condition: with b( ) defined in (10) and C m in Assumption A3, then, with probability approaching 1 as n → ∞, S is a superset of the basic signed interactions with size at most s max and a subset of union signed interactions.In particular, if we define then U equals the set of basic signed interactions of size at most s max .
Note that to recover all the basic interactions, s max needs to be larger than or equal to the order of all the basic signed interactions but the latter is unknown as we do not know the underlying LSS model.
Proof: If S ± is not a union signed interaction, then it follows from the second part of Theorem 2 and with probability approaching 1 as n → ∞.Thus, S is a subset of union signed interactions.If S ± is a basic signed interaction of size at most s max , then it follows from the first part of Theorem 2 and the fact that with probability approaching 1 as n → ∞.Thus, S is a superset of the basic signed interactions with size at most s max .
Remark 2: One important assumption in our theorem is the sparsity of signal features.If there are many "weak" signal features, it is very hard for RF to work well.For RF, at each node of a tree, only one feature is used.That means the total number of features used along each path is limited by the depth of the tree, which is usually of order O(log n).For our assertions of Theorem 2 the hard threshold in the set F has the purpose to select the signal features.Clearly, the choice of an appropriate value of is hard in practice.The iterative Random Forest fitting procedure in iRF [1] (which uses joint prevalence on decision paths in RF to recover interactions, similar as suggested by Theorem 2) filters noisy features not with a hard, but with a soft thresholding procedure: it grows several RF iteratively and samples features at each node according to their feature importance from the previous iteration.In that way, one does not need to chose a single hard threshold, which leads to a much more practical algorithm.Unfortunately, such an iterative soft thresholding makes theoretical analysis much harder, which is why we restrict to the hard threshold for the theoretical analysis in this work.
One of the remarkable aspects of the result in Theorem 3 is that the range of η is independent of any model coefficients in the LSS model (that is, the linear β coefficients and the γ thresholds).For sufficiently small , it only depends on the number of signal features s and the bound of m try , i.e., C m , and nothing else.In a sense, this shows that the tree ensemble of RF contains the qualitative or discrete-set information of which features interact with each other, independently of the quantitative information about what are the numerical parameters or model coefficients in the LSS model.
Another interesting aspect about the results from Theorem 3 is that it sheds some light on the influence of m try on the interaction recovery performance of RF.For the third assertion in Theorem 2 we actually show that DWP (S ± ) ≤ r n (D, ) + 0.5 |S ± | 1 − 0.5 min k∈∪j Sj P (root node splits on feature k) .
When m try is too large, min k∈∪j Sj P (root node splits on feature k) can get very small, as particularly strong features (large initial impurity decrease) can mask weaker features.As an extreme example, consider the situation where m try = p and thus, the root node gets to see all the features.In that case, the single feature which has the highest impurity decrease, say X 1 , will always appear at the root node and hence, for S ± = {(1, −1)} or S ± = {(1, +1)} one will get DWP (S ± ) = 2 −|S ± | = 0.5, independent of whether S ± is an interaction or not.This shows that when m try is too large, DWPs corresponding to false interactions can attain the universal upper bound 2 −|S ± | , which leads to false positives in terms of interaction recovery.On the other hand, when m try is too small, for a signal feature k ∈ ∪ j S j it can take a long time until it gets selected into the candidate feature set at a node.In particular, for finite sample, it can happen that the tree reaches purity due to lack of samples without having split on any of the signal features.Hence, the reasoning of Theorem 2, namely that correct split direction + pure path implies that a union interaction appears on the path does not hold anymore.This can lead to union interactions having significantly smaller DWP than the universal upper bound 2 −|S ± | , i.e., false negatives in terms of interaction recovery.

LSSFind and simulation results
In this section, motivated by our theoretical results in the previous section, we evaluate LSSFind empirically in terms of its ability to recover interactions 3 .Simulated experiments are carried out to assess the ability of LSSFind to correctly recover interactions from the LSS model, even when some of the LSS model assumptions are violated.
In LSSFind, one needs to search over all possible sets with size at most s max to obtain the final result.That is computationally very intensive.One more efficient way is to only look for sets with size at most s max and also with which implies that we don't need to search over all possible sets with sizes at most s max ; instead, we need to search only for sets whose DW P 's are larger than (1 − η) • 2 −smax .Because many sets with sizes at most s max are filtered out, this significantly reduces the search space.We use the FP-growth algorithm [14] to obtain those sets of signed features which have a DWP higher than some threshold.Note that DWP requires infinite number of trees.To approximate DWP, we use 100 trees in the simulation.Since each tree contains thousands of paths, we have hundreds of thousands of paths to estimate the DWP for.

Simulated data from LSS models
In the following we present simulation results, where we generated data D from the LSS model for different number and order of basic interactions and different signal-to-noise (SNR) ratios.We find that LSSFind recovers the true interactions from the LSS model with high probability, whenever the overall number of basic interactions and their orders are small.More precisely, we consider p = 20 features and n = 1, 000 sample, where each feature X j is generated from an uniform distribution U ([0, 1]), independent from one another.The number of basic interactions is denoted as J and the order of each interaction is denoted by L. We consider the same threshold τ for all features.The noise is Gaussian with variance σ 2 and the response is: We consider different values for J, L, and σ 2 , namely, J = 1, 2, L = 2, 3, 4, and σ 2 s such that the signal-to-noise ratios (SNR) is 0.5, 1, 2, or 5.For a given J and L, the threshold τ is chosen such that about 50 percent of samples fall into the union of hyper-rectangles, that is, As we know that the number of samples falling into ∪ J j=1 ∩ j•L k=(j−1)•L+1 {X k < τ }, which can also be roughly thought as the label imbalance, has a high impact on the results, keeping this number the same across different simulation settings makes sure that the simulation outcome are more comparable.The results are averaged across 40 independent Monte Carlo runs.We grow RF using the scikit-learn package with 100 trees.We apply LSSFind with parameters η = 0.01, = 0.01, and s max = L + 1. Recall that we use (12) to select candidate interactions.If s max is set to L, the condition (12) would be too restrictive for challenging situations, such as when LSS model is violated, and LSSFind can end up finding no interactions.Given a set S * of K true basic signed interactions from the respective LSS model and output from LSSFind S , we evaluate their proximity based on their Jaccard distance: Note that any element in S * and S is a set of signed features.This score gives no credit for partial recovery: If one interaction S ± in S * is {(1, +1), (2, +1)}, there will be no credit for S if it contains subsets of S ± such as {(1, +1)} or same features with different signs such as {(1, +1), (2, −1)}.While this score can be overly restrictive for practical problems, it is suitable for our simulation because we would like to evaluate whether LSSFind can consistently recover the interactions in the LSS model.The simulation results are shown in Figure S1.In general, the performance of LSSFind sharply degrades when the number of basic interactions and the order of interactions increases.For K = 1 and L = 2, 3, 4 LSSFind almost always recovers the correct basic signed interactions.For K = L = 2 it mostly recovers the correct basic signed interactions, except for small SNR.When K = 2 and L = 3, 4, LSSFind rarely recover the basic signed interactions, for this simulation setup, resulting in a score of almost zero.Note that this is consistent with our results in Theorem 2, which indicates the problem is much harder for more interactions and higher-order interactions.We also explored the high-dimensional case.When p = 20, 50, 100, 200 and n = 1000 • (1 + log(p/20), the score for LSSFind is shown in Figure 6.The scaling of p and n is chosen to make sure log p/n ≈ 0.001 and also when p = 20, n will be 1000, which corresponds to our previous numerical setting for better comparison.Recall that Theorem 2 and 3 require condition log(p)/n → 0, as stated in Condition C4.We also note that log(p)/n → 0 is commonly imposed when analyzing lasso problems, too [37,45,15].As can be seen in Figure 6, the score increases and approaches to 1 as the dimension p increases.This is consistent with Theorem 3 which shows that LSSFind can recover the underlying interactions for the high dimensional case.

Robustness to LSS model violations
In the following, we present simulation results for LSSFind when the data is generated from a misspecified LSS model, that means, some of the LSS model assumptions are violated.We find that LSSFind deteriorate when the LSS model is violated.We consider a misspecified LSS model with SN R = 5 and 2 order-2 interactions with p = 20 features and n = 1, 000 samples, analog as in Figure 2, second row, first column, third bar.We consider the following violations of LSS model assumptions: • Overlapping interactions: different basic interactions have overlapping features.When overlap = 1, the basic interactions are ((1, −1), (2, −1)), ((2, −1), (3, −1)).
• Correlated features: Different features are correlated instead of independent.When corr = α, the correlation between feature j 1 and j 2 is α |j1−j2| .
• Heavy-tail noise: the noise follows a Laplace or Cauchy distribution, which have heavier tails than (sub-)Gaussian distributions.The noise is normalized such that the SNR is 50.
Results of LSSFind are shown in Figure 3.For heavy-tail noise, we observe a gradual drop in performance.For the correlated feature case, one can see that LSSFind has reasonable performance when the correlation is close to zero but its performance deteriorates when the correlation is high.Similar, for the overlapping feature case the performance worsens.

Empirical comparison between LSSFind and iRF
Our original motivation to study DWP in RF tree ensemble came from the strong positive empirical evidence of iRF [1,21].There are three major reasons why the full iRF procedure is hard to analyze theoretically: First, the iterative re-weighting in iRF is based on the feature importance metric of mean decrease in impurity (MDI).Analyzing MDI for the RF algorithm is a challenging task on its own.In particular, MDI of noisy features in deep trees are known to have a bias, [46,23,26], which may propagate through various iterations in iRF and makes a theoretical analysis very challenging.Second, the iRF procedure selects interactions from the paths of the RF tree ensemble via the RIT algorithm [34].Thereby, individual paths are weighted according to the number of observations which fall into their respective leave nodes.This means that the selected feature interactions of iRF cannot be derived from the RF tree ensemble directly, but depend on the data in a more complex way.Third, the outer stability layer of iRF, where interactions are evaluated based on their consistent appearance among several bootstrap replications of the procedure, adds an additional layer of complexity for theoretical analysis.
In order to still analyze the major aspects of iRF theoretically, we proposed the related LSSFind algorithm.Instead of iterative re-weighting via MDI, LSSFind introduces a single hard-threshold on the impurity index at individual tree nodes.Moreover, instead of selecting interactions via RIT, LSSFind is based on DWP, which is derived from the tree ensemble directly, without an additional data dependent sampling scheme.In other words, although a high DWP in LSSFind does not exactly correspond to the RIT interaction selection strategy employed in iRF, they both build on similar high-level quantities, namely, sets of stable features which often appear together on decision paths in an RF tree ensemble.Therefore, our theoretical results on DWP and LSSFind provide evidence that the general interaction discovery strategy of iRF is theoretically justified.In the following, we complement our theoretical findings about LSSFind with an empirical comparison between iRF and LSSFind.
We consider the same simulation setting as in Section 5.1.However, we replace the very strict performance measure in ( 14) by a weaker one 4 .Specifically, given a set S * of K true basic signed interactions from the respective LSS model and output from LSSFind and iRF, respectively, S , we now evaluate their proximity based on: Note that (15) corresponds to the Jaccard distance on the set of unsigned features which appear in any of the detected interactions.While the stricter metric in ( 14) is more appropriate to evaluate finite sample validity of Theorem 2, the relaxed version in ( 15) is arguably of more practical interest.This is because it gives partial credit for interactions that are almost, but not perfectly, recovered.If score(S * , S ) is high, it means the features in the discovered interactions overlap with the features in the true interactions, which would greatly narrow down the interaction search space and save tremendous effort for subsequent analysis for a practical problem.
For the iRF algorithm we used the signed iRF algorithm (siRF) from the Python iRF package iRF5 , with default parameter settings and a threshold on iRF's stability score of 0.5 for interaction selection, as recommended in [21].Simulation results are shown in Figure 4.When the LSS model as in ( 13) is relatively simple, for example, when it has only a single signed interaction (K=1) or only a single feature per signed interaction (L =1), iRF and LSSFind perform comparably (see fist row and second row first column of Figure 4).However, when the LSS model gets more complex, with several additive interactions (K > 1) each having more than one signed features (L > 1) (see second row, second and third columns in Figure S3), iRF outperforms LSSFind in terms of the metric (15) 6 .In summary, we find that iRF outperforms LSSFind in situations where the underlying LSS model is more complex and when a flexible performance metric is chosen.This appears to be consistent with the fact that the iRF algorithm has witnessed empirical success on specific domain data problems [1], whereas LSSFind was specifically constructed in such a way that it reflects our result in Theorem 2.

Discussion
Relevant statistics theory starts with a model that is a good approximation to reality.Thus, it is important to derive theoretical results under a model that is scientifically motivated.Our proposed LSS model class provides such a family that reflects the biological phenomena of biomelecule interacting through thresholding.Also, analyzing RF based algorithms under different models rather than the smoothness classes in the literature can give insights into their empirical adaptivity.Our results are the first to give a theoretical result that DWP of a set of features in an RF tree ensemble recovers high order interactions under the LSS model and reasonable conditions on the RF hyperparameters.Moreover, the universality of interaction's DWP in LSS models gives insights into the general difference between quantitative (e.g., prediction accuracy) and qualitative (e.g., interaction recovery) information extraction.In scientific problems often the latter is of higher interest.
Thus, this work narrows the gap between theory and practice for Boolean interaction discovery and is of general interest to the fields of statistics, data science, machine learning, and scientific fields such as genomics.
Our theoretical analysis also gives some insights of RF for tuning a crucial hyper-parameter m try : Given an interaction with a fixed size, the non-interaction DWP upper bound in Theorem 2 depends only on C m and C m is only constrained by m try (A3).Therefore, one can find an optimal m try that minimizes this upper bound.The optimal choice of m try turns out to be m try = p • (0.5 − s/(2(p − 2)).If one third of all features are signal features, that is, s = p/3, m try recovers the default choice in standard RF implementations for regression, namely, m try ≈ p/3.However, when p s, the optimal choice from our theoretical results corresponds to m try ≈ p/2, which suggests that with the presence of many noisy features, m try should be larger than p/3 as in the default choice.Further investigations through data inspired simulations and theoretical analyses are needed.
One might wonder whether the form of interaction defined by the LSS model constitutes a particularly difficult or a particularly easy form of feature interaction.In general, there appears to be no clear (mathematical) answer to this question, as one cannot define what is meant by feature interaction in a clear way for a generic (possibly discontinuous) regression function f (X) = E(Y |X).For example, it is easy to check that for any multivariate function f : [0, 1] p − > R one can find (possibly discontinuous) univariate functions g, h 1 , . . ., h p , such that f (x 1 , . . ., x p ) = g(h 1 (x 1 ) + . . .+ h p (x p )).We stress that the reason why we considered the LSS model in this work was not because it defines a particularly easy form of interaction, but rather its biological relevance, as the thresholding relationships captured in the LSS model are observed in various biological data.
Although, the LSS model is motivated from biological phenomena, some of the assumptions which we made in order to derive our theoretical results might be difficult to justify directly in real-world-problems, in particular, the independence condition C1 and the non-overlapping interaction set condition C3.Note, however, that in many application settings, it is possible to overcome these limitations by appropriate data preprocessing, e.g., decorrelating features (recall the discussion after condition C1).Nevertheless, for future work it will be interesting to extend our results to a general LSS model (with possibly overlapping interaction sets and correlated features) or even interaction models beyond Boolean interactions, in order to further close the gap between theory and practice.
Finally, it will also be of interest to compare LSSFind and iRF with methods that, more generally, employ a ML black box model to extract interactions.For example, when individual features are independent, as we assume in C1, one can use Monte Carlo methods [32] to estimate higher-order Sobol indices for the fitted ML model.

Proof of Theorem 1 and Theorem 2 7.1 Proof of the population case -desirable features
Recall that there are three different sources of randomness: 1. (D) the randomness of the data D, 2. (T ) the randomness of the tree T , given the data D, 3. (P) the randomness of the randomly selected path P, given the tree T .
Note that, although the random path P depends on all three sources of randomness (the data randomness, the tree randomness, and the additional path randomness), when we condition on the tree T , then the random path P is independent of the data D. In the first part of the proof, we will only consider the last two sources of randomness, namely, from the random tree and from the randomly selected path on the tree.Also, recall that we define S − j , S + j ⊂ [p] × {−1, +1} as the features in S j ⊂ [p] with − and + sign, respectively, that is For each node t in a tree T , define Ḟ± (t) to be the set of signed features used by the parents of t in T and Ḟ(t) to be the corresponding (unsigned) features.For any feature j, (j, −) and (j, +) can appear together in Ḟ± (t).Furthermore, let F ± (t) be a subset of Ḟ± (t) by only including the signed feature that corresponds to the first split of the feature if a feature appeared multiple times in the path.As a result, for any feature j, at most one of (j, +) and (j, −) can appear in F ± (t).Define F(t) to be the set of (unsigned) features in F ± (t).Because Ḟ± (t) and F ± (t) only differ in terms of feature signs, they correspond to the same set of features, i.e., Ḟ(t) = F(t).Conditioned on a tree T , at every node t of T we now define the set of desirable features with respect to the LSS model as follows.
Definition 4 (Desirable features).Define the desirable feature set U (t) ⊂ [p] to be Note that the set of desirable features U (t) at a node t is only defined w.r.t.some particular LSS model.In particular, it depends on the basic signed interactions S − 1 , . . ., S − J .Hence, for a given tree T with node t, U (t) is an oracle set, which cannot be computed from data.The way to think about U (t) is that it corresponds exactly to those set of features which would yield some impurity decrease if the tree was grown by seeing the full data distribution P (X, Y ) and hence, making every split at the correct split point.Moreover, denote t leaf to be the leaf node of P and we define F to be the desirable signed features of F (t leaf ).That is, the signed features k t where for the node t on the path P we have k t ∈ U (t), i.e., For notation simplicity, we use F as the shorthand of F(P).Further, we define the event Ω 0 to be that the desirable features are exhausted at the leaf node: With these definitions we get the following lemma.
Lemma 1.For the event Ω 0 in (20) it holds true that Proof.For an arbitrary interaction j ∈ [J], it follows from the definition of U (t) that Ω 0 implies either ) that appears first on the path.Then, because F ± (t) only considers the signed features when they first appear in a path, we have that (k, +1) was desirable and thus, (k, +1) ∈ F, i.e., S + j ∩ F = ∅.Second, consider S − j ⊂ F ± (t leaf ).Then for any (k, −1) ∈ S − j , by definition of F (t) we have that no S + j feature appeared on the path before (k, −1) and hence, (k, +1) ∈ F, i.e., S − j ⊂ F. Finally, recall that by definition of F both conditions in (21) can never happen at the same time.
We state the population version of our main results below.
and if S± is a union signed interaction as in Definition 1 then almost surely Moreover, if S± is not a union signed interaction then almost surely Proof of Theorem 4. Recall that the path P corresponding to F is selected in such a way: one starts at the root node t root and then randomly follows the paths in the tree either to the plus (+1) or to the minus (−1) direction with probability 0.5.Let B denote a set of i.i.d.Bernoulli coin flips taking values +1 and −1 with equal probability 0.5.Assume that at every node in the tree we draw one of the Bernoulli coin flips B ∈ B to decide whether we follow the path in the plus (B = +1) or in the minus (B = −1) direction.In particular, for any feature k ∈ [p], let B k ∈ B be the Bernoulli random variable we draw when k appears for the first time on P.
Proof of (23): Note that when (k, −1) ∈ F, B k = −1.Similar, when (k, +1) ∈ F, this implies that B k = +1.Consequently, for any S± and hence That completes the proof.Proof of (24): Consider any basic interaction S j = {k 1 , . . ., k sj }, j ∈ [J], then by Lemma 1 we have that Moreover, when s j = 1, we also have that Consequently, when S is a union interaction as in Definition 1 it follows that which, shows (24).
Proof of (25): Assume that S± is not a union interaction.If any of the following is true: • S± contains any noisy signed feature (k, b) that's not contained in ∪ j S + j ∪ S − j ; • for some signal feature k ∈ ∪ j S j we have that (k, +1), (k, −1) ∈ S± ; Then by definition of U (t) in ( 19), P P S± ⊂ F T, D = 0 and thus, (25) holds.
Thus, we can assume that S± contains no noisy features and there exists some interaction j ∈ [J] with s j > 1 such that (S − j ∪ S + j ) ∩ S± = ∅ and for some (k, −1) ∈ S − j we have that (k, −1) ∈ S± .First, assume that (k, +1) ∈ S± .Then, whenever t root splits on feature k, we have that { S± ⊂ F} implies 7 B k = −1 and thus, P (P,T ) S± ⊂ F ∩ t root splits on k|D = 0.5 s(1 − P T (t root splits on k|D)) + 0.5 s+1 P T (t root splits on k|D) where we made use of the fact that the tree T is independent of the Bernoulli random variables B. Second, assume that (k, +1) ∈ S± .If S± ∩ S − j = ∅, then { S± ⊂ F } implies 8 that t root does not split on k and thus 7 Note that this requires the interactions to be disjoint, as otherwise the features in S± ∩ (S + j ∪ S − j ) may also appear in other interactions S l with l = j and k ∈ S l and thus, even when B k = +1 it is possible that S± ∩ (S + j ∪ S − j ) ⊂ F . 8 Again, this requires the interactions to be disjoint, as otherwise the features in S± ∩ (S − j ∪ S + j ) \ (k, +1) may also appear in other interactions S l with l = j and thus, even when troot splits on k with If S± ∩ S − j = ∅, let k ∈ S j and k = k.Because (k, +1) ∈ S± , we can assume that (k , +1) ∈ S± ; otherwise | S± ∩ S + j | > 1, which implies P S± ⊂ F = 0.When t root splits on k , { S± ⊂ F} implies9 B k = −1 and thus, Thus, we have shown (25).
7.2 Proof of the finite sample case

Filtering of desirable features and impurity
Recall that R t,l = R t ∩ {X|X kt ≤ γ t } and R t,r = R t ∩ {X|X kt > γ t } denote the region corresponding to the left and right children for node t.In other words, node t divides the region R t into R t,l and R t,r .Recall that N n (t) is the number of samples in the region R t , i.e., N n (t) = n i=1 1(x i ∈ R t ).We will use an equivalent formula for the impurity as in Lemma 2.
Lemma 2. ∆ n I (R t,l , R t,r ) defined in (6) in the main paper is equivalent to (31): Proof.We have that If we denote A = xi∈R t,l y i and B = xi∈Rt,r y i , the above formula is the same as : Let R denote the set of axis-aligned hyper-rectangles obtained by splitting the unit hyper-rectangle consecutively, where each split satisfies assumption A2 in the main text.We study R because it contains all the possible rectangles that can represent region of a node in a tree.Let R d be the set of rectangles obtained by splitting the unit hyper-rectangle d times, where each split satisfies assumption A2 from the main text.Then R = ∪ d≥1 R d , and for any R ∈ R d , we have µ(R) ≤ (1 − C γ ) d (recall that µ(R) denotes the volume of R).For any region R, we denote N R to be the number of points in R.
Lemma 3. Suppose that assumption A2 from the main text is satisfied.Then for any d ≥ 1 it holds true that max Therefore, Since R 1 is arbitrary, we have max Proposition 4. Suppose that constraint C4 and assumption A2 from the main text hold true.Then and Proof.For any fixed d, let G n (R d ) be the growth function for the set of rectangles R d defined in Chapter 5.2 of Vapnik [39], i.e., Here for any set A, |A| denotes the number of elements in A.
We claim that G n (R d ) ≤ log(n(2np) d ).This is because at each of d splits, we have at most p directions and at most n split points to choose from.Therefore, splitting d times can create no more than (2np) d different separations of the n data points.Furthermore, within each rectangle, the indicator functions 1(y i ≥ θ), θ ∈ R can at most create n separations. Thus, and Therefore, by Theorem 5.1 of [39]: Theorem 5.1 in [39]: Let A ≤ Q(z, α) ≤ B, α ∈ Λ be a measurable set of bounded real-valued functions.Let G n be the growth function of the indicator functions induced by Q, then we have the following inequality: we have max Taking Y = 1 and we have max By Lemma 3 and the above equation, we have max Since that holds for any fixed d > 0, we know the left hand side of (37) converges to zero in probability.Combining ( 35) and ( 37), we have shown that: Since this holds for any bounded random variable Y , we can take Y = 1 and we have shown max That completes the proof.
Proposition 5 (Subgaussian case).Suppose that assumption A2 from the main text holds true and (log n) 1+δ log p/n → 0 for some δ > 0. Suppose Y = E(Y |X) + Z where Z is independent of X and 1subgaussian.Then Here z i = y i − f (x i ) represents the noise terms.Our proof proceeds in the following two steps.
Step 1. Show that Step 1 is similar to the proof of Proposition 4 but the difference is that we need the convergence rate.Let and take Let G n (R d ) be the growth function for the set of rectangles R d .By (34), we have Therefore, by Theorem 5.1 of [39], we have max Since this holds for any bounded random variable Y , we can take Y = 1 and it follows that max Since d → ∞, (1 − C γ ) d → 0. Therefore, by Lemma 3, we have max Combining ( 40) and ( 42), ( 39) is proved.
Step 2. Show that Therefore, it suffices to prove that both of the two terms on the right hand size converges to 0 in probability.We begin with the first term: max . Since X and Z are independent and Z is 1-subgaussian, by Hoeffding inequality, for any rectangle R. Therefore by union bound, P max for any > 0. Since the above upper bound on the probability is independent of X, we conclude that max We now turn to the second term max . Let R s0 be the set of rectangles with at most s 0 = n/(log n) 1/2+δ0 samples, then log |R s0 | ≤ (s 0 + 1) log n.By union bound, Therefore, max Hence, to prove (43), it suffices to show that ∪ d1>d R d1 ⊂ R s0 with probability tending to 1.Note that by definition of δ 0 , 1/2+δ0 1/2−δ0 = 1 + δ.Therefore log(np) n By (32) and (41) we have max Therefore, max R∈∪ d 1 >d R d 1 N R ≤ s 0 with probability tending to 1.The proof is now complete.
Define population impurity decrease ∆ I (t) at a node t to be Similar to Lemma 2, we know it is equivalent to: The following proposition shows that the finite-sample impurity decrease converges to the population impurity decrease uniformly.
Proposition 6. Suppose that constraint C4 and assumption A2 from the main text are satisfied.Then, we have the following two uniform convergence results: Proof.a.This follow directly from Proposition 4. b.
Now we analyze the impurity decrease at each node of a tree.We consider three families of trees: T 0 , T 1 and T 2 : T 0 {Any tree that satisfies A2}.
T 1 {Any CART tree that satisfies A2 and A4}.
T 1 is the family of CART trees that satisfy our assumptions but m try can be arbitrary.T 1 is more restricted than T 0 in the sense that the threshold γ t of any node t of any tree in T 1 must maximize the finite sample impurity decrease in (6).Thus, T 1 depends on the data.For any T ∈ T 0 and any t ∈ T such that Ů (t) = ∅, its region R t is a rectangle: where c low, , c high, ∈ [0, 1].
Since F ± (t) ⊂ Ḟ± (t), Ů (t) ⊂ U (t).For any γ, denote R t,l (γ; k) = R t ∩ {X|X k ≤ γ} and R t,r (γ; k) = R t ∩ {X|X k > γ}.First, for any node t ∈ T and any k ∈ Ů (t), we have a characterization for the impurity decrease: Lemma 7.For any T ∈ T 0 , t ∈ T , j ∈ [J], k ∈ S j ∩ U (t), and γ ∈ (0, 1), Proof of Lemma 7. Since k ∈ U (t), we know that k is not in F(t).That means any of t's parents do not split on k.In other words, R t does not have any constraints for feature k, i.e., c low,k = 0 and c high,k = 1.Thus, we know that and Recall that ∆ I in (44) has its equivalent formula (45): where the conditional expectations are and Now we will analyze (49) and (50).To ease the notations, we define the following three events: Then (49) becomes J j =1 β j P (A j |BC k ).Because R t has no constraints on k, B does not involve feature k.When j = j (namely, k ∈ S j ), A j also does not involve feature k.Thus, C k is independent of (A j , B), Similarly this holds for (50).Therefore, when j = j: Therefore, (45) becomes: That completes the proof.
Lemma 8.For T ∈ T 0 , t ∈ T , if there exists j ∈ [J] and k ∈ S j such that k ∈ Ů (t), then Proof of Lemma 8.Because k ∈ S j and k ∈ Ů (t), we know that S + j ∩ Ḟ± (t) = ∅.That means node t is not at the right branch of any node that splits on features in S j .Thus, c low, =0 when ∈ S j .
(54) Also, c high,k = 1 and c low,k = 0 because k ∈ Ů (t).Then, P (∀ ∈ S j /{k}, X ≤ γ |X ∈ R t ) is That completes the proof.Lemma 9. Suppose that constraint C4 from the main text holds.Then, for any fixed > 0 it holds true that Proof.First of all, we know from Proposition 6 that sup Thus, in order to prove (55), we only need to show that inf Recall that γ k is the ground-truth threshold of feature k in the interaction.Here we can drop max γ∈[Cγ ,1−Cγ ] and use γ k because that results in a lower bound of the previous equation.Based on Lemma 7, we know that The second inequality is due to µ Proof.To simplify the notation in the proof, let us denote Using Proposition 6, we have sup By Lemma 9 (see ( 56)), we know the second term is bounded uniformly above zero: inf Thus, the ratio converges to 1 in probability: sup Similarly, this applies to b n and b, i.e., sup So by the continuous mapping theorem, we know that sup By Lemma 7, we know that Thus the ratio is Thus, by the Squeeze theorem, we complete the proof.
Lemma 11.Suppose that constraint C4 from the main text holds.Then the following statements are true: i) For any fixed , δ > 0, ii) For any fixed > 0, Proof.We use math induction to show that the above statements hold for any L ≥ 0: i) For any fixed , δ > 0, ii) For any fixed > 0, That implies ∆ I (R t,l (γ; k), R t,r (γ; k)) = 0. Second, assume that there exists j such that k ∈ S j /U (t).For j = j, by a similar deduction as before, we know that

The impurity decrease ∆
Again, we consider two cases: Because k ∈ U (t), either (k, −1) ∈ F ± (t) or we know that (62) converges to zero in probability. ii , it means there exists a parent of t, denoted t parent , such that feature k is used to split that node and none of its parents splits on k, in other words, k ∈ U (t parent ).By Lemma 11, we know that the corresponding threshold γ * tparent,k p → γ k .Since S + j ∩ F ± (t) = ∅, it follows that t is on the left branch of t parent .Thus, we have that c high,k (t) ≤ γ * tparent,k .For any fixed > 0, if µ(R t,l (γ; k)) > and µ(R t,r (γ; k)) > , then Since is chosen arbitrarily, this implies ∆ I (R t,l (γ; k), R t,r (γ; k)) p → 0. Combining a) and b), we complete the proof.
With the help of the previous lemmas, we have the following proposition: Proposition 13.Suppose t leaf is a leaf of P from a random tree T ∈ T 2 .Suppose that constraint C4 and assumptions A1-A4 from the main text hold true.For any fixed constant > 0, the following holds true: i) ii) iii) Because P (D ≥ log / log C γ ) → 1, by the Markov inequality, we know the random variable Thus, we know where η n (D, ) is a random variable only depend on D and η n (D, ) p → 0. Because that holds for any , we have iii) Denote t s to be the s-th node in a path P(t leaf ) for s ≥ 1.Based on the proof of ii), let d be an integer that (roughly) equals to thus, the proof follows from Lemma 11.

Balanced root feature selection
Recall the definition of C root (D) in ( 22), which appears in Theorem 4. Recall that for any tree T from RF, there are two different sources of randomness: first, the randomness of the data D = ((x i , y i )) n i=1 , which we denoted as (D), and second, the randomness from the candidate feature selection, which we denoted as (T ).Denote M try (t) ⊂ [p] to be the set of candidate features selected at node t and note that M try (t) and the data D are independent.
Define the event A to be that, given data D, the maximum impurity decrease at the split of root node for every signal feature k ∈ ∪ j S j is larger than that of any noisy feature k ∈ ∪ j S j , that is, Note that the event only depends on the data randomness D (and not on the tree randomness T and the path randomness P).Thus, A is independent of M try (t root ).Note that it follows from Proposition 13 that P D (A) → 1 as n → ∞.Proof.For any k ∈ ∪ j S j , define B k to be the event that only signal feature k is selected in M try (t root ), that is,

Combining results
Our major result in Theorem 4 is formulated for the random (oracle) feature set F = F(D, T, P).Note that this is an oracle feature set, as it depends on the true interactions S j , which are not known in practice.From the analysis in Section 7.2.1 we know that we can obtain a consistent estimate of the oracle feature set F by thresholding on MDI as in F .Recall that for a given the (random) set F can easily be obtained without any knowledge of the true model.Based on Proposition 13, we observe the following.
Recall that Ω 0 is defined in (20), F is defined in (19), and F is defined in (7) in the main text.We have the following theorem.Proof.(72) follows directly from Proposition 13 ii) and the definition of Ω 0 in (20).
To prove (73), one observes from Proposition 13 i) that for any > 0, taking ˜ =  where the second inequality follows from Theorem 5.   Figure 5: Simulation results for iRF analog as in Figure 2. LSSFind has higher score when the number of basic interactions K = 1 or when the order of interactions L = 2.For other cases, neither methods have good performance.Note that when a different metric is used, the story is different, see Figure 4.

Additional figures
Figure 6: Simulation results for LSSFind when the dimension p grows and the samplesize grows at the rate of log(p).LSSFind has higher score when the dimension grows higher.
To show this, we can get: DWP (S − ) = P (S − ⊂ F |D) = P T (T 's root splits on feature 1) =0.5,correspond to the left tree • P (S − ⊂ F |D, T 's root splits on feature 1) =0.25,only the red path satisfies this.+ P T (T 's root splits on feature 2) =0.5,correspond to the right tree • P (S − ⊂ F |D, T 's root splits on feature 2) =0.25,only the red path satisfies this.

P
T (the root node of T splits on feature k | D) .

Theorem 5 .
Assume that C m p ≤ m try ≤ (1 − C m )(p − s + 1) + 1 for some constant C m ∈ (0, 1).Condition on D = ((x i , y i )) ni=1 , for any k ∈ ∪ j S j , we have thatP T (t root splits on feature k| D) ≥ C s m − 1 A cand thus,C root (D) ≥ C s m − 1 A c p → [C m ] s as n → ∞.

B
k {M try (t root ) ∩ ∪ j S j = k and |M try (t root ) \ ∪ j S j | = m try − 1}.Note that B k only depends on M try (t root ) but not on D and A ∩ B k ⊂ {t root splits on feature k}.

Pp − m try − i p − 1 − i m try p ≥ p − m try − s + 2 p − s + 1 s,
T (t root splits on feature k|D)≥ P T (B k ∩ A|D) ≥ P T (B k |D) − P T (A c |D) = P (B k ) − 1 A c .Moreover, we have that P (B k ) = with where n = p − 1, h = s − 1, and k = m try − 1.

Theorem 6 .C
Under the assumption of Proposition 13 it holds true that for any fixed > 0,P (P,T ) (Ω c 0 + η n (D, ) with η n (D, )with C as in Proposition 13.

Figure 2 :
Figure 2: Simulation results for the performance of the LSSFind (Algorithm 1).The data is generated from an LSS model with Gaussian noise as in (13) with n = 1, 000 samples and p = 20 features.Standard deviations of the proximity scores are given in error bars.Different number of basic interactions K are shown in different rows, of interaction-orders L in different columns, and of a series of SNRs on the x-axis.The y-axis shows the proximity score in (14).A proximity score of one corresponds to perfect recovery of all interactions simultaneously.

Figure 3 :
Figure 3: Simulation results similar to those in Figure 2 for interactions of order L = 2 and J = 2, but when the data is generated from a mis-specified version of the LSS model, with n = 1, 000 samples and p = 20 features.Left : signed features of different basic interactions are overlapping.When overlap = 1, the basic interactions are ((1, −1), (2, −1)), ((2, −1), (3, −1).Middle: different features are correlated instead of independent.When corr = α, the correlation between feature j 1 and j 2 is α |j1−j2| .Right: the noise follows a Laplace or Cauchy distribution, instead of Gaussian distributions.

Figure 4 :
Figure 4: Simulation results of LSSFind (orange) and iRF (blue) analog as in Figure 2 but with the performance measure (15) instead of (14).
12 C γ and β j ≥ C β .Then using Lemma 8 leads to the conclusion.For a node t, denote γ * t,k = argmax γ∈[Cγ ,1−Cγ ] ∆ n I (R t,l (γ; k), R t,r (γ; k)).Suppose that constraint C4 from the main text holds true, then we have sup +1) is the first positive signed feature in S + j that enters F ± (t).That means we can find a parent of t, denoted as t parent , that splits on feature k and none of t parent 's parent splits on k .That implies k ∈ F(t parent ) and S + j ∩ F ± (t parent ) = ∅, in other words, k ∈ U (t parent ).Recall that γ * tparent,k denotes the threshold at node t parent .By Lemma 11, we know that the threshold γ * Since t is on the right branch of the node t parent , we have that c low,k (t) ≥ γ * tparent,k .Thus, log log Cγ .Then µ(R t d ) ≥ and P (U (t d ) = ∅ D) ≥ 1 − C + η n (D, ).When U (t d ) = ∅, it follows that U (t s ) = ∅ implies s ≤ d and µ(R ts ) ≥ .Thus, P (∃t ∈ P(t leaf ), such that U (t) = ∅ and µ(R t ) < |D) ≤ P (U (t d ) = ∅|D) ≤ C − η n (D, ).