Forest Garrote

Variable selection for high-dimensional linear models has received a lot of attention lately, mostly in the context of l1-regularization. Part of the attraction is the variable selection effect: parsimonious models are obtained, which are very suitable for interpretation. In terms of predictive power, however, these regularized linear models are often slightly inferior to machine learning procedures like tree ensembles. Tree ensembles, on the other hand, lack usually a formal way of variable selection and are difficult to visualize. A Garrote-style convex penalty for trees ensembles, in particular Random Forests, is proposed. The penalty selects functional groups of nodes in the trees. These could be as simple as monotone functions of individual predictor variables. This yields a parsimonious function fit, which lends itself easily to visualization and interpretation. The predictive power is maintained at least at the same level as the original tree ensemble. A key feature of the method is that, once a tree ensemble is fitted, no further tuning parameter needs to be selected. The empirical performance is demonstrated on a wide array of datasets.


Introduction
Given data (X i , Y i ), for i = 1, . . . , n, with a p-dimensional real-valued predictor variable X, where X = (X (1) , . . . , X (p) ) ∈ X , and a real-valued response Y , a typical goal of regression analysis is to find an estimatorŶ (x), such that the expected loss E L(Ŷ (X), X) is minimal, under a given loss function L : X × R → R + . For the following, the standard squared error loss is used. If the predictor can be of the 'black-box' type, tree ensembles have proven to be very powerful. Random Forests (Breiman, 2001) is a prime example, as are boosted regression trees (Yu and Bühlmann, 2003). There are many interesting tools available for interpretation of these tree ensembles, see for example Strobl et al. (2007) and the references therein.
While tree ensembles often have very good predictive performance, an advantage of a linear model is better interpretability. Measuring variable importance and performing variable selection are more easier to formulate and understand in the context of linear models. For high-dimensional data with p n, regularization is clearly imperative and the Lasso (Tibshirani, 1996;Chen et al., 2001) has proven to be very popular in recent years, since it combines a convex optimization problem with variable selection. A precursor to the Lasso was the nonnegative Garrote (Breiman, 1995). A disadvantage of the nonnegative Garrote is the reliance on an initial estimator, which could be the least squares estimator or a regularized variation. On the positive side, important variables incur less penalty and bias under the regularization than they do with the Lasso. For a deeper discussion of the properties of the nonnegative Garrote see Yuan and Lin (2007).
Here, it is proposed to use Random Forest as an initial estimator for the nonnegative Garrote. The idea is related to the Rule Ensemble approach of Friedman and Popescu (2008), who used the Lasso instead of the nonnegative Garrote. A crucial distinction is that rules fulfilling the same functional role are grouped in our approach. This is similar in spirit to the group Lasso (Meier et al., 2008;Yuan and Lin, 2006;Zhao, Rocha, and Yu, Zhao et al.). This produces a very accurate predictor that uses just a few functional groups of rules, discarding many variables in the process as irrelevant.
A unique feature of the proposed method is that is seems to work very well in the absence of a tuning parameter. It just requires the choice of an initial tree ensemble. This makes the procedure very simple to implement and computationally efficient. The idea and the algorithm is developed in Section 2, while a detailed numerical study on 15 datasets makes up Section 3.

Trees and Equivalent Rules
A tree T is seen here as a piecewise-constant function R p → R derived from the tree structure in the sense of Breiman et al. (1984). Friedman and Popescu (2008) proposed 'rules' as a name for simple rectangular-shaped indicator functions. Every node j in a tree is associated with a B j in R p -dimensional space, defined as the set of all values x ∈ R p that pass through node j if passed down the tree. All values x ∈ R p that do not pass throuh node j are outside of B j . The way rules are used here, they correspond to indicator functions R = R j , i.e. R j (x) = 1{X ∈ B j } is the indicator function for box B j . For a more detailed discussion see Friedman and Popescu (2008).
To give an example of a rule, take the well-know dataset on abalone (Nash et al., 1994) as an example. The goal is to predict age of abalone from physical measurements. For each of the 4177 abalone in the dataset, eight predictor variables (sex, length, diameter, height, while weight, shucked weight, viscera weight and shell weight) are available. An example of a rule is R 1 (x) = 1 {diameter ≥ 0.537 and shell weight ≥ 0.135} , and the presence of such a rule in a final predictor is easy to interpret, comparable to interpreting coefficients in a linear model. For the following, it is assumed that rules contain at most a single inequality for each variable. In other words, the boxes defined by rules in p-dimensional spaces are defined by at most a single hyperplane in each variable. If a rule violates this assumption, it can easily be decomposed into two or several rules satisfying the assumption. Every regression tree can be written as a linear superposition of rules. Suppose a tree T has J nodes in total. The regression functionT of this tree (ensemble) can then be written asT for someβ tree . The decomposition is not unique in general. We could, for example, assign non-zero regression coefficientsβ j only to leaf nodes. Here, we build the regression function incrementally instead, assigning non-zero regression coefficients to all nodes. The valueβ tree j are defined aŝ where E n is the empirical mean across the n observations and pa(j) is the parent node of j in the tree. Rule (1), in the abalone example above, receives a regression weightβ tree 1 = 0.0237. The contribution of rule (1) to the Random Forest fit is thus to increase the fitted value if and only if the diameter is larger than 0.537 and shell weight is larger than 0.135. The Random Forest fit is the sum of the contribution from all these rules, where each rule corresponds to one node in the tree ensemble.
To see that (2) and (3) really correspond to the original tree (ensemble) solution, consider just a single tree for the moment. Denote the predicted value for predictor variable X bŷ T (x). Let B leaf (x) be the rectangular area in p-dimensional space that corresponds to the leaf node leaf (x) of the tree in which predictor variable X falls. The predicted value follows then by adding up all relevant nodes and obtaining with (2) and (3), i.e. the predicted values is just the empirical mean of Y across all observations i = 1, . . . , n which fall into the same leaf node as X, which is equivalent to the original prediction of the tree. The equivalence for tree ensembles follows by averaging across all individual trees as in (2).

Rule Ensembles
The idea of Friedman and Popescu (2008) is to modify the coefficientsβ tree in (2), increasing sparsity of the fit by setting many regression coefficient to 0 and eliminating the corresponding rules from the fit, while hopefully not degrading the predictive performance of the tree ensemble in the process. The rule ensemble predictors are thus of the form Sparsity is enforced by penalizing the 1 -norm ofβ in LASSO-style (Tibshirani, 1996), This enforces sparsity in terms of rules, i.e. the final predictor will have typically only very few rules, at least compared to the original tree ensemble. The penalty parameter λ is typically chosen by cross-validation. Friedman and Popescu (2008) recommend to add the linear main effects of all variables into (5), which were omitted here for notational simplicity and to keep invariance with respect to monotone transformations of predictor variables. It is shown in Friedman and Popescu (2008) that the rule ensemble estimator maintains in general the predictive ability of Random Forests, while lending itself more easily to interpretation.

Functional grouping of rules
The rule ensemble approach is treating all rules equally by enforcing a 1 -penalty on all rules extracted from a tree ensemble. It does not take into account, however, that there are typically many very closely related rules in the fit. Take the RF fit to the abalone data as an example. Several hundred rules are extracted from the RF, two of which are with regression coefficientβ 1 = 0.023 and R 2 (x) = 1 {diameter ≥ 0.537 and shell weight ≥ 0.177} , with regression coefficientβ 2 = 0.019. The effect of these two rules, measured byβ 1 R 1 andβ 2 R 2 , are clearly very similar. In total, there are 32 rules 'interaction' rules that involve variables diameter and shell weight in the RF fit to the abalone data. Selecting some members of this group, it seems artificial to exclude others of the same 'functional' type. Sparsity is measured purely on a rule-by-rule basis in the 1 -penalty term of rule ensembles (5). Selecting the two rules mentioned above incurs the same sparsity penalty as if the second rule involved two completely different variables. An undesirable side-effect of not taking into account the grouping of rules is that many or even all original predictor variables might still be involved in the rules; sparsity is not explicitly enforced in the sense that many irrelevant original predictor variables are completely discarded in the selected rules.
It seems natural to let rules form functional groups. The question then turns up which rules form useful and interpretable groups. There is clearly no simple right or wrong answer to this question. Here, a very simple yet hopefully intuitive functional grouping of rules is employed. For the j-th rule, with coefficientβ j , define the interaction pattern σ j = (σ j,1 , . . . , σ j,p ) for variables k = 1, . . . , p by The meaning of interaction patterns is best understood if looking at examples for varying degrees, where the degree of a interaction pattern σ is understood to be the number of non-zero entries in σ and corresponds to the number of variables that are involved in a rule.
First degree (main effects). The simplest interaction patterns are those involving a single variable only, which correspond in some sense to the main effects of variables. The interaction pattern (0, +, 0, 0, 0, 0, 0, 0) for example collects all rules that involve the 2nd predictor variable length only and lead to a monotonically increasing fit (are thus of the form 1{length ≤ u} for some real-valued u if the corresponding regression coefficient were positive or 1{length ≥ u} if the regression coefficient were negative). The interaction pattern (0, 0, −, 0, 0, 0, 0, 0) collects conversely all those rules that yield a monotonically decreasing fit in the variable diameter, the third variable.
Second degree (interactions effects). Second degree interaction patterns are of the form (6) or (7). As diameter is the 3rd variable and shell weight the 8th, the interaction pattern of both rules (6) and (7) is (0, 0, +, 0, 0, 0, 0, +), making them members of the same functional group, as for both rules the fitted value is monotonically increasing in both involved variables. In other words, either a large value in both variables increases the fitted value or a very low value in both variables decreases the fitted value. Second degree interaction patterns thus form four categories for each pair of variables. A case could be made to merge these four categories into just two categories, as the interaction patterns do not conform nicely with the more standard multiplicative form of interactions in linear models. However, there is no reason to believe that nature always adheres to the multiplicative form of interactions typically assumed in linear models. The interaction patterns used here seemed more adapted to the context of rule-based inference. Factorial variables can be dealt with in the same framework (8) by converting to dummy variables first.

Garrote correction and selection
In contrast to the group Lasso approach of Yuan and Lin (2006), the proposed method does not only start with knowledge of natural groups of variables or rules. A very good initial estimator is available, namely the Random Forest fit. This is exploited in the following.
LetT σ be the part of the fit that collects all contributions from rules with interaction pattern σ,T Let G be the collection of all possible interaction patterns σ. The tree ensemble fit (2) can then be re-written as a sum over all interaction patternŝ A interaction pattern σ is called active if the corresponding fit in the tree ensemble is nonzero, i.e. if and only ifT σ is not identically 0. The Random Forest fit contains very often a huge number of active interaction pattern, involving interactions up to fourth and higher degrees. Most of those active patterns contribute just in a negligible way to the overall fit. The idea proposed here is to use (10) as a starting point and modify it to enforce sparsity in the final fit, getting rid of as many unnecessary predictor variables and associated interaction patterns as possible. The Lasso of Tibshirani (1996) was used in the rule ensemble approach of Friedman and Popescu (2008). Here, however, the starting point is the functional decomposition (10), which is already a very good initial (yet not sparse) estimator of the underlying regression function.
Hence it seems more appropriate to use Breiman's nonnegative Garrote (Breiman, 1995), penalizing contributions of interaction patterns less if their contribution to the initial estimator is large and vice versa. The beneficial effect of this bias reduction for important variables has been noted in, amongst others, Yuan and Lin (2007) and Zou (2006).
The Garrote-style Forest estimatorT gar is defined aŝ Each contributionT σ of an interaction pattern σ is multiplied by a factorγ σ . The original tree ensemble fit is obtained by setting all factors equal to 1. The multiplicative factor γ is chosen by least squares, subject to the constraint that the total 1 -norm of the multiplying coefficients is less than 1, The normalizing factor |G| −1 divides the 1 -norm of γ by the total number of interaction patterns and is certainly not crucial here but simplifies notation. The estimation ofγ is an application of Breiman's nonnegative Garrote (Breiman, 1995). As for the Garrote, the original predictorsT σ are not rescaled, thus putting effectively more penalty on unimportant predictors, with little variance ofT σ across the samples X 1 , . . . , X n and less penalty on the important predictors with higher variance, see (Yuan and Lin, 2007) for details. Algorithmically, the problem can be solved with an efficient Lars algorithm (Efron et al., 2004), which can easily be adapted to include the positivity constraint. Alternatively, quadratic programming can be used.
It might be surprising to see the 1 -norm constrained by 1 instead of a tuning parameter λ. Yet this is indeed one of the interesting properties of Forest Garrote. The tree ensemble is in some sense selecting a good level of sparsity. It seems maybe implausible that this would work in practice, but some intuitive reasons for its empirical success are given further below and ample empirical evidence is provided in the section with numerical results.
A drawback of the Garrote in the linear model setting is the reliance on the OLS estimator (or another suitable estimator), see also Yuan and Lin (2007). The OLS estimator is for example not available if p > n. The tree ensemble estimates are, in contrast, very reasonable estimators in a wide variety of settings, certainly including the high-dimensional setting p n. The entire Forest Garrote algorithm works thus as follows 1. Fit Random Forest or another tree ensemble approach to the data.
2. ExtractT σ from the tree ensemble for all σ ∈ G by first extracting all rules R j and corresponding regression coefficientsβ j and grouping them via (9) for each interaction pattern. (12) from the data, using for example the LARS algorithm (Efron et al., 2004).

The fitted Forest Garrote functionT gar is given by (11).
The whole algorithm is very simple and fast, as there is no tuning parameter to choose.

Lack of tuning parameter
In most regularization problems, like the Lasso (Tibshirani, 1996), choosing the regularization parameter is very important and it is usually a priori not clear what a good choice will be, unless the noise level is known with good accuracy (and it usually it is not). The most obvious approach would be cross-validation. Cross-validation can be computationally expensive and is usually not guaranteed to lead to optimal sparsity of the solution, selecting many more variables or interaction patterns than necessary, as shown for the Lasso in Meinshausen and Bühlmann (2006) and Leng et al. (2006). Since the starting point is a very useful predictor, the original tree ensemble, there is a natural tuning parameter for the Forest Garrote estimator. As noted before, γ = (1, 1, 1, . . . , 1, 1) corresponds to the original tree ensemble solution. The original tree ensemble solutionT is thus contained in the feasible region of the optimization problem (12) under a constraint on the 1 -norm of exactly 1. The solution (11) will thus be at least as sparse as the original tree ensemble solution in the sense if sparsity is measured in the same 1 -norm sense as in (12). Some variables might not be used at all even though they appear in the original tree ensemble.
On the other hand, the empirical squared error loss ofT gar is at least as low (and typically lower) than for the original tree ensemble solution asγ will reduce the empirical loss among all solutions in the feasible region of (12), which contains the original tree ensemble solution. The latter point does clearly not guarantee better generalization on yet unseen data, but a constraint of 1 on the 1 -norm turns out to be an interesting starting point and is a very good default choice of the penalty parameter.
Sometimes one might still be interested in introducing a tuning parameter. One could replace the constraint of 1 on the 1 -norm of γ in (12) by a constraint λ, (13) The range over which to search, with cross-validation, to over λ can then typically be limited to [0,2]. Empirically, it turned out that the default choice λ = 1 is very reasonable and actually achieves often better predictive and selection performance than the cross-validated solution, since the latter suffers from possibly high variance for finite sample sizes.

Example: diabetes data
The method is illustrated on the diabetes data from Efron et al. (2004) with p = 10 predictor variables, age, sex, body mass index, average blood pressure and six blood serum measurements. These variables were obtained for each of n = 442 diabetes patients, along with the response of interest, a 'quantitative measure of disease progression one year after baseline'. Applying a Random Forest fit to this dataset, the main effects and second-order interactions effects, extracted as in (9), are shown in the upper right diagonal of Figure 2, for 6 out of the 10 variables (chosen at random to facilitate presentation). All of these variables have non-vanishing main effects (on the diagonal) and the interaction patterns can be quite complex, making them somewhat difficult to interpret. Now applying a Forest Garrote selection to the Random Forest fit, one obtains the main effects and interaction plots shown in the lower left diagonal of Figure 2. Note that the x-axis in the interaction plots corresponds to the variable in the same column, while the y-axis refers to the variable in the same row. Interaction plots are thus rotated by 90 degrees between the upper right and the lower left diagonal. Some main effects and interactions are set to 0 by the Forest Garrote selection. Interaction effects that are not set to 0 are typically 'simplified' considerably. The same effect was observed and explained in Figure 1. The interaction plot of Forest Garrote seems thus much more amenable for interpretation.

Numerical Results
To examine the predictive accuracy, variable selection properties and computational speed, various standard datasets are used and augmented with two higher-dimensional datasets. The first of these is a motif regression dataset (henceforth called 'motif', p = 660 and n = 2587). The goal of the data collection is to help find transcription factor binding sites (motifs) in DNA sequences. The real-valued predictor variables are abundance scores for p candidate motifs (for each of the genes). Our dataset is from a heat-shock experiment with yeast. For a general description and motivation about motif regression see Conlon et al. (2003).
The method is applied to a gene expression dataset ('vitamin') which is kindly provided by DSM Nutritional Products (Switzerland). For n = 115 samples, there is a continuous response variable measuring the logarithm of riboflavin (vitamin B2) production rate of Bacillus Subtilis, and there are p = 4088 continuous covariates measuring the logarithm of gene expressions from essentially the whole genome of Bacillus Subtilis. Certain mutations of genes are thought to lead to higher vitamin concentrations and the challenge is to identify those relevant genes via regression, possibly using also interaction between genes.
The unexplained variance on test data for all these datasets with Forest Garrote is compared with that of Random Forests and Rule Ensembles in Figure 3. The tuning parameters in Random Forests (namely over how many randomly selected variables to search for the best splitpoint) is optimized for each dataset, using the out-of-bag performance measure. For comparison between the methods, the data are split into two parts of equal size, one half for training and the other for testing. Results are compared with Forest Garrote (CV), where the tuning parameter λ in (13) is not chosen to be 1 as for the standard Forest Garrote estimator, but is instead chosen by cross-validation. There are two main conclusions from the Figure. First, all three method (Forest Garrote, Forest Garrote (CV) and Rule Ensembles)  Ensembles for the various datasets. Forest Garrote uses the least computational resources since (i) it starts from a relative small set of dictionary elements (allT σ for σ ∈ G as opposed to all rules), (ii) the solution has to be computed only for a single regularization parameter and there is hence (iii) also no need for expensive cross-validation. Note that the times above are only for the rule selection steps (5) and (12) respectively and the overall relative speed difference is typically smaller as a tree ensemble fit needs to be computed as an initial estimator in all settings.
outperformed Random Forests in terms of predictive accuracy on almost all datasets. Second, the relative difference between these three methods is very small. Maybe surprisingly, using a cross-validated choice of λ did not help much in improving predictive accuracy for the Forest Garrote estimator. On the contrary, it rather lead to worse predictive performance, presumably due to the inherent variability of the selected penalty parameter. Forest Garrote (CV) has also obviously a computational disadvantage compared with the recommended Forest Garrote estimator, as shown in Table 1 which is comparing relative CPU times necessary to compute the relative estimators. All three methods could be speeded up considerably by clever computational implementation. Any such improvement would most likely be applicable to any of these three compared methods as they have very similar optimization problems at their heart. Only relative performance measurements seem to be appropriate and only the time it takes to solve the respective optimization problems (5), (12) and (13) is reported, including time necessary for cross-validation, if so required. Rule Ensembles is faring by far the worst here, since the underlying optimization problem is very high-dimensional. The dimensionality J in (5) is the total number of rules, which corresponds to the number of all nodes in the Random Forest fit. The total number |G| of interaction patterns in the optimization underlying (12) in the Forest Garrote fit is, on the other hand, very much smaller than the number J of all rules, since many rules are typically combined in each interaction patterns. The lack of cross-validation for the Forest Garrote estimator clearly also speeds computation up by an additional factor between 5 and 10, depending on which form of cross-validation is employed.
Finally, the number of variables selected by either method is examined. A variable is said to be selected for this purpose if it appears in any node in the Forest or in any rule that is selected with a non-zero coefficient. In other words, selected variables will be needed to compute predictions, not selected variables can be discarded. The results are shown in Table 2. Many variables are typically involved in a Random Forest Fit and both Rule Ensembles as well as Forest Garrote can cut down this number substantially. Especially for higher-dimensional data with large number p of variables, the effect can be pronounced. Between Rule Ensembles and Forest Garrote, the differences are very minor with a slight tendency of Forest Garrote to produce sparser results.

Conclusions
Balancing interpretability and predictive power for regression problems is a difficult act. Linear models lend themselves more easily to interpretation but suffer often in terms of predictive power. Random Forests (RF), on the other hand, are known deliver very accurate prediction. Tools exist to extract marginal variable importance measures from RF. However, the interpretability of RF could be improved if the very large number of nodes in the hundreds of trees fitted for RF could be reduced.
Here, the Forest Garrote was proposed as such a pruning method for RF or tree ensembles in general. It collects all rules or nodes in the Forest that belong to the same functional group. Using a Garrote-style penalty, some of these functional groups are then shrunken to zero, while the signal of other functional groups is enhanced. This leads to a sparser model and rather interpretable interaction plots between variables. Predictive power is similar or better to the original RF fit for all examined datasets.
The unique feature of Forest Garrote is that it seems to work very well without the use of a tuning parameter, as shown on multiple well known and less well known datasets. The lack of a tuning parameter makes the method very easy to implement and computationally efficient.