A Confidence Machine for Sparse High-Order Interaction Model

In predictive modeling for high-stake decision-making, predictors must be not only accurate but also reliable. Conformal prediction (CP) is a promising approach for obtaining the confidence of prediction results with fewer theoretical assumptions. To obtain the confidence set by so-called full-CP, we need to refit the predictor for all possible values of prediction results, which is only possible for simple predictors. For complex predictors such as random forests (RFs) or neural networks (NNs), split-CP is often employed where the data is split into two parts: one part for fitting and another to compute the confidence set. Unfortunately, because of the reduced sample size, split-CP is inferior to full-CP both in fitting as well as confidence set computation. In this paper, we develop a full-CP of sparse high-order interaction model (SHIM), which is sufficiently flexible as it can take into account high-order interactions among variables. We resolve the computational challenge for full-CP of SHIM by introducing a novel approach called homotopy mining. Through numerical experiments, we demonstrate that SHIM is as accurate as complex predictors such as RF and NN and enjoys the superior statistical power of full-CP.


Introduction
The uncertainty in data-driven analysis is a major concern, particularly in risk-sensitive automated decision-making problems (for example, in medical diagnosis and criminal justice). Several strategies exist to quantify the uncertainty of a point estimators. For example, the Bayesian approach can provide a strong confidence bound, but requires the assumption on prior distribution. The PAC analysis is another approach that provides bounds on the probability of error. The conformal prediction (CP) is one such uncertainty quantification method that is very generic, and it is applicable to almost any point estimators Vovk et al. (2005); Shafer and Vovk (2008). The CP method has recently gained significant attention as it can provide valid finite sample statistical coverage guarantee at any nominal level as long as data are independently and identically distributed (i.i.d.). The coverage guarantee provided is valid even when the model is misspecified. In this paper we are interested in the full-CP where the full data is used to compute the CP set. An alternative to this approach is the split-CP in which the data is split into two parts: one part is used for fitting and the remaining part is used to compute the CP set. The essential idea of the full-CP framework can be stated as follows: Given a training set D n = {(x 1 , y 1 ), . . . , (x n , y n )} and a test instance x n+1 which are both i.i.d., the goal of CP is to construct a 100(1 − α)% confidence set that contains the unobserved y n+1 . Here, α ∈ [0, 1] represents the confidence level. In other words, we are interested in all possible completions of the augmented dataset in the form of D n ∪ (x n+1 , τ ), such that y n+1 = τ is typical for the known data {D n , x n+1 }. This typicality is measured by a typicalness function π(·), which is also called the -0.12 Figure 1: (A) An intuitive illustration of a SHIM solution space, which is a hyper rectangle (i.e., generalization of a rectangle for higher dimensions that essentially represents a cartesian product of intervals). The decision sets, the predicted response (ŷ) and the associated confidence interval for a randomly chosen example of data from the ProPublica two-year recidivism criminal justice (COMPAS) dataset are shown. (B) A tree of patterns has been constructed by exploiting the hierarchical structure of high-order interaction features (Note: not all patterns appear due to pruning).
p-value in analogy to the classical hypothesis testing. In other words, the CP set for x n+1 is defined as the set of all τ for which the null hypothesis H 0 : y n+1 = τ is not rejected against the alternative hypothesis H 1 : y n+1 = τ .
Related work: Since its inception, there have been several extensions and application of the CP framework in diverse directions. Examples include the choice of conformity score and statistical efficiency Lei et al. (2013); Lei and Wasserman (2014), high-dimensional regression Lei et al. (2018), classification Lei ( (2015); Bates et al. (2021). Recently, some approximate computation of the conformal set was proposed for regression in Takeuchi (2019, 2021) and in classification, significant advancement has been made in Cherubin et al. (2021) to compute exact full-CP. Despite its attractive properties, the application of exact, full-CP in many practical problems remains an open problem owing to its high computational cost. By definition, the full-CP framework requires the refitting of model by augmenting the data for every possible candidate τ ∈ R of the unobserved y n+1 . In a regression setting, one needs to refit the model an infinite number of times for all possible candidates on the real line (∀τ ∈ R), and check the conformity by computing the p-values π(τ ).
Hence, an efficient computation of the full-CP set is possible only for a handful of simple models (e.g., ordinary LS regression, ridge regression) in which the solution is explicitly represented as a function of τ . This enables a closed-form derivation of full-CP set and avoids an exhaustive search over the real line Nouretdinov et al. (2001). Recently, Lei (2019) proposed a homotopy method to efficiently compute the full-CP set of the LASSO. The homotopy method exploits the piece-wise linearity of the LASSO solutions as a function of τ , and avoids the exhaustive search by keeping track of the transition points along the solution path and the use of a linear interpolation between transition points. Unfortunately, such a structure does not exist for most of the models, and the application of full-CP is still an open question for complex models such as a random forest (RF), neural network (NN) etc. For those complex models we need to rely on split-CP.
In CP, although the coverage property is satisfied even when the model is misspecified, it is better to use sufficiently complex models for complex data because this enables us to obtain a more compact confidence set (i.e., shorter confidence intervals). Furthermore, it is important to note that the confidence set obtained by a full-CP is more compact than that obtained by a split-CP because only a part of instances in the available dataset is used for constructing the confidence set in split-CP. This means that it is valuable to construct a full-CP algorithm for sufficiently complex models. In this paper, we considered the sparse high-order interaction model (SHIM) which is complex but interpretable, and proposed a homotopy-mining method to efficiently compute the full-CP set. We call the resulting machine as SHIM confidence machine. SHIM represents the model as a weighted sum of conjunction rules that are highly interpretable as well as accurate decision sets Das et al. (2019); Lakkaraju et al. (2016). SHIM can capture the combinatorial interactions of multiple factors, which can prove to be beneficial for deciphering complex data. A conjunction rule of a SHIM looks like where I(·) refers to the indicator function. An intuitive illustration of a SHIM is shown in Fig. 1A.
Contribution: Our main contribution in this paper is to develop an efficient algorithm to conduct exact full-CP for SHIM which can capture complex structures in the data by considering high-order interaction features. The exact full-CP for complex black box models such as RF and NN are intractable and an efficient computational method does not exist. For such black box models only split-CP is possible. The proposed method enables us to obtain a more compact confidence set (much better than linear and comparable to non linear models) by the full-CP for sufficiently flexible and complex SHIM. Because SHIM uses higher-order interaction features, the full-CP is computationally challenging, but we overcome this difficulty by introducing a method called homotopy mining which exploits the best of both homotopy and (pattern) mining methods. The computation of exact full-CP for SHIM by homotopy mining can be interpreted as an extension of LASSO's exact full-CP in Lei (2019). The use of black box or explainable machine learning models in high-stakes decision-making such as in healthcare, criminal justice, and other domains is highly criticized in the literature (Rudin, C. 2019). We believe that the exact full-CP of a SHIM has significant importance in practice where accuracy, statistical reliability, and interpretability are important. If a practitioner chooses SHIM as a regressor, how to compute the corresponding exact full-CP set?. The proposed method provides a computationally efficient solution to this. Notation: In our notation, y ∈ R n represents the response vector of n instances, y(τ ) ∈ R n+1 is the augmented response vector constructed by augmenting the possible response value (τ ∈ R) of the (n + 1) th instance with y. Later we defined y(τ ) as a function of the variable τ as the vector y(τ ) changes for different possible response value (τ ) of the (n + 1) th instance. The y i ∈ R, ∀i ∈ [n + 1], represents the scalar response of the i th instance, X ∈ R (n+1)×p is the design matrix of all (n + 1) instances, each having p features, x ∈ R n+1 is the column vector corresponding to some ∈ [p], where [p] = {1, . . . , p}, X Aτ ∈ R (n+1)×|Aτ | ⊆ X is a smaller design matrix in which the columns are restricted to the elements of some subset A τ ⊆ [p].

Problem Statement
Consider a regression problem with a response vector y ∈ R n and m original covariate vectors z 1 , . . . , z m , where z ∈ R n and ∈ [m] = {1, ..., m}. A high-order interaction model up to the d th order is then written as follows: The high-order interaction model (1) is then simply written as a linear model y = Xβ. Unfortunately, p can be prohibitively large unless both m and d are fairly small. In SHIM, we consider a sparse estimation of a high-order interaction model. An example of SHIM is as follows: y = θ 3 z 3 + θ 5 z 5 + θ 2,6 z 2 z 6 + θ 1,2,5,9 z 1 z 2 z 5 z 9 . (2) Before delving into our proposed method, we briefly overview the conformal prediction framework.

Conformal prediction
A mere point estimation is insufficient for risk-sensitive automated decision-making problems Rudin In such high-stake decision-making problems, if the estimators are equipped with the associated confidence information, the decision maker will be more informed and sufficiently confident to make a prudent decision when the stakes are high.
Full-CP: Given a labelled dataset D n and a new observation x n+1 , the goal of full-CP framework is to construct a set of likely values C(x n+1 ) of unobserved y n+1 with a valid statistical coverage guarantee Vovk et al. (2005); Shafer and Vovk (2008), i.e. , where α ∈ [0, 1] determines the level of confidence. If we define a prediction function µ(·) that maps the input X to the output y, then the essential idea of constructing a full-CP set is to fit a model µ τ (·) with the augmented data D n ∪ (x n+1 , τ ) for every possible candidate τ ∈ R and compare the prediction error of each instances. More precisely, let y(τ ) = [y 1 , . . . , y n , τ ] ∈ R n+1 be a vector augmented with τ . We define a score function that measures how well the model can predict each output variables; with the constraint that it should not depend on the order of the data instances Lei (2019). One such conformity score function generally used in linear regression setting is the (coordinate-wise) absolute residual i.e.
where we stack the input vectors in the design matrix X = [x 1 , . . . , x n+1 ] in R (n+1)×p . The conformity function can now be defined as which evaluates how the prediction of the candidate τ using the fitted model µ τ (x n+1 ) is ranked compared to the prediction of the previously observed data y i using µ τ (x i ). The conformal set is defined as: which is merely the collection of candidates whose conformity is large enough. Vovk et al. (2005) showed that the defined conformal set C(x n+1 ) satisfies the coverage guarantee 3 as long as the data are i.i.d.

Split-CP:
In split-CP Papadopoulos et al. (2002), the data set D n is split into two parts, defined as the training set D tr = {(x 1 , y 1 ), . . . , (x m , y m )} and the calibration set D cal = {(x m+1 , y m+1 ), . . . , (x n , y n )}, with m < n. Then the model (µ tr (·)) is fit with the training set D tr only once, and the p-values, denoted as π split (·), are determined using the calibration set D cal as follows.
. Therefore, the split-CP can be defined as follows: where Q cal 1−α is the (1 − α) quantile of the calibration scores S cal i , ∀i ∈ [m + 1, n]. When the conformity score is defined as the absolute residual, then the split-CP set can be conveniently written . Although split-CP enjoys the computational efficiency of single model fitting, it suffers from the poor statistical efficiency owing to the smaller sample size both in the model fitting and calibration phases. (2015) follow a similar trade-off as bias reduction versus variance reduction while trying to exploit the entire dataset for calibration and a significant proportions for training the model. The dataset is partitioned into K folds and one performs a split conformal set by sequentially defining the kth fold as the calibration set and the remaining as the training set for k ∈ {1, . . . , K}. The main difficulty lies in aggregating the different p-values while maintaining the validity of the method see Carlsson et al. (2014); Linusson et al. (2017). In general, it can be shown that the confidence level is inflated by a factor of 2 i.e. the (not improvable) theoretical coverage level is 1 − 2α instead of 1 − α, see Barber et al. (2021). Nevertheless, the practical performance is fairly acceptable both computationally and statistically.

Cross-conformal Predictors introduced in Vovk
On the other hand, to compute the exact full-CP set for a regression problem where y ∈ R, one needs to refit the regression model µ τ (·) and evaluate π(τ ) for an infinite number of candidates τ . Hence, efficient computation of the exact full-CP set is possible only for a handful of predictors (e.g., ridge regression) where it is possible to have a closed form expression of µ τ (·). However, such structure does not exist for most of the machine learning models and computing the exact full-CP set remains an open question for many models. Recently, Lei (2019) proposed a homotopy method to efficiently compute the exact full-CP set of the LASSO which was not trivial. Lei (2019) considered the response y as a function of a scalar and derived a homotopy method to construct the exact LASSO solution path as function of that scalar. After constructing the exact path, Lei (2019) used the solutions at the transition points of the path to efficiently compute the exact full-CP set of the LASSO. In this paper we extend that framework for the SHIM.

Proposed Method
We propose a homotopy-mining method to compute the exact full-CP set of a SHIM. The homotopy method refers to an optimization framework for solving a sequence of parameterized optimization problems. The basic idea of our method is to consider the following optimization problem: At optima we can write X Xβ(τ ) − y(τ ) + λs(τ ) = 0, et al., 2004). In Lei (2019), the author showed that for a fixed λ, the exact solution path of LASSO with a variable response y(τ ), characterized by τ can be shown to be piece-wise linear as stated in Proposition 1. Proposition 1. If τ 1 and τ 2 are any two points, where τ 2 = τ 1 + ∆, ∆ > 0 such that the active set A τ and the sign of the coefficients β (τ ), ∀ ∈ [p] remain the same ∀τ ∈ [τ 1 , τ 2 ) and X Aτ X Aτ is invertible ∀τ ∈ [τ 1 , τ 2 ), then one can show that is the next zero crossing point and there is only one feature ∈ [p] that enters or leaves the active set A τt at τ = τ t+1 , then Lei (2019) showed that τ → β(τ ) is well defined and piece-wise linear in τ .
We call the mapping τ → β(τ ) the τ -path and the number of linear pieces of this τ -path is finite (see complexity analysis in the appendix). This τ -path can be computed exactly using homotopy method, where the sequences of τ represent the breakpoints of the homotopy method (Efron et al., 2004;Mairal and Yu, 2012), the indices of which are represented by t ∈ {1, · · · , T }. The basic idea of homotopy method is stated below. At τ t+1 which is the next zero-crossing point of the τ -path, either of the following two events occurs (Efron et al., 2004;Rosset and Zhu, 2007) • A zero variable becomes non-zero, that is, Overall, the next change in the active set (or change in direction vector, where where, we use the convention that for any a ∈ R, (a) ++ = a if a > 0 and ∞ otherwise. However, naively (minimization over all possible interaction terms) determining the step size of inclusion (∆ 2 * ) will be intractable for the SHIM type problem. Because in SHIM the search space grows exponentially due to the combinatorial effect of high-order interaction terms. Therefore, both the fitting and constructing the full-CP set of a SHIM are non-trivial because unless both m and d are very small, a high-order interaction model will have a significantly large number of parameters to be considered. Several algorithms for fitting a sparse high-order interaction model have been proposed in the literature Tsuda (2007); Saigo et al. (2009);Nakagawa et al. (2016). A common approach adopted in these existing works is to exploit the hierarchical structure of high-order interaction features. In other words, a tree structure as in Fig. 1B is considered and a branch-and-bound strategy is employed in order to avoid handling all the exponentially increasing number of high-order interaction features. Hence, we need efficient computational methods to make the computation practically feasible. In the following section, we present an efficient tree pruning strategy that considers the tree structure of the interaction terms (or patterns). Here, each node of the tree represents an interaction term. The basic idea of tree pruning is that we construct a tree of interaction terms in a "progressive manner". Here, we keep track of the current minimum step size (up to the construction of th pattern) i.e. ∆ 2 * = min j∈{1,2,..., } {∆ 2 j }, as we construct the tree progressively, and prune a large part of the tree if some bound condition fails. Here, ∆ 2 j is the bracketed quantity in the expression of ∆ 2 * in (9) for any non-active feature j ∈ A c τt .

Tree pruning (τ -path)
Definition of Tree A tree is constructed in such a way that for any pair of nodes ( , ), where is the ancestor of , i.e., ⊂ , the following conditions are satisfied The basic idea of our tree pruning condition is stated below. The equicorrelation condition for any active feature k ∈ A τ at a fixed λ can be written as x k w(τ ) = λ (see definition of A τ below (7)).
Therefore, at τ = τ t+1 any non-active feature ∈ A c τt becomes active if Now, using the triangular inequality one can show that (10) will not have any solution if . Therefore, (11) can be used to derive the pruning condition of the τ -path which is formally stated in Lemma 1.
Similar idea has been used in the context of graph mining Tsuda (2007) and selective inference of SHIM Das et al. (2021). Note that Tsuda (2007) provided a pruning condition for the exact regularization of graph data (λ-path) and Das et al. (2021) provided a pruning condition in the context of selective inference to characterize the conditional distribution of the test statistics. However, in our case we adapted the similar idea to compute the exact full-CP set of SHIM. Lemma 1. If ∆ 2 * is the current minimum step size, that is, ∆ 2 * = min j∈{1,2,..., }  where, b ,w(τt) := max |v i (τ t )|x i . The Lemma 1 essentially states that if the condition in (12) is satisfied, then one can safely ignore the subtree with as the root node, thereby dramatically improving the computational efficiency. We extended our method to elastic net and call it El-SHIM. The details of El-SHIM, proof of Lemma 1 and the algorithm to compute the exact full-CP of SHIM are provided in the appendix.

Results and Discussions
We evaluated our proposed method using both synthetic and real-world data. For all experiments, we considered a coverage guarantee of 90%, that is, α = 0.1. We compared the statistical efficiency of our proposed method (SHIM) with those of other simple (LASSO) and complex models (NN, RF). For the complex models, we reported the split-CP because it is the only available method for computing the CP for complex models. We also demonstrated the statistical efficiency of full-CP in comparison to that of split-CP both for LASSO and SHIM.
Synthetic data We generated random i.i.d. samples (z i , y) ∈ {0, 1} m × R in such a way that 100m(1 − ζ)% features of z i ∈ R m contain a value of 1 on average. Here, ζ ∈ [0, 1] is the sparsity controlling parameter that controls the sparsity of the design matrix, whereas the sparsity in the model coefficients are controlled by the regularizer λ. The pruning effectiveness (Table 4) depends on the sparsity of the design matrix as it exploits the tree's anti-monotonicity property. High dimensional realworld data is generally very sparse and the choice of ζ in our experiments is just for the demonstration purpose. The response y i ∈ R is randomly generated from a normal distribution N (µ(x i ), σ 2 ). For demonstration purposes, we considered a true model of up to fifth-order interactions, which is defined as µ(x i ) = 2.0z 1 + 2.0z 1 z 2 + 2.0z 1 z 2 z 3 + 2.0z 1 z 3 z 4 z 5 + 2.0z 1 z 2 z 3 z 4 z 5 , and set σ = 1. The choice of this model is merely for the demonstration purposes, and the proposed method is equally applicable to any chosen model. We used the same true model µ(x i ) in both low and high dimensional settings (  Table 3: Comparison of the statistical power of the proposed method (shim) with other simple (lasso) and complex models (mlp, rf) using compas data. shim_2s and shim_2f respectively represent split-CP and full-CP for a 2 nd order SHIM. The bracketed values represent the standard deviations. Figure 2: Comparison of the confidence interval lengths (CI length) of the proposed method (shim) with other simple (lasso) and complex models (mlp, rf) using synthetic data for different sample sizes. shim_3s and shim_3f respectively represent split-CP and full-CP for a 3 rd order SHIM.

Comparison of statistical powers
We compared the statistical efficiency using both synthetic and real-world data. For the synthetic data, we considered both low-and high-dimensional settings. In the low-dimensional setting, we considered a dataset with n = 150 instances and m = 10 original covariates, whereas for the high-dimensional setting, we considered n = 150 and m = 100. We kept the sparsity of the design matrix fixed at ζ = 0.4 in both the settings. We varied the order of interaction d = 2, 3, · · · , however, almost in all the experiments we found that the model get saturated after 3 rd order interactions (i.e. the performance of SHIM did not change much for d ≥ 3). Hence, we reported the results of up to 3 rd order interactions. Note that the max order of interaction d is not needed to be specified beforehand. Our pruning condition takes care of the case even when d is not specified that is when the whole search space is considered (Table 4). For both the settings (low and high), we generated a dataset of n = 150 training instances, each accompanied with n = 50 test instances. We generated 5 such independent random datasets and repeated the experiments 3 times. Hence, in total we reported the average results of 15 independent datasets i.e. 3 × 5 × 50 = 750 test instances. The details of the hyper parameter selection are given in the appendix. We used a multi-layer perceptron (MLP) as a neural network architecture in our experiments.
From Table 1 and Table 2 it can be observed that both in low-and high-dimensional settings, all the methods produced perfect (or nearly perfect) coverage i.e cov=0.90. The statistical efficiency of individual methods are compared using the length of confidence interval. A statistically efficient model is expected to produce a shorter confidence interval length. Comparing the length of confidence intervals, one can observe that the statistical efficiency of SHIM increases as we increase the order of interactions. A 3 rd order SHIM produced the shortest average confidence interval lengths both in low-and high-dimensional settings. To compare the fitting power of individual methods we reported the R-squared (r2) scores. It can be observed that the r2 scores of a 3 rd order SHIM are comparable to the best performing complex model (RF) both in low-and high-dimensional settings. We also demonstrated the comparison of the confidence interval lengths (CI length) and R-squared (r2) scores of the proposed method (SHIM) with other simple (LASSO) and complex models (MLP, RF) for three different sample sizes, n ∈ {100, 150, 200}, m = 10 ( Figure 2). It can be observed that the full-CP of 3 rd order SHIM (shim3_f ) produced the best results (shortest avg. length, highest avg. r2 score) in all the cases. It can also be observed that the performance of data splitting (in mlp, d Search space (# nodes) λ = 1 λ = 10 With pruning Without pruning With pruning Without pruning ζ = 0.4 ζ = 0.7 ζ = 0.9 ζ = 0.4 ζ = 0.7 ζ = 0.9 ζ = 0.4 ζ = 0.7 ζ = 0.9 ζ = 0.4 ζ = 0.7 ζ = 0.  Table 4: Computation time (in sec) with and without pruning using two different λ values (λ = 1, 10) for three different sparsity levels (ζ = 0.4, 0.7, 0.9). All computation times were measured on an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz.
rf, lasso_s, shim3_s) is worse compared to that of full-CP. The split-CP tends to produce a longer confidence interval and smaller r2 score, and suffer from high variance due to the smaller data size as well as the additional randomness considered in data splitting. For the real data we randomly split the data into a training set of n = 5000 and a test set of n = 2214. We selected the hyper parameters by 5-fold cross-validation using the training data and reported the statistical power using the test data. The range of hyper parameters from which the optimal one is selected is the same as used for the synthetic data experiment. We repeated this experimental setup 100 times and reported the average results. From Table 3, one can observe that all the method produced a perfect coverage (cov=0.90) and nearly same r2 scores. A 2 nd -order SHIM produced the shortest average confidence interval length. Here, the performance of SHIM did not improve for (d > 2). Hence, we reported the results of a 2 nd -order SHIM. Additional results are provided in the appendix.

Comparison of computational efficiencies.
To demonstrate the computational efficiency of the proposed pruning strategy for the τ -path, we generated a synthetic dataset of n = 100 and m = 30 for three different sparsity levels of the design matrix (ζ = 0.4, 0.7, 0.9) using the same 5 th order model as used to demonstrate the statistical power. We compared both the fraction of nodes traversed (see appendix) and the time taken (Table 4) against a different maximum interaction order d for three different sparsity levels (ζ = 0.4, 0.7, 0.9) using two different λ values (λ = 1, 10). It can be observed that the pruning is more effective at the deeper nodes of the tree and saturates after a certain depth of the tree. This is evident as the sparsity of the data increases at the deeper nodes, and the pruning exploits the anti-monotonicity of high-order interaction terms constructed as tree of patterns.
In the case of the homotopy method without pruning, we stopped the execution of the program if the τ -path was not finished in one day. From Table 4, it can be observed that without the tree pruning, the construction of the τ -path is not practical as we progress to the deeper nodes of the tree because of the generation of an exponential number of high-order interaction terms. The maximum time taken by the τ -path with pruning was approximately 130 s, for a "max-pat" size of 25, i.e. for 1073709892 number of nodes at λ = 1, ζ = 0.4. In our numerical experiments, the number of kink in the τ -path appears to be polynomial in the number of features (see appendix). The worst-case complexity of τ -path is exponential (see appendix). However, fortunately, it has been well-recognized that this worst-case rarely happens in practice Li and Singer (2018); Le Duy and Takeuchi (2021) and, this is also evident from our experimental results (see appendix).

Conclusion
In this paper we proposed an algorithm to efficiently compute the full-CP set of SHIM. Through numerical experiments, we demonstrated that SHIM is statistically superior than the vanilla lasso and comparable to other complex models (NN, RF). The confidence interval along with a point estimation is better than a point estimation alone. The computation of a full-CP set for other complex models (NN, RF) remains an open question. SHIM is interpretable and it generates the decision function as a weighted combination of decision sets (if-then rules). The homotopy-mining based exact full-CP of SHIM is better than the split-CP of SHIM. The efficient pruning strategy makes the computation of exact full-CP set tractable for SHIM.
and the left hand side (l.h.s.) of (13) has an upper bound i.e.
The above two bounds are derived by considering the fact that for any a ∈ R, b ∈ R, c ∈ R and d ∈ R > 0 we can write the following: Therefore, for equation (13) to have a solution the following condition needs to be satisfied.
The equation (13) will not have any solution if (14) is not satisfied, and that can be used to derive the pruning condition. Therefore, the pruning condition of the SHIM τ -path can be written as follows: Before proving the Lemma 1, we introduce the following two propositions: Proposition 2. We can write Proof of Proposition 2: We have Similarly, Therefore, using Proposition 2 we can further write equation (16) which implies equation (15).
(16) Proposition 3. By using the definition of tree, we have v(τt) . We now prove Lemma 1 by contradiction, that is we assume that ∃ such that (16) holds, and there exists one ⊃ : ∆ 2 < ∆ 2 * ; then show that this is a contradiction.

Proof of Proposition 3: From the definition of tree we have
=⇒ is infeasible, using (14), (15), (16) =⇒ ∆ 2 ≮ ∆ 2 * . This completes the proof of Lemma 1. Hence, if the pruning condition in Lemma 1 holds, then we do not need to search the sub-tree with as the root node, and hence increasing the efficiency of the search procedure.

B Extension for Elastic Net (El-SHIM)
A common problem of the LASSO is that if the data has correlated features, then the LASSO picks only one of them and ignores the rest, which leads to instability. To solve this problem Zou and Hastie (2005) proposed the Elastic Net (ElNet). This feature correlation problem is very much evident in the SHIM-type problem, and hence we extended our framework for the Elastic Net. We solve the following optimization problem to extend our framework for the Elastic Net: The elastic net optimization problem can actually be formulated as a LASSO optimization problem using augmented data. If we consider an augmented data defined asX = X √ αI p ∈ R (n+1+p)×p andỹ(τ ) = y(τ ) 0 ∈ R n+1+p , where I p ∈ R p×p is an identity matrix and 0 ∈ R p is a zero vector, then solving the elastic net optimization problem (17) for a fixed λ, is equivalent to solving the following problem.
Step size (τ -path) If we consider two real values τ t and τ t+1 ( τ t+1 > τ t ) at which the active set does not change, and their signs also remain the same, then we can write Note that here the only change compared to the vanilla SHIM is the addition of an αI |Aτ t | term to the expression of ν Aτ t . Now, one can also derive a similar expression of the step size of inclusion and deletion as done for the vanilla SHIM τ -path (9), considering the updated expression of ν Aτ t (τ t ).
Now, using Proposition2, we can further simplify the above pruning condition as follows. b ,w(τt) +∆ 2 Now similar to the Lemma 1 one can formally prove the Lemma 2 using (25) and Proposition 3.

C.1 Hyper parameter selection (synthetic data experiments)
The hyper parameter selection for all the methods (e.g. λ in LASSO, SHIM; hidden layer's sizes in MLP etc.) are done based on 5−fold cross validation using a separate set of 15 independent samples. For MLP, the activations are chosen from {identity, relu, logistic, tanh} and the hidden layer's sizes are chosen from all possible combinations of {50, 100, 150} nodes, considering both 2-hidden layers and 3-hidden layers architectures. The most frequent activation and architecture chosen based on separate set of 15 independent samples are considered to report the average results. Similarly for RF, the number of estimators (n_estimators) and the min samples in the leaf of a tree (min_samples_leaf) are chosen from {50, 100, 200} and {0.1, 0.05, 0.01} respectively. The median λ value based on the separate set of 15 independent samples is considered to report the results of LASSO and SHIM. For the λ selection we considered a grid of 10 points in the range of [λ max /10, 0). For, MLP and RF, we used the standard scikit-learn implementation.

C.3 Additional results
To further compare the performance of the proposed method we considered highly sparse data (ζ = 0.6) in two different experimental settings. (1) model-1: We generated a model with strong signals that is we considered a true model of up to third-order interactions, which is defined as µ(x i ) = 5.0z 1 + 5.0z 1 z 2 + 5.0z 1 z 2 z 3 and (2) model-2: We generated a model with weak signals that is we considered a true model of up to third-order interactions, which is defined as µ(x i ) = 1.0z 1 + 1.0z 1 z 2 + 1.0z 1 z 2 z 3 . In both the settings we set σ = 1. The choice of these models is merely for demonstration purposes, and the proposed method is equally applicable to any chosen model. For both the settings (strong and weak), we generated a dataset of n ∈ {50, 100, 150} training instances, each accompanied with n = 10 test instances. Here we considered a covariate size of m = 5. We generated 3 such independent random datasets. Hence, in total we reported the average results of 3 × 10 = 30 test instances. The hyper parameter were selected based on another independent set of 3 samples using the same strategy as explained in the hyper parameter selection section in the appendix. For split-CP, we repeated the experiments 30 times to highlight the effect of randomness in the confidence set generation. For SHIM, we did not mention any "max-pat" size of d, i.e. the entire search space is considered for exploration (tree generation) and the proposed tree pruning condition takes care of it to improve the efficiency of the search. The best shim model is automatically chosen by the algorithm corresponding to the best hyper parameter λ (explained in the hyper parameter selection section in the appendix).
The results are shown in Fig.3 where the left figure corresponds to the model-1 (strong signal) and the right figure corresponds to the model-2 (weak signal). In both the settings, the full-CP methods (lasso_f, shim_f) produced more compact confidence sets irrespective of sample sizes. It can be observed that in case of highly sparse data, the shim model produces better results, more specifically when the sample size is small and the signal is weak. Table 5 shows true score (y), predicted score (ŷ), confidence interval (CI), length of confidence interval (length), rule and the coefficient (coef) of the individual component of a rule of five randomly chosen test instances of the compas data. This demonstrates that the generated SHIM is easy to interpret and, one can easily judge the reliability of model based on the provided confidence interval along with the point estimation. Figure 3: Comparison of the confidence interval lengths (CI length) of the proposed method (shim) with other simple (lasso) and complex models (mlp, rf) using synthetic data for different sample sizes. Left figure shows the results using model-1 (strong signal) and the right figure shows the results using model-2 (weak signal).
yŷ confidence interval length rule coef Table 5: The SHIM model for 5 randomly chosen test instances of the compas data. The true score (y), predicted score (ŷ), confidence interval (CI), length of confidence interval (length), rule and the coefficient (coef) of the individual component of a rule are shown.
= 1 = 10 Figure 4: Variation of node counts ("fraction of node counts") for different max_pat (d) sizes. Here, the fraction of nodes for a specific d, represents the number of nodes traversed divided by the total number of possible combinations of interaction terms using the "max-pat" size of d. The results are shown for two λ values (λ = 1, 10) and for three different sparsity levels (0.4, 0.7, 0.9). The pruning is more effective as the data is more sparse. Fig. 4 shows the variation of node counts ("fraction of node counts") for different max_pat (d) size during the construction of τ -path. One can observe that our pruning condition is more effective at the deeper nodes of the tree as it exploits the tree's anti-monotonicity property. Table 6 shows the number of homotopy transition points (# kinks) along the τ -path. Although the worst-case complexity of the