Isotonic regression meets LASSO

: This paper studies a two step procedure for monotone increas- ing additive single index models with Gaussian designs. The proposed procedure is simple, easy to implement with existing software, and consists of consecutively applying LASSO and isotonic regression. Aside from formal-izing this procedure, we provide theoretical guarantees regarding its perfor- mance: 1) we show that our procedure controls the in-sample squared error; 2) we demonstrate that one can use the procedure for predicting new ob- servations, by showing that the absolute prediction error can be controlled with high-probability. Our bounds show a tradeoﬀ of two rates: the mini- max rate for estimating high dimensional quadratic loss, and the minimax nonparametric rate for estimating a monotone increasing function.


Introduction
Linear regression modeling and least squares estimation are two closely related topics with pervasive applications in the field of statistics and beyond. Often times however, linear models are simply approximations to the true data generating mechanism. Nonparametric and semiparametric regressions present more flexible alternatives to the linear model, and have been studied extensively in the statistics literature in low dimensional regimes [36,40,21]. This paper focuses on the high dimensional, sparse, monotone increasing, additive, semiparametric generalization of the linear regression model: where the noise ε is independent of the design X. Here, the terms high dimensional and sparse, refer to the fact that the vector β * ∈ R p for a large ambient dimension p, but it is assumed to be sparse, i.e., its effective support (the set of its non-zero elements) is small; the terms semiparametric and monotone increasing refer to the "link" function f , which is assumed to be unspecified (unknown) and monotone increasing, while the term additive refers to the fact that the noise term ε is additive. In addition to assuming that f is increasing, in this paper we further suppose that f is a Lipschitz function, which is in general not necessary when studying models of the type (1.1), but it eases the theoretical analysis in the high dimensional regime. On an important note, model (1.1) is not fully identifiable, as multiplying β * by a constant and dividing f by the same constant generates the same outcome; similarly adding and subtracting a constant from f and ε yields the same outcome. However (1.1) can be easily identified by assuming that f is increasing * , fixing any function of β * which attains different values when scaling β * proportionally, such as any vector norm, and further assuming that the noise ε has a zero mean † . Notice that the linear regression model is a special case of the model (1.1) where f is a linear function with positive slope. Model (1.1) is also known as monotone single index model (SIM), and is a special case of the general SIM which imposes less restrictions on the outcome generation. Specifically, the general SIM framework assumes Note that unlike (1.1), (1.2) makes no monotonicity assumptions on f , and the error term need not be additive. Assuming a general SIM (1.2) with Gaussian design X, and that the error term ε is independent of the design X, [32,38] showed that a constrained variant of the LASSO [39] can successfully estimate β * (up to a scalar), while [30] showed that LASSO can recover the support of β * , given that Cov(Y, X β * ) = 0. (

1.3)
A nice feature of model (1.1) is that condition (1.3) holds true by default. Indeed, it is not hard to check that in (1.1) with a non-constant increasing link function f ‡ , Gaussian design X ∼ N (0, Σ), and noise ε independent of the design, one has Cov(Y, X β * ) = Cov(f (X β * ), X β * ) > 0. § Therefore, under the above assumptions, given existing results, model (1.1) is amenable to a LASSO application for obtaining proportional estimates of the vector β * . Isotonic regression is a line of work which is highly relevant to model (1.1) yet very distinct from the aforementioned LASSO developments. Specifically, isotonic regression aims to estimate the increasing function f after observing n samples from the model where the design X ∈ R is considered fixed, and ε ∼ N (0, σ 2 ) is a Gaussian noise. The model fitting is done via the following optimization procedure argmin f is increasing 1 2n There has been a multitude of advances on isotonic regression [12,45,13,8,7,4], where the authors showed estimation rates for general increasing, as well as for piecewise constant f . This manuscript puts the two ideas above into one procedure to tackle problems of the type (1.1). Specifically, we study a two step estimator, which starts with applying LASSO to obtain an initial estimate β of β * , then "plugs-in" β in place of β * effectively reducing the dimension from p to 1. In the second step the procedure uses isotonic regression to obtain an estimate of f . Finally, predicting a new observation is done on the basis of how close this observation is to a data point along the direction β. A detailed description of the procedure along with more in-depth motivation can be found in Section 2.

Contributions
This manuscript contains two main contributions, which are proved under the assumption that X i β * ∼ N (0, 1) and that the link function f is increasing and Lipschitz with Lipschitz constant L. In order to informally summarize our findings, recall that β denotes the LASSO estimate of the direction β * , and let f denote the isotonic regression estimate of f , given a sample of n observations (1.1). An informal version of our first main finding is that applying sequentially LASSO and isotonic regression gives the following bound on the in-sample squared 2 prediction error with high probability. In the above inequality we observe the tradeoff of two rates r 1 and r 2 which omitting constants for simplification are 2 3 , r 2 L 2 s log p n .
These two rates have natural interpretations: r 1 is the general nonparametric rate of isotonic regression for increasing functions as shown in [8,4]; r 2 is the standard estimation rate for LASSO of the 2 2 norm [5]. Our second main contribution is to derive guarantees for the conditional mean integrated absolute error (CMIAE), which is a measurement of the prediction capability of our estimate. The CMIAE is defined as and measures the absolute integrated error of prediction for a new observation X, conditional on the new observation lying within the data estimated range. Informally, we can show that up to constant factors with high probability the CMIAE satisfies:

713
In other words, conditional on a new observation falling within the range [min X i β, max X i β], the squared mean absolute integrated error obeys a similar inequality to the one in (1.4). The difference between (1.4) and the inequality in the preceding display is that in (1.4) we evaluate the performance of the estimator only on the dataset, while the CMIAE inequality shows that the estimator performs well on new observations. This is a non-trivial distinction, both in practice and in theory. The major difficulty in showing the above two results stems from the fact that one uses the data once when estimating β and therefore in the second step when one applies isotonic regression complicated data dependencies may occur.

Related work
Estimation for SIMs in the case when p is fixed, has been studied extensively in the literature [see e.g., 42, 20, 31, 28, among others]. Recently there have been active developments for high dimensional SIMs. The first work on high dimensional SIM was [2] where the authors proposed a PAC-Bayesian approach. [32] and later [38] demonstrated that under condition (1.3) running the least squares with 1 regularization can obtain a consistent estimate of the direction of β * , while [30] showed that this procedure also recovers the signed support of the direction. Even more recently [16,43] extended some of those ideas to cases where the design has a known elliptical and not necessarily Gaussian distribution, yet non of these works offer estimation of the unknown function f . It is noteworthy to mention that condition (1.3) which is instrumental in the aforementioned papers traces roots to the following seminal works [26,25,10] in the area of sufficient dimension reduction.
Some of the recent literature focuses not only on recovering the direction β * but also aims at estimating f , which is also what the main goal of the present paper is. [33], e.g., proposed a nonparametric least squares with an equality 1 constraint to handle simultaneous estimation of β * and f . This procedure is computationally challenging and in order to establish estimation guarantees, [33] assumes that f is smooth; the rates obtained by [33] are suboptimal for monotone links f . Furthermore, these rates do not hold in the "ultra-high" dimensional setting p n, while our procedure provably works even in such extreme situations. [17] considered a smoothed-out U -process type of loss function with 1 regularization, and proved their approach works for a model class which encompasses the monotone SIM (1.1). In contrast to [17] however, our nonparametric procedure requires only one tuning parameter, less computation, and has better estimation guarantees (albeit within a smaller model class). The algorithm proposed by [17] was in part inspired by one of the seminal works studying models with monotone link functions in the low dimensional setting - [27]. [27] proposed the maximum score estimator of the multinomial choice model, which is the minimizer to a non-smooth loss function. [27] also established the consistency for the maximum score estimator. Later [18,19] suggested a smoothed version of [27]'s loss function and showed that the smoothed out version can have computational and estimation benefits [see 37, for an overview].
Regularized procedures have also been proposed for specific choices of f and Y . For example, [44] considered the model Y = f (X β * ) + ε with a known continuously differentiable and monotonic f , and developed estimation and inferential procedures based on the 1 regularized quadratic loss. Non of the aforementioned works use isotonic regression to directly address prediction and simultaneous estimation of f and β * for monotone SIMs. That being said, this is not the first paper to consider an application of isotonic regression to SIMs. The use of isotonic regression in SIMs dates back to [29], where the authors proposed a heuristic approach of applying isotonic regression to SIM estimation. More recently, [22,23] formalized the Isotron algorithm, which is a combination of the perceptron and isotonic regression. The authors gave estimation guarantees which are suboptimal compared to the ones we provide. [3,9] give predictions of monotone SIMs albeit not in the high dimensional sparse setting. A heuristic procedure for variable selection using LASSO, isotonic regression and kernel regression was given by [14].
In conclusion, even though there has been a flurry of works studying SIMs, monotone SIMs, and solving SIMs with isotonic regression, to the best of our knowledge our work is the first to derive estimation and prediction guarantees for monotone SIMs with Gaussian designs in a high dimensional setting, and more specifically to study the behavior of isotonic regression after LASSO has been applied.

Notation
Throughout the paper we adopt the convenient notation [n] = {1, 2, . . . , n}. For a vector u =(u 1 , u 2 , . . . , u n ) ∈ R n , let u ↑ denote the vector (u (1) , u (2) , . . . , u (n) ) of order statistics: u (1) Given model (1.1), we define the shorthand notation under the assumptions that X β * ∼ N (0, 1) and EY 4 < ∞, so that all of the above quantities are finite and well defined. In addition we assume EY 4 does not scale with s, n, p asymptotically, which in turn ensures that all of the above quantities are bounded and do not scale above a constant. For a sample of n observations {X i } i∈ [n] , of X i ∈ R p vectors, we define the n × p matrix X which stacks the vectors X i into its rows. We will often use the sample covariance notation For a symmetric and positive semi-definite matrix Σ, we use Σ 1 2 to denote its symmetric square root, i.e., if we have the eigendecomposition Σ = UDU then Σ 1 2 = UD 1 2 U . The notations n and p are reserved for sample size and dimensionality of the signal vector β * . Throughout the paper we assume that the vector β * is s-sparse, i.e. it has only s non-zero entries. We reserve S to denote the support of β * , i.e., the set of all non-zero coefficients of β * ; hence |S| = |supp(β * )| = s. Similarly S c will be used to denote [p] \ S the set of zero coefficients of β * . For a vector v ∈ R p and a set C ⊂ [p] the vector v C is the C restriction of v, i.e., it contains only the entries of v which belong to C. Similarly for a matrix M ∈ R p1×p2 double indexing M C1C2 takes the entries of M belonging to C 1 and C 2 . For a vector v ∈ R k and a q ≥ 1 we use standard notation for the For brevity we let M 2 = M 2→2 , and let M max = max ij |M ij |. We use I d to denote a d-dimensional identity matrix, where sometimes the index d will be omitted when the dimension is clear. We use standard asymptotic notation for sequences. Given two sequences {a n }, {b n } we write a n = O(b n ) if there exists a constant C such that |a n | ≤ C|b n |; a n = o(b n ) if a n /b n → 0, and a n b n if there exists positive constants c and C such that c < a n /b n < C. We also use a n b n and a n b n as a shorthand for a n = O(b n ), a n , b n > 0 and a n = o(b n ), a n , b n > 0 respectively.

Organization
The paper is organized as follows. Section 2 formally states our procedure for monotone SIM prediction and gives initial motivation. Section 3 is dedicated to studying the in-sample prediction error of our procedure. Section 4 studies the prediction of our algorithm on a new observation. Finally Section 5 provides thorough numerical studies confirming the predictions of the main results of Sections 3 and 4. The majority of the proofs are relegated to the appendices.

Background and methodology
: where f is a monotone increasing and L-Lipschitz ¶ function, X ∼ N (0, Σ), ε ∼ N (0, σ 2 ) is noise independent of X. For identifiability we assume Σ 1 2 β * 2 = 1, which implies that X i β * ∼ N(0, 1). We propose the following two step estimation procedure: ¶ Recall that a function f : R → R is L-Lipschitz when for any two values x, y ∈ R we have |f (x) − f (y)| ≤ L|x − y|.
• Step I. Let β be the solution to for an appropriately chosen λ. Let in an increasing order, where ties are broken arbitrarily. In other words let π be such that X πi β ≤ X πj β for all i ≤ j. Fit an isotonic regression or in other words let f (x) be the coefficient in the vector f corresponding to the index argmin i:x≥X π i β (x − X πi β) when x ≥ X π1 β and f 1 otherwise.
There are (at least) two ways one can go about implementing the above procedure. The first way (and the one used to state the algorithm) is to run both Step I and Step II on the entire dataset D. The second way is to split the data D in two halves D 1 and D 2 , and run Step I on D 1 , and Step II on D 2 . The compelling reason for running the full data procedure is clear; by using the full data one takes advantage of the entire dataset when estimating both β * and f . A compelling reason for the second procedure is the fact that the estimate β would be derived independently from the second half of the data, thus eliminating any potential "cherry-picking" when estimating f .
Multiple technical challenges arise when deriving guarantees for the full data procedure. The main issue lies in the fact that the data is used once for estimating β, implying that the permutation π depends on the data. Consequently the distribution of the error terms {ε πi } n i=1 becomes complicated. The situation does not become easier even when one conditions on the random design X i , since the estimates {X i β} n i=1 depend on the error terms ε i . Data splitting is a simple way to remedy this effect. However, while splitting the data effectively breaks the dependency of the permutation π on the error terms, it may require as much as 1.5 the number of samples to achieve the performance of the full data procedure. We will illustrate this effect on simulated data in Section 5. In theory we will argue that both procedures produce estimates satisfying the same rates, although under slightly different conditions; furthermore the data splitting procedure is significantly easier to justify compared to the full data version.
In practice, Step I may benefit from performing a careful data dependent transformation. Such transformations include for example, scaling and centering the outcome values, i.e., . Before we formally justify the procedure and show some guarantees, we first provide general motivation. To see why optimizing (2.1) is useful for estimating β * , recall that we assume that ε is independent of X and that X ∼ N (0, Σ) (the latter is not crucial, as long as X has an elliptical distribution). Using the closed form solution for the least squares we have where X = X Σ − 1 2 and β * = Σ 1 2 β * . In terms of this notation X ∼ N (0, I). Hence by the properties of the Gaussian distribution multiplying E Xf ( X β * ) by any vector perpendicular to β * equals 0. Therefore it follows that E Xf ( X β * ) ∝ β * and hence argmin Now, argmin β∈R p E(Y − X β) 2 = 0, since supposing the contrary leads to which cannot hold for monotone non-constant f . Hence the population minimizer of the least squares is proportional to the true direction β * which serves as a motivation to (2.1), where the 1 norm penalty is added to help with the high dimensionality of β * . Applying isotonic regression after (2.1) is natural since we are assuming that f is monotone increasing and (2.1) gives us a plugin estimate which we can use to reduce the dimensionality from p to 1 as required in isotonic regression.
In the next section we explore the in-sample prediction error of our procedure.

In-sample prediction
The main result outlined in this section shows the usefulness of the procedure motivated in Section 2, for the class of monotone increasing and Lipschitz link functions. All results presented in the main text hold when using the entire dataset for both steps I and II. In the supplement (see Appendix D) we argue that if one is willing to split the data some of the assumptions can be relaxed. We begin by showcasing (see Figure 1) three simple examples of monotone increasing Lipschitz functions, which we will later on use for simulation purposes.
In order for us to guarantee that the full data procedure is successful we first ensure that for an appropriately chosen λ the estimate β is sufficiently close to β * , and that moreover its support is a subset of the support of β * . The latter property of β might appear unnecessary, but it allows us to carefully isolate the dependency of the error terms ε πi on β. To ensure that the estimate β, obeys these properties we will require that the entries of X are not overly correlated. Below we will use standard assumptions, which are often adopted in the high dimensional statistics literature.
To this end recall that Σ is the covariance matrix of X and let S := supp(β * ). Partition Σ (with a slight abuse of notation) as where Σ SS corresponds to the covariance of X S . Define the conditional covari- We assume the following conditions Furthermore, suppose that λ min , λ max , κ, Ω do not scale with (n, p, s).
The first condition in Assumption 3.1 is a somewhat stringent requirement on the covariance matrix which controls how much correlation of X is allowed outside of the true support S. It is used to ensure that the support of the first step estimate is embedded within the true support of β * for large enough values of λ. Furthermore, it allows us to control the dependency of the error terms ε i on β. This is likely not a necessary condition, but it facilitates our analysis greatly. The remaining two conditions in Assumption 3.1 are milder and they require boundedness of the eigenvalues of Σ SS and boundedness of the diagonal elements of Σ S c |S . While Assumption 3.1 depends on the true support of β * , there are matrices which satisfy it universally for any support set S. For example the identity matrix clearly satisfies Assumption 3.1 for any S. More generally, so does any Toeplitz matrix whose i, j th element is of the form ρ |i−j| for 0 < |ρ| < 1. Assumption 3.2 is mild and requires that the outcome has a bounded 4 th moment. The main result of this section, outlined below, shows that our estimator controls the in-sample prediction error over the observed sample.

Theorem 3.3. Suppose Assumptions 3.1 and 3.2 hold and that the monotone increasing function f is L-Lipschitz.
In addition let s n ≤ 1 64 , and assume that the effective sample size n s log p is large enough and the tuning parameter λ = C log p n for a sufficiently large constant C > 0. Then for some constants Ω 1 , Ω 2 , Ω 3 > 0 the following event happens with probability at least Denote the rate on the right hand side of (3.1) with R F , where F stands for "full data" procedure.
approximating a monotone function [8,4]. The rate Ω 2 L 2 s log p n + Ω 3 is the minimax rate (up to log factors) of estimating the parameter of a high dimensional linear model in 2 2 norm [5]. Notably the rate in (3.1) depends on the expression f (X πn β)−f (X π1 β). In a case when f is bounded one can upper bound this difference independently of the values X πn β and X π1 β. However, in many cases such as in the linear regression model, e.g., the boundedness assumption is violated. Under the assumption that f is L-Lipschitz, we know that Since β is close to β * , one would anticipate that the two quantities X π1 β, X πn β behave like min X i β * and max X i β * . However this is not immediately obvious, due to the complicated dependency of β on the data. Our next result argues that under Assumptions 3.1 and 3.2 this is indeed the case assuming additionally that s log p n = O(1). Before we state the result recall that the n × p matrix X stacks the n vectors X i into its rows. We now discuss (3.2) in view of Lemma 3.5. Let Z 1 , . . . , Z n be i.i.d. draws from a standard normal distribution. Since β * is a fixed vector such that . For a proof of this fact we refer to [Theorem 5.8 of 6]. In addition, using extreme value theory it is known that asymptotically EZ (n) ≈ √ 2 log n and EZ (1) ≈ − √ 2 log n [15]. By (3.3), it follows that with high probability (at for some constant C 1 . Naturally, on the same event a similar high confidence interval holds for [Xβ] (1) . Therefore, continuing (3.2) with probability at least The minimax rate is s log p/s n [34], which for most values of s is of the order of s log p n .
Here to be precise we say that s log p n is the minimax rate up to logarithmic factors.

Isotonic regression meets LASSO
The above in conjunction with (3.2) implies that even in the unbounded f case

Prediction
Inequality (3.1) ensures that the Isotonic LASSO estimate does not behave erratically on the observed data points, by guaranteeing that the average in-sample squared error is small. On the other hand, (3.1) says nothing about predicting future observations and in that sense is not completely satisfactory. What can one say about average case error when predicting a fresh new sample? In classical nonparametric theory, one would typically analyze the Mean Integrated Squared Error (MISE) [40]. In this section, we will confine to a slightly weaker, data dependent criteria which we refer to as the Conditional Mean Integrated Absolute Error (CMIAE). For a new observation X ∼ N (0, Σ), the CMIAE is defined conditionally on the observed data In the above, the conditional expectation E X|D is taken with respect to the new observation X conditionally on the data D, which fixes the quantities f, β, X π1 β and X πn β. Since X is independent of the dataset this notation is equivalent to simply taking the expectation with respect to X: E X . However, we will use the notation E X|D to underscore that this expectation is a function of the dataset D (hence a random variable with respect to the randomness of the data). Several remarks regarding the definition of CMIAE are in order. First, CMIAE measures the integrated absolute error in prediction, conditionally on the new observation's projection along the estimate β falling within the "data range" -[X π1 β, X πn β]. Importantly, the data range we use in the CMIAE is procedure dependent, as it is conditional on the first step estimate β. Furthermore, the condition X β ∈ [X π1 β, X πn β] excludes "extreme" new observations X from consideration in order to avoid extrapolation. Given that β is a good proxy of β * , intuitively one should be skeptical of the estimate f beyond the observed range [X π1 β, X πn β]. If one assumes that f is bounded, this condition can be traded off for an additional error rate which depends on f ∞ and the probability of observing an "extreme" point X such that X β ∈ [X π1 β, X πn β]. We do not present this result here.
Second, if in analogy to CMIAE, we define the Conditional MISE (CMISE) as the squared error, by Jensen's inequality we obtain The inequality above shows that any bound on CMISE implies a bound on CMIAE, and therefore CMIAE is a weaker measure of prediction. It turns out (unsurprisingly) that CMIAE is more amenable to analysis for the isotonic LASSO algorithm, and we defer to future work the analysis of CMISE.
Below we state and prove the main result on prediction. To this end recall that R F denotes the right hand side of (3.1).

2)
for some absolute constant C.
Before we turn to the proof of this theorem we comment on a related result, Theorem 6.1 of [3]. Cast into our framework, informally speaking, Theorem 6.1 of [3] implies that the CMISE 1 2 is of order n − 1 3 log(n) 5/3 (see also the remark after the Theorem) given that p is fixed, the function f is bounded, and an estimate of β * , β is available such that β − β * 2 = O p (n − 1 3 ). Even if one assumes that s and p are both fixed, one cannot derive Theorem 4.1 as a Corollary of Theorem 6.1 of [3] using (4.1). The reason being that the rate [3] provide has an extraneous logarithmic factor (Theorem 4.1 gives the sharp rate of order n − 1 3 ), and the estimate β in Theorem 6.1 of [3] is provided from a sample splitting procedure. Our proof strategy is fundamentally different to that of Theorem 6.1 of [3], and utilizes the properties of the Gaussian distribution.
Proof of Theorem 4.1. Without loss of generality we will assume that the vector β is the same as the one generated in Step II' of the proof of Theorem 3.3. By * * The "high probability" here is measured in terms of the randomness of the dataset D.
Proposition B.2 this is a high probability event, and hence we can assume it holds without loss of generality. For brevity let x i := X πi β. Using our shorthand notation the triangle inequality yields To control the first term (4.3) by the law of total expectation we have: where P X|D denotes the probability of the new observation X given the dataset D. For the second term (4.4) we first note that by the law of total expectation (4.5) We will now handle the terms I 1 and I 2 individually. We start with I 1 which is simpler. Using the Lipschitz property of f and (B.3) we have with high probability (the probability is measured with respect to the randomness in D) that Put c = [β Σβ] 1 2 = Σ 1 2 β 2 for brevity. Note that conditionally on the data D, X β|D ∼ N (0, c 2 ), since X is independent of the data and β is fixed given D.
. By (B.3) and the triangle inequality we know that |c − 1| ≤ C 2 s log p n . Thus using Lemma 3.5 and the arguments after its statement, we have the following lower bound for sufficiently large n and n s log p , where Z (1) and Z (n) are order statistics of n standard normal samples, Φ is the cdf of a standard normal distribution. We conclude that with high probability (measured in terms of the randomness of the dataset D) Next we tackle the term I 2 . By the definition of f , when where the first and last identities hold since f is monotone increasing. Therefore by (4.5) We handle those two terms in part below, starting with I 21 . Recall that c = [β Σβ] 1 2 = Σ 1 2 β 2 . We have conditionally on D that X β|D ∼ N (0, c 2 ), since X is independent of the data and β is fixed given D.
. By Cauchy-Schwartz and the triangle inequality we have: For I 22 we have Next we need the following result where c is some absolute constant.
Since we assume s log(n) √ n, the above probability converges to 1 asymptotically. Using (B.2) of Proposition B.1 on an event of high probability we have Combining this with our result on I 1 , the fact that P X|D X β ∈ [x 1 , x n ] ≥ 1 2 with high probability, and adjusting the constant C completes the proof.

Remark 4.3. Aside from the conditions required by Theorem 3.3, Theorem 4.1 requires additionally that s log n √
n. This means that the sparsity of β * has to be significantly smaller than the square root of the sample size. This condition is likely an artifact of our proof technique. The bottleneck is Proposition 4.2 which requires s log n √ n in order for the term on the left hand side of (4.6) to be bounded. Similar result holds for the data splitting procedure, without the need to require s log n √ n. For details we refer the reader to Appendix D.

Numerical experiments
In this section we show numerical studies with the three increasing Lipschitz link functions which we showcased in Figure 1 and X ∼ N (0, I). To enforce stability and robustness in the estimation Step I, i.e., the LASSO step, we scale and center the outcome values before running LASSO. The LASSO is ran via the glmnet package, and we use 10fold cross validation to select the tuning parameter by minimizing the Mean Squared Error (MSE). Isotonic regression is ran using the isoreg function of the stats package. In Figure 2 we show the estimator f in red. The x-axis consists of estimated values X i β, while the y-axis are the true values Y i . The dashed purple line connects the points {(X i β, f(X i β * ))} i∈ [n] . The collection of blue points is the set {(X i β, Y i )} i∈ [n] . We see that in all three instances the red curve follows closely the purple dashed curve, giving visual evidence that the isotonic regression after LASSO works well to estimate the link function f on the dataset.  [n] , i.e., their y-axis is the true Y i value and their x-axis is the estimated value X i β (not the actual value X i β * ). The red curve is the isotonic regression fit; it contains the points {(X i β, f (X i β))} i∈ [n] ; the purple dashed curve connects the points Next we provide numerical evidence which directly corroborates with the theoretical findings of Sections 3 and 4. Our setup is the following. For each of the three link functions of Figure 1, we set the sample size n = 500, and vary freely the dimension p and the sparsity s in the sets {100, 200, 500, 1000} and {5, 10, 15, 20, 25, 30, 40} respectively. This results in a range of possible values for the adjusted sample size n s log p . The error term, as in the previous example, is set to ε ∼ N (0,1)

727
where ρ = 0, 0.2 and 0.5. The covariance matrix is given by Σ ρ,ij = ρ |i−j| (where we understand 0 0 = 1). In all occasions the signal vector β * is initially selected with equal positive entries, and is properly normalized so that Σ 1 2 ρ β * 2 = 1. As before the LASSO step is ran using scaled and centered outcome values; the tuning parameter λ is selected via 10-fold cross validation of the MSE.
To evaluate the accuracy of the algorithm, we calculated both the square root in-sample prediction error (the square root of the LHS of (3.1)) over 500 repetitions and the CMIAE over 500 repetitions and 1000 fresh samples in each repetition. Since Theorems 3.3 and 4.1 both bound the square root in-sample prediction and the CMIAE with s log p n , we compared the observed values to the value of s log p n , see Figure 3. All trends appear relatively linear hence confirming the predictions of Theorems 3.3 and 4.1. We also notice that there is a significant difference between the in-sample errors and CMIAE. For example, it turns out that the smooth function (tanh) has smaller in-sample error compared to the linear non-smooth function 1, but has consistently larger CMIAE compared to this function. Furthermore, the slopes of all lines are steeper in the CMIAE simulation compared to the in-sample errors simulation; in addition, the slope of the linear non-smooth function 2 is much steeper for the CMIAE, implying that this function is easier to predict on the dataset at hand, and one can predict well even with relatively small adjusted sample size, but this is not the case for CMIAE.
In addition to presenting simulations for the full data procedure, we also present results on the sample splitting version. Figure 4 demonstrates how sample splitting performs with n = 500 observations in comparison to Figure 3. We can see a difference in the "in-sample" errors in comparison to the results in Figure 3. The in-sample error error is larger in comparison to the full procedure by about 1.2 to 1.3 times. The CMIAE is higher compared to the CMIAE using the full data procedure by roughly 1.3 to 1.5 times. This is to be expected as the full data uses double the observations for the two steps. However, in defense of sample splitting, as we argued in Appendix D some assumptions can be lifted when running the two steps independently.
Additional simulation results can be found in Section E.

Discussion
This paper focused on a two step procedure applying LASSO and isotonic regression for monotone increasing SIMs. We showed guarantees for both the in-sample and out-of-sample prediction errors. One interesting technical question is whether we can relax the condition s log n √ n required in Theorem 4.1. The fact that sample splitting does not require this condition suggests that perhaps it is not necessary for the full data procedure either.
Another interesting direction is to explore whether the condition of Gaussian covariates can be relaxed. It is known, see e.g., [26,25] that if the predictors have  elliptically symmetric distributions, linear regression can recover the predictor up to a proportionality constant. Such an extension lies beyond the scope of the present manuscript. In our current proof of the full data version of the procedure we are relying on some results which require the Gaussianity of the covariates. We do believe however, that the procedure should work for elliptical distributed covariates, and defer this important question for future investigations. A technical question regarding CMIAE was raised by one of the referees. It was asked whether one can remove the condition X β ∈ [min X i β, max X i β] from the expectation. Currently we do not know how to remove this condition, unless we impose boundedness on the function f . This represents a challenge that could be addressed in our future work.
It is also of interest to see whether one can extend our results on CMIAE to the stronger criterion -CMISE -defined in Section 4. Our preliminary simulations (not reported) show that CMISE behaves comparably to CMIAE suggesting that similar result to Theorem 4.1 might hold for CMISE. Another interesting question is whether consistency can be established without imposing Assumption 3.1 on the covariance.
We also conjecture that the rates we derived are minimax optimal. This conjecture is based off on the fact that when f is the linear, the rate s log p n is known to be minimax for the 2 2 error, while even if we know β * the other rate is minimax when estimating an increasing f . We defer investigating the validity of this conjecture to future work.

Appendix A: Auxiliary results
First we state a very useful inequality from random matrix theory regarding the singular values of a Gaussian matrix.
Lemma A.1 (Corollary 5.35 [41]). Let A n×s matrix whose entries are i.i.d. standard normal random variables. Then for every t ≥ 0, with probability at least 1 − 2 exp(−t 2 /2) one has: where s min (A) and s max (A) are the smallest and largest singular values of A correspondingly.
Next, we remind the reader of a powerful result of sub-Gaussian concentration of non-Lipschitz functions proved by [1]. Before that we give definitions of · ψ1 and · ψ2 norms which will be later used in our analysis. For a real random variable X, define Recall that a random variable is called sub-Gaussian if X ψ2 < ∞ and subexponential if X ψ1 < ∞.
First we introduce the notation of [1]. For an integer , let P denote the set of partitions of [ ] into non-empty and non-intersecting disjoint sets. For a partition J = {J 1 , . . . , J k }, and an -indexed matrix A = (a i ) i∈[n] , define the norm: where the indexing should be understood as i I := (i k ) k∈I . Given the convention that |J | is the cardinality of the set J we restate a version of Theorem 1.4 of [1].
Theorem A.2 (Theorem 1.4 [1]). Let X = (X 1 , . . . , X n ) be a random vector with independent components, such that for all i ≤ n, X i ψ2 ≤ Γ. Then for every polynomial f : R n → R of degree L we have: In the above, D is the th derivative of f .

Appendix B: In-sample prediction proofs (full data)
Proof of Theorem 3.3. Our first step is to show that β can be identified with the solution to the following program with high probability: where S is the support of β * and X iS is X i restricted to the set S. In addition will we argue that the vector β S = is consistent for β * S . We have the following result, which describes the relationship between β and β S .

Proposition B.1. Suppose all conditions of Theorem 3.3 hold. Then with prob-
where F (n, s, p, f ) → 0 as n → ∞.
The first two claims involving the support of β have already been established in [30], while similar claims to the third and fourth one (although for the squareroot LASSO) were given by [32,38]. Since we were not able to find a proof of (B.2) and (B.3) in the LASSO literature (in particular under Assumptions 3.2 and 3.1), we provide a standalone proof of Proposition B.1 in Appendix B. This proof will also be useful later when we discuss the predictive properties of our procedure. Let E denote the event on which (B.2) and (B.3) hold.
To make our analysis simpler we will now isolate the dependency of β on ε and will argue (in a roundabout manner) that it will not cause issues in the final estimator that we proposed.
Step II'. Construct the permutation π : in increasing order † † . In other words let π be such that X πiS β S ≤ X πj S β S for all π i ≤ π j . Fit isotonic regression as in (2.2). For a given X set Since Proposition B.1 argues that E is a high probability event, analyzing Step II' ensures that similar guarantees hold for Step II with slightly smaller probability. From now on we focus on analyzing Step II'. We note that: The above representation makes it apparent that in fact: For the following argument let us first condition on the matrix X.
Denote with X S the n × s matrix whose rows are the row vectors {X 1S , . . . , X nS }. Define the projections Since the error vector ε has a N (0, σ 2 I) distribution by assumption, we have that conditionally on X the random vector P X ⊥ S ε is independent of β S (note † † Here we use π with a slight abuse of notation to denote both permutations in Step II and Step II'; the two coincide on the high probability event E by Proposition B.1. that the covariance EP X ⊥ S εε X S = 0). Let ε be an independent copy of ε such that ε ∼ N (0, σ 2 I). We can rewrite our model as Importantly, notice the random variables ξ i ∼ N (0, σ 2 ) are i.i.d, since by definition P X ⊥ S ε+P X S ε ∼ N (0, σ 2 I n ) is independent of β S and therefore of π i . The motivation for defining ξ i is that the aforementioned property is not true for the original noise variables ε i , i.e., the distribution of ε πi is complicated. One disadvantage of ξ i however is that they depend on the term [P X S (ε − ε )] πi .
We now introduce some shorthand notation. Let u i := f (X πiS β S ) and let The function f (x) is increasing, and moreover by definition {X πiS β S } n i=1 is an increasing sequence; we therefore have u ∈ S ↑ n . Even conditionally on the design X S , the terms γ i for i ∈ [n] remain random, and depend on ξ i . Therefore known results such as Corollary 2.2 of [4] are not directly applicable in this situation. Following (1.21) of [4], using the cosine theorem and the fact that f is the projection of Y = γ + ξ on the cone S ↑ n , for In particular the above inequality holds for v = u, and after expanding the norms we obtain By Cauchy-Schwartz we can lower bound the left hand side of the preceding display as To this end we remind the reader of Theorem 2.3 of [4] which shows that if there exists t * (u) is such that For such a t * (u) inequality (B.5) shows that with probability at least 1 − p −1 where we used that 2 ζ 2 u − ν 2 ≤ ζ 2 2 + u − ν 2 2 and 2 ζ 2 f − ν 2 ≤ 2 ζ 2 2 + 1 2 f − ν 2 2 . It follows that f − ν 2 2 ≤ 4 u − ν 2 2 + 6 ζ 2 2 + 4t 2 * (u) + 8σ 2 log p. [8] showed that such t * (u) can be taken as t * (u) = √ cσ(1 + (un−u1) σ ) 1/3 n 1/6 for some absolute constant c. Now we will control the terms u − ν 2 2 and ζ 2 2 which appear in (B.5). Since f is L-Lipschitz we have that Furthermore since P X S is a projection matrix, we have s, and P X S 2 = 1. Using the Hanson-Wright inequality [35] for some absolute constant c we have for any X that Setting t = s n gives that with probability at least 1 − 2 exp(−cs) we have n −1 ζ 2 2 ≤ 3s n . To summarize conditionally on X we have established that with an overwhelming probability 18s + 8σ 2 log p n .
Note that the above bound has been established conditionally on the design matrix X and holds with high probability (independent of the design X) for any design. Therefore it holds also unconditionally with high probability. Denote the event on which the above bound holds with E . It follows that on the intersection event E ∩ E (recall that E is the event where (B.2) holds) we can further bound: This completes the proof.
then with high probability supp(β) ⊆ S and β S = β S (the latter is not stated in the theorem but is implied by the proof). Note that under the assumptions of the Proposition this condition is satisfied when n s log p is sufficiently large and λ = C log p n for a sufficiently large constant C. This completes the proof of our first claim. We will now show that the vector β S is close to β * S in Euclidean distance. We start by using the inequality: where recall the definition of Υ from (1.5). Expanding the norms leads to SS is mean 0. We will now control n −1 w X S Σ − 1 2 SS ∞ . First suppose that Σ SS = I (hence by the assumption Σ 1 2 SS β * S 2 = 1 it follows β * S 2 = 1). We have where P β * ⊥ S = I s − β * S β * S . Note that P β * ⊥ S X S and w are independent. Thus it is simple to check that conditionally on w the vector We now argue that the term n −1 w 2 2 ≤ 2ξ 2 with probability at least 1 − θ 2 nξ 2 (recall the definitions of θ 2 , ξ 2 in (1.5)). Note that w = Y − ΥX S β * S is not necessarily a zero mean vector. However, by Chebyshev's inequality we have: Then setting t = ξ 2 brings the above probability to 0 at a rate θ 2 nξ 2 . Let W = {w : The diagonal entries of the covariance matrix n −2 w 2 2 P β * ⊥ S are less than n −2 w 2 2 . Hence by a standard Gaussian tail bound and a union bound, we have that for some universal constant c. By the law of total probability where p(w) denotes the density of w.
Therefore setting t ≥ 2ξ 2 log p cn bounds the first term in the above probability by 2s We now move to the second term of (B.7). Since Next we have the elementary inequality By Chebyshev's inequality Setting t = 2γ log p n bounds the above probability by (log p) −1 . By Lemma 1 of [24] Setting t = 8Υ log p n bounds the above probability by 2p −1 . We conclude that with probability at least where C(Υ, γ, ξ) = 8|Υ| + 2γ + c 0 ξ and c 0 = 2/c is a universal constant.
Proof of Lemma 3.5. We have the following chain of inequalities where the last inequality holds by (B.2).

Appendix C: Prediction proofs (full data)
Proof of Proposition 4.2. Since we are assuming that supp(β) ⊆ S holds, we can restrict our analysis only to the vector β S , and X iS . In order not to burden the notation in this proof we will still refer to β as β S and X i as X iS . Before we begin the proof we note that x i = X πi β/c = X πi β, where β = β/c satisfies Σ (v − w) 2 ≤ . Note that by our argument above it follows that the vector β ∈ E s−1 .
For any two vectors γ ∈ E s−1 , β ∈ N e let u γ = [Xγ] ↑ and v β = [Xβ] ↑ . Below we will suppress the dependency of u γ and v β on γ, β respectively. Using Lemma C.1 we know that: For any fixed vector γ ∈ E s−1 we have that X i γ ∼ N(0, 1) and therefore Φ((X γ) i  Additionally, by Lemma A.1 we have: with probability at least 1−2e −n/2 . Select = c n for some constant c. We obtain that with probability at least 1 − 2e −n/2 − 1 + 2n c s e −c0 √ n we have that the RHS of (C.1) is bounded by √ 12 + c √ 2(2 + s n )/ √ π. Selecting c appropriately we can bound this quantity by √ 12 + 1 √ π . Furthermore since we are assuming that s log n √ n the above probability will converge to 1. This completes the proof.
Proof of Lemma C.1. By the triangle inequality Using the elementary inequality (a − b) 2 ≤ 2(a 2 + b 2 ) and the fact that Φ(x) is a 1 √ 2π -Lipschitz function, we have The next to last inequality follows since for any two vectors a ∈ R n and b ∈ R n we have 2 2 . The latter be easily seen upon observing that if we have the ordering a i ≤ a j but b   Figure 3, the difference being that n = 1000 through all simulation settings. We observe similar linear trends in all settings confirming our theoretical predictions.  Figure 4, the difference being that n = 1000 through all simulation settings. We observe similar linear trends in all settings confirming our theoretical predictions.