Bounded isotonic regression

Isotonic regression offers a flexible modeling approach under monotonicity assumptions, which are natural in many applications. Despite this attractive setting and extensive theoretical research, isotonic regression has enjoyed limited interest in practical modeling primarily due to its tendency to suffer significant overfitting, even in moderate dimension, as the monotonicity constraints do not offer sufficient complexity control. Here we propose to regularize isotonic regression by penalizing or constraining the range of the fitted model (i.e., the difference between the maximal and minimal predictions). We show that the optimal solution to this problem is obtained by constraining the non-penalized isotonic regression model to lie in the required range, and hence can be found easily given this nonpenalized solution. This makes our approach applicable to large datasets and to generalized loss functions such as Huber’s loss or exponential family log-likelihoods. We also show how the problem can be reformulated as a Lasso problem in a very high dimensional basis of upper sets. Hence, range regularization inherits some of the statistical properties of Lasso, notably its degrees of freedom estimation. We demonstrate the favorable empirical performance of our approach compared to various relevant alternatives. MSC 2010 subject classifications: Primary 62G08; secondary 62J07.


Introduction
The statistics community has long had an interest in fitting monotone models to data [18,2,29,25]. Assume we are given a set of observations (x 1 , y 1 ), ..., (x n , y n ), each consisting of a vector of covariates x ∈ X and a response y ∈ R, along with a partial order on covariate space X . The class of isotonic models G is defined as the collection of models on X which obey the partial order constraints, i.e., a model g : X → R is in G if for all x, z ∈ X , The partial order on X can be translated to a set of order constraints on the observations indexed by I = {(i, j) : x i x j }. Isotonic regression aims to find the modelĝ ∈ G which minimizes the residual sum of squares on the sample data:ĝ = arg min g∈G n i=1 (y i − g(x i )) 2 . (1) It is well-known that an optimal solution of (1) can be written as a non-negative linear combination of upper sets plus an intercept. An upper set is any subset U {x 1 , ..., x n } such that x i ∈ U , x i x j implies x j ∈ U . Assume we have N distinct upper sets (N is typically exponential in n) denoted U 1 , ..., U N . We thus know that the optimal solution has the form for some set of nonnegative coefficientsβ l ≥ 0. Note that this only defines the solution at our observed covariate vectors x 1 , . . . , x n , but it can easily be extended to all of X by associating every point x ∈ X with the set of upper sets it "dominates". Prediction and extrapolation in this context have been previously discussed in Luss et al. [21].
The most commonly used covariate space and partial order are the standard Euclidean space X = R d and the partial order defined on x, z ∈ R d by x z iff the inequality holds coordinate-wise, i.e., x j ≤ z j for all j = 1, . . . , d.
In terms of model form, isotonic regression is clearly very attractive in situations where monotonicity is a reasonable assumption but commonly assumed structures such as linearity or additivity are not. Indeed, this formulation has found useful applications in biology [25,21], medicine [29], statistics [2] and psychology [18], among others (see Tibshirani et al. [31] for a nearly-isotonic formulation that relaxes the isotonic assumption). However, two major concerns arise when considering the practical use of isotonic regression in modern situations as the number of observations n, the data dimensionality d, and the number of isotonicity constraints m = |{(i, j) : x i x j }| implied by (1) all grow large: statistical overfitting and computational difficulty. The notations n, d, and m will refer to these quantities throughout the paper.
The first concern is statistical overfitting. Beyond very low dimensions, the isotonicity constraints on the family G can become inefficient in controlling model complexity and the isotonic regression solutions can be severely overfitted (for example, see Bacchetti [1] and Schell and Singh [29]). At the extreme, there may be no isotonicity constraints because no two observations obey the coordinate-wise requirement for the partial order . In this case, every observation is a singleton upper set, and if we denote these n upper sets by U 1 , ..., U n , the isotonic model can fit the data perfectly using only these upper sets: i.e., α = min i y i andβ l = y l − min i y i for all l = 1, . . . , n, providing a perfect interpolation of the training data. As demonstrated in the literature [29,21] and below, the overfitting concern is clearly well-founded when considering the optimal isotonic regression model implied by (1), even in non-extreme cases with a large number of constraints. In this case, regularization, i.e. fitting isotonic models that are constrained to a restricted subset of G, offers an approach that maintains isotonicity while controlling variance, leading to improved accuracy.
A second important issue is computation. Traditional methods developed in the statistics community did not scale well with the dimension d [19,3]. However, in recent years, ideas from optimization have permeated this area and led to the development of very efficient algorithms which can be used to solve problems with tens of thousands of observations in any dimension. The most efficient algorithm for isotonic regression known to the authors is due to Hochbaum and Queyranne [14] where the problem is cast in a more general form they refer to as the convex cost closure problem. They show how to obtain the global isotonic solution with the complexity of solving a single minimum-cut problem (which deals with finding a minimal cut through the arcs of a graph). Furthermore, their algorithm can impose fixed upper and lower bounds on the isotonic model at all observation points.
The isotonic recursive partitioning (IRP) method proposed in Spouge et al. [30] and Luss et al. [21] (following previous related work by Maxwell and Muckstadt [23] and Roundy [28], among others) finds a solution of (1) through iterative partitioning of the space X (i.e., solving a sequence of minimum-cut problems). Each split can be computed efficiently and the procedure is guaranteed to converge to an optimal solution of (1). It offers a highly efficient approach for solving (1) that is also amenable to regularization through early stopping of the iterative partitioning process. However, as demonstrated there, IRP does not offer sufficient complexity control and regularization in many cases. For example, at dimension d = 6 for particular simulation models, the first iteration of IRP already performs more than half of the fitting of the optimal solution of (1) as measured by equivalent degrees of freedom [21]. IRP regularization also lacks a rigorous formulation of the problem solved, as one cannot explicitly write the optimization problem being solved to obtain the model after k IRP iterations.
We are thus interested in designing a rigorous regularization approach for isotonic regression that would allow for a continuum of regularized models with increasing model complexity. In the nonparametric context where we fit a model by choosing it from a large function class, there has been an extensive use of smoothness (or complexity) penalties, leading to important tools such as spline methods and kernel machines [12]. These nonparametric smoothness regularization problems are typically solved by identifying "equivalent" parametric representations and solving those using ridge-or lasso-like approaches. This includes smoothing splines, kernel machines, and total variation penalties [22], among others. Importantly, total variation penalties are intimately tied to Lasso-type penalties on spline bases, as demonstrated by Mammen and van der Geer [22] and others.
A similar approach can be proposed for isotonic regression, where a natural measure of model "complexity" is the range of model predictions. For a model g ∈ G, define: It is easy to see that for a model of the form (2), we have: so there is a close connection here too between range regularization and Lasso regularization, which we return to later. Our explicit range-regularized isotonic regression model is thus (in its constrained form):ĝ = argmin g∈G,range(g)≤s i or in its equivalent penalized form: Our main observation in this paper is that this problem has a simple optimal solution, obtained by solving the non-regularized isotonic regression problem and finding optimal thresholds on this solution which obey the range constraint. Thus, all solutions to range-regularized isotonic regression problems are in fact thresholded versions of the non-regularized solutions. This leads to a simple and efficient algorithm for deriving all range-regularized solutions, which we term Bounded Isotonic Regression (BIR) and present in Section 2.
In Section 3 we investigate the properties of our new formulation and algorithm. We show that BIR is in fact equivalent to solving a non-negative Lasso problem in the set of upper sets. We further examine the regularization behavior of BIR and demonstrate that it generally adds one degree of freedom per upperset added to the solution, as shown for Lasso by Zou et al. [33]. Hence, BIR offers a gentle fitting of isotonic models with increasing complexity. Finally, we show that the basic ideas and efficient implementation of the BIR framework are not limited to isotonic regression, but can be applied to isotonic modeling problems with other differentiable loss functions (e.g., exponential family log-likelihoods or Huber's loss).
We demonstrate our BIR algorithm on simulated datasets in Section 4. We compare BIR regularization to IRP, and show its continuous regularization behavior, as well as its improved predictive performance, compared to IRP. We demonstrate its superior predictive performance over multivarate additive regression splines (MARS) for models that are both isotonic and highly complex. We also compare model selection approaches for the regularization parameter in BIR: general purpose cross-validation is compared to methods based on insample error, such as AIC and GCV, using the degrees of freedom approximation based on the Lasso connection.
It is interesting to compare our approach to a recent paper by Fang and Meinshausen [9] which also combines isotonic regression and Lasso penalties in an algorithm termed LISO (Lasso-Isotone). Fang and Meinshausen limit their interest to additive isotonic models, i.e., where a univariate isotonic function is fit to every covariate, and the overall isotonic model is the sum of these univariate functions. Additivity significantly limits the generality of the isotonic models being fit, but allows Fang and Meinshausen [9] to fit useful models to very high dimensional data (large d). Our approach is more assumption-free, and in lower dimensions (up to d = 8 or d = 10) when the models are complex and data is abundant, demonstrates superior performance relative to LISO in our experiments (Section 4).
A corresponding result to our main observation about a simple solution for range-regularized isotonic regression problem (5) was independently derived in parallel for the range-constrained isotonic regression problem (4) in Chen et al. [6]. While our results were discovered independently, our proof proceeds in similar fashion. However, whereas our result on the range-regularized case directly leads to an algorithm for the entire path of solutions (i.e., a solution to problem (5) for all values of λ), the result in Chen et al. [6] for the range-constrained case was used to prove a theorem about the degrees of freedom of the estimator. Again, we have independently discovered the same properties, however our claims were realized through the above-mentioned connection between BIR and Lasso. We address the connections to Chen et al. [6] more thoroughly in Sections 2.1 and 3.1.1

Regularization path for bounded isotonic regression
In this section, we derive an efficient algorithm for solving BIR. It relies on the result in Theorem 1 (in Section 2.1 below) that the solutions to BIR are all thresholded versions of the non-regularized isotonic regression solution.

Derivation of the optimal solution to BIR
Our focus is on the range-regularized isotonic regression problem in its penalized form given by problem (5). We first reformulate it as . . , n}}, and the range of the isotonic model is now captured byb −â. We will use the optimality conditions to (6) to derive an algorithm for efficiently generating the solution to (6) for all values of λ. The optimality conditions are: where μ ∈ R |I| , γ ∈ R n , δ ∈ R n are the dual variables to the monotonicity constraints, the lower range, and the upper range constraints, respectively. The following theorem shows that BIR solutions are all thresholded versions of the non-regularized isotonic regression solution (i.e., BIR with λ = 0). Thus, if we have this non-regularized solution, we can obtain the solution to all BIR problems with minimal effort. We note that this result extends a similar theorem of Fang and Meinshausen [9] from one dimensional (complete order) isotonic regression to partial order isotonic regression. Furthermore, a corresponding result was recently made in Chen et al. [6] for solving the constrained version of bounded isotonic regression. We discuss their result in more detail at the end of this section. Theorem 1. Letẑ be the optimal solution to the non-regularized isotonic regression problem. Then settinĝ for all i ∈ {1, . . . , n} solves BIR problem (6), whereâ andb solve the equations Proof. By construction, we prove thatŷ as defined by (7) solves the optimality conditions (a)-(f) for problem (6) given above. Let μ * be the optimal dual variables to the corresponding monotonicity constraints in the non-regularized problem (which has the same optimality conditions above when λ = 0). Then λ = 0 implies γ = δ = 0 and condition (a) can be rewritten aŝ For λ > 0, set the dual variables to be the same as the optimal dual variables as given in (8), i.e., set μ = μ * . First note that the optimality conditions (e), called complementarity conditions, are satisfied by construction (μ ij > 0 ⇒ z i =ẑ j ⇒ŷ i =ŷ j andŷ i <ŷ j ⇒ẑ i <ẑ j ⇒ μ ij = 0). The next set of complementarity conditions (f) imply that either γ i or δ i can be nonzero, but not both. Nonnegativity of γ and δ, along with condition (a) which can be written 2(ŷ i −ẑ i ) = γ i − δ i , then imply γ i = 2(ŷ i −ẑ i ) + and δ i = 2(ẑ i −ŷ i ) + . Given the definition forŷ in (7), note that ( Next are the primal variables. Supposeŷ i = max (â, min (ẑ i ,b)) for someâ,b. With this definition, conditions (d) are equivalent to 2 i (â −ẑ i ) + = λ and 2 i (ẑ i −b) + = λ. Bothâ andb are scalars and can be chosen to satisfy these equations, and then used to obtainŷ from its definition. Given optimalŷ, the optimal dual variables can be computed.
Optimality conditions (a),(d), and (g) hold by construction. Condition (b) holds from the definition ofŷ sinceẑ is feasible for the non-regularized problem, and condition (c) holds strictly by definition ofŷ. Conditions (f) hold by construction and using condition (c): ) + = 0, since both terms in the max are less than or equal to zero. A similar argument shows the other complementarity condition holds as well. Optimality conditions (a)-(g) hold and, thus,ŷ,â,b are optimal solutions to the range-regularized isotonic regression problem (6). Figure 1 where the solution to BIR problem (6) for a decreasing sequence of λ values is depicted. We use data from a well-known Baseball dataset [13] which describes the dependence of salary on a collection of player properties. Models are limited to two covariates in order to facilitate visualization. The number of runs batted in and hits were selected since they seemed a-priori most likely to comply with the isotonicity assumptions. The increasing range, given in each figure as the distance between the two lines corresponding to the optimal values ofâ andb, can be seen as λ is decreased. The first two figures with λ > 0 are thresholded versions of the last figure which depicts the non-regularized solution, i.e. the surface between theâ andb lines is identical to the corresponding surface in the non-regularized solution.

Theorem 1 is illustrated in
As mentioned above, Chen et al. [6] independently and concurrently developed a corresponding thresholding theory for the constrained version of bounded isotonic regression. In our notation, we could reformulate problem (4) as where, again,â = inf{ŷ i : i ∈ {1, . . . , n}},â + s = sup{ŷ i : i ∈ {1, . . . , n}}, and the range is still captured by s. Rather than adding variablesâ with our range constraints, Chen et al. [6] constrains the range by adding constraints of the form y j − y i ≤ s for all (i, j) such that there exists no observations that bound y i from below and no observations that bound y j from above (with respect to the partial order). We next state the main result of Proposition 3.3 of Chen et al. [6] using our notation and in similar form to Theorem 1 above:

R. Luss and S. Rosset
The optimality conditions for problem (9) are almost equivalent to those of problem (6) where the only differences are Given the optimality conditions, proof of the above proposition follows straightforward almost exactly as the proof for Theorem 1. The most significant difference is that the new equation for determiningâ is given by the new optimality condition (d). We note that this proof differs from that given in Chen et al. [6] where they describe the optimality conditions in terms of a graph representation of problem (9) that falls under a class of problems known as transportation problems. Proposition 3.3 of Chen et al. [6] also states thatâ is a non-increasing function of s (which we did not mention in Theorem 1, but this is trivially noted from the equation that determinesâ above in Proposition 3.3 of Chen et al. [6]). Chen et al. [6] developed the above theory in order to prove a statement about the degrees of freedom of the BIR estimator, and we discuss this further in Section 3.1.1. Rather, our motivation of such theory is to derive efficient algorithms for generating solutions to problem (6) for a sequence of λ.
Lastly, regarding the above discussion on thresholding isotonic regression, an interesting and related work was done by Hu [15]. While he previously showed that problem (6) with fixedâ andb could be solved by projecting the nonregularized isotonic regression solution onto the bound constraints, he showed that the same was not true when the lower (or upper) bounds for each data point are not equivalent. The new approach he took approximated the bounded isotonic regression problem with an extended and weighted isotonic regression problem and showed that the limit (as some of the weights go to infinity) of solving these weighted isotonic regression problems converges to the bounded isotonic regression problem (with fixed but different bounds).

Regularization path
Given Theorem 1, we can compute BIR for any fixed value of the regularization parameter λ. Rather than compute solutions over a grid of λ which would require solving the one variable equations forâ andb, we next derive an efficient path algorithm that generates all solutions for a sequence of λ. For fixed λ, define the following three sets of indices: The decreasing sequence of λ values for which we derive solutions contains those values at which the sets A λ , A λ a , and A λ b change. Between those values the interpolation is trivial.
Our approach to problem (6) requires that we first solve the non-regularized isotonic regression problem (1) to getẑ, which can be done using the IRP algorithm of [21]. We next consider the solution to the completely regularized problem when the range is constrained to zero, i.e., λ = ∞ andb =â. Under this constraint,ŷ i is fixed to the same constant for all i and it is trivial to see thatŷ i = y =â =b. Given this fit, we can obtain the dual variables γ and δ and determine that λ = γ T 1. It is easy to see that we would obtain the same solution for any larger value of λ.
Our next goal is to decrease λ until there is a change in A λ , A λ a , or A λ b , in which caseâ must be decreased andb must be increased. Clearly, from optimality We make two observations: for each i ∈ A λ a , γ i must be decreased by the same amount in order for the optimality condition to be maintained, and similarly, for each i ∈ A λ b , δ i must be decreased by the same amount. A change in the sets will occur when a nonzero γ i or δ i becomes zero.
Let γ min = min{γ i : i ∈ A λ a } and δ min = min{δ i : i ∈ A λ b } be the most that either dual variable can be decreased. Recall that decreasing the dual variables decreases λ. Decreasing γ or δ as much as possible would decrease λ by γ min |A λ a | and δ min |A λ b |, respectively. If γ min |A λ a | < δ min |A λ b |, we will decrease γ as much as possible (by γ min for all i ∈ A λ A ) and decreaseâ by γ min /2 and λ by γ min |A λ a |. Note that δ cannot be decreased by as much as possible because then optimality condition (d) would not hold. Summing over the optimality condition forb over i ∈ A λ b and using optimality condition Given the new boundarieŝ a andb, the new modelŷ for the new regularization λ can be computed. The cases γ min |A λ a | > δ min |A λ b | and γ min |A λ a | = δ min |A λ b | are handled in a similar manner. Obtaining the BIR regularization path is summarized by Algorithm 1. The output is the path ofâ andb, which can be used to generate the path of isotonic models viaŷ i = max (â, min (ẑ i ,b)).
While we do not develop a corresponding algorithm for problem (9) for a sequence of s, we note that such an algorithm could be a subject of future research. Initial investigations suggest that formulation (9) may not lead to a simple algorithm because it is difficult to control how parameter s and variable a move together. Another reformulation where we add variableb, change the upper bound constraints toŷ i ≤b, and control the range with the constraint b −â ≤ s might lead to a corresponding efficient algorithm.

Properties of bounded isotonic regression
Here we discuss the statistical behavior of our new algorithm and the resulting models. We address two aspects: 1. The connection between BIR and Lasso, and the resulting conclusions about regularization behavior of the BIR "regularization path", using results on degrees of freedom of Lasso [33]. 2. Generalization of our algorithm to other loss functions besides the quadratic loss in (5), which turns out to be straightforward.

BIR as a Lasso problem
As noted in Section 1, any isotonic fit to data can be described as a non-negative linear combination of N upper set indicators, and the range of any such model is smaller or equal than the sum of upper set coefficients, as shown in (3). Here we show that in fact any optimal solution to (5) is also exactly an optimal solution to a non-negative Lasso problem in upper sets. Following (2), we propose the following Lasso-like problem, with added non-negativity requirement: where we optimize a penalized (penalty on the sum of upper set coefficients) criterion over all isotonic functions. Denote the optimal solution: Lemma 2. The upper sets for whichβ l > 0 in the solution to (11) are a nested sequence, i.e., ifβ l1 > 0 andβ l2 > 0 then either Proof. Assume by negation that this does not hold for l 1 , l 2 . U u = U l1 ∪ U l2 and U 12 = U l1 \ U l2 and U 21 = U l2 \ U l1 are all also trivially upper sets (U 21 or U 12 may be empty). Now we increaseβ Uu by min(β l1 ,β l2 ), increaseβ U12 byβ l1 − min(β l1 ,β l2 ) and increaseβ U21 byβ l2 − min(β l1 ,β l2 ), and setβ l1 = β l2 = 0. The newĝ(x) remained unchanged for all x, but lβ has decreased by min(β l1 ,β l2 ) > 0. Hence this new function would have a lower penalized loss than the originalĝ λ in (11), which contradicts the optimality ofĝ λ . because thisx is in all nested sequence. On the other hand, let l max be the index of the maximal set in this sequence. Then U lmax X because if U lmax = X , then we can setβ lmax = 0 and increase the non-penalizedα correspondingly, again contradicting optimality. Hence there is x * / ∈ U lmax such thatĝ λ (x * ) =α. Hence The other inequality is trivial, and has been stated in the introduction.

Theorem 4.
Any optimal solution to the Lasso problem (11) is also an optimal solution to the BIR problem (5) with the same λ.
Proof. From Lemma 3, the optimal solution to (11) also gives a solution to (5) with the same penalized objective value. Denote this solution byĝ λ , as before. It remains to prove that (5) cannot have a better solution. Assume by negation there is a lower penalized objective solution to (5), denoted byf λ . Assume WLOG that {f λ (x i ) : i = 1, ..., n} is sorted in increasing order, and denote U * It is easy to verify that U * 1 , ..., U * n are a nested sequence of upper sets, and that we can express: which is a solution for (11) with the same loss and penalty asf λ has in (5), hence same penalized objective. This contradicts optimality ofĝ λ .

Implications of the Lasso connection
Lasso-type problems have been extensively studied in the literature from various theoretical, methodological and practical perspectives. We focus here on two important classes of Lasso-related results and their implications on BIR.
First is the statistical complexity of BIR models, as measured in degrees of freedom (df ) and optimism [7,12]. For a generic penalized Lasso problem, denote by A = {j :β j (λ) = 0} the set of active covariates with non-zero coefficients. Zou et al. [33] have shown that Stein's unbiased estimator for df is the number of covariates with non-zero coefficients in the solution: If our Lasso-like formulation for BIR (11) did not contain the non-negativity constraint, this result would apply directly and would imply, using Lemma 2, that the number of blocks (distinct values of the functionĝ λ ) is the Stein estimate.
In the presence of the non-negativity constraint, we claim that the generic Lasso result still holds. This can be verified by carefully considering the proof of Zou et al. [33], which only relies on the behavior of the Lasso regularization path given the set of active variables. Since the non-negativity constraint only affects selection of variables into the active set, and not the path direction given this active set, the result still holds. Furthermore, considering Algorithm 1, it is easy to see that the BIR solutions only add variables into the model and never drop them. Hence the Stein unbiased estimate of degrees of freedom is simply the number of iterations the algorithm has performed. This result is also consistent with the result of Meyer and Woodroofe [24], showing that the Stein estimate of degrees of freedom of non-regularized isotonic regression is the number of blocks (groups of distinct values) in the solution, which is the number of iterations the algorithm takes to reach the solution as λ → 0.
The Lasso connection is not the only way to make our claim about the degrees of freedom of the BIR estimator. Indeed, Chen et al. [6] (equation (22)), again independently and concurrently, show the result that the number of degrees of freedom is equal to the number of blocks in the BIR solution. While they derive the result for range-constrained BIR, the claim is equivalent although they pose it in terms of the number of connected components of a graph induced by the BIR estimator. It is interesting to note two completely different approaches for obtaining this result. We rely on results for the Lasso regularization path, while Chen et al. [6] rely on a result from algebraic graph theory pertaining to the . Each path is the mean over 500 trials with 1000 observations. rank of incidence matrices of graphs. Also note that essentially the same result we used for the regularized version Lasso above is also known for the constrained version of Lasso (e.g., see Kato [16]).
Behavior of our BIR regularization path is demonstrated empirically in Figure 2, where we compare the expected value of this Stein estimate to the actual df as estimated from repeated simulations using the optimism theorem of Efron [7]. In all dimensions, the number of iterations is seen empirically to be an unbiased estimate of the actual df.
This df result clarifies the regularization behavior of BIR, and in particular the gradual increase of model complexity as λ decreases. It also naturally facilitates using for BIR the multitude of model selection approaches developed for Lasso. We note that this non-increasing behavior of df as λ increases that we illustrate empirically was proved in Chen et al. [6] (refer to their Theorem 3.4 though note there df is non-decreasing in λ because λ is the range parameter to the range-constrained case). Our Lasso connection can easily be used to make the same claim since, as we noted above, BIR solutions only add variables and never drop them.
The second aspect we consider is the computational one. Efficient algorithms have been developed for calculating Lasso regularization paths [26,8]. Can these be used to efficiently calculate BIR solutions? If we consider these algorithms in their standard forms the answer is clearly no, since they include a step of enumerating over all covariates and solving a simple linear equation for each one, which is repeated many times (in every iteration). Since the covariates in our Lasso formulation are upper set indicators, and the number of upper sets is exponential in the number of observations, a direct application of these algo-rithms is unlikely to yield efficient algorithms. An alternative approach for high dimensional problems could be to replace this enumeration with an appropriate search algorithm as proposed in Rosset et al. [27]. This turns out to be possible for BIR (details are eliminated for brevity), but the resulting algorithm is still substantially less efficient than our algorithmic solution presented in Section 2, which relies on the specific structure of this problem. Hence we do not see a computational benefit in considering the Lasso connection, although this is open for further research.

Generalization of BIR for other loss functions
A natural question that arises is whether the useful structure identified in Section 2 is unique to isotonic regression with l 2 loss, or whether it generalizes to other loss functions. Of particular interest are exponential family log-likelihoods and robust regression loss functions, as discussed in Luss and Rosset [20] and references therein. The generic problem we consider is: where L i is some convex and differentiable loss function (usually a function of observation y i ) like the ones mentioned above. As noted in the Introduction, the non-regularized isotonic modeling problem can be solved for all such loss functions with the same complexity as the standard l 2 isotonic regression problem as shown by Hochbaum and Queyranne [14], which addresses the general loss function (and even furthermore solves problem (12) for fixedâ andb). Hence for us to be able to solve the BIR version we only need to verify that Theorem 1 also generalizes successfully. This is in fact true, as captured by the following generalized result which is proven in the Appendix.

Theorem 5. Letẑ be the optimal solution to the non-regularized isotonic regression problem with convex and differentiable loss functions L i . Then settinĝ
for all i ∈ {1, . . . , n} solves the generalized BIR problem (12), whereâ andb solve the equations Hence, Algorithm 1 can be generalized to solving the family of problems (12) by replacing averages and residuals with their appropriate loss-function-specific generalization, following the same ideas as in Luss and Rosset [20]. We eliminate the details for brevity. We also note that there have been other works regarding bounded isotonic regression with a total order and fixed bounds (see Chakravarti [4] for an example with an l 1 loss function).

Experiments
Our simulations examine the performance of BIR from several perspectives: In Section 4.1, we compare BIR and IRP to the popular flexible modeling approach multivariate adaptive regression splines (MARS) [11] as a representative of modern competitive approaches which do not assume linearity or additivity. Reassuringly, BIR shows significantly improved performance for isotonic functions. In Section 4.2, we compare BIR and IRP to additive LISO [9], demonstrating the expected behavior: in lower dimension, with large amounts of data, the added flexibility of BIR allows fitting of better models. In Section 4.3, we concentrate on comparing our two isotonic modeling approaches, IRP [21] and BIR, demonstrating the advantage of BIR regularization as expressed by improved prediction performance. Finally, in Section 4.4, we describe the application of in-sample model selection approaches GCV and AIC to BIR, capitalizing on the Lasso connection, and compare their performance to cross-validation.
Simulations are carried out in the following manner: Training and testing data are generated. The training data is further split into training and validation folds. An IRP model is trained on the training fold resulting in a regularization path. Models along this regularization path are used to generate a path of RMSEs by predicting responses in the validation fold. The model that generates the lowest RMSE is the chosen model which is used to predict the independent testing data. Results on this independent testing data are reported. The global isotonic solution generated by IRP from the training fold of the training data is then used to generate a path of BIR models, which is, in turn, used to select a model using the validation fold of the training data. As with IRP, this chosen BIR model is used to obtain prediction results on the independent testing data. Note that time results for BIR include the time to find the non-regularized solution with IRP. In all simulations, the extra computation that BIR required given the non-regularized solution was less than 5% of the computation required to find the solution. All results are averaged over 50 trials.

Comparing with MARS
MARS is a well-known regression approach that builds models of the form x j , 0) for the j th covariate and knot t ∈ R) or the product of hinge functions. The choice of knots are typically determined by the training data, and β is then estimated by standard linear regression. Basis functions h i (x) are added to the model in a greedy fashion. Fang and Meinshausen [9] show scenarios where LISO outperforms MARS, particularly in much higher dimensions than are considered here. In lower dimensions, MARS performs very well, but in our experiments suffers computationally with the large number of where the z ij indicator variables are uniform over {0, 1} and both z ij and x ij variables are independent. Each model has four indicator variables and four continuous variables for a total of eight dimensions. These are deemed "interaction" models because the terms in the summations are only included in the response if the two corresponding indicators are both on (i.e., both set to one). These are scenarios in which IRP outperforms MARS, as well as where range regularization learns better isotonic models than offered by the regularization of IRP. Furthermore, the timing results clearly show that MARS takes much longer to learn models than IRP and BIR. As the amount of training data is increased, the computational complexity of MARS increases faster than IRP and BIR. The following simulation trains IRP and BIR models on 12000 observations, perform model selection using another 3000 observations, and the performance results are based on an independent test set of 3000 observations. MARS, which has an internal cross-validation mechanism, trains models on the first two folds (15000 observations) with peformance results based on the same independent test set of 3000 observations. We next use Model 1 to examine how performance and regularization change as the training sample size increases. Figure 3 (left) demonstrates that RMSE on the hold-out test samples decreases as the number of training samples increases, due to better modeling of the entire space. Regarding the regularization parameter λ, the optimal λ increases as the number of training samples increases (not shown). To understand why, consider the BIR formulation as a Lasso-like problem (3). As the number of training samples increases, the number of upper sets (and variables) increases exponentially, and hence a higher regularization  for some small c (Theorem 4 of [32]). However, while Figure 3 (right) shows that λ n grows slower than n, our empirical tests do not demonstrate that λ n grows faster than √ n as required for this particular consistency theorem.

Comparing with LISO
LISO builds models of the form where h i (x i ) is a one-dimensional isotonic function of the i th covariate. This additive isotonic regression is trained by taking each h i (x i ) to be a positive linear combination of the upper sets formed by the i th dimension and solving the following lasso problem, where β ∈ R d×N . The coefficient for upper set l in the j th dimension (represented by U jl ) is β jl and α is again the intercept. Then we have h j (x j ) = N l=1 β jl I{x j ∈ U jl }. Note that the number of upper sets N in each dimension is bounded by the number of training instances n. Then problem (15) reduces to a lasso problem with nd parameters, as shown in Fang and Meinshausen [9], which can be solved by the classic LARS algorithm for low-dimensional problems to get a full regularization path of additive isotonic models. For high-dimensional problems, Fang and Meinshausen [9] offer an algorithm that solves (15) for fixed λ, and hence require solving the problem many times over a λ-grid in order to generate a path of solutions. The following results use a LARS-implementation of LISO in order to obtain a full regularization path for additive isotonic models.
A path of models with increasing complexity (i.e., increasing number of upper sets in the solution) is learned for IRP, BIR, and LISO and performance is shown in Table 2. For dimensions 2-5, the following simulation trains models on 3000 observations, performs model selection using another 3000 observations, and the performance results are based on an independent test set of 3000 observations. The limited training on 3000 observations is because LISO is computationally expensive (3000 observations with d dimensions translates to running LARS on 3000 observations with roughly 3000d variables). For dimension 5, we give results for training IRP and BIR with 12000 observations (indicated by 5 * ) as well to show how performance improves when training with more data.
Neither model is additive so we should expect isotonic regression (IRP and BIR) to outperform LISO in all dimensions. This is true for the first model, however the second model with d = 5 shows better performance for LISO. This is because isotonic regression is less structured than additive isotonic regression and requires more data to learn the model as the dimension increases, demonstrated by the improved performance of IRP and BIR when trained with 12000, rather than 3000, observations (LISO is too computationally expensive to train with this many observations and dimensions as demonstrated by the time results). These simulations further show that range regularization using BIR gives, in many cases, statistically significant improved results over the regularization provided by IRP.

Comparing with IRP
The previous subsection noted that IRP and BIR are less structured than other regression methods, such as LISO, and hence require more training data to learn a good model. This was exhibited in Table 2 in 5 dimensions by the increased performance between training with 3000 versus 12000 observations. In this section, we give two more examples in slightly higher dimension that are trained with 12000 observations. Results are shown in Table 3. Again, a validation set of 3000 observations is used to select models and performance is measured on  Figure 4, which illustrates performance throughout the regularization path for both IRP and BIR on a single sample of data. Diamonds represent the minimum RMSE along the respective paths. First note that IRP performance is minimized very early in all regularization paths. This is consistent with the observation in Luss et al. [21] that IRP performs most of its fitting in its first few iterations. Next note that the overall trend is the same for IRP and BIR. In dimensions 4,6, and 8, performance improves (RMSE decreases) as the models become more complex until a certain point at which more complexity hurts performance and RMSE increases. The minimum of the BIR path is lower than the minimum of the IRP path for these dimensions under both models, a result due to the sounder and slower regularization employed in BIR compared to early stopping of IRP. This slower model fitting can also be observed by comparing the expected degrees of freedom in Figure 2 to corresponding simulations for IRP in Luss et al. [21].
While each iteration here increases the degrees of freedom by about one, most of the total fitting in IRP was done in the first few iterations (usually more than half in the first iteration). Finally, we again note the equivalent performance in 10 dimensions. Figure 4 depicts the immediate overfitting, attributed to the insufficient constraints, that occurs in both IRP and BIR.

Comparing model selection approaches for BIR
Our experiments so far have exclusively used cross-validation (CV) for selecting the regularization parameters of BIR and other approaches. This is convenient for comparing between different approaches which may not necessarily have available in-sample approaches for model selection. CV is also an appropriate model selection approach for observational situations where future prediction points are independently drawn ("random-X"), which arguably represent the majority of modern data analysis scenarios [12]. However, for BIR specifically, the Lasso connection allows us to use a variety of in-sample model selection methods developed or adapted for Lasso, including AIC, GCV and others [10, and references therein]. Here we briefly compare CV to AIC and GCV as approaches for selecting λ in BIR, where we use the Stein estimate df as the number of parameters / degrees of freedom in AIC and GCV.
Because the in-sample approaches are targeted at "fixed-X" situations, we slightly change the setup of the simulation for this experiment, and draw the test set for evaluating selected models at the same set of x-values as the training  Table 3. set. As before, for CV a portion of the training set is set aside for model selection, while for AIC and GCV the entire set (15000 observations) is used for model building. In general, in this situation, we expect in-sample approaches to do better than CV.
We compare the approaches on models 1 and 2 from Section 4.3, and show the results in Table 4. For model 1, which is a relatively simple model with limited non-additivity effects, CV does roughly as well as GCV and AIC and essentially selects the same models. Still, GCV appears to be slightly superior to both CV and AIC in higher dimensions. Model 2 is a much more "wild" model, hence prediction at new covariate vectors outside the training set is a much more difficult task than "fixed-X" prediction. Consequently for this model the difference between the CV model selection and GCV/AIC becomes evident. CV selects models with heavy regularization, while GCV/AIC tend to select smaller λ and hence models which are appropriate for predicting under the "fixed-X" assumption. The exception is the deteriorated performance of GCV at dimension d = 10, a phenomenon whose detailed investigation may reveal further insights but is a topic for future study. We also performed some experiments comparing the performance of the three approaches in prediction in the "random-X" scenario. As expected, CV was consistently superior to AIC/GCV in this setting (results not shown).

Discussion and conclusion
In this paper we propose to regularize isotonic regression by penalizing or constraining the range of the estimated function and name this new approach BIR. We demonstrate that, given the non-regularized isotonic regression model, all BIR solutions can be generated by a simple and efficient algorithm because they are obtained by thresholding the non-regularized solution from above and below. Furthermore, the BIR problem can be formulated as a non-negative Lasso problem in the basis of upper set indicators and thus inherits properties of Lasso, in particular its regularization behavior. Like Lasso, it adds about one degree of freedom with each iteration (each upper set added to the model). Thus, BIR combines a sound regularization approach, efficient computations, and interesting connections to other methods. Furthermore, we show that the BIR algorithm can easily be generalized to other loss functions, including exponential family log-likelihoods and robust regression. This significantly enhances the utility of BIR methods.
As mentioned in the Introduction, isotonic regression suffers from overfitting issues that severely limit its utility for modern high-dimensional problems. Our simulations demonstrate that BIR can significantly increase the range of usefulness of isotonic modeling compared to the non-regularized solution or coarsely regularized alternative when using IRP. Up to about dimension eight, BIR can significantly improve on isotonic regression and still generate useful models. Overall, we believe our simulations demonstrate that when isotonicity assumptions are appropriate, the true relationship is complex and non-additive, the dimension is relatively low, and data is abundant, properly regularized isotonic regression is likely to do very well compared to alternatives. It should be noted that our simulations are "space filling" in the sense that the covariate values are uniformly distributed in the covariate space X . This means that the actual dimension is also the effective dimension. Natural data are often highly structured (as captured for example by PCA) and can be closely approximated by lower dimensional spaces. In our context it means that the isotonic constraints required to control model complexity can persist into higher dimension in natural data than in our simulations, thus allowing BIR to remain useful.
An interesting connection of BIR is to total variation penalties, which have become important in several application domains [22,5]. In d = 1 dimension, the range of a monotonic function is trivially also its total variation, so BIR can be thought of as a total variation approach with added isotonicity constraints, rather than isotonic regression with an added range constraint. In higher dimensions, total variation definitions become mathematically quite involved, but for isotonic functions they simply reduce back to the range. Hence BIR can also be thought of as total variation penalized isotonic regression.